FPGA-based SmartNICs: OS4C and Beyond
Speaker: Scott Smith
Affiliation: UIUC
Date: Oct 9, 2024
-
Abstract: Smart network interface cards (SmartNICs) are powerful, high-throughput programmable devices. These devices enable developers to offload specific functionalities to hardware, improving performance and/or reducing CPU overheads. Over the last several years, much work has been done in this research area. This talk will cover three broad strokes. First, we will briefly overview the current state-of-the-art SmartNIC architecture research. Second, we will present our recent work, “OS4C: An Open-Source SR-IOV System for SmartNIC-based Cloud Platforms.” This work discussed the limitations of current open-source NICs and extended the open-source NIC Corundum with support for Single Root I/O Virtualization. Third, we will discuss our plans to build a new NIC architecture on the Alveo V80 FPGA. We aim for this project to be the first open-source 200+ Gbps NIC. Furthermore, our architecture will enable greater flexibility and programmability than current state-of-the-art FPGA-based SmartNICs.
DETECTive: Machine Learning-driven Automatic Test Pattern Prediction for Faults in Digital Circuits
Speaker: Prof. Debjit Pal
Affiliation: University of Illinois Chicago
Date: May 8, 2024
-
Abstract: Due to the continuous technology scaling and the ever-increasing complexity and size of the hardware designs, manufacturing defects have become a key obstacle in meeting end-user demand. Despite decades of research, traditional test-generation techniques often struggle to scale to massive and complex designs. Such scalability issues stem from the numerous backtracking the traditional test generation techniques perform before converging to a test pattern. In this talk, we present DETECTive that leverages deep learning on graphs to learn fault characteristics and predict test pattern(s) to expose faults without requiring backtracking. DETECTive is trained on small circuits, and its learned knowledge is transferable to predict test patterns for circuits that contain up to 29x more gates than the training circuits. Since DETECTive avoids backtracking completely, it can predict test patterns up to 15x faster than academic tools and up to 2x faster than commercial tools. DETECTive achieves up to 100% pattern accuracy on synthetic designs and up to 95% test pattern accuracy on realistic designs. To our knowledge, DETECTive is the first to leverage deep learning to predict test patterns for digital hardware designs that can complement the traditional test generation techniques for faster design closure.
Large Language Model Acceleration on FPGA with Allo
Speaker: Hongzheng Chen
Affiliation: Cornell
Date: Apr. 10, 2024
-
Abstract: As the benefits of technology scaling diminish, specialized hardware accelerators are crucial for performance in emerging applications. However, designers currently lack effective tools and methodologies to construct complex, high-performance accelerator architectures. Existing high-level synthesis (HLS) tools often require intrusive source-level changes to attain satisfactory quality of results. While new accelerator design languages (ADLs) aim to enhance or replace HLS, they are typically more effective for simple applications with a single kernel, rather than for hierarchical designs with multiple kernels.
In the first part of this talk, I will introduce Allo, a composable programming model for efficient hardware accelerator design (to appear in PLDI’24). Allo decouples hardware customizations, including compute, memory, communication, and data types from algorithm specification, and encapsulates them as a set of customization primitives. Allo also preserves the hierarchical structure of an input program by combining customizations from different functions in a bottom-up, type-safe manner. Our evaluation shows that Allo can outperform state-of-the-art HLS tools and ADLs on all test cases in the PolyBench.
Furthermore, I will demonstrate how Allo optimizes large-scale designs using spatial architecture for large language models (LLMs) as an example. This accelerator implements a design point of our analytical model presented in FCCM’24, where we introduce a comprehensive analytical framework for estimating the performance of a spatial LLM accelerator. Through this analysis, we can identify the most effective parallelization and buffering schemes for the accelerator and, crucially, determine the scenarios in which FPGA-based spatial acceleration can outperform its GPU-based counterpart. For GPT generative inference, our accelerator attains a 2.2× speedup compared to DFX, an FPGA overlay, in the prefill stage, while achieving a 1.9× speedup and a 5.7× improvement in energy efficiency compared to the NVIDIA A100 GPU in the decode stage.
HAL: Hardware-assisted Load Balancing for Energy-efficient SNIC-Host Cooperative Computing
Speaker: Jinghan Huang
Affiliation: UIUC
Date: Mar. 13, 2024
-
Abstract: A typical SmartNIC (SNIC) integrates a processor, consisting of Arm CPU and accelerators, with a conventional NIC. The processor is designed to energy-efficiently execute functions frequently used by network-intensive datacenter applications. With such a processor, the SNIC has promised to notably increase overall energy efficiency of datacenter servers. Nevertheless, the recent trend of integrating accelerators into server CPUs for these functions sparks questions on the SNIC processor’s superiority over a host processor (i.e., server CPU with accelerators) in system-wide energy efficiency especially under given tail latency constraints. Answering this pressing question, we first take a processor, integrated with various accelerators, as a host processor, and then compare it to a commercial SNIC processor. This reveals that (1) the host accelerators, coupled with a more powerful memory subsystem, can outperform the SNIC accelerators and (2) the SNIC processor can improve systemwide energy efficiency over the host processor only at low packet rates for most functions under tail latency constraints. To offer high system-wide energy efficiency without hurting tail latency at any packet rates, we propose HAL consisting of a hardware-based load balancer and an intelligent load balancing policy implemented inside the SNIC. When HAL detects that the SNIC processor cannot efficiently process a given function beyond a specific packet rate, it limits the rate of packets to the SNIC processor and lets the host processor handle the excess. HAL works for stateless functions with conventional PCIe-attached SNICs for now, but we also demonstrate that HAL can work for stateful functions as effectively with a CXL-attached SNIC. We implement HAL with AMD U280 FPGA connected to the SNIC and show that HAL makes the SNIC processor improve energy efficiency and throughput of the server by 31% and 10%, respectively, without notably hurting tail latency.
Medusa: a Novel Framework for Accelerating LLM Generation with Multiple Decoding Heads
Speaker: Yuhong Li
Affiliation: UIUC
Date: Feb. 14, 2024
-
Abstract: Large Language Models (LLMs) have changed the world. However, generating text with them can be slow and expensive. In this seminar, we present Medusa, which introduces a transformative approach to enhancing Large Language Model (LLM) inference, tackling the inherent bottleneck of auto-regressive decoding which relies on sequential computation. Traditionally, this process is hindered by the need for each step to wait on its predecessor, coupled with the requirement to transfer extensive model parameters to the accelerator’s cache, leading to significant delays. Medusa breaks away from conventional paradigms by integrating additional decoding heads capable of predicting multiple future tokens in parallel. Utilizing a sophisticated tree-based attention mechanism, Medusa simultaneously constructs and evaluates various candidate continuations at each decoding step. This novel approach significantly reduces the number of sequential decoding steps, thereby enhancing efficiency without compromising output quality. Additionally, Medusa introduces innovative extensions, such as a self-distillation technique for scenarios lack of training data and a novel acceptance scheme that increases acceptance rates while maintaining generation quality. Extensive evaluations of Medusa across various model sizes and training techniques have demonstrated its effectiveness, achieving over 2.2x speedup without quality compromise, and further improvements up to 2.8x speedup in optimized settings. Medusa represents a significant advancement in LLM efficiency, offering a novel, non-intuitive solution to the challenges of traditional LLM inference processes. Medusa has gathered over 1.6k Github stars and been adopted by open-sourced libraries such as NVIDIA TensorRT-LLM, Alibaba RTP-LLM, Huggingface TGI. It has also been utilized in close-sourced inference engines by startups such as Together AI led by Professor Christopher Re and Lepton AI founded by Dr. Yangqing Jia.
Targeting AMD heterogenous compute units with AIR on ROCr
Speaker: Eddie Richter
Affiliation: AMD
Date: Jan. 10, 2024
-
Abstract: In this talk we present ROCm-air, which integrates AIEs into the ROCm runtime (ROCr) software stack using the AIR spatial compute framework and compiler. We describe how ROCm interfaces with the AIR compiler, driver, and framework, and present two demoes of the release which showcase a weather stencil application ported to AIEs running on our experimental ROCm runtime. The ROCm-air release shows how this unified software framework decreases the barrier to entry for users to target AMD heterogenous compute units specifically targeting Ryzen IPU and Versal VCK5000.
HIDA: A Hierarchical Dataflow Compiler for High-Level Synthesis
Speaker: Hanchen Ye
Affiliation: UIUC
Date: Nov. 8, 2023
-
Abstract: Dataflow architectures are growing in popularity due to their potential to mitigate the challenges posed by the memory wall inherent to the Von Neumann architecture. At the same time, high-level synthesis (HLS) has demonstrated its efficacy as a design methodology for generating efficient dataflow architectures within a short development cycle. However, existing HLS tools rely on developers to explore the vast dataflow design space, ultimately leading to suboptimal designs. This phenomenon is especially concerning as the size of the HLS design grows. To tackle these challenges, we introduce HIDA, a new scalable and hierarchical HLS framework that can systematically convert an algorithmic description into a dataflow implementation on hardware. We first propose a collection of efficient and versatile dataflow representations for modeling the hierarchical dataflow structure. Capitalizing on these representations, we develop an automated optimizer that decomposes the dataflow optimization problem into multiple levels based on the inherent dataflow hierarchy. Using FPGAs as an evaluation platform, working with a set of neural networks modeled in PyTorch, HIDA achieves up to 8.54x higher throughput compared to the state-of-the-art (SOTA) HLS optimization tool. Furthermore, despite being fully automated and able to handle various applications, HIDA achieves 1.29x higher throughput over the SOTA RTL-based neural network accelerators on an FPGA.
FPGA-based SSD Emulation Framework
Speaker: Yizhen Lu and Curtis Yu
Affiliation: UIUC
Date: Oct. 11, 2023
-
Abstract: With the growing popularity of Solid State Drives (SSDs) in high-performance computing (HPC) and cloud services, understanding their intricate behaviors becomes imperative. This gives rise to the importance of hardware-based emulation systems for SSDs, which outperform software-based simulation solutions in both speed and efficiency. In this talk, we will first introduce FSSD, an FPGA-based emulation system that models the latency and access patterns of actual NVMe SSDs. FSSD leverages FPGA flexibility to enable customization of SSD microarchitectures and robust design space exploration for data-intensive applications. It interacts directly with real operating systems, surpassing the limitations of Virtual Machines for most existing works, with a remarkable 1000x speedup over a conventional software-based simulation framework, SimpleSSD. Building on FSSD’s foundation, we will then present SSDe, an enhanced FPGA-based SSD-Express emulator. SSDe incorporates advanced features such as energy modeling, garbage collection, and rapid runtime reconfiguration, making it a superior and adaptable emulator. SSDe showcases its highly efficient design space exploration capabilities with 236,000x faster reconfiguration speed to explore different architecture settings for SSD designs. The platform’s open-source nature further ensures its wide-scale applicability and advancement in the research community.
MESA: Microarchitecture Extensions for Spatial Architecture Generation
Speaker: Dong Kai Wang
Affiliation: UIUC
Date: Sept. 13, 2023
-
Abstract: Modern heterogeneous CPUs incorporate various hardware accelerators to achieve improved efficiency. A well-known class among them, spatial accelerators, are designed with reconfigurability to accelerate a wide range of applications. However, they tend to require specialized compilers and software stacks, libraries, or languages to operate and cannot be utilized with ease by all applications. As a result, the accelerator’s resources sit wastefully idle when it is not explicitly programmed. Our goal is to dismantle this CPU-accelerator barrier by monitoring CPU threads for acceleration opportunities during execution and, if viable, dynamically reconfigure the accelerator to allow transparent offloading. We develop MESA, a hardware controller on the CPU that translates machine code to build an accelerator configuration tailored to the running program. While such a dynamic translation/reconfiguration approach is challenging, it has a key advantage over ahead-of-time compilers: access to runtime information, revealing not only dynamic dependencies but also performance characteristics. MESA maintains a real-time performance model of the program mapped on the accelerator in the form of a spatial dataflow graph with nodes weighted by operation latency and edges weighted by data transfer latency. Features of this dataflow graph are continuously updated with runtime information captured by performance counters, allowing a feedback loop of optimization, reconfiguration, and acceleration. We evaluate the feasibility of our solution with different accelerator configurations. Across the Rodinia benchmarks, results demonstrate an average 1.3x speedup in performance and 1.8x gain in energy efficiency against a multicore baseline.
Scheduling and Serverless Computing with Multi-Tenant FPGAs
Speaker: Meghna Mandava
Affiliation: UIUC
Date: May 10, 2023
-
Abstract: We introduce Nimblock for multi-tenant FPGA sharing. Nimblock explores scheduling possibilities to effectively time- and space-multiplex reconfigurable slots on a virtualized FPGA. The Nimblock scheduling algorithm balances application priorities and performance degradation to improve response time and reduce deadline violations. We demonstrate system feasibility on a Xilinx ZCU106 FPGA and evaluate on a set of real-world benchmarks. We achieve up to 5.7x lower average response times when compared to a no-sharing scheduling algorithm and up to 2.1x average response time improvement over competitive scheduling algorithms that support sharing. We also demonstrate up to 49% fewer deadline violations and up to 2.6x lower tail response times when compared to other high-performance algorithms. This work will appear at ISCA’23.
We then use Nimblock to enable the use of FPGAs in serverless computing frameworks. We present Nimblock 2.0 to integrate virtualized multi-tenant FPGAs into a serverless platform. Our evaluation of the Nimblock 2.0 heterogeneous serverless computing model results in an average overhead of only 13% over a bare-metal FPGA implementation. Leveraging a heterogeneous serverless cluster with both CPU and FPGA compute resources can provide up to a 35% performance improvement compared to FPGA-only serverless computing.
Enabling Efficient Large-Scale Deep Learning Training with Cache Coherent Disaggregated Memory Systems
Speaker: Prof. Jishen Zhao
Affiliation: UCSD
Date: Apr. 12, 2023
-
Abstract: Modern deep learning (DL) training is memory-consuming, constrained by the memory capacity of each computation component and cross-device communication bandwidth. In response to such constraints, current approaches include increasing parallelism in distributed training and optimizing inter-device communication. However, model parameter communication is becoming a key performance bottleneck in distributed DL training. This talk will introduce our recent design COARSE, which is a disaggregated memory extension for distributed DL training. COARSE is built on modern cache-coherent interconnect (CCI) protocols and MPI-like collective communication for synchronization, to allow low-latency and parallel access to training data and model parameters shared among worker GPUs. To enable high bandwidth transfers between GPUs and the disaggregated memory system, we propose a decentralized parameter communication scheme to decouple and localize parameter synchronization traffic. Furthermore, we propose dynamic tensor routing and partitioning to fully utilize the non-uniform serial bus bandwidth varied across different cloud computing systems. Finally, we design deadlock avoidance and dual synchronization to ensure high-performance parameter synchronization. We implement a disaggregated memory prototype based on an industrial CCI protocol using two FPGAs: a Xilinx KCU1500 and a BittWare 250-SoC are interconnected with one QSFP cable to allow the ARM cores on the 250-SoC to access the shared memory pool through CCI. Our evaluation shows that COARSE achieves up to 48.3% faster DL training compared to the state-of-the-art MPI AllReduce communication.
VersatileSSD: Breaking I/O Barrier by Leveraging SmartSSD
Speaker: Ipoom Jeong
Affiliation: UIUC
Date: Mar. 8, 2023
-
Abstract: Modern server systems are facing the challenge of meeting the increasingly demanding performance requirements of applications that process massive amounts of data, such as databases, machine learning, and data analytics. Solid-state drives (SSDs) have grown in popularity due to their ability to significantly reduce the time required for transferring huge amounts of data, resulting in a shift in the system bottleneck from data transfer to interconnect bandwidth and operating system overhead. However, as the PCIe places limitations on simultaneous I/O device access, the scalability of the system is limited when multiple I/O devices are servicing independent contexts of I/O operations. This issue, along with the long-latency interconnect bus data transfer, has led to the development of SmartSSD, which employs near-data processing (NDP) to move computations closer to the location where the data is stored. This presentation outlines our vision for leveraging SmartSSD to maximize its computing potential and versatility in the context of NDP. Specifically, we will discuss two approaches that we are currently exploring: 1) Universal predicate pushdown and 2) OS page cache expander. We will present the high-level concepts behind these approaches and outline our development plan, which we believe will pave the way for integrating SmartSSD into mainstream server systems.
SPADES: A Rapid Compilation Flow for Versal FPGAs
Speaker: Tan Nguyen
Affiliation: UC Berkeley
Date: Feb. 8, 2023
-
Abstract: With the increasing growth of complexity and heterogeneity of modern FPGA fabrics, the conventional ”flat” compilation flow relying on the standard tools, from Synthesis, Implementation, to Bitstream Generation, has become more arduous than ever. This leads to an inordinate turn-around time which severely impacts the productivity of application developers in quest of design space exploration. We propose an open-source tool flow built around a customizable overlay of Spatially Distributed Socket Engines (SPADES) to address the FPGA productivity issue. SPADES organizes the computation and communication of an application in a Coarse-grained Reconfigurable Array of socket engines. To address the compilation time issue, we employ the hardened Network-on-Chip present in Versal, a novel commercial FPGA architecture from AMD to alleviate the inter-socket routing task, as well as cater to the socket netlist reusability by utilizing regular and repeatable fabric regions. Our tool flow achieves shorter compilation time than the standard, top-down AMD Vitis flow by at most 10x (from several hours down to minutes) with comparable Quality-of-Result on some benchmarks targeting the AMD Versal VCK5000 data center card.
AutoScaleDSE: A Scalable Design Space Exploration Engine for High-Level Synthesis
Speaker: Greg Jun
Affiliation: UIUC
Date: Jan. 11, 2023
-
Abstract: High-Level Synthesis (HLS) has enabled users to rapidly develop designs targeted for FPGAs from the behavioral description of the design. However, to synthesize an optimal design capable of taking better advantage of the target FPGA, a considerable amount of effort is needed to transform the initial behavioral description into a form that can capture the desired level of parallelism. Thus, a design space exploration (DSE) engine capable of optimizing large complex designs is needed to achieve this goal. We present a new DSE engine capable of considering code transformation, compiler directives (pragmas), and the compatibility of these optimizations. To accomplish this, we initially express the structure of the input code as a graph to guide the exploration process. To appropriately transform the code, we take advantage of ScaleHLS based on the multi-level compiler infrastructure (MLIR). Finally, we identify problems that limit the scalability of existing DSEs, which we name the “design space merging problem.” We address this issue by employing a Random Forest classifier that can successfully decrease the number of invalid design points without invoking the HLS compiler as a validation tool. We evaluated our DSE engine against the ScaleHLS DSE, outperforming it by a maximum of 59X. We additionally demonstrate the scalability of our design by applying our DSE to large-scale HLS designs, achieving a maximum speedup of 12x for the benchmarks in the MachSuite and Rodinia set.
Generic and automated graph neural network acceleration
Speaker: Prof. Cong (Callie) Hao
Affiliation: Georgia Tech
Date: Dec. 14, 2022
-
Abstract: Graph neural networks (GNNs) are becoming increasingly important in many applications such as social science, natural science, and autonomous driving. Driven by real-time inference requirement, GNN acceleration has become a key research topic. Given the largely diverse GNN model types, such as graph convolution network, graph attention network, graph isomorphic network, with arbitrary aggravation methods and edge attributes, designing a generic GNN accelerator is challenging. In this talk, we discuss our proposed generic and efficient GNN accelerator, called FlowGNN, which can easily accommodate a wide range of GNN types. Without losing generality, FlowGNN can outperform CPU and GPU by up to 400 times. In addition, we discuss an open-source automation flow, GNNBuilder, which allows users to design their own GNNs in PyTorch and then automatically generates the accelerator code targeting FPGA.
UniNET: Accelerating Tenant Container Network in IaaS Cloud
Speaker: Jeff Ma and Bill Dai
Affiliation: UIUC
Date: Nov. 9, 2022
-
Abstract: The current IaaS cloud networking stack adopts a layered approach to grant cloud tenants virtualized networks. Cloud providers usually offer a software-defined, VM-level network by sub-netting or overlaying on top of the hardware fabrics. In contrast, in a containerized scheme, the cloud tenants in each VM maintain the virtual network at the container level using their methods. Unfortunately, such methods are often implemented in software and create significant overheads that degrade the pod-to-pod, or process-to-process, network throughput and thus increase the overall tenant application execution time. We closely studied typical cloud providers and cloud-tenant network solutions and examined the critical network operations in existing network solutions that decrease performance. As a result, we propose UniNET, a SmartNIC-based, holistic network solution providing both VM-level and container-level network virtualization that can simultaneously achieve high throughput and reduce latency. To meet fundamental cloud concerns, we also design UniNET with security and scalability guarantees in mind.
ThymesisFlow with Ethernet Interface
Speaker: Haiyang Zhang
Affiliation: UIUC
Date: Oct. 12, 2022
-
Abstract: Memory disaggregation, the technology that decouples memory resources from its local host and allows applications to utilize a memory pool from both local and remote servers, has been gaining attention for its potential to increase resource utilization and energy efficiency in data centers. ThymesisFlow, originally developed by IBM, is one of the solutions that implements software-defined disaggregated memory with FPGA. It is based on OpenCAPI and deployed on the POWER9 system with a HW/SW co-designed data path connecting the compute node and the memory-stealing node. The original prototype has limited scalability due to its implementation of the data link with aurora protocol. Nodes have to be directly connected point-to-point to establish communications. Our new design seeks to address this problem by implementing the network interface with ethernet protocol, where each node is identified by a unique MAC address, allowing much more flexible network topology. Benchmark shows that the performance degradation caused by the introduction of the ethernet interface is around 20%, a reasonable loss considering the addition of frame header and more complex data path for frame processing. We are also planning to implement a lightweight tcp-like protocol that provides resilience and congestion control over unreliable networks.
Machine Learning Acceleration Through Algorithm and Hardware Co-design
Speaker: Prof. Caiwen Ding
Affiliation: University of Connecticut
Date: Sept. 14, 2022
-
Abstract: Machine learning based statistical models are increasingly challenging the mainstream computing platforms, across high-performance computers to low-end embedded systems for both training and inference. In addition, the rapid deployment of ML systems has witnessed emerging privacy and security concerns. To achieve high performance and high energy efficiency, two research trends have attracted enormous interest, i.e., model compression and hardware acceleration. In this talk, we will discuss the current challenges and recent advances in efficient machine learning. We will present several machine learning acceleration works through algorithm-hardware codesign, using various computing platforms such as FPGA, MCU, GPU, and ReRAM. We will also discuss the challenges and recent advances in FPGA-based privacy-preserving ML implementation, as it often comes at very high computation and communication overhead and potentially prohibit the ML popularity.
Security Analysis of Complex Cyber-Physical Systems
Speaker: Prof. Kirill Levchenko
Affiliation: UIUC
Date: Aug. 17, 2022
-
Abstract: A Cyber-Physical System (CPS) is an embedded computer system that interacts with or controls a physical process. In many cases, the correct operation of such systems is critical to safety, whether it be a car, aircraft, or industrial plant. At the same time many cyber-physical systems are complex and highly interconnected, exposing a broad attack surface that an attacker can use to undermine its correct operation. In this talk, I will describe the challenges and recent advances in analyzing the security of cyber-physical systems. Specifically, I will describe Jetset, a system we developed that allows an analyst to boot an embedded system’s firmware in a general-purpose CPU emulator (e.g., QEMU) without the need to simulate or model the rest of the system. I will also touch on the challenges posed by FPGAs that are now a common component of cyber-physical systems and discuss several future threads of research.
Making Sparse Run Fast on FPGAs
Speaker: Yixiao Du
Affiliation: Cornell University
Date: Jun. 8, 2022
-
Abstract: Sparse processing, especially graph processing, is typically memory-bound due to low compute to memory access ratio and irregular compute patterns. The emerging high-bandwidth memory (HBM) delivers exceptional bandwidth and is adopted on FPGAs, which brings the potential to significantly boost the performance of sparse processing and relieve the programming burden. We first present HiSparse, an accelerator for sparse-matrix dense-vector multiplication (SpMV) targeting HBM-equipped FPGAs. HiSparse performs a case study on SpMV since it is widely used and exhibits common characteristics of sparse processing. We illustrate approaches to tackle the memory-bound and irregularity challenges, with the ideas of sparse-format-accelerator-architecture co-design and dynamic execution. Going beyond SpMV to domain-specific sparse processing, we propose GraphLily, a graph linear algebra overlay, to accelerate graph processing on HBM-equipped FPGAs. GraphLily supports a rich set of graph algorithms by adopting the GraphBLAS programming abstraction, which formulates graph algorithms as sparse linear algebra operations on different semirings. In GraphLily, different semirings share the same FPGA bitstream and are run-time configurable. GraphLily further builds a middleware to enable easy porting of existing GraphBLAS programs, requiring slight modifications to the original code intended for CPU/GPU execution. The evaluation shows that compared with state-of-the-art sparse processing frameworks on CPUs and GPUs, HiSparse and GraphLily deliver promising speedup with increased bandwidth and energy efficiency; HiSparse and GraphLily also achieve higher throughput compared with prior work on FPGA-based sparse processing.
Qilin: Enabling Performance Analysis and Optimization of Shared-Virtual Memory Systems for Discrete FPGA-enabled Accelerators
Speaker: Eddie Richter
Affiliation: UIUC
Date: May 11, 2022
-
Abstract: While the tight integration of components in heterogeneous systems has increased the popularity of the Shared-Virtual Memory (SVM) system programming model, the overhead of SVM can significantly impact end-to-end application performance. Several SVM implementations have been proposed, but systematically studying the cost and benefit of each implementation is difficult as the SVM design space is not clearly defined and there is no open and flexible system to explore the tradeoffs of different SVM implementations. In this work, we provide a categorization of the SVM design space to understand differences between SVM implementations, and how design decisions impact performance, flexibility, and resource utilization. To this end, we present Qilin, an open and flexible system built on top of an open-source FPGA shell, which allows researchers to alter components of the underlying SVM implementation, to understand how design decisions of the SVM system impact performance. For example, using Qilin we show that utilizing local page-table walkers and the host IOMMU are 4.36x and 3.16x faster respectively than performing translations in software. Qilin also provides application developers a flexible SVM shell for high-performance virtualized applications. Optimizations enabled by Qilin can reduce the latency of translations by 6.86x compared to an open-source FPGA shell.
Nimblock: Scheduling for Fine-grained FPGA Sharing through Virtualization
Speaker: Paul Reckamp
Affiliation: UIUC
Date: Apr. 13, 2022
-
Abstract: As FPGAs become ubiquitous compute platforms, existing research has focused on enabling virtualization features to facilitate fine-grained FPGA sharing. In this presentation, we present Nimblock, a scheduling technique for fine-grained FPGA sharing. We employ an overlay architecture which enables arbitrary, independent user logic to share portions of a single FPGA by dividing the FPGA into independently reconfigurable slots. We then explore scheduling possibilities to effectively time-multiplex and space-multiplex the virtualized FPGA. The Nimblock scheduling algorithm balances application priorities and performance degradation to improve response time and reduce deadline violations. Unlike other algorithms, Nimblock explores preemption as a scheduling parameter to dynamically change resource allocations. In our exploration, we evaluate five scheduling algorithms: a baseline, three existing algorithms, and our novel Nimblock algorithm. We demonstrate system feasibility by realizing the complete system on a Xilinx ZCU106 FPGA and evaluating on a set of real-world benchmarks. In our results, we achieve up to 9x lower median response times when compared to the baseline scheduling algorithms. We additionally demonstrate up to 21% fewer deadline violations and up to 2.1x lower tail response times when compared to other high-performance algorithms. We close the presentation with a discussion of extending Nimblock to Versal devices.
SmartNIC Benchmark Suite for Cloud Applications
Speaker: Yuan (Jeff) Ma, Scott Smith, Eddie Richter
Affiliation: UIUC
Date: Mar. 9, 2022
-
Abstract: The line-rate of datacenter networks has increased from 10Gbps to 400Gbps over the past 10 years. Simultaneously, networks are becoming increasingly software-defined. SmartNIC devices are emerging in response to these two trends, which extend the functionality of a standard NIC by offering programmability. SmartNICs enable offloading of infrastructure- and user-level applications to the NIC, which saves host CPU cycles, bypasses expensive host operating system overheads, and accelerates network-related tasks. SmartNICs come in different architectures with different programming models, both of which can significantly affect system performance. The need to quantify SmartNICs’ impact in the cloud motivates the design of a benchmark suite that analyzes the performance of a given SmartNIC. In this presentation, we will discuss our approach toward developing such a SmartNIC benchmark system. The benchmark suite is broken down into three stages: (1) collect, implement, and profile representative infrastructure and user-level network functions, (2) accumulate the testing scenarios and design a benchmark system that profiles a SmartNIC in the emulated cloud context, and (3) scale out the benchmark system by implementing a full SmartNIC-enabled cloud simulation platform on top of the open-source Firesim project.
Morpheus: A Polymorphous Design for General-purpose Code Acceleration
Speaker: Dong Kai Wang
Affiliation: UIUC
Date: Feb. 9, 2022
-
Abstract: While domain specific accelerators have been on the rise in recent years, there is a lack of a unified and transparent solution to seamlessly accelerate general applications. We propose Morpheus, a reconfigurable architecture that leverages the CPU’s microarchitectural structures to dynamically build spatial accelerators during program execution. We introduce hardware components capable of abstracting and scheduling program instructions to construct a dataflow graph (DFG) that is then mapped to FPGA/CGRA-like reconfigurable backends. Under this abstraction, Morpheus is not merely an efficient execution backend, it monitors and has command over its own architecture. By observing real-time execution behavior through activity counters, Morpheus has the capacity to dynamically tune its architecture to further adapt to the application. Morpheus provides a flexible acceleration platform that offers ISA compatibility and ease of use, enables self-managed reconfigurable computing, and maintains transparency to software.
FPGAs in the Open Cloud Testbed and Applications
Speaker: Suranga Handagala
Affiliation: Northeastern University
Date: Jan. 13, 2022
-
Abstract: The Open Cloud Testbed (OCT) is an NSF funded community research infrastructure project aimed at cloud researchers and users of national testbeds. Users have complete bare metal access to servers with Alveo U280 FPGAs through CloudLab, and can chose what OS to install, what version of the tool, etc. The FPGAs in OCT have dual 100Gb Ethernet connections to a network switch, allowing researchers to experiment with FPGAs directly connected to the network, supporting distributed and scalable applications. In this talk I will present the OCT setup and discuss some applications we have that make use of the FPGAs directly connected to the network, including a MobileNet implementation developed, using FINN, that is split across two FPGAs.
TwinDNN: A Tale of Two Deep Neural Networks
Speaker: Paul Jeong
Affiliation: UIUC
Date: Jan. 13, 2022
-
Abstract: Compression technologies for deep neural networks (DNNs) have been widely investigated to reduce the model size so that they can be implemented on hardware with strict resource restrictions. However, one major downside of model compression is accuracy degradation. To deal with this problem effectively, we propose a new compressed network inference scheme with a high accuracy but slower DNN coupled with its highly compressed version that has a lower accuracy. We demonstrate our design on two image classification tasks: CIFAR-10 and ImageNet. Our experiments show that our design can recover up to 94% of accuracy drop caused by extreme network compression, with more than 90% speedup compared to just using the original DNN.
HiKonv: High Throughput Quantized Convolution With Novel Bit-wise Management and Computation
Speaker: Xinheng Liu
Affiliation: UIUC
Date: Nov. 10, 2021
-
Abstract: We propose HiKonv, a unified solution that maximizes the compute throughput of given underlying hardware to process low-bitwidth quantized data inputs through novel bit-wise parallel computation. We establish theoretical performance boundaries of using a full-bitwidth multiplier for highly parallelized low-bitwidth convolution and demonstrate new breakthroughs for high-performance computing in this important domain. For example, a single 32-bit processing unit can deliver 128 binarized convolution operations (multiplications and additions) using one instruction for CPU, and a single 27×18 DSP core can deliver 8 convolution operations when the input bitwidth is 4, in one cycle.
Investigation of ML Application Acceleration with Vitis-AI
Speaker: Katherine Yun
Affiliation: UIUC
Date: Nov. 10, 2021
-
Abstract: Vitis-AI is a development platform for inferencing machine learning applications on Xilinx hardware platforms. The toolchain supports mainstream ML frameworks and popular models for various types of application. We are interested in investigating the potential of Vitis AI for porting custom ML applications on Cloud FPGA platforms. Based on the workflow for model deployment, we will discuss Vitis-AI’s limitations as well as its integration with other frameworks/services.