MESA: Microarchitecture Extensions for Spatial Architecture Generation
    Speaker: Dong Kai Wang
    Affiliation: UIUC
    Date: Sept 13, 2023
  • Abstract: Modern heterogeneous CPUs incorporate various hardware accelerators to achieve improved efficiency. A well-known class among them, spatial accelerators, are designed with reconfigurability to accelerate a wide range of applications. However, they tend to require specialized compilers and software stacks, libraries, or languages to operate and cannot be utilized with ease by all applications. As a result, the accelerator’s resources sit wastefully idle when it is not explicitly programmed. Our goal is to dismantle this CPU-accelerator barrier by monitoring CPU threads for acceleration opportunities during execution and, if viable, dynamically reconfigure the accelerator to allow transparent offloading. We develop MESA, a hardware controller on the CPU that translates machine code to build an accelerator configuration tailored to the running program. While such a dynamic translation/reconfiguration approach is challenging, it has a key advantage over ahead-of-time compilers: access to runtime information, revealing not only dynamic dependencies but also performance characteristics. MESA maintains a real-time performance model of the program mapped on the accelerator in the form of a spatial dataflow graph with nodes weighted by operation latency and edges weighted by data transfer latency. Features of this dataflow graph are continuously updated with runtime information captured by performance counters, allowing a feedback loop of optimization, reconfiguration, and acceleration. We evaluate the feasibility of our solution with different accelerator configurations. Across the Rodinia benchmarks, results demonstrate an average 1.3x speedup in performance and 1.8x gain in energy efficiency against a multicore baseline.


Scheduling and Serverless Computing with Multi-Tenant FPGAs
    Speaker: Meghna Mandava
    Affiliation: UIUC
    Date: May 10, 2023
  • Abstract: We introduce Nimblock for multi-tenant FPGA sharing. Nimblock explores scheduling possibilities to effectively time- and space-multiplex reconfigurable slots on a virtualized FPGA. The Nimblock scheduling algorithm balances application priorities and performance degradation to improve response time and reduce deadline violations. We demonstrate system feasibility on a Xilinx ZCU106 FPGA and evaluate on a set of real-world benchmarks. We achieve up to 5.7x lower average response times when compared to a no-sharing scheduling algorithm and up to 2.1x average response time improvement over competitive scheduling algorithms that support sharing. We also demonstrate up to 49% fewer deadline violations and up to 2.6x lower tail response times when compared to other high-performance algorithms. This work will appear at ISCA’23.

    We then use Nimblock to enable the use of FPGAs in serverless computing frameworks. We present Nimblock 2.0 to integrate virtualized multi-tenant FPGAs into a serverless platform. Our evaluation of the Nimblock 2.0 heterogeneous serverless computing model results in an average overhead of only 13% over a bare-metal FPGA implementation. Leveraging a heterogeneous serverless cluster with both CPU and FPGA compute resources can provide up to a 35% performance improvement compared to FPGA-only serverless computing.


Enabling Efficient Large-Scale Deep Learning Training with Cache Coherent Disaggregated Memory Systems
    Speaker: Prof. Jishen Zhao
    Affiliation: UCSD
    Date: Apr. 12, 2023
  • Abstract: Modern deep learning (DL) training is memory-consuming, constrained by the memory capacity of each computation component and cross-device communication bandwidth. In response to such constraints, current approaches include increasing parallelism in distributed training and optimizing inter-device communication. However, model parameter communication is becoming a key performance bottleneck in distributed DL training. This talk will introduce our recent design COARSE, which is a disaggregated memory extension for distributed DL training. COARSE is built on modern cache-coherent interconnect (CCI) protocols and MPI-like collective communication for synchronization, to allow low-latency and parallel access to training data and model parameters shared among worker GPUs. To enable high bandwidth transfers between GPUs and the disaggregated memory system, we propose a decentralized parameter communication scheme to decouple and localize parameter synchronization traffic. Furthermore, we propose dynamic tensor routing and partitioning to fully utilize the non-uniform serial bus bandwidth varied across different cloud computing systems. Finally, we design deadlock avoidance and dual synchronization to ensure high-performance parameter synchronization. We implement a disaggregated memory prototype based on an industrial CCI protocol using two FPGAs: a Xilinx KCU1500 and a BittWare 250-SoC are interconnected with one QSFP cable to allow the ARM cores on the 250-SoC to access the shared memory pool through CCI. Our evaluation shows that COARSE achieves up to 48.3% faster DL training compared to the state-of-the-art MPI AllReduce communication.


VersatileSSD: Breaking I/O Barrier by Leveraging SmartSSD
    Speaker: Ipoom Jeong
    Affiliation: UIUC
    Date: Mar. 8, 2023
  • Abstract: Modern server systems are facing the challenge of meeting the increasingly demanding performance requirements of applications that process massive amounts of data, such as databases, machine learning, and data analytics. Solid-state drives (SSDs) have grown in popularity due to their ability to significantly reduce the time required for transferring huge amounts of data, resulting in a shift in the system bottleneck from data transfer to interconnect bandwidth and operating system overhead. However, as the PCIe places limitations on simultaneous I/O device access, the scalability of the system is limited when multiple I/O devices are servicing independent contexts of I/O operations. This issue, along with the long-latency interconnect bus data transfer, has led to the development of SmartSSD, which employs near-data processing (NDP) to move computations closer to the location where the data is stored. This presentation outlines our vision for leveraging SmartSSD to maximize its computing potential and versatility in the context of NDP. Specifically, we will discuss two approaches that we are currently exploring: 1) Universal predicate pushdown and 2) OS page cache expander. We will present the high-level concepts behind these approaches and outline our development plan, which we believe will pave the way for integrating SmartSSD into mainstream server systems.


SPADES: A Rapid Compilation Flow for Versal FPGAs
    Speaker: Tan Nguyen
    Affiliation: UC Berkeley
    Date: Feb. 8, 2023
  • Abstract: With the increasing growth of complexity and heterogeneity of modern FPGA fabrics, the conventional ”flat” compilation flow relying on the standard tools, from Synthesis, Implementation, to Bitstream Generation, has become more arduous than ever. This leads to an inordinate turn-around time which severely impacts the productivity of application developers in quest of design space exploration. We propose an open-source tool flow built around a customizable overlay of Spatially Distributed Socket Engines (SPADES) to address the FPGA productivity issue. SPADES organizes the computation and communication of an application in a Coarse-grained Reconfigurable Array of socket engines. To address the compilation time issue, we employ the hardened Network-on-Chip present in Versal, a novel commercial FPGA architecture from AMD to alleviate the inter-socket routing task, as well as cater to the socket netlist reusability by utilizing regular and repeatable fabric regions. Our tool flow achieves shorter compilation time than the standard, top-down AMD Vitis flow by at most 10x (from several hours down to minutes) with comparable Quality-of-Result on some benchmarks targeting the AMD Versal VCK5000 data center card.


AutoScaleDSE: A Scalable Design Space Exploration Engine for High-Level Synthesis
    Speaker: Greg Jun
    Affiliation: UIUC
    Date: Jan. 11, 2023
  • Abstract: High-Level Synthesis (HLS) has enabled users to rapidly develop designs targeted for FPGAs from the behavioral description of the design. However, to synthesize an optimal design capable of taking better advantage of the target FPGA, a considerable amount of effort is needed to transform the initial behavioral description into a form that can capture the desired level of parallelism. Thus, a design space exploration (DSE) engine capable of optimizing large complex designs is needed to achieve this goal. We present a new DSE engine capable of considering code transformation, compiler directives (pragmas), and the compatibility of these optimizations. To accomplish this, we initially express the structure of the input code as a graph to guide the exploration process. To appropriately transform the code, we take advantage of ScaleHLS based on the multi-level compiler infrastructure (MLIR). Finally, we identify problems that limit the scalability of existing DSEs, which we name the “design space merging problem.” We address this issue by employing a Random Forest classifier that can successfully decrease the number of invalid design points without invoking the HLS compiler as a validation tool. We evaluated our DSE engine against the ScaleHLS DSE, outperforming it by a maximum of 59X. We additionally demonstrate the scalability of our design by applying our DSE to large-scale HLS designs, achieving a maximum speedup of 12x for the benchmarks in the MachSuite and Rodinia set.


Generic and automated graph neural network acceleration
    Speaker: Prof. Cong (Callie) Hao
    Affiliation: Georgia Tech
    Date: Dec. 14, 2022
  • Abstract: Graph neural networks (GNNs) are becoming increasingly important in many applications such as social science, natural science, and autonomous driving. Driven by real-time inference requirement, GNN acceleration has become a key research topic. Given the largely diverse GNN model types, such as graph convolution network, graph attention network, graph isomorphic network, with arbitrary aggravation methods and edge attributes, designing a generic GNN accelerator is challenging. In this talk, we discuss our proposed generic and efficient GNN accelerator, called FlowGNN, which can easily accommodate a wide range of GNN types. Without losing generality, FlowGNN can outperform CPU and GPU by up to 400 times. In addition, we discuss an open-source automation flow, GNNBuilder, which allows users to design their own GNNs in PyTorch and then automatically generates the accelerator code targeting FPGA.


UniNET: Accelerating Tenant Container Network in IaaS Cloud
    Speaker: Jeff Ma and Bill Dai
    Affiliation: UIUC
    Date: Nov. 9, 2022
  • Abstract: The current IaaS cloud networking stack adopts a layered approach to grant cloud tenants virtualized networks. Cloud providers usually offer a software-defined, VM-level network by sub-netting or overlaying on top of the hardware fabrics. In contrast, in a containerized scheme, the cloud tenants in each VM maintain the virtual network at the container level using their methods. Unfortunately, such methods are often implemented in software and create significant overheads that degrade the pod-to-pod, or process-to-process, network throughput and thus increase the overall tenant application execution time. We closely studied typical cloud providers and cloud-tenant network solutions and examined the critical network operations in existing network solutions that decrease performance. As a result, we propose UniNET, a SmartNIC-based, holistic network solution providing both VM-level and container-level network virtualization that can simultaneously achieve high throughput and reduce latency. To meet fundamental cloud concerns, we also design UniNET with security and scalability guarantees in mind.


ThymesisFlow with Ethernet Interface
    Speaker: Haiyang Zhang
    Affiliation: UIUC
    Date: Oct. 12, 2022
  • Abstract: Memory disaggregation, the technology that decouples memory resources from its local host and allows applications to utilize a memory pool from both local and remote servers, has been gaining attention for its potential to increase resource utilization and energy efficiency in data centers. ThymesisFlow, originally developed by IBM, is one of the solutions that implements software-defined disaggregated memory with FPGA. It is based on OpenCAPI and deployed on the POWER9 system with a HW/SW co-designed data path connecting the compute node and the memory-stealing node. The original prototype has limited scalability due to its implementation of the data link with aurora protocol. Nodes have to be directly connected point-to-point to establish communications. Our new design seeks to address this problem by implementing the network interface with ethernet protocol, where each node is identified by a unique MAC address, allowing much more flexible network topology. Benchmark shows that the performance degradation caused by the introduction of the ethernet interface is around 20%, a reasonable loss considering the addition of frame header and more complex data path for frame processing. We are also planning to implement a lightweight tcp-like protocol that provides resilience and congestion control over unreliable networks.


Machine Learning Acceleration Through Algorithm and Hardware Co-design
    Speaker: Prof. Caiwen Ding
    Affiliation: University of Connecticut
    Date: Sept. 14, 2022
  • Abstract: Machine learning based statistical models are increasingly challenging the mainstream computing platforms, across high-performance computers to low-end embedded systems for both training and inference. In addition, the rapid deployment of ML systems has witnessed emerging privacy and security concerns. To achieve high performance and high energy efficiency, two research trends have attracted enormous interest, i.e., model compression and hardware acceleration. In this talk, we will discuss the current challenges and recent advances in efficient machine learning. We will present several machine learning acceleration works through algorithm-hardware codesign, using various computing platforms such as FPGA, MCU, GPU, and ReRAM. We will also discuss the challenges and recent advances in FPGA-based privacy-preserving ML implementation, as it often comes at very high computation and communication overhead and potentially prohibit the ML popularity.


Security Analysis of Complex Cyber-Physical Systems
    Speaker: Prof. Kirill Levchenko
    Affiliation: UIUC
    Date: Aug. 17, 2022
  • Abstract: A Cyber-Physical System (CPS) is an embedded computer system that interacts with or controls a physical process. In many cases, the correct operation of such systems is critical to safety, whether it be a car, aircraft, or industrial plant. At the same time many cyber-physical systems are complex and highly interconnected, exposing a broad attack surface that an attacker can use to undermine its correct operation. In this talk, I will describe the challenges and recent advances in analyzing the security of cyber-physical systems. Specifically, I will describe Jetset, a system we developed that allows an analyst to boot an embedded system’s firmware in a general-purpose CPU emulator (e.g., QEMU) without the need to simulate or model the rest of the system. I will also touch on the challenges posed by FPGAs that are now a common component of cyber-physical systems and discuss several future threads of research.


Making Sparse Run Fast on FPGAs
    Speaker: Yixiao Du
    Affiliation: Cornell University
    Date: Jun. 8, 2022
  • Abstract: Sparse processing, especially graph processing, is typically memory-bound due to low compute to memory access ratio and irregular compute patterns. The emerging high-bandwidth memory (HBM) delivers exceptional bandwidth and is adopted on FPGAs, which brings the potential to significantly boost the performance of sparse processing and relieve the programming burden. We first present HiSparse, an accelerator for sparse-matrix dense-vector multiplication (SpMV) targeting HBM-equipped FPGAs. HiSparse performs a case study on SpMV since it is widely used and exhibits common characteristics of sparse processing. We illustrate approaches to tackle the memory-bound and irregularity challenges, with the ideas of sparse-format-accelerator-architecture co-design and dynamic execution. Going beyond SpMV to domain-specific sparse processing, we propose GraphLily, a graph linear algebra overlay, to accelerate graph processing on HBM-equipped FPGAs. GraphLily supports a rich set of graph algorithms by adopting the GraphBLAS programming abstraction, which formulates graph algorithms as sparse linear algebra operations on different semirings. In GraphLily, different semirings share the same FPGA bitstream and are run-time configurable. GraphLily further builds a middleware to enable easy porting of existing GraphBLAS programs, requiring slight modifications to the original code intended for CPU/GPU execution. The evaluation shows that compared with state-of-the-art sparse processing frameworks on CPUs and GPUs, HiSparse and GraphLily deliver promising speedup with increased bandwidth and energy efficiency; HiSparse and GraphLily also achieve higher throughput compared with prior work on FPGA-based sparse processing.


Qilin: Enabling Performance Analysis and Optimization of Shared-Virtual Memory Systems for Discrete FPGA-enabled Accelerators
    Speaker: Eddie Richter
    Affiliation: UIUC
    Date: May 11, 2022
  • Abstract: While the tight integration of components in heterogeneous systems has increased the popularity of the Shared-Virtual Memory (SVM) system programming model, the overhead of SVM can significantly impact end-to-end application performance. Several SVM implementations have been proposed, but systematically studying the cost and benefit of each implementation is difficult as the SVM design space is not clearly defined and there is no open and flexible system to explore the tradeoffs of different SVM implementations. In this work, we provide a categorization of the SVM design space to understand differences between SVM implementations, and how design decisions impact performance, flexibility, and resource utilization. To this end, we present Qilin, an open and flexible system built on top of an open-source FPGA shell, which allows researchers to alter components of the underlying SVM implementation, to understand how design decisions of the SVM system impact performance. For example, using Qilin we show that utilizing local page-table walkers and the host IOMMU are 4.36x and 3.16x faster respectively than performing translations in software. Qilin also provides application developers a flexible SVM shell for high-performance virtualized applications. Optimizations enabled by Qilin can reduce the latency of translations by 6.86x compared to an open-source FPGA shell.


Nimblock: Scheduling for Fine-grained FPGA Sharing through Virtualization
    Speaker: Paul Reckamp
    Affiliation: UIUC
    Date: Apr. 13, 2022
  • Abstract: As FPGAs become ubiquitous compute platforms, existing research has focused on enabling virtualization features to facilitate fine-grained FPGA sharing.  In this presentation, we present Nimblock, a scheduling technique for fine-grained FPGA sharing. We employ an overlay architecture which enables arbitrary, independent user logic to share portions of a single FPGA by dividing the FPGA into independently reconfigurable slots. We then explore scheduling possibilities to effectively time-multiplex and space-multiplex the virtualized FPGA. The Nimblock scheduling algorithm balances application priorities and performance degradation to improve response time and reduce deadline violations. Unlike other algorithms, Nimblock explores preemption as a scheduling parameter to dynamically change resource allocations. In our exploration, we evaluate five scheduling algorithms: a baseline, three existing algorithms, and our novel Nimblock algorithm. We demonstrate system feasibility by realizing the complete system on a Xilinx ZCU106 FPGA and evaluating on a set of real-world benchmarks. In our results, we achieve up to 9x lower median response times when compared to the baseline scheduling algorithms.  We additionally demonstrate up to 21% fewer deadline violations and up to 2.1x lower tail response times when compared to other high-performance algorithms. We close the presentation with a discussion of extending Nimblock to Versal devices.


SmartNIC Benchmark Suite for Cloud Applications
    Speaker: Yuan (Jeff) Ma, Scott Smith, Eddie Richter
    Affiliation: UIUC
    Date: Mar. 9, 2022
  • Abstract: The line-rate of datacenter networks has increased from 10Gbps to 400Gbps over the past 10 years. Simultaneously, networks are becoming increasingly software-defined. SmartNIC devices are emerging in response to these two trends, which extend the functionality of a standard NIC by offering programmability. SmartNICs enable offloading of infrastructure- and user-level applications to the NIC, which saves host CPU cycles, bypasses expensive host operating system overheads, and accelerates network-related tasks. SmartNICs come in different architectures with different programming models, both of which can significantly affect system performance. The need to quantify SmartNICs’ impact in the cloud motivates the design of a benchmark suite that analyzes the performance of a given SmartNIC. In this presentation, we will discuss our approach toward developing such a SmartNIC benchmark system. The benchmark suite is broken down into three stages: (1) collect, implement, and profile representative infrastructure and user-level network functions, (2) accumulate the testing scenarios and design a benchmark system that profiles a SmartNIC in the emulated cloud context, and (3) scale out the benchmark system by implementing a full SmartNIC-enabled cloud simulation platform on top of the open-source Firesim project.


Morpheus: A Polymorphous Design for General-purpose Code Acceleration
    Speaker: Dong Kai Wang
    Affiliation: UIUC
    Date: Feb. 9, 2022
  • Abstract: While domain specific accelerators have been on the rise in recent years, there is a lack of a unified and transparent solution to seamlessly accelerate general applications. We propose Morpheus, a reconfigurable architecture that leverages the CPU’s microarchitectural structures to dynamically build spatial accelerators during program execution. We introduce hardware components capable of abstracting and scheduling program instructions to construct a dataflow graph (DFG) that is then mapped to FPGA/CGRA-like reconfigurable backends. Under this abstraction, Morpheus is not merely an efficient execution backend, it monitors and has command over its own architecture. By observing real-time execution behavior through activity counters, Morpheus has the capacity to dynamically tune its architecture to further adapt to the application. Morpheus provides a flexible acceleration platform that offers ISA compatibility and ease of use, enables self-managed reconfigurable computing, and maintains transparency to software.


FPGAs in the Open Cloud Testbed and Applications
    Speaker: Suranga Handagala
    Affiliation: Northeastern University
    Date: Jan. 13, 2022
  • Abstract: The Open Cloud Testbed (OCT) is an NSF funded community research infrastructure project aimed at cloud researchers and users of national testbeds. Users have complete bare metal access to servers with Alveo U280 FPGAs through CloudLab, and can chose what OS to install, what version of the tool, etc. The FPGAs in OCT have dual 100Gb Ethernet connections to a network switch,  allowing researchers to experiment with FPGAs directly connected to the network, supporting distributed and scalable applications. In this talk I will present the OCT setup and discuss some applications we have that make use of the FPGAs directly connected to the network, including a MobileNet implementation developed, using FINN, that is split across two FPGAs. 


TwinDNN: A Tale of Two Deep Neural Networks
    Speaker: Paul Jeong
    Affiliation: UIUC
    Date: Jan. 13, 2022
  • Abstract: Compression technologies for deep neural networks (DNNs) have been widely investigated to reduce the model size so that they can be implemented on hardware with strict resource restrictions. However, one major downside of model compression is accuracy degradation. To deal with this problem effectively, we propose a new compressed network inference scheme with a high accuracy but slower DNN coupled with its highly compressed version that has a lower accuracy. We demonstrate our design on two image classification tasks: CIFAR-10 and ImageNet. Our experiments show that our design can recover up to 94% of accuracy drop caused by extreme network compression, with more than 90% speedup compared to just using the original DNN.


HiKonv: High Throughput Quantized Convolution With Novel Bit-wise Management and Computation
    Speaker: Xinheng Liu
    Affiliation: UIUC
    Date: Nov. 10, 2021
  • Abstract: We propose HiKonv, a unified solution that maximizes the compute throughput of given underlying hardware to process low-bitwidth quantized data inputs through novel bit-wise parallel computation. We establish theoretical performance boundaries of using a full-bitwidth multiplier for highly parallelized low-bitwidth convolution and demonstrate new breakthroughs for high-performance computing in this important domain. For example, a single 32-bit processing unit can deliver 128 binarized convolution operations (multiplications and additions) using one instruction for CPU, and a single 27×18 DSP core can deliver 8 convolution operations when the input bitwidth is 4, in one cycle.


Investigation of ML Application Acceleration with Vitis-AI
    Speaker: Katherine Yun
    Affiliation: UIUC
    Date: Nov. 10, 2021
  • Abstract: Vitis-AI is a development platform for inferencing machine learning applications on Xilinx hardware platforms. The toolchain supports mainstream ML frameworks and popular models for various types of application. We are interested in investigating the potential of Vitis AI for porting custom ML applications on Cloud FPGA platforms. Based on the workflow for model deployment, we will discuss Vitis-AI’s limitations as well as its integration with other frameworks/services.