Machine Learning Acceleration Through Algorithm and Hardware Co-design
    Speaker: Prof. Caiwen Ding
    Affiliation: University of Connecticut
    Date: Sept. 14, 2022
  • Abstract: Machine learning based statistical models are increasingly challenging the mainstream computing platforms, across high-performance computers to low-end embedded systems for both training and inference. In addition, the rapid deployment of ML systems has witnessed emerging privacy and security concerns. To achieve high performance and high energy efficiency, two research trends have attracted enormous interest, i.e., model compression and hardware acceleration. In this talk, we will discuss the current challenges and recent advances in efficient machine learning. We will present several machine learning acceleration works through algorithm-hardware codesign, using various computing platforms such as FPGA, MCU, GPU, and ReRAM. We will also discuss the challenges and recent advances in FPGA-based privacy-preserving ML implementation, as it often comes at very high computation and communication overhead and potentially prohibit the ML popularity.


Security Analysis of Complex Cyber-Physical Systems
    Speaker: Prof. Kirill Levchenko
    Affiliation: UIUC
    Date: Aug. 17, 2022
  • Abstract: A Cyber-Physical System (CPS) is an embedded computer system that interacts with or controls a physical process. In many cases, the correct operation of such systems is critical to safety, whether it be a car, aircraft, or industrial plant. At the same time many cyber-physical systems are complex and highly interconnected, exposing a broad attack surface that an attacker can use to undermine its correct operation. In this talk, I will describe the challenges and recent advances in analyzing the security of cyber-physical systems. Specifically, I will describe Jetset, a system we developed that allows an analyst to boot an embedded system’s firmware in a general-purpose CPU emulator (e.g., QEMU) without the need to simulate or model the rest of the system. I will also touch on the challenges posed by FPGAs that are now a common component of cyber-physical systems and discuss several future threads of research.


Making Sparse Run Fast on FPGAs
    Speaker: Yixiao Du
    Affiliation: Cornell University
    Date: Jun. 8, 2022
  • Abstract: Sparse processing, especially graph processing, is typically memory-bound due to low compute to memory access ratio and irregular compute patterns. The emerging high-bandwidth memory (HBM) delivers exceptional bandwidth and is adopted on FPGAs, which brings the potential to significantly boost the performance of sparse processing and relieve the programming burden. We first present HiSparse, an accelerator for sparse-matrix dense-vector multiplication (SpMV) targeting HBM-equipped FPGAs. HiSparse performs a case study on SpMV since it is widely used and exhibits common characteristics of sparse processing. We illustrate approaches to tackle the memory-bound and irregularity challenges, with the ideas of sparse-format-accelerator-architecture co-design and dynamic execution. Going beyond SpMV to domain-specific sparse processing, we propose GraphLily, a graph linear algebra overlay, to accelerate graph processing on HBM-equipped FPGAs. GraphLily supports a rich set of graph algorithms by adopting the GraphBLAS programming abstraction, which formulates graph algorithms as sparse linear algebra operations on different semirings. In GraphLily, different semirings share the same FPGA bitstream and are run-time configurable. GraphLily further builds a middleware to enable easy porting of existing GraphBLAS programs, requiring slight modifications to the original code intended for CPU/GPU execution. The evaluation shows that compared with state-of-the-art sparse processing frameworks on CPUs and GPUs, HiSparse and GraphLily deliver promising speedup with increased bandwidth and energy efficiency; HiSparse and GraphLily also achieve higher throughput compared with prior work on FPGA-based sparse processing.


Qilin: Enabling Performance Analysis and Optimization of Shared-Virtual Memory Systems for Discrete FPGA-enabled Accelerators
    Speaker: Eddie Richter
    Affiliation: UIUC
    Date: May 11, 2022
  • Abstract: While the tight integration of components in heterogeneous systems has increased the popularity of the Shared-Virtual Memory (SVM) system programming model, the overhead of SVM can significantly impact end-to-end application performance. Several SVM implementations have been proposed, but systematically studying the cost and benefit of each implementation is difficult as the SVM design space is not clearly defined and there is no open and flexible system to explore the tradeoffs of different SVM implementations. In this work, we provide a categorization of the SVM design space to understand differences between SVM implementations, and how design decisions impact performance, flexibility, and resource utilization. To this end, we present Qilin, an open and flexible system built on top of an open-source FPGA shell, which allows researchers to alter components of the underlying SVM implementation, to understand how design decisions of the SVM system impact performance. For example, using Qilin we show that utilizing local page-table walkers and the host IOMMU are 4.36x and 3.16x faster respectively than performing translations in software. Qilin also provides application developers a flexible SVM shell for high-performance virtualized applications. Optimizations enabled by Qilin can reduce the latency of translations by 6.86x compared to an open-source FPGA shell.


Nimblock: Scheduling for Fine-grained FPGA Sharing through Virtualization
    Speaker: Paul Reckamp
    Affiliation: UIUC
    Date: Apr. 13, 2022
  • Abstract: As FPGAs become ubiquitous compute platforms, existing research has focused on enabling virtualization features to facilitate fine-grained FPGA sharing.  In this presentation, we present Nimblock, a scheduling technique for fine-grained FPGA sharing. We employ an overlay architecture which enables arbitrary, independent user logic to share portions of a single FPGA by dividing the FPGA into independently reconfigurable slots. We then explore scheduling possibilities to effectively time-multiplex and space-multiplex the virtualized FPGA. The Nimblock scheduling algorithm balances application priorities and performance degradation to improve response time and reduce deadline violations. Unlike other algorithms, Nimblock explores preemption as a scheduling parameter to dynamically change resource allocations. In our exploration, we evaluate five scheduling algorithms: a baseline, three existing algorithms, and our novel Nimblock algorithm. We demonstrate system feasibility by realizing the complete system on a Xilinx ZCU106 FPGA and evaluating on a set of real-world benchmarks. In our results, we achieve up to 9x lower median response times when compared to the baseline scheduling algorithms.  We additionally demonstrate up to 21% fewer deadline violations and up to 2.1x lower tail response times when compared to other high-performance algorithms. We close the presentation with a discussion of extending Nimblock to Versal devices.


SmartNIC Benchmark Suite for Cloud Applications
    Speaker: Yuan (Jeff) Ma, Scott Smith, Eddie Richter
    Affiliation: UIUC
    Date: Mar. 9, 2022
  • Abstract: The line-rate of datacenter networks has increased from 10Gbps to 400Gbps over the past 10 years. Simultaneously, networks are becoming increasingly software-defined. SmartNIC devices are emerging in response to these two trends, which extend the functionality of a standard NIC by offering programmability. SmartNICs enable offloading of infrastructure- and user-level applications to the NIC, which saves host CPU cycles, bypasses expensive host operating system overheads, and accelerates network-related tasks. SmartNICs come in different architectures with different programming models, both of which can significantly affect system performance. The need to quantify SmartNICs’ impact in the cloud motivates the design of a benchmark suite that analyzes the performance of a given SmartNIC. In this presentation, we will discuss our approach toward developing such a SmartNIC benchmark system. The benchmark suite is broken down into three stages: (1) collect, implement, and profile representative infrastructure and user-level network functions, (2) accumulate the testing scenarios and design a benchmark system that profiles a SmartNIC in the emulated cloud context, and (3) scale out the benchmark system by implementing a full SmartNIC-enabled cloud simulation platform on top of the open-source Firesim project.


Morpheus: A Polymorphous Design for General-purpose Code Acceleration
    Speaker: Dong Kai Wang
    Affiliation: UIUC
    Date: Feb. 9, 2022
  • Abstract: While domain specific accelerators have been on the rise in recent years, there is a lack of a unified and transparent solution to seamlessly accelerate general applications. We propose Morpheus, a reconfigurable architecture that leverages the CPU’s microarchitectural structures to dynamically build spatial accelerators during program execution. We introduce hardware components capable of abstracting and scheduling program instructions to construct a dataflow graph (DFG) that is then mapped to FPGA/CGRA-like reconfigurable backends. Under this abstraction, Morpheus is not merely an efficient execution backend, it monitors and has command over its own architecture. By observing real-time execution behavior through activity counters, Morpheus has the capacity to dynamically tune its architecture to further adapt to the application. Morpheus provides a flexible acceleration platform that offers ISA compatibility and ease of use, enables self-managed reconfigurable computing, and maintains transparency to software.


FPGAs in the Open Cloud Testbed and Applications
    Speaker: Suranga Handagala
    Affiliation: Northeastern University
    Date: Jan. 13, 2022
  • Abstract: The Open Cloud Testbed (OCT) is an NSF funded community research infrastructure project aimed at cloud researchers and users of national testbeds. Users have complete bare metal access to servers with Alveo U280 FPGAs through CloudLab, and can chose what OS to install, what version of the tool, etc. The FPGAs in OCT have dual 100Gb Ethernet connections to a network switch,  allowing researchers to experiment with FPGAs directly connected to the network, supporting distributed and scalable applications. In this talk I will present the OCT setup and discuss some applications we have that make use of the FPGAs directly connected to the network, including a MobileNet implementation developed, using FINN, that is split across two FPGAs. 


TwinDNN: A Tale of Two Deep Neural Networks
    Speaker: Paul Jeong
    Affiliation: UIUC
    Date: Jan. 13, 2022
  • Abstract: Compression technologies for deep neural networks (DNNs) have been widely investigated to reduce the model size so that they can be implemented on hardware with strict resource restrictions. However, one major downside of model compression is accuracy degradation. To deal with this problem effectively, we propose a new compressed network inference scheme with a high accuracy but slower DNN coupled with its highly compressed version that has a lower accuracy. We demonstrate our design on two image classification tasks: CIFAR-10 and ImageNet. Our experiments show that our design can recover up to 94% of accuracy drop caused by extreme network compression, with more than 90% speedup compared to just using the original DNN.


HiKonv: High Throughput Quantized Convolution With Novel Bit-wise Management and Computation
    Speaker: Xinheng Liu
    Affiliation: UIUC
    Date: Nov. 10, 2021
  • Abstract: We propose HiKonv, a unified solution that maximizes the compute throughput of given underlying hardware to process low-bitwidth quantized data inputs through novel bit-wise parallel computation. We establish theoretical performance boundaries of using a full-bitwidth multiplier for highly parallelized low-bitwidth convolution and demonstrate new breakthroughs for high-performance computing in this important domain. For example, a single 32-bit processing unit can deliver 128 binarized convolution operations (multiplications and additions) using one instruction for CPU, and a single 27×18 DSP core can deliver 8 convolution operations when the input bitwidth is 4, in one cycle.


Investigation of ML Application Acceleration with Vitis-AI
    Speaker: Katherine Yun
    Affiliation: UIUC
    Date: Nov. 10, 2021
  • Abstract: Vitis-AI is a development platform for inferencing machine learning applications on Xilinx hardware platforms. The toolchain supports mainstream ML frameworks and popular models for various types of application. We are interested in investigating the potential of Vitis AI for porting custom ML applications on Cloud FPGA platforms. Based on the workflow for model deployment, we will discuss Vitis-AI’s limitations as well as its integration with other frameworks/services.