DNNs inference on FPGAs for High Throughput Applications

Introduction: Neural Network (NN), Deployment of deep neural network for applications that require very high throughput or extremely low latency is a severe computational challenge, further exacerbated by inefficiencies in mapping the computation to hardware we present a nove

Project Title

Project Area of Specialization

Artificial Intelligence

Project Summary

Introduction:

Neural Network (NN), Deployment of deep neural network for applications that require very high throughput or extremely low latency is a severe computational challenge, further exacerbated by inefficiencies in mapping the computation to hardware we present a novel method for designing neural network topologies that directly map to a highly efficient the FPGA implementation.

DNNs are what we call state-of-the-art computer vision and natural language processing system. Deep Convolutional Neural Networks (CNNs) have demonstrated amazing accuracy in a variety of applications, including image processing. These cutting-edge CNNs employ extremely deep models, requiring hundreds of ExaOps (1018 of operations per second) of compute and GBytes of model and data storage during training. Many applications like during data collection of particle physics applications or during packet line rate filtering for network intrusion detection and wireless communication (communication networks), would require DNNs with inferences rate of hundreds of millions of requests per second and latency at sub-micro level would be required to replace machine-learned component.

Through DNNs hardware architecture are customized to the specific of the topology. streaming inference, that is, the input is streamed through the architecture like a pipeline. These DNNs are put through high specialization for higher efficiency which is that they will only instantiate when needed. The latency is lowered as there are no buffering before and after layers and they are highly flexible due to reconfiguration as changing topology means a different architecture which is offered in FPGA through reconfiguration.

How can we create DNN implementations that can match the performance and latency requirements of high-throughput applications?

GPUs, FPGA overlays, and dedicated tensor processors are all popular hardware platforms for speeding up DNN inference. Mostly old ways of designing the hardware and software separately are used which would then be linked by complier to connect these two on the available hardware.Prior research by many peoples has shown that specialized co-design approaches can generate FPGA, DNN implementations with enhanced speed while also allowing for reconfiguration to meet changing requirements.

Our approach is based on the fact that artificial neurons with quantized inputs and outputs can be transformed to truth tables. However, for a truth table, an effective FPGA implementation is often only viable when the number of inputs is minimal.

Following are the tools and platforms we use for the creation of high throughput, ultra low latency DNN compute architectures.

Logicnets Designing, training and deploying sparse, quantized neural networks based on hardware building blocks.

FINN is the idea of transformations, which gradually transform the model into a synthesizable hardware description.

Project Objectives

OBJECTIVES:

Our project main objective are as follows:

To Utilized DNNs into Extreme-Throughput Application.
To Expose whether a LUT is capable of machine learning framework.
To design Topologies with Low FAN-IN.
To restrict our design space exploration to smaller networks.
To Build a prototype in PyTorch.
To Implement it Through FPGA on ULTRA96V2-MPSOC.
To run a large number of experiments while avoiding high resources cost.

Project Implementation Method

METHDOLOGY:

Now the question is how can we build DNN implementations that are able to meet the performance and latency constraints for extreme-throughput applications?

Here we describe a novel method named LogicNets and FINN for designing DNN topologies that map directly to an efficient FPGA implementation for extreme-throughput applications. Our scheme is based on the observation that artificial neurons with quantized inputs and outputs can be converted to truth tables.

LOGICNETS flow:

FINN flow:

DNNs designed and trained in this manner result in fast and efficient FPGA implementations that can fulfill the performance requirements for extreme-throughput application.

Benefits of the Project

There are many exciting directions for future work here. On the machine learning side, we would be able to improve our understanding of how to design highly sparce and quantized topologies that retain high accuracy whereas on the tool flow and EDA side it would be great to have better synthesis techniques to scale neuron with larger FAN-IN and of course on the application side to discover and demonstrate more extreme-throughput applications that can leverage Logic Nets and FINN.

DNN model have become deeper and developed huge accuracy with equally high complexity and computation, which demands to design specialized accelerators on the development platform for these models. The deployments can be summarized into two kinds—cloud and edge. The cloud deployment needs to transport the data from sensor to data centers. Training and inferences of models are executed in data centers. The edge computing performs network inference closed to where data is produced, and models can be pre-trained in data centre.

We will be dealing with edge computing in our project with which we will a specialized DNN topology for the task at hand. Once we decide on the topology, we train it with standard training techniques and when we achieved the desired accuracy, we can directly convert the trained neural network into our FPGA circuit. If we choose the right topology, it will have low logic depth and it will be able to achieve high clock rate and classification performance to meet the need of extreme throughput applications.

Technical Details of Final Deliverable

Our Hardware contain any open-source machine learning framework that will be helpful the path from research prototyping to production deployment like PyTorch which will be used to train our specific DNN topology and when it achieves the desired, we will convert it into a LUT (LookUp Table) by enumerating each possible input and observing the output to construct the truth table. Model conversion and synthesis can be done by using python and we generate a Verilog code for it, then we synthesis our final product with VIVADO and implement it on our ULTRA96V2- MPSOC FPGA board.