Custom Hardware Accelerator for Inference of Compressed Sparse Convolutional Neural Networks

Simply put, our project aims to build a custom processor for running (not training) AI tasks. Just as a CPU offloads heavy graphical computations to a GPU, it shall be able to delegate heavy AI computations to our processor. It aims to provide increased throughput and energy efficiency. Based

2025-06-28 16:31:01 - Adil Khan

Project Title

Project Area of Specialization Artificial IntelligenceProject Summary

Based on a novel dataflow as set out in SCNN [1], we are developing the architecture for it in order to actually utilize it for real world AI applications. It shall work alongside a host processor, and ARM or RISC V CPU, which shall pass it instructions. It shall pick up the input stream of compressed data through Direct Memory Access (DMA), process it and then write the result back to memory, from where the host CPU can use it.

The architecture being developed consists of the recognizable CNN functions of convolution, pooling, and activation. The architecture is divided into Processing Engines, each working in parallel and within each there is additional parallelism – inter-PE as well as intra-PE parallelism.

The project is a research project intended for commercialization and is being pursued jointly with Vienna University of Technology (TU Wien).

[1] Parashar, Angshuman et al. “SCNN: An accelerator for compressed-sparse convolutional neural networks.” 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA) (2017): 27-40.

Project Objectives

The project aims primarily to enable high throughput low energy processing of CNNs. As opposed to training, inferencing is the deployment of the CNN model for use in applications. Much of this deployment occurs on the edge (referring to user or small scale devices) as opposed to the cloud (data centers and large computing clusters). Being battery operated and limited to a small form factor puts a limit to the size and complexity of CNN models that can be run in this scenario.

Sparsity is a technique which reduces the size of a CNN with minimal impact on accuracy. Existing solutions which are used to run CNNs, however, are unable to leverage the non-uniformity that sparsity introduces. GPUs and dense accelerators, such as Eyeriss[1], are an example.

The goal of this product is to cater to the blooming AI infrastructure market, which is expected to grow by 350% in the coming 4 years [2]. Owing to size, memory and computational constraints, it is likely that our project which exploits a compressed-sparse dataflow and promises performance improvements of around 150% and energy savings of 57%, becomes more relevant as an AI inference chip.

Going further, an API will have to be built atop the accelerator to facilitate translation of CNN models into the specific format required by our accelerator.

[1] Y. Chen, T. Krishna, J. S. Emer and V. Sze, "Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks," in IEEE Journal of Solid-State Circuits, vol. 52, no. 1, pp. 127-138, Jan. 2017.

[2] https://www.marketsandmarkets.com/Market-Reports/ai-infrastructure-market-38254348.html

Project Implementation Method

We are writing hardware. As opposed to drawing circuit diagram, we use hardware descriptive languages, such as VHDL and Verilog, which can be synthesized by the software provided by FPGA vendors into a circuit diagram.

This description of the circuit in the form of code, termed an Intellectual Property (IP) is to be the primary product. It can be implemented in an FPGA, which allows generic hardware to be implemented and has a large amount of memory, computational and IO resources; as well as being used to fabricate into a chip, which we are familiar with in the form of our devices’ processors.

Benefits of the Project

The project caters to an important research topic of finding novel hardware architectures to speed up and efficiently process CNNs. Sparse CNNs are an increasing trend, and our project will help make that trend practically viable.

SCNN proposal, which was generic, will be developed in depth and a working prototype prepared. The resulting IP can be sold, turning the project into a viable startup product.

At our own university, NUST, in our own department, SEECS, a plethora of projects utilize an overlap of Internet of Things (IoT) and AI in such use scenarios as medical diagnosis, industrial process control, smart sensors, field data gathering and analysis etc. These projects can be enabled to use more complex and computer intensive AI programs with the help of our product.

Technical Details of Final Deliverable

The final deliverable of the project includes the working architecture receiving a compressed stream of data values and generating correct output. It shall be interfaced to a host processor, an ARM or RISC V processor, and shall access memory through DMA.

The host CPU triggers the accelerator through an instruction, which causes the accelerator to receive the compressed data stream through DMA from main memory. It processes it and writes the result back to memory. The host CPU can then display the output.

We plan to run a face detection algorithm with the correct output being shown on a monitor.

Final Deliverable of the Project HW/SW integrated systemCore Industry ITOther IndustriesCore Technology Artificial Intelligence(AI)Other Technologies Internet of Things (IoT)Sustainable Development Goals Industry, Innovation and InfrastructureRequired Resources

Item Name	Type	No. of Units	Per Unit Cost (in Rs)	Total (in Rs)
			Total in (Rs)	47000
Genesys Development board	Equipment	1	47000	47000

Custom Hardware Accelerator for Inference of Compressed Sparse Convolutional Neural Networks

More Posts