3 May 2016 | Song Han*, Xingyu Liu*, Huizi Mao*, Jing Pu*, Ardavan Pedram*, Mark A. Horowitz*, William J. Dally*†
EIE is an energy-efficient inference engine designed to accelerate inference on compressed deep neural networks (DNNs). The main challenge in deploying DNNs on embedded systems is their high computational and memory demands, which are exacerbated by the high cost of fetching weights from DRAM. EIE addresses this by leveraging compressed DNNs, which are stored in on-chip SRAM, reducing energy consumption and improving performance.
The compression technique, known as Deep Compression, reduces the size of DNNs by pruning redundant connections and sharing weights. This allows large DNNs like AlexNet and VGGNet to fit in on-chip SRAM. EIE exploits this compression by performing sparse matrix-vector multiplication and weight sharing, significantly reducing energy consumption. The energy savings come from multiple factors: moving from DRAM to SRAM saves 120×, exploiting sparsity saves 10×, weight sharing saves 8×, and skipping zero activations saves 3×.
EIE is evaluated on nine DNN benchmarks and is 189× and 13× faster than CPU and GPU implementations, respectively. It processes AlexNet's fully connected layers at 1.88×10⁴ frames/sec with a power dissipation of only 600mW. Compared to other accelerators like DaDianNao, EIE offers better throughput, energy efficiency, and area efficiency.
EIE is a scalable array of processing elements (PEs), each handling a partition of the network. It uses a compressed sparse column (CSC) format to store weights and exploits both static and dynamic sparsity. The architecture includes a central control unit, activation queue, and load balancing mechanisms to efficiently process sparse matrix-vector multiplication.
EIE's hardware implementation is optimized for 45nm CMOS technology, with each PE having an area of 0.638mm² and dissipating 9.16mW at 800MHz. The design allows for efficient execution of compressed DNNs, with significant energy savings compared to traditional CPU and GPU implementations. EIE's performance and energy efficiency make it suitable for real-time applications where latency is critical.EIE is an energy-efficient inference engine designed to accelerate inference on compressed deep neural networks (DNNs). The main challenge in deploying DNNs on embedded systems is their high computational and memory demands, which are exacerbated by the high cost of fetching weights from DRAM. EIE addresses this by leveraging compressed DNNs, which are stored in on-chip SRAM, reducing energy consumption and improving performance.
The compression technique, known as Deep Compression, reduces the size of DNNs by pruning redundant connections and sharing weights. This allows large DNNs like AlexNet and VGGNet to fit in on-chip SRAM. EIE exploits this compression by performing sparse matrix-vector multiplication and weight sharing, significantly reducing energy consumption. The energy savings come from multiple factors: moving from DRAM to SRAM saves 120×, exploiting sparsity saves 10×, weight sharing saves 8×, and skipping zero activations saves 3×.
EIE is evaluated on nine DNN benchmarks and is 189× and 13× faster than CPU and GPU implementations, respectively. It processes AlexNet's fully connected layers at 1.88×10⁴ frames/sec with a power dissipation of only 600mW. Compared to other accelerators like DaDianNao, EIE offers better throughput, energy efficiency, and area efficiency.
EIE is a scalable array of processing elements (PEs), each handling a partition of the network. It uses a compressed sparse column (CSC) format to store weights and exploits both static and dynamic sparsity. The architecture includes a central control unit, activation queue, and load balancing mechanisms to efficiently process sparse matrix-vector multiplication.
EIE's hardware implementation is optimized for 45nm CMOS technology, with each PE having an area of 0.638mm² and dissipating 9.16mW at 800MHz. The design allows for efficient execution of compressed DNNs, with significant energy savings compared to traditional CPU and GPU implementations. EIE's performance and energy efficiency make it suitable for real-time applications where latency is critical.