Jan. 31-Feb. 4, 2016 | Chen, Yu-Hsin, Tushar Krishna, Joel Emer, and Vivienne Sze
The paper presents Eyeriss, an energy-efficient reconfigurable accelerator for deep convolutional neural networks (CNNs). CNNs are widely used in computer vision tasks due to their high accuracy. However, they require significant computational resources, especially for large networks with many filters and varying shapes. Eyeriss addresses these challenges by minimizing data movement and energy consumption through two key methods: (1) efficient dataflow and hardware design that reduce data movement by exploiting data reuse and support different shapes; and (2) data statistics exploitation to minimize energy through zero skipping/gating and data compression to reduce off-chip memory bandwidth.
The accelerator's architecture includes a shared SRAM buffer for input image data, filter weights, and partial sums, which facilitates temporal reuse of loaded data. Image data and filter weights are read from DRAM to the buffer and streamed into the spatial computation array, allowing for overlap of memory traffic and computation. The spatial array computes inner products between the image and filter weights, generating partial sums that are returned to the buffer and optionally rectified (ReLU) and compressed before being sent to DRAM. Run-length-based compression reduces the average image bandwidth by 2x.
The accelerator supports configurable image and filter sizes that do not fit completely into the spatial array by saving partial sums in the buffer and later restoring them to the spatial array. The sizes of the spatial array and buffer determine the number of passes needed for calculations. Unused PEs are clock gated to save energy.
The paper also describes the use of a Network-on-Chip (NoC) to support address-based data delivery, which allows for efficient multicast to a variable number of PEs within a single cycle. The NoC comprises one Global Y bus and 12 Global X buses, with each PE configured with a (row, col) ID. Data from the buffer is tagged with the target PEs' (row, col) ID, and multicast controllers deliver data only to those PEs that match the target ID.
The accelerator's processing engine is a three-stage pipeline responsible for calculating the inner product of the input image and filter weights for a single row of the filter. The sequence of partial sums for the sliding filter window is computed sequentially. Local scratch pads allow for energy-efficient temporal reuse of input image and filter weights by recirculating values needed by different windows. Data gating is achieved by recording the input image values of zero in a 'zero buffer' and skipping filter reads and computation for those values, resulting in a 45% power savings in the PE.
The test chip is implemented in 65nm CMOS and operates at 200MHz core clock and 60MHz link clock, achieving a frame rate of 34.7fps on the five convolutional layers in AlexNet and a measured power of 278mW at 1V. The PE array, NoC, and on-chip buffer consume 77.8%,The paper presents Eyeriss, an energy-efficient reconfigurable accelerator for deep convolutional neural networks (CNNs). CNNs are widely used in computer vision tasks due to their high accuracy. However, they require significant computational resources, especially for large networks with many filters and varying shapes. Eyeriss addresses these challenges by minimizing data movement and energy consumption through two key methods: (1) efficient dataflow and hardware design that reduce data movement by exploiting data reuse and support different shapes; and (2) data statistics exploitation to minimize energy through zero skipping/gating and data compression to reduce off-chip memory bandwidth.
The accelerator's architecture includes a shared SRAM buffer for input image data, filter weights, and partial sums, which facilitates temporal reuse of loaded data. Image data and filter weights are read from DRAM to the buffer and streamed into the spatial computation array, allowing for overlap of memory traffic and computation. The spatial array computes inner products between the image and filter weights, generating partial sums that are returned to the buffer and optionally rectified (ReLU) and compressed before being sent to DRAM. Run-length-based compression reduces the average image bandwidth by 2x.
The accelerator supports configurable image and filter sizes that do not fit completely into the spatial array by saving partial sums in the buffer and later restoring them to the spatial array. The sizes of the spatial array and buffer determine the number of passes needed for calculations. Unused PEs are clock gated to save energy.
The paper also describes the use of a Network-on-Chip (NoC) to support address-based data delivery, which allows for efficient multicast to a variable number of PEs within a single cycle. The NoC comprises one Global Y bus and 12 Global X buses, with each PE configured with a (row, col) ID. Data from the buffer is tagged with the target PEs' (row, col) ID, and multicast controllers deliver data only to those PEs that match the target ID.
The accelerator's processing engine is a three-stage pipeline responsible for calculating the inner product of the input image and filter weights for a single row of the filter. The sequence of partial sums for the sliding filter window is computed sequentially. Local scratch pads allow for energy-efficient temporal reuse of input image and filter weights by recirculating values needed by different windows. Data gating is achieved by recording the input image values of zero in a 'zero buffer' and skipping filter reads and computation for those values, resulting in a 45% power savings in the PE.
The test chip is implemented in 65nm CMOS and operates at 200MHz core clock and 60MHz link clock, achieving a frame rate of 34.7fps on the five convolutional layers in AlexNet and a measured power of 278mW at 1V. The PE array, NoC, and on-chip buffer consume 77.8%,