June 2024 | Hongzheng Chen, Niansong Zhang, Shaojie Xiang, Zhichen Zeng, Mengjia Dai, and Zhiru Zhang
Allo is a composable programming model for efficient spatial accelerator design. It decouples hardware customizations (compute, memory, communication, and data types) from algorithm specification, encapsulating them as primitives. Allo preserves the hierarchical structure of input programs by combining customizations from different functions in a bottom-up, type-safe manner, enabling holistic optimizations across function boundaries. Comprehensive experiments on HLS benchmarks and deep learning models show that Allo outperforms state-of-the-art HLS tools and ADLs, achieving 1.7× faster inference latency and 5.4× higher energy efficiency for the GPT2 model compared to the NVIDIA A100 GPU.
Allo supports parameterized kernel templates, allowing users to declare type variables during kernel creation and instantiate the kernel when building the hardware executable. It also provides composable schedules, enabling users to construct kernels incrementally from the bottom up, adding customizations one at a time while validating each submodule. Allo introduces holistic dataflow optimizations, using a hierarchical dataflow graph to support the composition of multiple kernels within a complex design while maintaining function boundaries. It models interface unification as a type inference problem and solves it efficiently through dataflow analysis.
Allo's frontend is implemented in Python, allowing flexible programming with minimal type annotations. It also provides an end-to-end optimizing compiler for Allo, enabling users to write Python programs and generate hardware bitstream. An MLIR dialect supports decoupled hardware customizations at the IR level and potentially supports multiple input languages.
Allo addresses two major challenges in high-performance accelerator design: balancing manual control with automated compiler optimizations and bridging the gap from single-kernel optimization to complex multi-kernel designs. It provides progressive hardware customizations, reusable parameterized kernel templates, composable schedules, and holistic dataflow optimizations. Allo's ability to compose individual kernels and construct large-scale, high-performance designs makes it distinct from other ADLs.
Allo's compilation flow includes a Python-embedded ADL, an Allo compiler, and an MLIR dialect. It supports different backend targets and generates LLVM IR for CPU simulation and HLS C/C++ for hardware synthesis. Allo's customizable hardware transformations allow users to handle complex transformations for single kernel designs, such as systolic arrays. Allo's verification procedures ensure the correctness of the generated accelerator, including functional simulation testing and formal equivalence checking.
Allo's parameterized kernel templates allow users to define functions with type parameters, enabling polymorphism and flexibility in handling variable-sized input matrices. Allo's composable schedules enable the integration of external kernels and holistic optimization of the design. Allo's hierarchical dataflow graph preserves the hierarchy of modules during scheduling, facilitating analysis of interfaces between functions. Allo's schedule replay algorithm allows the composition of multiple schedules, ensuring that primitives are applied correctly and conflicts are resolved. Allo's memory layout composition ensures consistency between function call arguments and actual function definitions, maintaining data layout integrity.Allo is a composable programming model for efficient spatial accelerator design. It decouples hardware customizations (compute, memory, communication, and data types) from algorithm specification, encapsulating them as primitives. Allo preserves the hierarchical structure of input programs by combining customizations from different functions in a bottom-up, type-safe manner, enabling holistic optimizations across function boundaries. Comprehensive experiments on HLS benchmarks and deep learning models show that Allo outperforms state-of-the-art HLS tools and ADLs, achieving 1.7× faster inference latency and 5.4× higher energy efficiency for the GPT2 model compared to the NVIDIA A100 GPU.
Allo supports parameterized kernel templates, allowing users to declare type variables during kernel creation and instantiate the kernel when building the hardware executable. It also provides composable schedules, enabling users to construct kernels incrementally from the bottom up, adding customizations one at a time while validating each submodule. Allo introduces holistic dataflow optimizations, using a hierarchical dataflow graph to support the composition of multiple kernels within a complex design while maintaining function boundaries. It models interface unification as a type inference problem and solves it efficiently through dataflow analysis.
Allo's frontend is implemented in Python, allowing flexible programming with minimal type annotations. It also provides an end-to-end optimizing compiler for Allo, enabling users to write Python programs and generate hardware bitstream. An MLIR dialect supports decoupled hardware customizations at the IR level and potentially supports multiple input languages.
Allo addresses two major challenges in high-performance accelerator design: balancing manual control with automated compiler optimizations and bridging the gap from single-kernel optimization to complex multi-kernel designs. It provides progressive hardware customizations, reusable parameterized kernel templates, composable schedules, and holistic dataflow optimizations. Allo's ability to compose individual kernels and construct large-scale, high-performance designs makes it distinct from other ADLs.
Allo's compilation flow includes a Python-embedded ADL, an Allo compiler, and an MLIR dialect. It supports different backend targets and generates LLVM IR for CPU simulation and HLS C/C++ for hardware synthesis. Allo's customizable hardware transformations allow users to handle complex transformations for single kernel designs, such as systolic arrays. Allo's verification procedures ensure the correctness of the generated accelerator, including functional simulation testing and formal equivalence checking.
Allo's parameterized kernel templates allow users to define functions with type parameters, enabling polymorphism and flexibility in handling variable-sized input matrices. Allo's composable schedules enable the integration of external kernels and holistic optimization of the design. Allo's hierarchical dataflow graph preserves the hierarchy of modules during scheduling, facilitating analysis of interfaces between functions. Allo's schedule replay algorithm allows the composition of multiple schedules, ensuring that primitives are applied correctly and conflicts are resolved. Allo's memory layout composition ensures consistency between function call arguments and actual function definitions, maintaining data layout integrity.