1996 | Kunle Olukotun, Basem A. Nayfeh, Lance Hammond, Ken Wilson, and Kunyung Chang
This paper argues against the trend of increasing instruction issue width in superscalar processors, suggesting that for applications with little parallelism, a single-chip multiprocessor (MP) can match the performance of a wide-issue superscalar (SS) processor. For applications with high parallelism, the MP outperforms the SS by a significant margin. The MP offers localized high-clock-rate processors for sequential applications and low-latency interprocessor communication for parallel applications.
The paper compares a six-issue dynamically scheduled SS processor with a 4x2-issue MP. The results show that for applications with little parallelism, the SS is 30% faster than the MP. For applications with fine-grained thread-level parallelism, the MP can exploit this parallelism, making the SS at most 10% better. For applications with large-grained parallelism and multiprogramming workloads, the MP performs 50–100% better than the SS.
The paper discusses the limitations of superscalar designs, including the quadratic increase in area and complexity of the instruction issue queue, register files, and cache systems. It argues that the MP, composed of simpler processors, can be implemented in a similar area and offers better performance for applications with high parallelism. The MP also benefits from low-latency interprocessor communication, making it suitable for both integer and floating-point applications. The study concludes that the MP is a more efficient use of silicon resources and can achieve higher clock rates due to its simpler design.This paper argues against the trend of increasing instruction issue width in superscalar processors, suggesting that for applications with little parallelism, a single-chip multiprocessor (MP) can match the performance of a wide-issue superscalar (SS) processor. For applications with high parallelism, the MP outperforms the SS by a significant margin. The MP offers localized high-clock-rate processors for sequential applications and low-latency interprocessor communication for parallel applications.
The paper compares a six-issue dynamically scheduled SS processor with a 4x2-issue MP. The results show that for applications with little parallelism, the SS is 30% faster than the MP. For applications with fine-grained thread-level parallelism, the MP can exploit this parallelism, making the SS at most 10% better. For applications with large-grained parallelism and multiprogramming workloads, the MP performs 50–100% better than the SS.
The paper discusses the limitations of superscalar designs, including the quadratic increase in area and complexity of the instruction issue queue, register files, and cache systems. It argues that the MP, composed of simpler processors, can be implemented in a similar area and offers better performance for applications with high parallelism. The MP also benefits from low-latency interprocessor communication, making it suitable for both integer and floating-point applications. The study concludes that the MP is a more efficient use of silicon resources and can achieve higher clock rates due to its simpler design.