1998 | Matteo Frigo, Charles E. Leiserson, Keith H. Randall
The fifth release of the multithreaded language Cilk, Cilk-5, uses a provably efficient "work-stealing" scheduling algorithm similar to the first system, but the language has been completely redesigned and the runtime system reengineered. The efficiency of the new implementation was aided by a clear strategy that arose from a theoretical analysis of the scheduling algorithm: concentrate on minimizing overheads that contribute to the work, even at the expense of overheads that contribute to the critical path. Although it may seem counterintuitive to move overheads onto the critical path, this "work-first" principle has led to a portable Cilk-5 implementation in which the typical cost of spawning a parallel thread is only between 2 and 6 times the cost of a C function call on a variety of contemporary machines. Many Cilk programs run on one processor with virtually no degradation compared to equivalent C programs. This paper describes how the work-first principle was exploited in the design of Cilk-5's compiler and its runtime system. In particular, we present Cilk-5's novel "two-clone" compilation strategy and its Dijkstra-like mutual-exclusion protocol for implementing the ready deque in the work-stealing scheduler.
The work-first principle is central to Cilk-5's design. It minimizes scheduling overheads that contribute to the work of a computation, even if this means increasing overheads that contribute to the critical path. This principle was used to develop a "two-clone" compilation strategy, where each Cilk procedure is compiled into two clones: a fast clone and a slow clone. The fast clone is used in the common case where serial semantics suffice, while the slow clone is used in the infrequent case where parallel semantics are required. The slow clone handles all communication due to scheduling, contributing to critical-path overhead but not to work overhead.
The work-first principle also inspired a Dijkstra-like shared-memory mutual-exclusion protocol, called the THE protocol, for managing the runtime deque in the work-stealing scheduler. This protocol allows exceptions to be signaled to a working processor with no additional work overhead, a feature used in Cilk's abort mechanism. The THE protocol ensures that the cost of stealing contributes only to critical-path overhead, and not to work overhead. This strategy of amortizing costs against steal attempts permeates virtually every decision made in the design of the scheduler.
The paper also describes the performance of Cilk-5 on various machines, showing that the work overhead $ c_1 $ is close to 1 for most programs, indicating that the Cilk-5 implementation effectively exploits the work-first principle. The critical-path overhead $ c_\infty $ is also reasonably small, validating the assumption of parallel slackness. The results show that Cilk-5 achieves linear speedup on a variety of applications, demonstrating its efficiency and effectiveness in parallel computing.The fifth release of the multithreaded language Cilk, Cilk-5, uses a provably efficient "work-stealing" scheduling algorithm similar to the first system, but the language has been completely redesigned and the runtime system reengineered. The efficiency of the new implementation was aided by a clear strategy that arose from a theoretical analysis of the scheduling algorithm: concentrate on minimizing overheads that contribute to the work, even at the expense of overheads that contribute to the critical path. Although it may seem counterintuitive to move overheads onto the critical path, this "work-first" principle has led to a portable Cilk-5 implementation in which the typical cost of spawning a parallel thread is only between 2 and 6 times the cost of a C function call on a variety of contemporary machines. Many Cilk programs run on one processor with virtually no degradation compared to equivalent C programs. This paper describes how the work-first principle was exploited in the design of Cilk-5's compiler and its runtime system. In particular, we present Cilk-5's novel "two-clone" compilation strategy and its Dijkstra-like mutual-exclusion protocol for implementing the ready deque in the work-stealing scheduler.
The work-first principle is central to Cilk-5's design. It minimizes scheduling overheads that contribute to the work of a computation, even if this means increasing overheads that contribute to the critical path. This principle was used to develop a "two-clone" compilation strategy, where each Cilk procedure is compiled into two clones: a fast clone and a slow clone. The fast clone is used in the common case where serial semantics suffice, while the slow clone is used in the infrequent case where parallel semantics are required. The slow clone handles all communication due to scheduling, contributing to critical-path overhead but not to work overhead.
The work-first principle also inspired a Dijkstra-like shared-memory mutual-exclusion protocol, called the THE protocol, for managing the runtime deque in the work-stealing scheduler. This protocol allows exceptions to be signaled to a working processor with no additional work overhead, a feature used in Cilk's abort mechanism. The THE protocol ensures that the cost of stealing contributes only to critical-path overhead, and not to work overhead. This strategy of amortizing costs against steal attempts permeates virtually every decision made in the design of the scheduler.
The paper also describes the performance of Cilk-5 on various machines, showing that the work overhead $ c_1 $ is close to 1 for most programs, indicating that the Cilk-5 implementation effectively exploits the work-first principle. The critical-path overhead $ c_\infty $ is also reasonably small, validating the assumption of parallel slackness. The results show that Cilk-5 achieves linear speedup on a variety of applications, demonstrating its efficiency and effectiveness in parallel computing.