1998 | Matteo Frigo, Charles E. Leiserson, Keith H. Randall
The paper discusses the implementation of Cilk-5, a multithreaded language for parallel programming that builds upon the C language. The new version of Cilk uses a provably efficient "work-stealing" scheduling algorithm, similar to its predecessor, but with significant redesigns in both the language and the runtime system. The efficiency of the new implementation is enhanced by a clear strategy that minimizes overheads contributing to work, even if it means moving overheads onto the critical path. This "work-first" principle has led to a portable Cilk-5 implementation where the typical cost of spawning a parallel thread is only 2 to 6 times the cost of a C function call on contemporary machines. Many Cilk programs run almost as efficiently on a single processor as they do on multiple processors.
The paper describes the design of Cilk-5's compiler and runtime system, focusing on the "two-clone" compilation strategy and a Dijkstra-like mutual-exclusion protocol for implementing the ready deque in the work-stealing scheduler. The two-clone strategy generates a fast clone and a slow clone for each procedure, with the fast clone executing in the common case where serial semantics suffice, and the slow clone handling parallel semantics and bookkeeping. The mutual-exclusion protocol, called the *THE* protocol, manages the runtime deque of ready threads, ensuring that scheduling costs are amortized against the critical path.
The paper also presents empirical evidence showing that the Cilk-5 scheduler is efficient, with the work overhead being close to 1 for most applications. The critical-path overhead is also found to be reasonably small, indicating that the work-first principle effectively minimizes scheduling overheads. The conclusion discusses related work and highlights the advantages of Cilk-5's design, particularly its focus on leveraging C compiler technology for high-performance codes.The paper discusses the implementation of Cilk-5, a multithreaded language for parallel programming that builds upon the C language. The new version of Cilk uses a provably efficient "work-stealing" scheduling algorithm, similar to its predecessor, but with significant redesigns in both the language and the runtime system. The efficiency of the new implementation is enhanced by a clear strategy that minimizes overheads contributing to work, even if it means moving overheads onto the critical path. This "work-first" principle has led to a portable Cilk-5 implementation where the typical cost of spawning a parallel thread is only 2 to 6 times the cost of a C function call on contemporary machines. Many Cilk programs run almost as efficiently on a single processor as they do on multiple processors.
The paper describes the design of Cilk-5's compiler and runtime system, focusing on the "two-clone" compilation strategy and a Dijkstra-like mutual-exclusion protocol for implementing the ready deque in the work-stealing scheduler. The two-clone strategy generates a fast clone and a slow clone for each procedure, with the fast clone executing in the common case where serial semantics suffice, and the slow clone handling parallel semantics and bookkeeping. The mutual-exclusion protocol, called the *THE* protocol, manages the runtime deque of ready threads, ensuring that scheduling costs are amortized against the critical path.
The paper also presents empirical evidence showing that the Cilk-5 scheduler is efficient, with the work overhead being close to 1 for most applications. The critical-path overhead is also found to be reasonably small, indicating that the work-first principle effectively minimizes scheduling overheads. The conclusion discusses related work and highlights the advantages of Cilk-5's design, particularly its focus on leveraging C compiler technology for high-performance codes.