The Codelet Program Execution Model
We have defined a new program execution model (PXM), the Codelet Model, which emphasizes the use of fine-grain event-driven tasks to carry the computation. It is inspired by the data flow execution model principles, coupled with the von Neumann execution model.
1. What Are Codelets?
A codelet is a (usually short) sequence of machine instructions that executes until completion: Save for very specific events (e.g., the core on which a codelet executes has been flagged as faulty), a codelet cannot be interrupted and migrated elsewhere (i.e., it is non-preemptible).
2. The Codelet Firing Rule
A codelet can fire if all of its dependencies are satisfied. If all of the data dependencies of a codelet are satisfied, it is enabled. If all of the resources and other events on which the codelet depends upon are also satisfied, then it is ready to fire. A codelet is firing when it is currently executing on a computation core.
3. The Codelet Abstract Machine Model
The Codelet Model relies on an abstract machine, which describes the mechanisms on which codelets rely to be allocated, stored, and scheduled. Our abstract machine is meant to reflect how future extreme-scale systems will look like. Hence, we picture a hierarchical machine with compute nodes linked by some kind of interconnect. Each compute node is composed of one or more many-core chips also linked through some interconnection network (à la HyperTransport or QPI). Each many-core chip is composed of clusters of cores with some interconnect (most likely, some kind of Network-on-Chip). Each cluster of cores is composed of two kinds of cores: Computation Units (CUs), and at least on Synchronization (sometimes called Scheduling) Unit (SU). Each CU has access to a pool of ready codelets. Each time a codelet is done running on its CU, the CU pops another ready codelet from its ready pool. The SU is in charge of managing resources (e.g., memory allocation, network bandwidth distribution, ...) and scheduling ready codelets to the right CU.
4. Codelet Graphs
When a codelet releases some resources, or produces/updates some data items, it signals the codelets and/or system software that depend on such events to become ready for execution. Hence, codelets which share dependences form a graph, which we denote as Codelet Graph (CDG). A CDG is a directed graph where the vertices are codelets, and arcs are dependencies flowing from one codelet to the other. While a CDG can be distributed over the whole Codelet Abstract Machine Model, portions of it are usually allocated to specific parts of it, usually a specific cluster. Codelet subgraphs are most often allocated through Threaded Procedures.
5. Threaded Procedures
Threaded Procedures (TPs) are asynchronous functions. They are called in a control-flow manner. TPs is composed of two parts: a frame, and a codelet graph. The frame contains all the data required by the codelets contained in the CDG to: (1) Store the inputs provided to the TP and which will act as initial data inputs for some of the codelets in the CDG, (2) Allocate enough memory for any intermediate data generated by codelets and used by others in the same CDG, and eventually bound to be deallocated when the last codelet in the TP is done firing, and finally (3) Allocate output space to store the result of the computation carried by the CDG contained in the TP. All the codelets forming the CDG contained in a TP are allocated at once: all dependencies are statically known to the TP. Hence, while the global CDG of an application is dynamically allocated through the various invocations of threaded procedures, each TP allocates a sub-graph with statically known dependences. Moreover, once scheduled for execution by an SU, i.e., once its CDG and TP-frame are allocated to a specific cluster, a TP (and its bound codelets) cannot be migrated elsewhere (barring some specific exceptions tied to fault-tolerance and resiliency issues).
Who Should Use Codelets?
Everyone! Codelets are meant to be a way to describe how parallel tasks execute at a low-level in a parallel machine. Ideally, a high-level language with relatively intuitive ways to express parallelism and locality constraints is used by the user; a compiler and its associated runtime system then generate fine-grain event-driven tasks (the codelets) to provide a "best fit" to the underlying hardware and keep it usefully busy.
Who Should Use Codelets Directly?
The Codelet Model and its implementations are meant to act as some kind of "assembly language" of parallel
applications: unless one is a tuning expert with expert hardware knowledge to better pin work to specific
parts of the underlying system, we expect codelets to be generated through some kind of high-level language
rather than specified directly using the codelet API.
Current Codelet-Based Implementations
There are multiple implementations of the codelet model, whether as a runtime system, a computer architecture project, etc. Some are direct implementations, while others "simply" take a deep inspiration from the model itself.
The Delaware Adaptive Run-Time System is the University of Delaware's own implementation of the Codelet Model specification. It is intended to be a faithful implementation of the Codelet Model itself: it maps a version of a codelet abstract machine to the underlying hardware; it also provides a two-level scheduling system to invoke threaded procedures and map them on a given cluster of cores, and run the codelets they contain within it; etc. DARTS is written in C++ and was designed to be modular: it is meant to be a research vehicle to evaluate the Codelet Model itself under various aspects: scheduling strategies, fault-resilience and recovery, energy efficiency, etc. It is currently running on x86 (64-bit) architectures, and was ported on a Dataflow-inspired simulation framework in the context of the TERAFLUX consortium. A description of DARTS' design and implementation is available on our website in the publications section (Euro-Par 2013).
The SWift Adaptive Runtime Machine is the first official implementation of the Codelet Model. It is published by ET International. A beta version is available upon request on their website. It runs on both shared-memory and distributed systems. A description of SWARM and its application to extreme-scale programming was recently published (EXADAPT 2012).
FreshBreeze is a novel computer architecture project started at MIT. The University of Delaware and MIT are closely working together to propose a simulation and compilation environment to generate FreshBreeze programs. The unit of execution is a codelet: it is a data-driven fine-grain task which executes only once all its dependencies are satisfied, and runs until completion. FreshBreeze brings its own set of features: a unified view of memory, including mass storage (at least for SSDs); a tree-based representation of memory (with an arity of 16), where each chunk is write-once; and various hardware mechanisms to facilitate the integration of this extension of virtual memory and automated resource management (e.g., and auto-buffer mechanism to deliver new chunks). Several publications describe the FreshBreeze system.
The Open-Community Runtime is a joint effort led by Intel and Rice University in the context of the DOE X-Stack initiative. While it is not officially a direct implementation of the Codelet Model, it is strongly inspired by it (and in particular, the quantum of scheduling of OCR is an "event-driven task" which has many common points with a codelet): the University of Delaware was part of the board which birthed OCR. It is written in C.
Other Codelet-Related Work
The ParalleX family of execution models describes the use of fine-grain multithreading and leverages several mechanisms such as active messages, continuations, etc. Its flagship implementation, HPX, has been evaluated on a number of applications and benchmarks.
Stéphane Zuckerman, Joshua Suetterlein, Rob Knauerhase, and Guang R. Gao. 2011.
Using a "codelet" program execution model for exascale machines: position paper.
In Proceedings of the 1st International Workshop on Adaptive Self-Tuning Computing Systems for the Exaflop
Era(EXADAPT '11). ACM, New York, NY, USA, 64-69. DOI=10.1145/2000417.2000424
© CAPSL 1996-2013. All Rights Reserved. firstname.lastname@example.org