The Codelet Execution Model

Fine-Grain Multithreading for Extreme-Scale Computing



Contents

The Codelet Program Execution Model

    We have defined a new program execution model (PXM), the Codelet Model, which emphasizes the use of fine-grain event-driven tasks to carry the computation. It is inspired by the data flow execution model principles, coupled with the von Neumann execution model.

1. What Are Codelets?

    A codelet is a (usually short) sequence of machine instructions that executes until completion: Save for very specific events (e.g., the core on which a codelet executes has been flagged as faulty), a codelet cannot be interrupted and migrated elsewhere (i.e., it is non-preemptible).

2. The Codelet Firing Rule

    A codelet can fire if all of its dependencies are satisfied. If all of the data dependencies of a codelet are satisfied, it is enabled. If all of the resources and other events on which the codelet depends upon are also satisfied, then it is ready to fire. A codelet is firing when it is currently executing on a computation core.

3. The Codelet Abstract Machine Model

    The Codelet Model relies on an abstract machine, which describes the mechanisms on which codelets rely to be allocated, stored, and scheduled. Our abstract machine is meant to reflect how future extreme-scale systems will look like. Hence, we picture a hierarchical machine with compute nodes linked by some kind of interconnect. Each compute node is composed of one or more many-core chips also linked through some interconnection network (à la HyperTransport or QPI). Each many-core chip is composed of clusters of cores with some interconnect (most likely, some kind of Network-on-Chip). Each cluster of cores is composed of two kinds of cores: Computation Units (CUs), and at least on Synchronization (sometimes called Scheduling) Unit (SU). Each CU has access to a pool of ready codelets. Each time a codelet is done running on its CU, the CU pops another ready codelet from its ready pool. The SU is in charge of managing resources (e.g., memory allocation, network bandwidth distribution, ...) and scheduling ready codelets to the right CU.

4. Codelet Graphs

    When a codelet releases some resources, or produces/updates some data items, it signals the codelets and/or system software that depend on such events to become ready for execution. Hence, codelets which share dependences form a graph, which we denote as Codelet Graph (CDG). A CDG is a directed graph where the vertices are codelets, and arcs are dependencies flowing from one codelet to the other. While a CDG can be distributed over the whole Codelet Abstract Machine Model, portions of it are usually allocated to specific parts of it, usually a specific cluster. Codelet subgraphs are most often allocated through Threaded Procedures.

5. Threaded Procedures

    Threaded Procedures (TPs) are asynchronous functions. They are called in a control-flow manner. TPs is composed of two parts: a frame, and a codelet graph. The frame contains all the data required by the codelets contained in the CDG to: (1) Store the inputs provided to the TP and which will act as initial data inputs for some of the codelets in the CDG, (2) Allocate enough memory for any intermediate data generated by codelets and used by others in the same CDG, and eventually bound to be deallocated when the last codelet in the TP is done firing, and finally (3) Allocate output space to store the result of the computation carried by the CDG contained in the TP. All the codelets forming the CDG contained in a TP are allocated at once: all dependencies are statically known to the TP. Hence, while the global CDG of an application is dynamically allocated through the various invocations of threaded procedures, each TP allocates a sub-graph with statically known dependences. Moreover, once scheduled for execution by an SU, i.e., once its CDG and TP-frame are allocated to a specific cluster, a TP (and its bound codelets) cannot be migrated elsewhere (barring some specific exceptions tied to fault-tolerance and resiliency issues).

Who Should Use Codelets?

    Everyone! Codelets are meant to be a way to describe how parallel tasks execute at a low-level in a parallel machine. Ideally, a high-level language with relatively intuitive ways to express parallelism and locality constraints is used by the user; a compiler and its associated runtime system then generate fine-grain event-driven tasks (the codelets) to provide a "best fit" to the underlying hardware and keep it usefully busy.

Who Should Use Codelets Directly?

    The Codelet Model and its implementations are meant to act as some kind of "assembly language" of parallel applications: unless one is a tuning expert with expert hardware knowledge to better pin work to specific parts of the underlying system, we expect codelets to be generated through some kind of high-level language rather than specified directly using the codelet API.
    In our experience, writing parallel programs with codelets is akin to writing PThread programs: with enough "self-control" it is possible to write efficient parallel code which correctly exploits the underlying hardware. However, it can be a tedious task to do if one is not a computer scientist or computer engineer.
    There are multiple projects which leverage the ideas expressed in the Codelet Model and translate high-level languages (e.g., Chapel, OpenMP, etc.) down to event-driven fine-grain tasking systems (e.g., OCR, SWARM, DARTS).


Current Codelet-Based Implementations

    There are multiple implementations of the codelet model, whether as a runtime system, a computer architecture project, etc. Some are direct implementations, while others "simply" take a deep inspiration from the model itself.

DARTS

    The Delaware Adaptive Run-Time System is the University of Delaware's own implementation of the Codelet Model specification. It is intended to be a faithful implementation of the Codelet Model itself: it maps a version of a codelet abstract machine to the underlying hardware; it also provides a two-level scheduling system to invoke threaded procedures and map them on a given cluster of cores, and run the codelets they contain within it; etc. DARTS is written in C++ and was designed to be modular: it is meant to be a research vehicle to evaluate the Codelet Model itself under various aspects: scheduling strategies, fault-resilience and recovery, energy efficiency, etc. It is currently running on x86 (64-bit) architectures, and was ported on a Dataflow-inspired simulation framework in the context of the TERAFLUX consortium. A description of DARTS' design and implementation is available on our website in the publications section (Euro-Par 2013).

SWARM

    The SWift Adaptive Runtime Machine is the first official implementation of the Codelet Model. It is published by ET International. A beta version is available upon request on their website. It runs on both shared-memory and distributed systems. A description of SWARM and its application to extreme-scale programming was recently published (EXADAPT 2012).

FreshBreeze

    FreshBreeze is a novel computer architecture project started at MIT. The University of Delaware and MIT are closely working together to propose a simulation and compilation environment to generate FreshBreeze programs. The unit of execution is a codelet: it is a data-driven fine-grain task which executes only once all its dependencies are satisfied, and runs until completion. FreshBreeze brings its own set of features: a unified view of memory, including mass storage (at least for SSDs); a tree-based representation of memory (with an arity of 16), where each chunk is write-once; and various hardware mechanisms to facilitate the integration of this extension of virtual memory and automated resource management (e.g., and auto-buffer mechanism to deliver new chunks). Several publications describe the FreshBreeze system.

OCR

    The Open-Community Runtime is a joint effort led by Intel and Rice University in the context of the DOE X-Stack initiative. While it is not officially a direct implementation of the Codelet Model, it is strongly inspired by it (and in particular, the quantum of scheduling of OCR is an "event-driven task" which has many common points with a codelet): the University of Delaware was part of the board which birthed OCR. It is written in C.

Other Codelet-Related Work

    The ParalleX family of execution models describes the use of fine-grain multithreading and leverages several mechanisms such as active messages, continuations, etc. Its flagship implementation, HPX, has been evaluated on a number of applications and benchmarks.


Software Download


Publications

Stéphane Zuckerman, Joshua Suetterlein, Rob Knauerhase, and Guang R. Gao. 2011. Using a "codelet" program execution model for exascale machines: position paper. In Proceedings of the 1st International Workshop on Adaptive Self-Tuning Computing Systems for the Exaflop Era(EXADAPT '11). ACM, New York, NY, USA, 64-69. DOI=10.1145/2000417.2000424
http://doi.acm.org/10.1145/2000417.2000424

Daniel Orozco, Elkin Garcia, Robert Pavel, Rishi Khan, and Guang R. Gao. TIDeFlow: The Time Iterated Dependency Flow Execution Model. In Proceedings of the First Workshop on Data-Flow Execution Models for Extreme Scale Computing (DFM), Galveston Island, TX, USA, 2011 vol., no., pp.1,9, 10-10 Oct. 2011 i
http://dx.doi.org/10.1109/DFM.2011.11
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6176399&isnumber=6176385

Daniel Orozco, Elkin Garcia, Robert Pavel, Rishi Khan, and Guang R. Gao. Polytasks: A Compressed Task Representation for HPC Runtimes. In Proceedings of Languages and Compilers for Parallel Computing (LCPC) 2011, Fort Collins, CO, USA. Sept. 2011.

Tom St. John, Jack B. Dennis, and Guang R. Gao. 2012. Massively parallel breadth first search using a tree-structured memory model. In Proceedings of the 2012 International Workshop on Programming Models and Applications for Multicores and Manycores (PMAM '12). ACM, New York, NY, USA, 115-123. DOI=10.1145/2141702.2141715
http://doi.acm.org/10.1145/2141702.2141715

Chen Chen, Yao Wu, Joshua Suetterlein, Long Zheng, Minyi Guo, and Guang R. Gao. 2013. Automatic Locality Exploitation in the Codelet Model. In Proceedings of the 2013 12th IEEE International Conference on Trust, Security and Privacy in Computing and Communications(TRUSTCOM '13). IEEE Computer Society, Washington, DC, USA, 853-862. DOI=10.1109/TrustCom.2013.104
http://dx.doi.org/10.1109/TrustCom.2013.104

Joshua Suettlerlein, Stéphane Zuckerman, and Guang R. Gao. 2013. An implementation of the codelet model. In Proceedings of the 19th international conference on Parallel Processing (Euro-Par'13), Felix Wolf, Bernd Mohr, and Dieter Mey (Eds.). Springer-Verlag, Berlin, Heidelberg, 633-644. DOI=10.1007/978-3-642-40047-6_63
http://dx.doi.org/10.1007/978-3-642-40047-6_63

Chen Chen, Yao Wu, Stéphane Zuckerman, and Guang R. Gao. 2013. Towards Memory-Load Balanced Fast Fourier Transformations in Fine-Grain Execution Models. In Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing Workshops and PhD Forum (IPDPSW '13). IEEE Computer Society, Washington, DC, USA, 1607-1617. DOI=10.1109/IPDPSW.2013.47
http://dx.doi.org/10.1109/IPDPSW.2013.47

Marco Solinas, Rosa M. Badia, François Bodin, Albert Cohen, Paraskevas Evripidou, Paolo Faraboschi , Bernhard Fechner, Guang R. Gao, Arne Garbade, Sylvain Girbal, Daniel Goodman, Behram Khan, Souad Koliai, Feng Li, Mikel Luján, Laurent Morin, Avi Mendelson, Nacho Navarro, Antoniu Pop, Pedro Trancoso, Theo Ungerer, Mateo Valero, Sebastian Weis, Ian Watson, Stéphane Zuckerman, Roberto Giorgi. The TERAFLUX Project: Exploiting the DataFlow Paradigm in Next Generation Teradevices. DSD 2013: 272-279

R. Giorgi, R. M. Badia, F. Bodin, A. Cohen, P. Evripidou, P. Faraboschi, B. Fechner, G. R. Gao, A. Garbade, R. Gayatri, S. Girbal, D. Goodman, B. Khan, S. Koliaï, J. Landwehr, N. Minh L, F. Li, M. Luján, A. Mendelson, L. Morin, N. Navarro, T. Patejko, A. Pop, P. Trancoso, T. Ungerer, I. Watson, S. Weis, S. Zuckerman, M. Valero. TERAFLUX: Harnessing dataflow in next generation teradevices, Journal of Microprocessors and Microsystems: Embedded Hardware Design (MICPRO), April 2014,
http://doi.org/10.1016/j.micpro.2014.04.001.

Haitao Wei, Stéphane Zuckerman, Xiaoming Li, and Guang R. Gao. A Dataflow Programming Language and its Compiler for Streaming Systems. Procedia Computer Science, Volume 29, 2014, Pages 1289-1298, ISSN 1877-0509,
http://dx.doi.org/10.1016/j.procs.2014.05.116.
(http://www.sciencedirect.com/science/article/pii/S1877050914002932).



 


© CAPSL 1996-2013. All Rights Reserved.
capslwww@capsl.udel.edu