My Research Interests

After I joined McGill University in 1987, I established the Advanced Computer Architecture Program Structures Group (ACAPS). For related please refer to Publications.

I am interested in areas related to
Computer Architecture and Systems,
Optimizing and Parallelizing Compilers ,
Programming Languages
Computational Science .

I. Computer Architectures and Systems

The evolution of architecture ideas at ACAPS, to a great extent, reflects the evolution of early dataflow models into multithreaded program execution models and architectures. This evolution at McGill has been centered to a succession of architecture models as described below in
Section I.1 McGill Dataflow Architecture (MDFA) ,
Section I.2 The Super Actor Architecture Model
Section I.3 The EARTH Project.

The other aspects of our work includes the measurement of instruction-level parallelism and the SITA tool to be discussed in section I.4 and the work on Memory Consistency Models in section I.5.

I.1 The McGill Dataflow Architecture Model (MDFA: 1987--1991)

Dataflow architectures have promised solutions addressing the two fundamental problems of von Neumann computers in multiprocessing : memory latency and synchronization overhead. However, we must not ignore the efficiency and simplicity of the instruction sequencing mechanism in von Neumann architecture models, as well as over 50 years of experience in optimization of of sequential instruction execution mechanisms.

Conceived in fall of 1987 and based on my experience with my Ph.D research with the static dataflow model at MIT the McGill Dataflow Architecture Model evolve into an efficient hybrid architecture model which employs (1) conventional architecture techniques to achieve fast pipelined instruction execution, while exploiting fine-grain parallelism by data-driven instruction scheduling; (2) a mechanism which supports concurrent operation of multiple instruction threads on the hybrid architecture model; (3) a compiling paradigm for dataflow software pipelining which efficiently exploits loop parallelism through balancingtechniques. MDFA utilizes the so-called argument-fetching dataflow principle and is one of the first architecture model which has been substantially developed based on this principle. The MDFA has been simulated on a dataflow arch- tecture testbed developed at McGill which includes a detailed architecture simulator and a compiler based on a SISAL compiler. The experimental results have demonstrated the effectiveness of features of the proposed hybrid architecture. The MDFA architecture has been documented in Zaharias Paraskivas'and Jean-Marc Monti's SM thesis and Hebert Hum's Ph.D thesis. For an introduction to the MDFA model, the readers are referred to the paper written by Gao, in Journal of Parallel and Distributed Systems, Dec. 1993.

I.2 The Super-Actor Architecture Model (SAM: 1989--1993)

Based on our experience with MDFA, we have proposed the Super-Actor Machine (SAM) --- based on a multi- threaded program execution model. The SAM model retains the simplicity of the MDFA model in handling loop parallelism such as dataflow software pipelining and extends the MDFA model to include efficient support for dynamic parallel function invocations.

The architecture of SAM contains a novel organization of high-speed memories, known as the register-cache, for a multithreaded architecture. As the term suggests, it is organized both as a register file and a cache. Viewed from the execution unit, its contents are addressable similar to ordinary CPU registers using relatively short addresses. From the main memory perspective,it is content addressable, i.e., its contents are tagged just as in conventional caches. The register allocation for the register-cache is adaptively performed at runtime, resulting in a dynamically allocated register file. The SAM has been simulated on a dataflow architecture testbed developed at McGill and CRIM, and experimental results have demonstrated its effectiveness.

The SAM architecture has been documented in Herbert Hum's Ph.D thesis, and the register-cache idea has been described in a pending patent application. Short introduction to SAM can be found in the PARLE'91 paper by Hum and Gao.

I.3 The EARTH Project (EARTH: 1993 -- )

Evolving from MDFA and SAM, the EARTH (Efficient Architecture for Running THreads) project has the following objectives:

Define a multithreaded execution model based on the EARTH (We have changed the name of our project from MTA to EARTH (Efficient Architecture for Running THreads) due to the fact that `MTA' is used as a trademark of an international computer company, and also we liked the new name better!) model. This model should permit a smooth integration of interprocessor communication with computation in conventional micro- processor-based high-performance machines, and be suitable for supporting the parallel computation in both regular and irregular scientific and symbolic applications.

Implement our EARTH model on the following high- performance platforms :

GMD MANNA 2.0 Parallel Machine: This platform will be used to study innovative hardware and software concepts for a node architecture under the EARTH model.

Workstation Farm of IBM RS/6000s: This high- performance workstation cluster will be used to study the multiprocessing issues of multithreading as they relate to conventional distributed memory architectures.

Develop advanced compiler techniques for pervasive dependence analysis and thread-making transformations. Implement these techniques within the McCAT compiler and evaluate their effectiveness on the two implementation platforms.

Investigate the impact of the EARTH model on conventional node architectures, in particular, the architectural support which may be required for efficient computations within the context of the the EARTH model.

Demonstrate the applicability of our approach on a collection of regular scientific and irregular scientific and symbolic applications.

On August 31, 1995, as a milestone of the project, we have completed the inauguration of the high- performance EARTH-MANNA system on the 20-node MANNA Architecture -- the result of a joint German-Canada collaboration initiative in high-performance computing with the support from the governments of both Germany and Canada. The EARTH-MANNA system is fully functional, enabling us to experiment with large-scale benchmarks with impressive speed. We will illustrate how the results of our research in EARTH, MANNA,and EARTH-MANNA significantly augmented our understanding of critical issues in the design and utilization of the next generation of high-performance computers systems.

For an introduction of the EARTH model and EARTH-MANNA testbed, the readers are referred to our PACT'95 and EUROP-PAR'95 papers.

I.4 Parallelism Limit Studies

We have designed and developed the SITA testbed which provides a powerful set of object-code analysis techniques for the study of algorithmic parallelism, compiler optimization strategies and parallel archi- tecture structures. With SITA, we can conduct studies on the limits of parallelism and the smoothability of the parallelism on a wide variety of applications.

This tool has already shown a high-level of potential parallelism for both scientific and non-scientific applications. The results indicate that high amounts of parallelism can be exploited smoothly-- delayed without affecting the overall execution time-- under various combinations of conventional architecture features and multithreading, as well as compiler optimization techniques.

This work has been documented in the MICRO25 paper by Theobald,Gao and Hendren.

I.5 Memory Consistency Model: The Location Consistency

A memory consistency model represents a binding "contract" between software and hardware in a shared memory mutiprocessor system. It is important to provide a memory consistency model that is easy to understand and that also facilitates efficent implementation.

The memory consistency model that has been most commonly used in past is sequential consistency (SC), which requires the execution of a parallel program to appear as some interleaving of the memory operations on a sequential machine. To reduce the rigid constraints of the SC model, several relaxed consistency models have been proposed, notably weak ordering (or weak consistency) (WC), releases consistency (RC), data-race-free-0, and data-race-free-1. These models allow performance optimizations to be correctly applied, while guaranteeing that sequential consistency is retained for a specified class of programs.We call these models SC-derived models. A central assumption in the definitions of all SC-derived memory consistency models is the memory coherence assumption, which can be informally stated as follows: ''all writes to the same location are serialized in some order and are performed in that order with respect to any processor."

In our work (joint with V.Sarkar at IBM), we propose a location consistency (LC) model for memory consistency. A distinguishing feature of the LC model is that it does not rely on the assumption of memory coherence. Instead of assuming that all writes to the same location are serialized according to some total order, we model the state of a memory location as a partially ordered multiset(pomset)of write operations and synchronization operations. The partial order in the pomset naturally follows from the ordering constraints defined by a concurrent program and its execution. The LC model provides a very simple "contract" between software and hardware; only the partial order defined by the program needs to be obeyed, and there is no requirement for all processors to observe the same ordering of concurrent write operations. Furthermore, concurrent asynchronous reads and writes to the same memory location are well- defined in the LC-model, rather than being considered as error conditions. We demostrate that the LC model allows more compiler and hardware performance optimiza- tions to be applied, compared to the other memory consistency models mentioned above. The LC model is also easier to implement because serialization constraints imposed by SC and SC-derived model(e.g.due to memory coherence assumption) are removed.

The basic LC model has appeared in ICPP'95. Shamir Merali at ACAPS is now investigating an efficient cache management scheme under LC model, which will lead to his masters' thesis.

I.6. International Interaction and Collaboration

Members of ACAPS group and myself have been invited to present our research results to several important international workshops, conferences and panel discussions. For example, I have served as a co- chair of the International Workshop on Multithreaded Computers in conjunction with IEEE Supercomputing'91 held at Albuquerque, New Mexico --- the first such meeting in this field. He has also served as a co- chair of the Second International Workshop on Dataflow and multithreaded Computers in conjunction with The 19th ACM Symposium on Computer Architecture, held in Queensland,Australia, May 1992. He also served as the Program Chairman of the International Conference on Parallel Architectures and Compilation Techniques (PACT '94) held in Aug.1994 in Montreal, co-sponsored by IFIP and ACM SIGARCH, in association with ACM SIGPLAN, IEEE TCCA (Technical Committee on Computer Architecture) and IEEE TCPP (Technical Committee on Parallel Processing).

Three monographs based on the above conferences have been co-edited by professor Gao and his colleagues, and have been the major source of research references in the field of multithreaded computers.

II. Optimizing and Parallelizing Compilers

In the design of high-performance architectures today new compiler concepts and techniques are essential to cope with rapidly advancing hardware technology. In fact, recent architectural breakthroughs have often been directly accompanied by innovative compiler optimizations.

In section II.1 we discuss our work on Instruction- Level Parallelism (ILP). In section II.2, we summarize our work on compiling for distributed memory machines. In section II.3, we outline our work on program flow analysis based on DJ-graphs .

Our main contributions are listed below.

II.1 Compiling for Instruction-Level Parallelism (ILP)

Dataflow Software Pipelining

Dataflow Software Pipelining was originated from my Ph.D dissertation work as an efficient code mapping methodology for pipelining array operations embedded in loops for static dataflow computers. See the monograph I have edited for Kluwer Publishers 1992 for an extensive treatment..

A Timed-Petri Net Model for Loop Scheduling

Efficient execution of loops is one of the most important challenges for high-performance computer architectures. Loop scheduling involves handling a partially ordered set of operations which are to be performed repetitively over a sequence of iterations. Our original contribution is the proposal of applying the theory and methods in timed Petri net to fine-grain loop scheduling due to their unique power in modeling both the partial order and the cyclic nature of the loop execution sequence in a unified fashion. The behavior of a Petri net can be modeled by constructing its behavior graph(at compile-time), which contains a repetitive pattern in its firing sequence known as the cyclic frustum.

Our work based on the petri net model has been documented in Y-B, Wang's Master's thesis. The loop storage model for dataflow graphs have been included in Qi Ning's Ph.D thesis.

Software Pipelining Based on Modulo Scheduling Techniques

We have focused on the use of periodic schedules in software pipelining for single and nested loops. In our POPL'93 paper, we have developed a framework to handle scheduling and register allocation simultaneously based on an integer linear programming. A key intuition leading to our surprisingly simple formulation and its efficient solution is the association of maximum computation rate of a program graph with its critical cycles due to Reiter's pioneering work on Karp-Miller computation graphs.In particular,our method generalizes the work by Callahan,Carr and Kennedy on scalar expansion, the work by Lam on modular variable expansion for software pipelined loops, and the work by Rau et al on register allocation for modulo scheduled loops. This framework has also been described in Ning's Ph.D thesis.

We have also proposed the ''interval graph'' based register allocation algorithms which appear to provide a good representation to study combined instruction scheduling and register allocation methods. Most recently, we have utilized such ILP based framework to find rate-optimalperiodic schedules under both resource and register constraints. We focus on the following software pipelining problem: given a loop and a machine architecture with a fixed number of fixed number of processor resources (e.g. function units), how can one construct a software-pipelined schedule which runs on the given architecture at the maximum possible iteration rate (so called rate-optimal) while minimizing the number of registers or fit some given register constraints? Our main contributions are listed below. First, we demonstrate that such a problem can be described by a simple mathematical formulation with precise optimization objectives under periodic linear scheduling framework. The mathematical formulation the overall provides a clear picture which permits one to visualize solution space (for rate-optimal schedules) under different sets of constraints. Secondly, we show that a precise mathematical precise mathematical formulation and its solution does make a significant performance difference! We evaluated the performance of our method against three other leading conour temporary heuristic methods and demonstrated that our method performed significantly and consitently better than these methods.

We further expanded our work to address the challenge: how to represent such resource constraints for unclean pipelines (e.g. non-pipelined or pipelined but having structural hazards) and their assignment (mapping) simultaneously under a unified ILP framework. We have proposed a method to construct rate-optima software pipelined schedules for pipelined architectures with structural hazards.

A distinct feature of this work is that it gives a unified treatment to two challenging and interlaced aspects of software pipelining --- assignment (mapping) and scheduling for available functions --- under a unified ILP framework.

Research along this line is now leading to a Ph.D thesis by Erik Altman. For highlight of this work, the readers are referred to our MICRO28 and PLDI papers by Erik, Govind and Gao.

Over the years we have developed a comprehensive testbed called MOST (Scheduling Toolkits) which integrates several tools for modulo scheduling. It has been used extensively in the evaluation of our techniques for industry strength compiler applications. The results of such studies are now leading to a Master's thesis by Artour Stouchinin.

The impact of dynamic scheduling mechanisms --- such as the reorder buffer (or completion buffer) and renaming mechanisms in modern superscalar processors -- on scheduling and register allocation have been studied in Luis Lozano's Master's thesis. A detailed simulation tool -- called SuperDLX -- for a generic superscalar processor architecture has been developed as a Master's thesis project by Cecile Moura.

A Methodology for Collective Loop Optimization

Traditionally, the methodology of loop optimization and parallelization is to perform optimization on a single loop (or loop nest), one at a time. We have developed a methodology of collective loop optimization. Such methodology has been applied to a collection of loops to perform a novel optimization called array contraction, that saves space and time by converting an array variable into a scalar variable or a buffer containing a small number of scalar variables. We show that the array contraction problem can be solved efficiently for a class of loops. These results have been documented in Russel Olsen's Master's thesis.

II.2 Compiling for Distributed Memory Machines

We have been working on enabling compiling methodology and techniques for distributed memory machines.

We have proposed and implemented a parallelizing techn- ique which automatically computes computation and data decompositions. Under our decomposition algorithms, a program is divided into collections (called clusters) of loop nests. Within each cluster of nests, the data locality constraints are formulated as a system of homogeneous linear equations which is solved by polynomial time algorithms. Our algorithms also attempt to minimize data redistribution communications between loop clusters. The data locality constraints within a cluster may need to be relaxed to trade locality for parallelism. Such relaxations are guided by exploiting the hierarchical program nesting structures from outer to inner nesting levels. Our method can handle complex programs containing perfect and non-perfectly nested loops with or without loop-carried dependences. To apply our method, we have extended the array dataflow analysis technique by P.Feautrier and performed an initial implementation and experimentation on benchmark programs such as Perfect Club Benchmarks. This work is partly documented in R. Wen's Master's thesis preparation.

Our technique has been tested in a research High- Performance C (HPC) compiler developed under the EPPP (Environment of Portable Parallel Programming) Project at CRIM (in which I have been a main investigator) and preliminary results are very encouraging. The readers are referred to our paper in Parallel Programming Letters (to appear).

We have also proposed a novel technique for computation and data decomposition based on collective loop optimization, through joint work with Dr. Vivek Sarkar at IBM Santa Terasa Lab.

II.3 Efficient Program Analysis Using DJ Graphs

Dataflow analysis framework based on Static Single Assignment (SSA) form, Sparse Evaluation Graphs (SEG), Dependence Flow Graphs (DFG), etc., demand fast computation of program points where data flow information must be merged, the so called Phi-nodes. To the best of our knowledge, the problem of finding an algorithm for computing Phi-nodes for an arbitrary SEG in linear time remains open. We have presented a surprisingly simple algorithm for computing Phi- nodes for arbitrary flow graphs (reducible or irreducible) that runs in linear time. We employ a novel program representation, called the DJ graph.

The skeleton of the DJ graph of a program is the dominator tree of its flow graph. The tree skeleton is augmented with join edges(called J-edges) from the original flow graph which potentially led to join nodes where data flow information are merged. Based on DJ graphs, we have developed other novel and efficient algorithms to solve a series of problems in flow graph analysis such as multi-node immediate dominator analysis, identification of reducible and irreducible graphs, and incremental algorithm for maintaining dominator trees. This research leads to the preparation of V.C. Sreedhar's Ph.D thesis. Highlight of this work can be found in a POPL'95 paper by Sreedhar and Gao.

II.4 International Interaction and Colloborations

The work at McGill in compiling has been in constant interaction with the international research community of this field, and I have been invited to present papers on the 2nd, 3rd, 4th, 5th, 6th and 8th workshop on Languages and Compilers for Parallel Computing (1989, '90,'91,'92,'93,'95). This workshop series is a very important and well-known international event in this research area, and has been a forum for invited speakers to present most recent and new significant research results. I have also been elected into program committees of several international conferences in the field, most recently the PACT95, PACT96 and MICRO28.

III. Programming Languages

IV. Computational Science

This page has been visited times