Ph.D. Defense

Abstract:

Traditional superscalar architectures exploit instruction level parallelism by issuing instructions from a window of instructions. However, the amount of fetched parallelism available from a single stream of instructions is limited and thus increasing the size of the instruction window yields diminishing returns in performance improvement.

An alternative is to use multithreading i.e., to allow instruction dispatch from multiple loci of control in a program. This approach increases the potential amount of parallelism available in the instruction window. However, unless care is taken to dispatch into the window instructions that can actually be executed, an instruction that depends on a long latency operation is likely to occupy a slot in the window for a long time before it can be actually issued. One approach to solving this problem is to increase the size of the instruction window. The drawback is that the complexity of the instruction window increases quadratically with the window size. To reduce the complexity, recent research has proposed to partition the instruction window into multiple levels of instruction queues with only one queue bearing the quadratic matching complexity. The problem with the latter approach is to decide at runtime which instruction groups to enable for matching.

This research proposes a new approach to solving this problem by the use of judicious algorithms to select the instructions that go into the instruction window. Rather than relying on existing superscalar processor fetch, dispatch and scheduling logic and on partitioned instruction queues, the compiler in a "Compiler Aided Reorder Engine" (CARE) architecture partitions the program into multiple units of execution, called strands, each of them 1 or more instructions in size. Strands are fetched, dispatched, issued and executed ballistically i.e., as atomic units, benefiting from locality and well defined latency between its instruction members at all stages of the processor pipeline. Strands are partitioned at compile time following the "split-phase constraint": The request and use of the result of a unpredictable or long latency instruction is placed in different strands. Instructions that fall into this category are unpredictable memory loads and long latency branches.

The CARE architecture introduces the concepts of multiple program counters that fetch strand chains with separate dispatch windows, a scoreboard that tracks the status of multiple concurrent strands, a register file that guides the computation according to a compiler preestablished partial order and loop hardware that supports the overlap of multiple concurrent iterations.

The CARE compiler framework introduces the concept of an "Attractor Graph" to partition a function into strands according to the split-phase constraint. In addition, the CARE compiler organizes strands into "strand chains" where the strand members are fetched according to the total order imposed by the chain. Strand chaining can help improve the utilization of architecture resources, such as program counters (one for each active strand chain) or the instruction window.