Abstract:
Traditional superscalar architectures exploit instruction level
parallelism by issuing instructions from a window of
instructions. However, the amount of fetched parallelism available
from a single stream of instructions is limited and thus increasing
the size of the instruction window yields diminishing returns in
performance improvement.
An alternative is to use multithreading i.e., to allow instruction
dispatch from multiple loci of control in a program. This approach
increases the potential amount of parallelism available in the
instruction window. However, unless care is taken to dispatch into
the window instructions that can actually be executed, an instruction
that depends on a long latency operation is likely to occupy a slot in
the window for a long time before it can be actually issued. One
approach to solving this problem is to increase the size of the
instruction window. The drawback is that the complexity of the
instruction window increases quadratically with the window size. To
reduce the complexity, recent research has proposed to partition the
instruction window into multiple levels of instruction queues with
only one queue bearing the quadratic matching complexity. The problem
with the latter approach is to decide at runtime which instruction
groups to enable for matching.
This research proposes a new approach to solving this problem by the
use of judicious algorithms to select the instructions that go into
the instruction window. Rather than relying on existing superscalar
processor fetch, dispatch and scheduling logic and on partitioned
instruction queues, the compiler in a "Compiler Aided Reorder Engine"
(CARE) architecture partitions the program into multiple units of
execution, called strands, each of them 1 or more instructions in
size. Strands are fetched, dispatched, issued and executed
ballistically i.e., as atomic units, benefiting from locality and well
defined latency between its instruction members at all stages of the
processor pipeline. Strands are partitioned at compile time following
the "split-phase constraint": The request and use of the result of a
unpredictable or long latency instruction is placed in different
strands. Instructions that fall into this category are unpredictable
memory loads and long latency branches.
The CARE architecture introduces the concepts of multiple program
counters that fetch strand chains with separate dispatch windows, a
scoreboard that tracks the status of multiple concurrent strands, a
register file that guides the computation according to a compiler
preestablished partial order and loop hardware that supports the
overlap of multiple concurrent iterations.
The CARE compiler framework introduces the concept of an "Attractor
Graph" to partition a function into strands according to the
split-phase constraint. In addition, the CARE compiler organizes
strands into "strand chains" where the strand members are fetched
according to the total order imposed by the chain. Strand chaining can
help improve the utilization of architecture resources, such as
program counters (one for each active strand chain) or the instruction
window.