Power-Aware High Performance Compilation Techniques
Abstract:
The widening gap between processor cycle time and memory
access time
impedes the performance of modern processors significantly~-- a problem
known as the ``memory wall'' problem. Although the conventional cache
hierarchy has considerable success in bridging this gap, recent research
results show that existing hardware-only cache solutions (for cache
mapping and replacement) often result in unsatisfactory cache utilization,
and poor cache performance. In order to achieve better performance, it is
desirable to separate data with ``high" reuse from those with ``poor"
reuse. In other words, a caching scheme should not let data with a poor
reuse pollute the cache and displace useful data which has a high reuse.
Based on the above observation, a number of schemes have been proposed at
both architectural and compiler level with different degrees of success.
In this talk we develop a compile-time method to analyze
application code
and effectively separate data with a high reuse from that with low reuse.
This method can be directly applied to architectures that support cache
bypassing~\cite{ChiDie89,TysonEtAl95},. where data with a poor reuse is
directly accessed from lower levels in memory hierarchy and will not
occupy precious cache space. This helps to avoid cache pollution and
retain useful data in the higher levels of memory hierarchy (say
L1-Cache). Besides, our approach is also applicable on other architectures
like caches with ``Evict-Me'' tag.
A central question addressed by our method is: ``How
does the compiler
decide which data references should bypass the data cache and which should
not, in order to achieve the best cache performance?" We have developed
a
mathematical formulation for this problem, and relate it to the well-known
knapsack problem which can be solved using a dynamic programming algorithm
in polynomial time. We have implemented our method in our compiler testbed
and evaluated its effectiveness on SPEC benchmarks. Our results show that
our approach can reduce number of L1 cache misses by up to 50\%.