# A Study of a Software Cache Implementation of the OpenMP Memory Model for Multicore and Manycore Architectures

Chen Chen<sup>1</sup>, Joseph B. Manzano<sup>2</sup>, Ge Gan<sup>2</sup>, Guang R. Gao<sup>2</sup>, and Vivek Sarkar<sup>3</sup>

<sup>1</sup> Tsinghua University, Beijing 100084, P.R. China

<sup>2</sup> University of Delaware, Newark DE 19716, USA

<sup>3</sup> Rice University, Houston TX 77251, USA

**Abstract.** This paper is motivated by the desire to provide an efficient and scalable software cache implementation of OpenMP on multicore and manycore architectures in general, and on the IBM CELL architecture in particular. In this paper, we propose an instantiation of the OpenMP memory model with the following advantages: (1) The proposed instantiation prohibits undefined values that may cause problems of safety, security, programming and debugging. (2) The proposed instantiation is scalable with respect to the number of threads because it does not rely on communication among threads or a centralized directory that maintains consistency of multiple copies of each shared variable. (3) The proposed instantiation avoids the ambiguity of the original memory model definition proposed on the OpenMP Specification 3.0.

We also introduce a new cache protocol for this instantiation, which can be implemented as a software-controlled cache. Experimental results on the Cell Broadband Engine show that our instantiation results in nearly linear speedup with respect to the number of threads for a number of NAS Parallel Benchmarks. The results also show a clear advantage when comparing it to a software cache design derived from a stronger memory model that maintains a global total ordering among flush operations.

# 1 Introduction

An important open problem for future multicore and manycore chip architectures is the development of shared-memory organizations and memory consistency models (or memory models for short) that are effective for small local memory sizes in each core, scalable to a large number of cores, and still productive for software to use. Despite the fact that strong memory models such as Sequential Consistency (SC) [1] are supported on mainstream small-scale SMPs, it seems likely that weaker memory models will be explored in current and future multicore and manycore architectures such as the Cell Broadband Engine [2], Tilera [3], and Cyclops64 [4].

OpenMP [5] is a natural candidate as a programming model for multicore and manycore processors with software-managed local memories, thanks to its weak memory model. In the OpenMP memory model, each thread may maintain a *temporary view* of the shared memory which "allows the thread to cache variables and thereby to avoid going to memory for every reference to a variable" [5]. It includes a *flush* operation on

P. D'Ambra, M. Guarracino, and D. Talia (Eds.): Euro-Par 2010, Part II, LNCS 6272, pp. 341–352, 2010. © Springer-Verlag Berlin Heidelberg 2010

a specified *flush-set* that can be used to synchronize the temporary view with the shared memory for the variables in the flush-set. It is a weak consistency model "because a thread; s temporary view of memory is not required to be consistent with memory at all times" [5]. This relaxation of the memory consistency constraints provides room for computer system designers to experiment with a wide range of caching schemes, each of which has different performance and cost tradeoff. Therefore, the OpenMP memory model can exhibit very different instantiations, each of which is a memory model that is stronger than the OpenMP memory model, *i.e.*, any legal value under an instantiation is also a legal value under the OpenMP memory model, but not vice versa.

Among various instantiations of the OpenMP memory model, an important problem is to find an instantiation that can be efficiently implemented on multicore and manycore architectures and easily understood by programmers.

#### 1.1 A Key Observation for Implementing the Flush Operation Efficiently

The flush operation synchronizes temporary views with the shared memory. So it is more expensive than read and write operations. In order to efficiently implement the OpenMP memory model, the instantiation should be able to implement the flush operation efficiently.

Unfortunately, the OpenMP memory model has the serialization requirement for flush operations, *i.e.*, "if the intersection of the flush-sets of two flushes performed by two different threads is non-empty, then the two flushes must be completed as if in some sequential order, seen by all threads" [5]. Therefore, it seems that it is very hard to efficiently implement the flush operation because of the serialization requirement. However, this requirement has a hidden meaning that is not clearly explained in [5]. The hidden meaning is the key for efficiently implement the flush operation.

We use an example to explain the real meaning of the serialization requirement. For the program in Fig. 1, it seems that the final status of the shared memory must be either x = y = 1 or x = y = 2 according to the serialization requirement. However, after discussion with the OpenMP community, x = 1, y = 2 and x = 2, y = 1 are also legal results under the OpenMP memory model. The reason is that the OpenMP memory model allows flush operations to be completed earlier (but cannot be later) than the flush points (statements 3 and 6 in this program). Therefore, one possible way to get the result x = 1, y = 2 is that firstly thread 2 assigns 2 to x and immediately flushes x into the shared memory, then thread 1 assigns 1 to x and 1 to y and then flushes x and y, and finally thread 2 assigns 2 to y and flushes y. Therefore, we get a key observation for implementing the flush operation efficiently as follows.

| Thread 1                                    | Thread 2                          |
|---------------------------------------------|-----------------------------------|
| 1: $x = 1$ ;                                | 4: x = 2;                         |
| 2: $y = 1;$                                 | 5: y = 2;                         |
| 3: flush(x,y);                              | 6: flush(x,y);                    |
| Is $x = 1, y = 2$ (or $x = 2, y = 1$ ) lega | al under the OpenMP memory model? |

Fig. 1. A motivating example for understanding the serialization requirement under the OpenMP memory model

**The Key Observation:** A flush operation on a flush-set of shared locations can be decomposed into unordered flush operations on each individual location. Each flush operation after decomposition must be completed no later than the flush point of the original flush operation. Assuming that a memory location is the minimal unit for atomic memory accesses, the serialization requirement is naturally satisfied.

## **1.2 Main Contributions**

In this paper, we propose an instantiation of the OpenMP memory model based on the key observation in Section 1.1. It has the following advantages.

- Our instantiation prohibits undefined values that may cause problems of safety, security, programming and debugging. The OpenMP memory model may allow programs with data races to generate undefined values. However, in our instantiation, all return values must be in a subset of initial value and the values that was written by some thread before. Since the OpenMP memory model allows programs with data races<sup>1</sup>, our instantiation would be helpful when programming in such cases.
- Our instantiation is scalable with respect to the number of threads because it does not rely on communication among threads or a centralized directory that maintains consistency of multiple copies of each shared variable.
- Our instantiation avoids the ambiguity of the original memory model definition proposed on the OpenMP Specification 3.0, such as the unclear serialization requirement, the problem of handling temporary overflow and the unclear semantics for programs with data races. Therefore, our instantiation is easy to understand from the angle of efficient implementations.

We also propose the cache protocol of the instantiation and implement the softwarecontrolled cache on Cell Broadband Engine. The experimental results show that our instantiation has nearly linear speedup with respect to the number of threads for a number of NAS Parallel Benchmarks. The results also show a clear advantage when comparing it to a software cache design derived from a stronger memory model that maintains a global total ordering among flush operations.

The rest of the paper is organized as follows. Section 2 introduces our instantiation of the OpenMP memory model. Section 3 introduces the cache protocol of the instantiation. Section 4 presents the experimental results. Section 5 discusses the related work. The conclusion is presented in Section 6.

# 2 Formalization of Our OpenMP Memory Model Instantiation

A necessary prerequisite to build OpenMP's software cache implementations is the availability of formal memory models that establish the legality conditions for determining if an implementation is correct. As observed in [6], "it is impossible to verify OpenMP applications formally since the prose does not provide a formal consistency

<sup>&</sup>lt;sup>1</sup> Section 2.8.6 of the OpenMP specification 3.0 [5] shows a program with data races that implements critical sections.

model that precisely describes how reads and writes on different threads interact". While there is general agreement that the OpenMP memory model is based on *temporary views* and *flush* operations, discussion with OpenMP experts led us to conclude that the OpenMP specification provides a lot of leeway on when *flush* operations can be performed and on the inclusion of additional flush operations (not specified by the programmer) to deal with local memory size constraints.

In this section, we formalize an instantiation of the OpenMP Memory Model — *Model*LF, based on the key observation in Section 1.1. *Model*LF builds on OpenMP's relaxed-consistency memory model in which each worker thread maintains a temporary view of shared data which may not always be consistent with the actual data stored in the shared memory. The OpenMP *flush* operation is used to establish consistency between these temporary views and the shared memory at specific program points. In ModelLF, each flush operation only forces local temporary view to be consistent with the shared memory. That is why we call it *Model*LF where "LF" means local flush. A flush operation is only applied on a single location. We assume that a memory location is the minimal unit for atomic memory accesses. Therefore, the serialization requirement of flush operations is naturally satisfied. A flush operation on a set of shared locations is decomposed into unordered flush operations on each individual location, where those flush operations after decomposition must be completed no later than the flush point of the original flush operation. So it avoids the known problem of decomposition as explained in Section 2.8.6 of the OpenMP specification 3.0 [5], where the compiler may reorder the flush operations after decomposition to a later position than the flush point and cause incorrect semantics.

#### 2.1 Operational Semantics of ModelLF

In this section, we define the operational semantics of *Model*LF. Firstly, we introduce a little background for the definition. A store,  $\sigma$ , is a mathematical representation of the machine's shared memory, which maps memory location addresses to values  $(\sigma : addr \mapsto val)$ . We model temporary views by introducing a distinct store,  $\sigma_i$ , for each worker thread  $T_i$  in an OpenMP parallel region. Following OpenMP's convention, thread  $T_0$  is assumed to be the master thread.  $\sigma_i[l]$  represents the value stored in location l in thread  $T_i$ 's temporary view. The flush operation,  $flush(T_i, l)$  makes temporary view  $\sigma_i$  consistent with the shared memory  $\sigma$  on location l.

Under *Model*LF, program flush operations are performed at the program points specified by the programmer. Moreover, additional flush operations may be inserted nondeterministically by the implementation at any program point, which makes it possible to implement the memory model with bounded space for temporary views, such as caches. The operational semantics of memory operations of *Model*LF include the read, write, program flush operation and nondeterministic flush operation defined as follows:

- Memory read: If thread  $T_i$  needs to read the value of the location l, it performs a  $read(T_i, l)$  operation on store  $\sigma_i$ . If  $\sigma_i$  does not contain any value of l, the value in the shared memory will be loaded to  $\sigma_i$  and returned to the read operation.
- Memory write: If thread  $T_i$  needs to write value v to the location l, it performs a  $write(T_i, v, l)$  operation on store  $\sigma_i$ .

- **Program / Nondeterministic flush:** If thread  $T_i$  needs to synchronize  $\sigma_i$  with the shared memory on a shared location l, it performs a  $flush(T_i, l)$  operation. If  $\sigma_i$  contains a "dirty value" <sup>2</sup> of l, it will write back the value into the shared memory. After the flush operation,  $\sigma_i$  will discard the value of l. A thread performs program flush operations at program points specified by the programmer, and can nondeterministically perform flush operations at any program point. All the program and nondeterministic flush operations on the same shared location must be observed by all threads to be completed in the same sequential order.

# **3** Cache Protocol of *Model*LF

In this section, we introduce the cache protocol that implements *Model*LF. We assume that each thread contains a cache which corresponds to its temporary view. Therefore, performing operations on temporary views is equivalent to performing such operations on the caches. Without loss of generality, in this section, we assume that each operation is performed on one cache line. The reason is that an operation on one cache line can be decomposed into sub operations; each of which is performed on a single location. We use per-location dirty bits in a cache line to take care of the decomposition problem.

#### 3.1 Cache Line States

We assume that each cache line contains multiple locations. Each location contains a value that can be a "clean value" <sup>3</sup>, a "dirty value", or an "invalid value". Each cache line can be in one of the five states as follows.

Invalid: All the locations contain "invalid values".

Clean: All the locations contain "clean values".

Dirty: All the locations contain "dirty values".

Clean-Dirty: Each location contains either a "clean value" or a "dirty value".

Invalid-Dirty: Each location contains either an "invalid value" or a "dirty value".

For simplicity, the cache line cannot be in other states such as **Invalid-Clean**. Additional nondeterministic flush operations may be performed when necessary to force the cache line to be in one of the five states as above. We use a per-line flag bit together with the dirty bits to represent the state of the cache line. The flag bit indicates whether those non-dirty values in the cache line are clean or invalid.

## 3.2 Cache Operations and State Transitions

The state transition diagram of *Model*LF cache protocol is shown in Fig. 2. Now we explain how each cache operation affects the state transition diagram.

**Memory read:** If the original state of the cache line is invalid or invalid-dirty, the invalid locations will load "clean values" from memory. Therefore, the state will change to clean or clean-dirty, respectively. In other cases, the state will not change. After that, the values in the cache line will be returned.

<sup>&</sup>lt;sup>2</sup> The term "dirty value" means that the value of location l was modified by thread  $T_i$ .

<sup>&</sup>lt;sup>3</sup> The term "clean value" means that the value was read but not modified by the thread.



Fig. 2. State transition diagram for the cache protocol of ModelLF

**Memory write:** A write operation writes specified "dirty values" to the cache line. Therefore, if the original state is invalid or invalid-dirty, it becomes either invalid-dirty or dirty after the write operation, which depends on whether all the locations contain "dirty values". In other cases, the state will become either clean-dirty or dirty, which depends on whether all the locations contain "dirty values".

**Program / Nondeterministic flush:** A flush operation forces all the "dirty values" of the cache line to be written back into memory. Then, the state will become invalid.

There may be various ways to implement the flush operation. For example, many architectures support a block of data to be written back at a time. So a possible way of implementing the flush operation is to write back the entire cache line that is being flushed together with the dirty bits and then merge the "dirty values" into the corresponding memory line in the shared memory. If the mergence in memory is not supported, a thread has to load the memory line, and then merge it with the cache line, and finally write back the merged line, where the process must be atomic to handle the false sharing problem. For example, on the Cell processor, atomic DMA operations can be used to guarantee atomicity of the process.

## **4** Experimental Results and Analyses

In this section, we introduce our experimental results under *Model*LF cache protocol. In section 4.1, we introduce the experimental testbed. Then in section 4.2, we introduce the major observations of our experiments. Finally, we introduce the details and analyses of the observations in the last two sections.

#### 4.1 Experimental Testbed

The experimental results presented in this paper were obtained on CBEA (Cell Broadband Engine Architecture) [2] under the OPELL (OPenmp for cELL) framework [7].

**CBEA:** CBEA has a main processor called the Power Processing Element (PPE) and a number of co-processors called the Synergistic Processing Elements (SPEs). The PPE handles most of the computational workload and has control over the SPEs, *i.e.*, start, stop, interrupt, and schedule processes onto the SPEs. Each SPE has a 256KB local

storage which is used to store both instructions and data. An SPE can only access its own local storage directly. Both PPE and SPEs share main memory. SPEs access main memory via DMA (direct memory access) transfers which are much slower than the access on each SPE's own local storage.

We executed the programs on a PlayStation 3 [8] which has one 3.2 GHz Cell Broadband Engine CPU (with 6 accessible SPEs) and 256MB global shared memory. Our experiments used all 6 SPEs with the exception of the evaluation of speedup which used various numbers of SPEs from 1 to 6.

**OPELL Framework:** OPELL is an open source toolchain / runtime effort to implement OpenMP for the CBEA. OPELL has a single source compiler which compiles an OpenMP program to a single source file that is executable on CBEA.

During runtime, the executable file starts to run the sequential region of the program on PPE. Once the program enters a parallel region, PPE will assign tasks of computing parallel codes to SPEs. After SPEs finish the tasks, the parallel region ends and PPE will go ahead to execute the following sequential region.

Since each SPE only has 256KB local storage to store both instructions and data, OPELL has a partition /overlay manager runtime library that partitions the parallel codes into small pieces to fit for the local storage size, and loads and replaces those pieces on demand.

Since a DMA transfer is much slower than an access on the local storage, OPELL has a software cache runtime library to take advantage of locality. The runtime library manages a part of local storages as caches and has a user interface for accessing. We implement our cache protocol in OPELL's software cache runtime library. The cache protocol uses 4-way set associative caches. The size of each cache line is 128 bytes. We ran the experiments on various cache sizes which range from 4KB to 64KB. We did not try bigger cache size because the size of local storage is very limited (256KB) and a part of it is used to store instructions and maintain stack.

**Benchmarks:** We used three benchmark programs in our experiments — Integer Sort (IS), Embarrassingly Parallel (EP) and Multigrid (MG) from the NAS Parallel Benchmarks [9].

## 4.2 Summary of Main Results

The main results of our experiments are as follows:

**Result I:** Scalability (Section 4.3): ModelLF cache protocol has nearly linear speedup with respect to the number of threads for the tested benchmarks.

**Result II:** Impact of Cache Size (Section 4.4): We use another instantiation of the OpenMP memory model —  $ModelGF^4$ , to compare with ModelLF. ModelGF maintains a global total ordering among flush operations. The difference between ModelGF and ModelLF is that when ModelGF performs a flush operation on a location l, it enforces the temporary views of all threads to see the same value of l by discarding the values of l in the temporary views. To implement ModelGF, we simulate a centralized directory

<sup>&</sup>lt;sup>4</sup> Operational semantics of *Model*GF is defined in [10].

that maintains the information for all the caches. When a flush operation on a location *l* is performed, the directory informs all the threads that contain the value of *l* to discard the value. We assume that the centralized directory is "ideal", *i.e.*, the cost of maintenance and lookup is trivial. However, the cost of informing a thread is as expensive as a DMA transfer because the directory is placed in main memory. *Model*LF outperforms *Model*GF due to its cheaper flush operations. Our results show that the performance gap between *Model*LF and *Model*GF cache protocols increases as the cache size becomes smaller. This observation is significant because the current trend in multicore and many-core processors is that the local memory size per core decreases as the number of cores increases.

# 4.3 Scalability

Fig. 3 shows the speedup as a function of the number of SPEs (Each SPE runs one thread.) under *Model*LF cache protocol. The tested applications are MG with a 32KB cache size, and IS and EP with a 64KB cache size. All the three applications have input size *W*. We can see that for IS and EP benchmarks, *Model*LF cache protocol nearly achieves linear speedup. For MG benchmark, the speedup is not as good as the other two when the number of threads is 3, 5 and 6. The reason is that the workloads among threads are not balanced when the number of threads is not a power of 2.



Fig. 3. Speedup as a function of the number of SPEs under ModelLF cache protocol

## 4.4 Impact of Cache Size

Fig. 4 and 5 show execution time and cache eviction ratio curves for IS and MG with input size *W* on various cache sizes (4KB, 8KB, 16KB, 32KB and 64KB<sup>5</sup>) per thread. The two figures show that the cache eviction ratio curves under the two cache protocols are equal, but the execution time curves are not. Moreover, the difference in execution time becomes larger as the cache size becomes smaller. This is because the cost of cache eviction in *Model*GF cache protocol is much higher. Moreover, the smaller the cache size is, the higher the cache eviction ratio is. To show the change of performance gap clearly, we normalize the execution time into the interval [0, 1] by applying division on every execution time where the divisor is the maximal execution time are 4KB cache sizes under *Model*GF for both MG and IS. The performance gap between *Model*GF and *Model*LF keeps constantly for EP when we change the cache sizes. The reason is that EP has very bad temporal locality. So it is insensitive to the change of cache sizes.

<sup>&</sup>lt;sup>5</sup> 64KB is only for IS.

#### A Study of a Software Cache Implementation of the OpenMP Memory Model 349



SPE was decreased from 64KB to 4KB.



Fig. 4. Trends of execution time and cache eviction ratio for IS-W on various cache sizes

Fig. 5. Trends of execution time and cache eviction ratio for MG-W on various cache sizes

# 5 Related Work

Despite over two decades of research on memory consistency models, there does not appear to be a consensus on how memory models should be formalized [11,12,13,14]. The efforts to formalize memory models for mainstream parallel languages such as the Java memory model [15], the C++ memory model [16], and the OpenMP memory model [6] all take different approaches.

The authoritative source for the OpenMP memory model can be found in the specifications for OpenMP 3.0 [5], but the memory model definition therein is provided in terms of informal prose. To address this limitation, a formalization of the OpenMP memory model was presented in [6]. In this paper, the authors developed a formal, mathematical language to model the relevant features of OpenMP. They developed an operational model to verify its conformance to the OpenMP standard. Through these tools, the authors found that the OpenMP memory model is weaker than the weak consistency model [17]. The authors also claimed that they found some ambiguities in the informal definition of the OpenMP memory model presented in the OpenMP specification version 2.5 [18]. Since there is no significant change of the OpenMP memory model from version 2.5 to version 3.0, their work demonstrates the need for the

OpenMP community to work towards a formal and complete definition of the OpenMP memory model.

Some early research on software controlled caches can be found in the NYU Ultracomputer [19], Cedar [20], and IBM RP3 [21] projects. All three machines have local memories that can be used as programmable caches, with software taking responsibility for maintaining consistency by inserting explicit synchronization and cache consistency operations. By default, this responsibility falls on the programmer but compiler techniques have also been developed in which these operations are inserted by the compiler instead, *e.g.*, [22]. Interest in software caching has been renewed with the advent of multicore processors with local memories such as the Cell Broadband Engine. There have been a number of reports on more recent software cache optimization from compiler angle as described in [23,24,25].

Examples of recent work on software cache protocol implementation on Cell processors can be found in [26,27,28]. The cache protocol used in [26] uses a centralized directory to keep tract cache line state information in the implementation - reminds us the *Model*GF cache protocol in this paper. The cache protocols reported in [27,28] do not appear to use a centralized directory - hence appear to be more close to the *Model*LF cache protocol. However, we do not have access to the detailed information on the implementations of these models, and cannot make a more definitive comparisons at the time when this paper is written.

OPELL [7] is an open source toolchain / runtime effort to implement OpenMP for the Cell Broadband Engine. Our cache protocol framework reported here has been developed much earlier in 2006-2007 frame and embedded in OPELL (see [7])- but the protocols themselves are not published externally.

# 6 Conclusion and Future Work

In this paper, we investigate the problem of software cache implementations for the OpenMP memory model on multicore and manycore processors. We propose an instantiation of the OpenMP memory model — *Model*LF which prohibits undefined values and avoids the ambiguity of the original memory model definition on OpenMP Specification 3.0. *Model*LF is scalable with respect to the number of threads because it does not rely on communications among threads or a centralized directory that maintains consistency of multiple copies of each shared variable.

We propose the corresponding cache protocol and implement the cache protocol by software cache on the Cell processor. The experimental results show that *Model*LF cache protocol has nearly linear speedup with respect to the number of threads for a number of NAS Parallel Benchmarks. The results also show a clear advantage when comparing it to *Model*GF cache protocol derived from a stronger memory model that maintains a global total ordering among flush operations.

This provides a useful way that how to formalize (architecture unspecified) OpenMP memory model in different ways and evaluate the instantiations to produce different performance profiles. Our conclusion is that OpenMP's relaxed memory model with temporary views is a good match for software cache implementations, and that the refinements in *ModelLF* can lead to good opportunities for scalable implementations of OpenMP on future multicore and manycore processors.

In the future, we will investigate the possibility of implementing our instantiation on different architectures and study its scalability in the case that the architecture contains big number of cores (e.g. over 100).

# Acknowledgment

This work was supported by NSF (CNS-0509332, CSR-0720531, CCF-0833166, CCF-0702244), and other government sponsors. We thank all the members of CAPSL group at University of Delaware. We thank Ziang Hu for his suggestions on the experimental design. We thank Bronis R. de Supinski and Greg Bronevetsky for answering questions regarding the OpenMP memory model. We thank the efforts of our reviewers for their helpful suggestions that have led to several important improvements of our work.

# References

- Lamport, L.: How to make a multiprocessor that correctly executes multiprocess programs. IEEE Trans. on Computers C-28(9), 690–691 (1979)
- IBM Microelectronics: Cell Broadband Engine, http://www-01.ibm.com/chips/techlib/techlib.nsf/products/ Cell\_Broadband\_Engine
- 3. Tilera Corporation: Tilera, http://www.tilera.com/
- Cuvillo, J., Zhu, W., Hu, Z., Gao, G.R.: Fast: A functionally accurate simulation toolset for the Cyclops-64 cellular architecture. In: Proceedings of the Workshop on Modeling, Benchmarking and Simulation, Held in conjunction with the 32nd Annual International Symposium on Computer Architecture, Madison, Wisconsin, pp. 11–20 (2005)
- 5. OpenMP Architecture Review Board: OpenMP Application Program Interface Version 3.0 (May 2008), http://www.openmp.org/mp-documents/spec30.pdf
- Bronevetsky, G., de Supinski, B.R.: Complete formal specification of the OpenMP memory model. Int. J. Parallel Program. 35(4), 335–392 (2007)
- Manzano, J., Hu, Z., Jiang, Y., Gan, G.: Towards an automatic code layout framework. In: Chapman, B., Zheng, W., Gao, G.R., Sato, M., Ayguadé, E., Wang, D. (eds.) IWOMP 2007. LNCS, vol. 4935, pp. 157–160. Springer, Heidelberg (2008)
- Sony Computer Entertainment: PlayStation3, http://www.us.playstation.com/ps3/features
- 9. NASA Ames Research Center: NAS Parallel Benchmark, http://www.nas.nasa.gov/Resources/Software/npb.html
- Chen, C., Manzano, J.B., Gan, G., Gao, G.R., Sarkar, V.: A study of a software cache implementation of the openmp memory model for multicore and manycore architectures. Technical Memo CAPSL/TM-93 (February 2010)
- Adve, S., Hill, M.D.: A unified formalization of four shared-memory models. IEEE Transactions on Parallel and Distributed Systems 4, 613–624 (1993)
- Shen, X., Arvind, Rudolph, L.: Commit-Reconcile & Fences (CRF): a new memory model for architects and compiler writers. In: ISCA 1999: Proceedings of the 26th annual international symposium on Computer architecture, pp. 150–161. IEEE Computer Society, Washington (1999)
- Saraswat, V.A., Jagadeesan, R., Michael, M., von Praun, C.: A theory of memory models. In: PPoPP 2007: Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming, pp. 161–172. ACM, New York (2007)

- 352 C. Chen et al.
- Arvind, A., Maessen, J.W.: Memory model = instruction reordering + store atomicity. SIGARCH Comput. Archit. News 34(2), 29–40 (2006)
- Manson, J., Pugh, W., Adve, S.V.: The Java memory model. In: POPL 2005: Proceedings of the 32nd ACM SIGPLAN-SIGACT symposium on Principles of programming languages, pp. 378–391. ACM, New York (2005)
- Boehm, H.J., Adve, S.V.: Foundations of the C++ concurrency memory model. In: PLDI 2008: Proceedings of the 2008 ACM SIGPLAN conference on Programming language design and implementation, pp. 68–78. ACM, New York (2008)
- Dubois, M., Scheurich, C., Briggs, F.: Memory access buffering in multiprocessors. In: ISCA 1998: 25 years of the international symposia on Computer architecture (selected papers), pp. 320–328. ACM, New York (1998)
- 18. OpenMP Architecture Review Board: OpenMP Application Program Interface (2005), http://www.openmp.org/mp-documents/spec25.pdf
- Gottlieb, A., Grishman, R., Kruskal, C.P., McAuliffe, K.P., Rudolph, L., Snir, M.: The NYU ultracomputer—designing a MIMD, shared-memory parallel machine. In: ISCA 1998: 25 years of the international symposia on Computer architecture (selected papers), pp. 239–254. ACM, New York (1998)
- Gajski, D., Kuck, D., Lawrie, D., Sameh, A.: CEDAR—a large scale multiprocessor, pp. 69–74. IEEE Computer Society Press, Los Alamitos (1986)
- Pfister, G., Brantley, W., George, D., Harvey, S., Kleinfelder, W., McAuliffe, K., Melton, E., Norton, V., Weiss, J.: The research parallel processor prototype (RP3): Introduction and architecture. In: ICPP 1985: Proceedings of the 1985 International Conference on Parallel Processing, pp. 764–771 (1985)
- Cytron, R., Karlovsky, S., McAuliffe, K.P.: Automatic management of programmable caches. In: ICPP 1988: Proceedings of the 1988 International Conference on Parallel Processing, pp. 229–238 (August 1988)
- Eichenberger, A.E., O'Brien, K., O'Brien, K., Wu, P., Chen, T., Oden, P.H., Prener, D.A., Shepherd, J.C., So, B., Sura, Z., Wang, A., Zhang, T., Zhao, P., Gschwind, M.: Optimizing compiler for the CELL processor. In: PACT 2005: Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques, pp. 161–172. IEEE Computer Society, Los Alamitos (2005)
- Eichenberger, A.E., O'Brien, J.K., O'Brien, K.M., Wu, P., Chen, T., Oden, P.H., Prener, D.A., Shepherd, J.C., So, B., Sura, Z., Wang, A., Zhang, T., Zhao, P., Gschwind, M.K., Archambault, R., Gao, Y., Koo, R.: Using advanced compiler technology to exploit the performance of the Cell Broadband EngineTM architecture. IBM Syst. J. 45(1), 59–84 (2006)
- Chen, T., Zhang, T., Sura, Z., Tallada, M.G.: Prefetching irregular references for software cache on CELL. In: CGO 2008: Proceedings of the sixth annual IEEE/ACM international symposium on Code generation and optimization, pp. 155–164. ACM, New York (2008)
- Lee, J., Seo, S., Kim, C., Kim, J., Chun, P., Sura, Z., Kim, J., Han, S.: COMIC: a coherent shared memory interface for Cell BE. In: PACT 2008: Proceedings of the 17th international conference on Parallel architectures and compilation techniques, pp. 303–314. ACM, New York (2008)
- Chen, T., Lin, H., Zhang, T.: Orchestrating data transfer for the Cell/B.E. processor. In: ICS 2008: Proceedings of the 22nd annual international conference on Supercomputing, pp. 289– 298. ACM, New York (2008)
- Gonzàlez, M., Vujic, N., Martorell, X., Ayguadé, E., Eichenberger, A.E., Chen, T., Sura, Z., Zhang, T., O'Brien, K., O'Brien, K.: Hybrid access-specific software cache techniques for the Cell BE architecture. In: PACT 2008: Proceedings of the 17th international conference on Parallel architectures and compilation techniques, pp. 292–302. ACM, New York (2008)