Prefetching/Multi-Core Processors

Source: Data Prefetching in the Era of Multicore Processors


Data Prefetching in the Era of Multicore Processors

Data Prefetching in the Era of Multicore Processors
Surendra Byna, Ph.D.

Introduction

Processor performance has been increasing much faster than that of memory performance over the last three decades. While processor performance followed Moore’s law and improved by 50% annually till 2004 and 20% since, memory performance has been improving by a mere 9% a year. This has caused a large gap and made memory a performance bottleneck. The problem is only getting worse with the rapid growth of multicore processors as multiple cores contend for accessing data from memory, which is shared by these cores. Traditionally cache memories were used to improve the performance of memory accesses using the principle of locality. Yet, optimizations are necessary to improve the usage of cache memories and to reduce cache misses.

Data prefetching has been considered an effective way to mask data access latency caused by cache misses and to bridge the performance gap between processor and memory. Data prefetching is a data access latency hiding technique, which decouples and overlaps data transfers and computation. In order to reduce CPU stalling on a cache miss, data prefetching predicts future data accesses, initiates a data fetch, and brings the data closer to the computing processor before it is requested.

Now, what is necessary to design a data prefetching strategy? Traditionally, prefetching considered two issues: what to prefetch and when to prefetch. What to prefetch decides what data a processor might need in the future. When to prefetch decides how early data has to be fetched and avoid any negative effects because of prefetching. Most of existing prefetching methods pretty much focus on what to prefetching. With the fact that multicore processors have become mainstream, prefetching can be a powerful technique to solve data access latency problem. In this article, we discuss various factors that are required to be considered to design a powerful prefetching strategy. We provide a taxonomy that classifies various design concerns in developing a prefetching strategy. We discuss various prefetching strategies and issues that have to be considered in designing a prefetching strategy for multi-core processors.

Prefetching Scenarios

Figure 1 shows various scenarios of implementing prefetching strategies. In Scenario A, a prefetch engine (PE) observes history of L1 cache misses and initiates prefetch operations. In multi-threaded and multi-core processors, pre-execution based approaches use a separate thread to predict future accesses (scenario B). In a pre-execution based approach, a prefetching-thread pre-executes data references of a main computation-thread and initiates prefetching data into a shared cache memory (L2 cache) earlier than the computation-thread initiates a demand request. In memory-side prefetching strategy, (scenario C) the prefetching-thread is executed on an intelligent main memory, where a memory processor pre-executes helper-threads. The predicted data is pushed towards the processor.

Figure 1. Prefetching Scenarios
(Click on the figure to enlarge)

From the scenarios in Figure 1, five most fundamental issues that a good prefetching strategy has to address: what data to prefetch, when to prefetch, what is the prefetching source, what is the prefetching destination, and who initiates a prefetch.

We will discuss each of these design decisions briefly.

The Five Questions

What to Prefetch?
Predicting what data to prefetch is the most important requirement of prefetching. If a prefetching strategy can predict the occurrence of future misses, then prefetch instructions can be issued ahead and bring that data by the time the cache misses. To mask the stall time caused by cache misses effectively, the accuracy of predicting what to prefetch must be high. Predicting future data references accurately is critical. Low accuracy leads to cache pollution. Figure 2 shows a classification of predicting what data to prefetch based on where it is implemented and on various techniques.

Figure 2.
Classification of prediction approach to find what to prefetch
(Click figure to enlarge)

When to Prefetch?
The time to issue a prefetch instruction has significant effect on the overall performance of prefetching. Prefetched data should arrive its destination before a raw cache miss occurs. The efficiency of timely prefetching depends on total prefetching overhead (i.e. the overhead of predicting future accesses plus the overhead in prefetching data) and the time for the occurrence of next cache miss. If the total prefetching overhead exceeds the time of next cache miss, adjusting prefetching distance can avoid late prefetches.

What is the source of prefetching?
Memory hierarchy contains multiple levels including cache memories, main memory, secondary storage, and tertiary storage. Data prefetching can be implemented at various levels of memory hierarchy. Data can be prefetched between cache memories and main memory, or between main memory and storage. To design a prefetching strategy, it is necessary to know where the latest copy of data is. In existing deep memory hierarchies with write back policy, data can exist at more than one location of memory hierarchy. In single-core processors, prefetching source is usually the main memory or lower level cache memory. In multi-core processors, memory hierarchy contains local cache memories that are private to each core and cache memories that are shared by multiple cores. Designing a prefetching strategy considering multiple copies of a data in local cache memories may lead to data coherence concerns, which is a challenging task. A prefetching design has to consider how to ensure that it is prefetching the latest copy of data.

What is the destination of prefetching?
Destination of prefetched data is another major concern of prefetching strategy. Prefetching destination should be closer to CPU than a prefetching source in order to obtain performance benefits. Data can be prefetched either into a cache memory that is local to a processor or into a cache memory that is shared by multiple processing cores, or to a separate prefetch cache. A separate prefetch cache can be either private to a processor core or shared by multiple cores.

Who initiates prefetch instructions?
Prefetching instructions can be issued either by a processor which requires data or by a processor that provides such a service. The first method is also called client-initiated or pull-based prefetching and the latter is called push-based prefetching. Pull-based prefetching can be initiated by the same thread that is running an application or by a different thread. For example, in Intel processors with hyper-threading support, a thread can be used to prefetch data for the other computing thread. Push-based prefetching can be divided as Processor side, Memory side, and Server-based. On the processor side prefetching, a core on the processor side would be used to do prefetching work (predicting future accesses and initiating prefetching requests). On memory side prefetching, an intelligent memory performs prefetching for the application running on the processor, by pushing data into the processor. A server-based prefetching is where a separate server is used for pushing data into the processor. The dedicated prefetching server analyzes what data an application might need and finds out where it is in the memory hierarchy, and pushes that data into its destination, usually closer to the processor.

Comparison of various prefetching strategies
In the table below, we show a comparison of selected prefetching strategies that are published in research literature and their categorization based on the five questions. In the table, What? column refers to the type of implementation, and the algorithm used in predicting future data accesses. Implementation is either hardware controlled, or software controlled, or a combination of both.

Publication

What?

When?

Source

Destination

Initiator

Dahlgren et
al. [2][3]*

Hardware
controlled/Software controlled, Next k blocks, (sequential Prefetching)

Event based

Lower level cache/main memory

Private, L1 cache/L2 cache

Processor
side

Chen et al.
[4]*

Hardware
controlled/Software controlled, constant regular strides,
(strided prefetching)
Event based/lookahead
counter based

Lower level
cache/main memory

Private, L1 cache/L2 cache

Processor
side/ Memory side

Joseph et
al. [5]*

Hardware
controlled/Software controlled, repeating data accesses, (Markov
chain based prediction)
Event based

Lower level
cache/main memory

Private, L1 cache/L2 cache

Processor
side/ Memory side

Kandiraju et
al. [6]

Hardware
controlled/Software controlled, repeating strides, (Markov chain
based distance prediction)

Event based

Lower level
cache/main memory

Private TLB, Private L1 cache/L2 cache

Processor
side/ Memory side

Byna et al.
[7]**

Hardware
controlled/Software controlled, complex and nested regular
patterns, (Multi Level Difference Table based prediction)

Prediction
based

Lower level
cache/main memory

Private
prefetch cache

Memory side

Solihin et
al. [12]

Hybrid
controlled, history based prediction for pair-wise correlation

Event based

Main memory

L2 cache

Memory side

Luk et al.
[14]***

Software
controlled, helper-thread predicts future accesses

Software
controlled synchronization

Main memory

L1/L2 cache

Memory side

Zilles et
al. [13]

Hybrid
controlled, helper thread predicts future accesses

Software
controlled synchronization

Main memory

L1/L2 cache

Memory side

Annavaram et
al. [1]
#

Hardware,
pre-computation based (Data graph pre-computation)

Event based

Main memory

L1/L2 cache

Processor
side

Hassanein et
al. [9]
##

Hybrid
control, helper-thread predicts future accesses

Software
controlled synchronization

Main memory

Private, L1
cache and CPU registers

Memory side

Ganusov et
al. [8]
###

Hardware
controlled, run-ahead execution based prediction

Event based

Main memory

Shared L2
cache

Processor
side

Kim et al.
[10]

Hardware
controlled, offline analysis of history of accesses

Event based
(on cache miss)

L2
cache/Main memory

L1 cache/L2
cache

Processor
side

  • Originally proposed for hardware controlled prefetching, but can be used in software controlled prefetching as well.

** Server-based prefetching scheme, where a server is used to predict complex data access patterns.
*** Hardware support can improve synchronization of computation thread and pre-execution thread.

  1. Dependence graph generator scans the pre-decoded instructions and load/store instructions are marked for prefetching.

## Called data forwarding, Pre-computation slice generator component selects the instructions for pre-execution.
### Called future execution, assumes that a sequence of instructions will execute again in the future, hence, the second core executes n iterations ahead of the program running on the first core.

Prefetching for Multicore Processors
Designing prefetching strategies for multi-core processors poses new challenges. These challenges include multiple computing cores’ competing to fetch regular data and prefetched data, while sharing memory bandwidth. With single-core processors, main memory accepts prefetching requests just from one core. In multi-core processors, prefetching requests from multiple cores may put more pressure on main memory in addition to regular data fetch requests. For example, the memory processor based solutions [9][12] are not scalable to monitor data access history or pre-execute threads and predict future references for multiple cores. This problem can be solved by decoupling data access from computing cores. In Server-based push prefetching [15], we proposed using a dedicated server core to provide data access support by predicting and prefetching data for computing cores.

As mentioned earlier in identifying the source of prefetching, it is challenging in multi-core processor prefetching to maintain cache coherence. Multi-core processors access the main memory, which is shared by multiple cores and hence at some level in the memory hierarchy they have to resolve conflicting accesses to memory. Cache coherence in multi-core processors is dealt either by directory-based approach or by snooping cache accesses. There are two options to deal with data that is being shared: 1. to implement coherence protocols 2. to drop prefetching requests to that shared data. Implementing coherence protocols may increase complexity as every prefetching request to shared data has to initiate updating data with the latest copy. The other option, prefetching requests to shared data (for writing) can be dropped to reduce complexity of coherence. In this case, prefetching may not be beneficial for applications with a large number of writes. As many applications have more reads than writes, it should be reasonable to drop prefetching requests for shared data.
Usage of aggressive prediction algorithms on single-core processors has long been discouraged due to their complexity becoming counter productive. With large amount of computing available in multicore processors, transferring complexity to idle or dedicated cores using Server-based push prefetching architecture [15] is beneficial.

Conclusions
Performance gains of prefetching strategies depend on various criteria. With the emergence of multi-core and multi-threaded processors, new challenges and issues need to be considered to prefetch data. In this paper, we provide a taxonomy of the five primary issues (what, when, destination, source, and initiator), that are necessary in designing prefetching strategies. To be effective, a prefetching strategy for multi-core processing environments has to be adaptive to choose among multiple methods to predict future data accesses. When data access patterns of an application is easy to be found, prefetching strategy can choose history-based prediction algorithms to predict future data accesses. If data accesses are random, using pre-execution based approach would be beneficial.

Full paper, “A Taxonomy of Data Prefetching Mechanisms” (Co-authors: Xian-He Sun and Yong Chen) was published in the Proceedings of the International Symposium on Parallel Architectures, Algorithms, and Networks (I-SPAN) 2008, Sydney, Australia in May 2008.

Send in your comments to the author at moc.ofnierocitlum|anyb_nerus#moc.ofnierocitlum|anyb_nerus

References
[1] M. Annavaram, J. M. Patel and E. S. Davidson, “Data Prefetching by Dependence Graph Pre-computation”, Proceedings of the 28th International Symposium on Computer Architecture, 2001.
[2] F. Dahlgren, M. Dubois and P. Stenström, “Fixed and Adaptive Sequential Prefetching in Shared-memory Multiprocessors,” Proceedings of International Conference on Parallel Processing, pp. I56-I63, 1993.
[3] F. Dahlgren, M. Dubois and P. Stenström, “Sequential Hardware Prefetching in Shared-Memory Multiprocessors”, IEEE Transactions on Parallel and Distributed Systems, 6(7), 1995.
[4] T.F. Chen and J.L. Baer, “Effective Hardware-Based Data Prefetching for High Performance Processors,” IEEE Transactions on Computers, pp. 609-623, 1995.
[5] D. Joseph and D. Grunwald, “Prefetching Using Markov Predictors”, Proceedings of the 24th International Symposium on Computer Architecture, pp 252-263, 1997.
[6] G. Kandiraju and A. Sivasubramaniam, “Going the Distance for TLB Prefetching: An Application-Driven Study”, Proceedings of the 29th International Symposium on Computer Architecture, 2002.
[7] S. Byna, X.H. Sun and Y. Chen, “Server-based Data Push for Multi-processor Environments”, CS TR-2006-031, Illinois Institute of Technology, 2006.
[8] I. Ganusov and M. Burtscher, “Future Execution: A Hardware Prefetching Technique for Chip Multiprocessors”, Proceedings of the 14th Parallel Architectures and Compilation Techniques, 2005.
[9] W. Hassanein, J. Fortes and R. Eigenmann, “Data Forwarding through In-Memory Precomputation Threads”, Proceedings of the 18th International Conference on Supercomputing, 2004.
[10] J. Kim, K. V. Palem and W.F. Wong, “A Framework for Data Prefetching using Off-line Training of Markovian Predictors”, Proceedings of the 20th International Conference on Computer Design, 2002.
[11] K.S. McKinley, S. Carr and C.W. Tseng, “Improving Data Locality with Loop Transformations”, ACM Transactions on Programming Languages and Systems, 18(4), pp. 424-453, 1996.
[12] Y. Solihin, J. Lee and J. Torrellas, “Using a User-Level Memory Thread for Correlation Prefetching”, Proceedings of the 29th International Symposium on Computer Architecture, pp. 171-182, 2002.
[13] C. Zilles and G. Sohi, “Execution-based Prediction Using Speculative Slices”, Proceedings of the 28th International Symposium on Computer Architecture, 2001.
[14] C.K. Luk, “Tolerating Memory Latency through Software-Controlled Pre-Execution in Simultaneous Multithreading Processors”, Proceedings of the 28th International Symposium on Computer Architecture, 2001.
[15] X.H. Sun, S. Byna and Y. Chen, “Improving Data Access Performance with Server Push Architecture”, Proceedings of the NSF Next Generation Software Program Workshop (with IPDPS ‘07), 2007.

ShareThis

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License