1 mini-project presentation: prefetching tdt4260 computer architecture stefano nichele, angelo...

19
1 Mini-Project Presentation: Prefetching TDT4260 Computer Architecture Stefano Nichele, Angelo Spalluto Department of Computer and Information Science 2011, April 15th Stefano Nichele – Angelo Spalluto, 2011

Upload: lawrence-snow

Post on 05-Jan-2016

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 Mini-Project Presentation: Prefetching TDT4260 Computer Architecture Stefano Nichele, Angelo Spalluto Department of Computer and Information Science

1

Mini-Project Presentation: Prefetching TDT4260 Computer Architecture

Stefano Nichele, Angelo SpallutoDepartment of Computer and Information Science2011, April 15th

Stefano Nichele – Angelo Spalluto, 2011

Page 2: 1 Mini-Project Presentation: Prefetching TDT4260 Computer Architecture Stefano Nichele, Angelo Spalluto Department of Computer and Information Science

2

Agenda• Moore’s law – Memory wall• Related work• Fixed Sequential Prefetching• Sequential Aggressive Prefetching

(M-Adaptive, DM-Adaptive)• DCPT, DCPT-P• WA-DCPT and SA-DCPT• Results• Conclusion• References

Page 3: 1 Mini-Project Presentation: Prefetching TDT4260 Computer Architecture Stefano Nichele, Angelo Spalluto Department of Computer and Information Science

3

Moore vs. Mem. Wall

• Spatial Locality• Temporal Locality

Page 4: 1 Mini-Project Presentation: Prefetching TDT4260 Computer Architecture Stefano Nichele, Angelo Spalluto Department of Computer and Information Science

4

Prefetching

Predicting Fetching

1 – Which data will be needed by the next instructions?

2 – Deliver it into the cache before it is referenced!

• Sequential• RPT• PC/DC• DCPT• Adaptive

Page 5: 1 Mini-Project Presentation: Prefetching TDT4260 Computer Architecture Stefano Nichele, Angelo Spalluto Department of Computer and Information Science

5

Fixed Sequential Prefetching

Speed up

Fixed size window

Benchmarks Sequential Algorithm• The prefetcher issues N requests after a miss

occurs;• The value of window is constant for the whole

execution of program;

Sequential benchmarks• Wupwise;• Applu;• Galgel;

Not sequential benchmarks• Ammp;• Art110;• Art470;

Page 6: 1 Mini-Project Presentation: Prefetching TDT4260 Computer Architecture Stefano Nichele, Angelo Spalluto Department of Computer and Information Science

6

Sequential Aggressive Adaptive Prefetcher

Sequential Aggressive Adaptive

The adaptive prefetcher adjusts dynamically the degree of prefetching (N)

Adaptive window parameters

• Window: Number of N contiguous blocks issued by prefetcher• Accuracy: Number of good prefetches referred to a window• Threshold: Number of good prefetches necessary to increase the window (Accuracy >= Threshold)• Lock window: Number of times whereby the window is locked• Listening state: The prefetcher counts the number of good prefetches

Prefetcher algorithm

1. Prefetcher initialises Window, Threshold and Lock Window

2. Upon a request issued by CPU, the prefetcher issues N prefetching

3. It waits for N times (listening state)

4. In step N it checks if Accuracy >= Threshold

5. If previous condition is satisfied, then it uses the same window for other L-1 times. Otherwise it decreases the window and it issues N requests (back in step 3)

6. If step 4 succeedes for L times, the prefetcher increases the window and it issues other N requests

7. Back in step 3

Page 7: 1 Mini-Project Presentation: Prefetching TDT4260 Computer Architecture Stefano Nichele, Angelo Spalluto Department of Computer and Information Science

7

Example Seq. Aggressive Adaptive Prefetcher

Page 8: 1 Mini-Project Presentation: Prefetching TDT4260 Computer Architecture Stefano Nichele, Angelo Spalluto Department of Computer and Information Science

8

Sequential Aggressive• Prefetching occurs immediately after the last element checked in the window (either if it is a miss or hit)• Each window is composed by P elements = #hits + #misses

Miss-Adaptive (M-Adaptive)• The M-Adaptive issues a prefetching (restart a new window) only when the first miss occurs after that the

whole window has been checked (hits do not trigger prefetching) • Each window is composed by P elements = #hits + #misses

Discard Miss-Adaptive (DM-Adaptive)• DM-Adaptive issues a prefetching immediately after the first miss occurs inside the window• Each window is composed by P elements = #hits

Different listening states

Page 9: 1 Mini-Project Presentation: Prefetching TDT4260 Computer Architecture Stefano Nichele, Angelo Spalluto Department of Computer and Information Science

9

DCPT and DCPT-P

•No last prefetched•Test if in cache before prefetching•Maybe in the queue

Page 10: 1 Mini-Project Presentation: Prefetching TDT4260 Computer Architecture Stefano Nichele, Angelo Spalluto Department of Computer and Information Science

10

Aggressive AdaptiveAggressive Adaptive

Aggressive Adaptive - DCPT

DCPTDCPT

Aggressive Adaptive-DCPTAggressive Adaptive-DCPT

Stefano, Aggressive Adaptive works pretty well with sequential benchmarks. What about DCPT?

Great!! DCPT works very goods with not sequential benchmarks.

Ja ja, we may achieve better results!

Let’s try to combine them togheter !!

SA-DCPT WA-DCPT

Page 11: 1 Mini-Project Presentation: Prefetching TDT4260 Computer Architecture Stefano Nichele, Angelo Spalluto Department of Computer and Information Science

11

WA-DCPT and SA-DCPTWA-DCPT• WA-DCPT adds the concept of window in DCPT• When DCPT issues a prefetching for a specific PC, it also delivers all subsequent blocks according to its

window size• WA-DCPT is more memory demanding than DCPT. It uses a larger data structure

SA-DCPT• At runtime it adapts the best algorithm between DCPT and Aggressive Sequential• Switch Threshold is the major concern• Best switch threshold is 4

Page 12: 1 Mini-Project Presentation: Prefetching TDT4260 Computer Architecture Stefano Nichele, Angelo Spalluto Department of Computer and Information Science

12

Adaptive resultsAggressive Adaptive

• In some benchmarks (galgel, applu, wupwise) the window reaches also size between 13 and 15

• Using a window greater than 12 does not improve the performances

• Low sequencing for ammp, art110 and art470

M-Adaptive and DM-Adaptive

• The results of M-Adaptive and DM-Adaptive are not better than Aggressive Adaptive

• As expected, they produce less “misses” and “prefetches issued”

Page 13: 1 Mini-Project Presentation: Prefetching TDT4260 Computer Architecture Stefano Nichele, Angelo Spalluto Department of Computer and Information Science

13

DCPT resultsDCPT and DCPT-P

• As expected, DCPT-P is slightly better than DCPT• For ammp, DCPT-P outperforms almost twice better than adaptive• Table composed by 16 deltas and 97 PCs is the best configuration (smaller than 8KB)• DCPT-P uses a masking of 8bits• In our tests there are not improvement using a bit mask of 12

Page 14: 1 Mini-Project Presentation: Prefetching TDT4260 Computer Architecture Stefano Nichele, Angelo Spalluto Department of Computer and Information Science

14

Adaptive DCPT resultsWA-DCPT

• WA-DCPT has a different data structure than DCPT (window data)

• Best results are achieved using 14 deltas

SA-DCPT

• SA-DCPT has same data structure than DCPT• Tuning on switching threshold• Best switching factor is 4• SA-DCPT behaves as DCPT for switching factor

greater than 4

Page 15: 1 Mini-Project Presentation: Prefetching TDT4260 Computer Architecture Stefano Nichele, Angelo Spalluto Department of Computer and Information Science

15

Developed and Literature prefetcherDeveloped Prefetchers

• DCPT obtains the best performances

• SA-DCPT is a good compromise when we do not know the type of benchmark

Literature VS Developed

• Our DCPT-P implementation outperforms the reference DCPT-P

• Likely because they have different data structure

Page 16: 1 Mini-Project Presentation: Prefetching TDT4260 Computer Architecture Stefano Nichele, Angelo Spalluto Department of Computer and Information Science

16

Coverage AnalysisCoverage

• Benchmarks with low sequencing (ammp, art110 and art470) have a higher coverage with DCPT-P

• Benchmarks with high sequencing (except applu) have better coverage with SA-DCPT

Coverage vs Speedup

The coverage is not directly proportional to speedup

If the algorithm spends too much time to discover the next element to prefetch, as consequence it might increase its execution time

Page 17: 1 Mini-Project Presentation: Prefetching TDT4260 Computer Architecture Stefano Nichele, Angelo Spalluto Department of Computer and Information Science

17

Conclusion

• Importance of prefetcher, it can really improve performances

• Contribution: 3 new prefetcher variants:

– adaptive window (aggressive technique)

– DCPT-based with bit masking

– Combination: delta correlation with adaptive window

• Importance of parameter tuning

• DCPT-P has best performances (on overall)

• Difficult to combine two different (opposite) algorithms to exploit the best properties of each

Page 18: 1 Mini-Project Presentation: Prefetching TDT4260 Computer Architecture Stefano Nichele, Angelo Spalluto Department of Computer and Information Science

18

References• G. E. Moore, Cramming more Components onto Integrated Circuits, Electronics, 38(8), April 9, 1965.

• W.A. Wulf and S.A. McKee, Hitting the Memory Wall: Implications of the Obvious, Computer Architecture News, vol. 23, no. 1, Mar. 1995, pp. 20–24

• A. R. Brodtkorb,  C. Dyken,  T. R. Hagen, J. M. Hjelmervik, O. O. Storaasli, State-of-the-art in heterogeneous computing, Sci. Program., Vol. 18 (January 2010), pp. 1-33.

• M. Jahre, Managing Shared Resources in Chip Multiprocessor Memory Systems.: NTNU 2010 (ISBN 978-82-471-2287-7) 238 s. Doktoravhandlinger ved NTNU (159)

• M. Grannaes, Reducing Memory Latency by Improving Resource Utilization.: NTNU 2010 (ISBN 978-82-471-2177-8) 242 s. Doktoravhandlinger ved NTNU (106)

• A. J. Smith, Cache memories, ACM Comput. Surv., vol. 14, no. 3, pp. 473–530, 1982

• F. Dahlgren, M. Dubois, and P. Stenstrom. Fixed and adaptive sequential prefetching in shared memory multiprocessors. In Parallel Processing, 1993. ICPP 1993. International Conference on, volume 1, pages 56-63, Aug. 1993.

• M. Grannaes, M. Jahre and L. Natvig. Multi-level Hardware Prefetching Using Low Complexity Delta Correlating Prediction Tables with Partial Matching. High Performance Embedded Architectures and Compilers LNCS, 2010, Volume 5952/2010, 247-261.

• M. Grannaes, M. Jahre and L. Natvig. Storage Efficient Hardware Prefetching using Delta Correlating Prediction Tables. In Data Prefetching Championships (2009)

Page 19: 1 Mini-Project Presentation: Prefetching TDT4260 Computer Architecture Stefano Nichele, Angelo Spalluto Department of Computer and Information Science

19

QUESTIONS ?