hk-nuca: boosting data searches in dynamic nuca for cmps javier lira ψ carlos molina ф antonio...
Post on 19-Dec-2015
219 views
TRANSCRIPT
![Page 1: HK-NUCA: Boosting Data Searches in Dynamic NUCA for CMPs Javier Lira ψ Carlos Molina ф Antonio González ψ,λ λ Intel Barcelona Research Center Intel Labs](https://reader035.vdocuments.mx/reader035/viewer/2022062714/56649d3e5503460f94a166e3/html5/thumbnails/1.jpg)
HK-NUCA: Boosting Data Searches in Dynamic NUCA for CMPs
Javier Lira ψ
Carlos Molina ф
Antonio González ψ,λ
λ Intel Barcelona Research Center
Intel Labs - UPC
Barcelona, Spain
ф Dept. Enginyeria Informàtica
Universitat Rovira i Virgili
Tarragona, Spain
ψ Dept. Arquitectura de Computadors
Universitat Politècnica de Catalunya
Barcelona, Spain
IPDPS 2011, Anchorage, AK (USA) – May 17, 2011
![Page 2: HK-NUCA: Boosting Data Searches in Dynamic NUCA for CMPs Javier Lira ψ Carlos Molina ф Antonio González ψ,λ λ Intel Barcelona Research Center Intel Labs](https://reader035.vdocuments.mx/reader035/viewer/2022062714/56649d3e5503460f94a166e3/html5/thumbnails/2.jpg)
Introduction
2
Core 0 Core 1 Core 2 Core 3
Core 4 Core 5 Core 6 Core 7
NUCA
S-NUCA (Static NUCA)
One possible location in the NUCA
Simple
Trivial search of data
No leverages locality
D-NUCA (Dynamic NUCA)
Multiple candidate banks
Migration increases complexity
Not easy to find data
Optimize cache access latency
![Page 3: HK-NUCA: Boosting Data Searches in Dynamic NUCA for CMPs Javier Lira ψ Carlos Molina ф Antonio González ψ,λ λ Intel Barcelona Research Center Intel Labs](https://reader035.vdocuments.mx/reader035/viewer/2022062714/56649d3e5503460f94a166e3/html5/thumbnails/3.jpg)
Motivation
3
Significant performance potential
Limited by the access scheme
![Page 4: HK-NUCA: Boosting Data Searches in Dynamic NUCA for CMPs Javier Lira ψ Carlos Molina ф Antonio González ψ,λ λ Intel Barcelona Research Center Intel Labs](https://reader035.vdocuments.mx/reader035/viewer/2022062714/56649d3e5503460f94a166e3/html5/thumbnails/4.jpg)
Access schemes in D-NUCA
Directory is not an alternative Needs to update block location on every migration Reduces D-NUCA potentiality Potential bottleneck
Algorithmic-based schemes
Partitioned multicast (hybrid access scheme) 1st step: Local bank + central banks (9 banks) 2nd step: The other core’s local banks
4
Performance Energy
Serial Low Low
Parallel High High
![Page 5: HK-NUCA: Boosting Data Searches in Dynamic NUCA for CMPs Javier Lira ψ Carlos Molina ф Antonio González ψ,λ λ Intel Barcelona Research Center Intel Labs](https://reader035.vdocuments.mx/reader035/viewer/2022062714/56649d3e5503460f94a166e3/html5/thumbnails/5.jpg)
Serial vs Parallel
5
Reduce the number of messages required per access is crucial
![Page 6: HK-NUCA: Boosting Data Searches in Dynamic NUCA for CMPs Javier Lira ψ Carlos Molina ф Antonio González ψ,λ λ Intel Barcelona Research Center Intel Labs](https://reader035.vdocuments.mx/reader035/viewer/2022062714/56649d3e5503460f94a166e3/html5/thumbnails/6.jpg)
Objectives
6
Optimize NUCA features Provide fast access when the data is near the requesting core
Reduce network contention Crucial in both performance and energy
![Page 7: HK-NUCA: Boosting Data Searches in Dynamic NUCA for CMPs Javier Lira ψ Carlos Molina ф Antonio González ψ,λ λ Intel Barcelona Research Center Intel Labs](https://reader035.vdocuments.mx/reader035/viewer/2022062714/56649d3e5503460f94a166e3/html5/thumbnails/7.jpg)
Outline
Introduction and motivationMethodologyHK-NUCAResultsConclusions
7
![Page 8: HK-NUCA: Boosting Data Searches in Dynamic NUCA for CMPs Javier Lira ψ Carlos Molina ф Antonio González ψ,λ λ Intel Barcelona Research Center Intel Labs](https://reader035.vdocuments.mx/reader035/viewer/2022062714/56649d3e5503460f94a166e3/html5/thumbnails/8.jpg)
Methodology
Simulation tools: Simics + GEMS CACTI v6.0
Two scenarios: Multi-programmed
Mix of SPEC CPU2006
Parallel applications PARSEC
Number of cores 8 – UltraSPARC IIIi
Frequency 1.5 GHz
Main Memory Size 4 Gbytes
Memory Bandwidth 512 Bytes/cycle
Private L1 caches 8 x 32 Kbytes, 2-way
Shared L2 NUCA cache 8 MBytes, 128 Banks
NUCA Bank 64 KBytes, 8-way
L1 cache latency 3 cycles
NUCA bank latency 4 cycles
Router delay 1 cycle
On-chip wire delay 1 cycle
Main memory latency 250 cycles (from core)
![Page 9: HK-NUCA: Boosting Data Searches in Dynamic NUCA for CMPs Javier Lira ψ Carlos Molina ф Antonio González ψ,λ λ Intel Barcelona Research Center Intel Labs](https://reader035.vdocuments.mx/reader035/viewer/2022062714/56649d3e5503460f94a166e3/html5/thumbnails/9.jpg)
Baseline architecture
D-NUCA cache 8 MBytes 128 Banks Bank: 64 KBytes, 8-way
Migration scheme: Gradual Promotion
Replacement LRU
Access Partitioned Multicast
9
Core 0
Core 1
Core 2
Core 3
Core 4
Core 5
Core 6
Core 7
![Page 10: HK-NUCA: Boosting Data Searches in Dynamic NUCA for CMPs Javier Lira ψ Carlos Molina ф Antonio González ψ,λ λ Intel Barcelona Research Center Intel Labs](https://reader035.vdocuments.mx/reader035/viewer/2022062714/56649d3e5503460f94a166e3/html5/thumbnails/10.jpg)
Outline
Introduction and motivationMethodologyHK-NUCAResultsConclusions
10
![Page 11: HK-NUCA: Boosting Data Searches in Dynamic NUCA for CMPs Javier Lira ψ Carlos Molina ф Antonio González ψ,λ λ Intel Barcelona Research Center Intel Labs](https://reader035.vdocuments.mx/reader035/viewer/2022062714/56649d3e5503460f94a166e3/html5/thumbnails/11.jpg)
HK-NUCA
Home Knows where to find data in the NUCA cache
Home bank knows which other banks have at least one data block that it manages
There is a HK-PTR per cache set in all banks.
11
0 0 1 0 1 1 0 0 0 0 0 0 1 0 1 0
HK-PTR
![Page 12: HK-NUCA: Boosting Data Searches in Dynamic NUCA for CMPs Javier Lira ψ Carlos Molina ф Antonio González ψ,λ λ Intel Barcelona Research Center Intel Labs](https://reader035.vdocuments.mx/reader035/viewer/2022062714/56649d3e5503460f94a166e3/html5/thumbnails/12.jpg)
(2) Call Home(3) Parallel access
HK-NUCA
12
Core 0 Core 1 Core 2 Core 3
Core 4 Core 5 Core 6 Core 7
Core 0
(1) Fast access
0 0 1 0 1 1 0 0 0 0 0 0 1 0 1 0
![Page 13: HK-NUCA: Boosting Data Searches in Dynamic NUCA for CMPs Javier Lira ψ Carlos Molina ф Antonio González ψ,λ λ Intel Barcelona Research Center Intel Labs](https://reader035.vdocuments.mx/reader035/viewer/2022062714/56649d3e5503460f94a166e3/html5/thumbnails/13.jpg)
Managing Home knowledge
Actions that provoke an update of HK-PTR:
New data enters to the cache
Eviction from the NUCA cache
Migration movements
Migrations are synchronized with HK-PTR updates
13
![Page 14: HK-NUCA: Boosting Data Searches in Dynamic NUCA for CMPs Javier Lira ψ Carlos Molina ф Antonio González ψ,λ λ Intel Barcelona Research Center Intel Labs](https://reader035.vdocuments.mx/reader035/viewer/2022062714/56649d3e5503460f94a166e3/html5/thumbnails/14.jpg)
Overheads
Hardware Implementation HK-PTRs
Network Home knowledge updates
14
NUCA cache 8 MBytesHK-PTRs 32 KBytes
![Page 15: HK-NUCA: Boosting Data Searches in Dynamic NUCA for CMPs Javier Lira ψ Carlos Molina ф Antonio González ψ,λ λ Intel Barcelona Research Center Intel Labs](https://reader035.vdocuments.mx/reader035/viewer/2022062714/56649d3e5503460f94a166e3/html5/thumbnails/15.jpg)
Outline
Introduction and motivationMethodologyHK-NUCAResultsConclusions
15
![Page 16: HK-NUCA: Boosting Data Searches in Dynamic NUCA for CMPs Javier Lira ψ Carlos Molina ф Antonio González ψ,λ λ Intel Barcelona Research Center Intel Labs](https://reader035.vdocuments.mx/reader035/viewer/2022062714/56649d3e5503460f94a166e3/html5/thumbnails/16.jpg)
Performance results
16
Overall performance improvement of 4-6%Workloads with high miss rateLow miss rate, but high hit rate in the first two HK-NUCA stages
Low miss rate, high hit rate in the parallel access stage of HK-NUCA
![Page 17: HK-NUCA: Boosting Data Searches in Dynamic NUCA for CMPs Javier Lira ψ Carlos Molina ф Antonio González ψ,λ λ Intel Barcelona Research Center Intel Labs](https://reader035.vdocuments.mx/reader035/viewer/2022062714/56649d3e5503460f94a166e3/html5/thumbnails/17.jpg)
HK-NUCA accuracy
17
0 mess
ages
1 mess
age
2 mess
ages
3 mess
ages
4 mess
ages
5 mess
ages
6 mess
ages
7 mess
ages
8 mess
ages
9 mess
ages
10 mess
ages
11 mess
ages
12 mess
ages
13 mess
ages
14 mess
ages0%
2%
4%
6%
8%
10%
12%
14%
16%
18%
20%
Me
ss
ag
es
se
nt
du
rin
g t
he
pa
ralle
l ac
ce
ss
sta
ge
of
HK
-NU
CA
85% of memory requests send less than 6 messages to the NUCA
![Page 18: HK-NUCA: Boosting Data Searches in Dynamic NUCA for CMPs Javier Lira ψ Carlos Molina ф Antonio González ψ,λ λ Intel Barcelona Research Center Intel Labs](https://reader035.vdocuments.mx/reader035/viewer/2022062714/56649d3e5503460f94a166e3/html5/thumbnails/18.jpg)
On-chip network traffic
18
Avg Messages sent per request
Part. Multcast 10.03HK-NUCA (3-steps) 3.82HK-NUCA (2-steps) 4.06Perfect Search 1
![Page 19: HK-NUCA: Boosting Data Searches in Dynamic NUCA for CMPs Javier Lira ψ Carlos Molina ф Antonio González ψ,λ λ Intel Barcelona Research Center Intel Labs](https://reader035.vdocuments.mx/reader035/viewer/2022062714/56649d3e5503460f94a166e3/html5/thumbnails/19.jpg)
Energy consumption results
19
HK-NUCA reduces dynamic energy consumption by more than 50%
![Page 20: HK-NUCA: Boosting Data Searches in Dynamic NUCA for CMPs Javier Lira ψ Carlos Molina ф Antonio González ψ,λ λ Intel Barcelona Research Center Intel Labs](https://reader035.vdocuments.mx/reader035/viewer/2022062714/56649d3e5503460f94a166e3/html5/thumbnails/20.jpg)
Outline
Introduction and motivationMethodologyHK-NUCAResultsConclusions
20
![Page 21: HK-NUCA: Boosting Data Searches in Dynamic NUCA for CMPs Javier Lira ψ Carlos Molina ф Antonio González ψ,λ λ Intel Barcelona Research Center Intel Labs](https://reader035.vdocuments.mx/reader035/viewer/2022062714/56649d3e5503460f94a166e3/html5/thumbnails/21.jpg)
Conclusions
D-NUCA enables to take profit of the non-uniformity of NUCA caches
D-NUCA benefits are restricted by the access scheme used
HK-NUCA is an access scheme for D-NUCA organizations
Allows fast accesses to data that is near the requesting core
Home knowledge reduces miss resolution time and network contention
Outperforms by 6% the best performing access scheme
Reduces dynamic energy consumption by 50%
21
![Page 22: HK-NUCA: Boosting Data Searches in Dynamic NUCA for CMPs Javier Lira ψ Carlos Molina ф Antonio González ψ,λ λ Intel Barcelona Research Center Intel Labs](https://reader035.vdocuments.mx/reader035/viewer/2022062714/56649d3e5503460f94a166e3/html5/thumbnails/22.jpg)
HK-NUCA: Boosting data searches in Dynamic NUCA for CMPs
Questions?
22
![Page 23: HK-NUCA: Boosting Data Searches in Dynamic NUCA for CMPs Javier Lira ψ Carlos Molina ф Antonio González ψ,λ λ Intel Barcelona Research Center Intel Labs](https://reader035.vdocuments.mx/reader035/viewer/2022062714/56649d3e5503460f94a166e3/html5/thumbnails/23.jpg)
Migration is not the problem
23
S-NUCAD-NUCA
Access scheme is the main limitation
in D-NUCA