memory hierarchy - androbenchcsl.skku.edu/uploads/sse2030s18/15-memory.pdf · storage systems known...
TRANSCRIPT
![Page 1: Memory Hierarchy - AndroBenchcsl.skku.edu/uploads/SSE2030S18/15-memory.pdf · storage systems known as a memory hierarchy. SSE2030:Introduction to Computer Systems, Spring 2018, Jinkyu](https://reader036.vdocuments.mx/reader036/viewer/2022062922/5f06a42e7e708231d4190141/html5/thumbnails/1.jpg)
SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong ([email protected])
Memory Hierarchy
Jinkyu Jeong ([email protected])Computer Systems Laboratory
Sungkyunkwan Universityhttp://csl.skku.edu
![Page 2: Memory Hierarchy - AndroBenchcsl.skku.edu/uploads/SSE2030S18/15-memory.pdf · storage systems known as a memory hierarchy. SSE2030:Introduction to Computer Systems, Spring 2018, Jinkyu](https://reader036.vdocuments.mx/reader036/viewer/2022062922/5f06a42e7e708231d4190141/html5/thumbnails/2.jpg)
SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong ([email protected]) 2
The CPU-Memory Gap• The gap widens between DRAM, disk, and CPU speeds
0.0
0.1
1.0
10.0
100.0
1,000.0
10,000.0
100,000.0
1,000,000.0
10,000,000.0
100,000,000.0
1985 1990 1995 2000 2003 2005 2010 2015
Tim
e (n
s)
Year
Disk seek timeSSD access timeDRAM access timeSRAM access timeCPU cycle timeEffective CPU cycle time
![Page 3: Memory Hierarchy - AndroBenchcsl.skku.edu/uploads/SSE2030S18/15-memory.pdf · storage systems known as a memory hierarchy. SSE2030:Introduction to Computer Systems, Spring 2018, Jinkyu](https://reader036.vdocuments.mx/reader036/viewer/2022062922/5f06a42e7e708231d4190141/html5/thumbnails/3.jpg)
SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong ([email protected]) 3
Principle of Locality
• Temporal locality– Recently referenced items are
likely to be referenced in the near future
• Spatial locality– Items with nearby addresses tend
to be referenced close together in time
![Page 4: Memory Hierarchy - AndroBenchcsl.skku.edu/uploads/SSE2030S18/15-memory.pdf · storage systems known as a memory hierarchy. SSE2030:Introduction to Computer Systems, Spring 2018, Jinkyu](https://reader036.vdocuments.mx/reader036/viewer/2022062922/5f06a42e7e708231d4190141/html5/thumbnails/4.jpg)
SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong ([email protected]) 4
Principle of Locality: Example
• Data– Reference array elements in succession Spatial locality– Reference sum each iteration Temporal locality
• Instructions– Reference instructions in sequence Spatial locality– Cycle through loop repeatedly Temporal locality
sum = 0;for (i = 0; i < n; i++)
sum += a[i];return sum;
![Page 5: Memory Hierarchy - AndroBenchcsl.skku.edu/uploads/SSE2030S18/15-memory.pdf · storage systems known as a memory hierarchy. SSE2030:Introduction to Computer Systems, Spring 2018, Jinkyu](https://reader036.vdocuments.mx/reader036/viewer/2022062922/5f06a42e7e708231d4190141/html5/thumbnails/5.jpg)
SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong ([email protected]) 5
Memory Hierarchy• Some fundamental and enduring properties of
hardware and software– Fast storage technologies cost more per byte, have less
capacity, and require more power– The gap between CPU and main memory speed is widening– Well-written programs tend to exhibit good locality
• These fundamental properties complement each other beautifully
• They suggest an approach for organizing memory and storage systems known as a memory hierarchy
![Page 6: Memory Hierarchy - AndroBenchcsl.skku.edu/uploads/SSE2030S18/15-memory.pdf · storage systems known as a memory hierarchy. SSE2030:Introduction to Computer Systems, Spring 2018, Jinkyu](https://reader036.vdocuments.mx/reader036/viewer/2022062922/5f06a42e7e708231d4190141/html5/thumbnails/6.jpg)
SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong ([email protected]) 6
Memory Hierarchy: Example
Registers
L1 cache(SRAM)
L2 cache(SRAM)
L3 cache(SRAM)
Main memory(DRAM)
Local secondary storage(local disks)
Remote secondary storage(e.g. Web cache)
Larger, slower,
and cheaper
(per byte)storagedevices
Smaller,faster,and
costlier(per byte)storage devices
L0:
L1:
L2:
L3:
L4:
L5:
L6:
![Page 7: Memory Hierarchy - AndroBenchcsl.skku.edu/uploads/SSE2030S18/15-memory.pdf · storage systems known as a memory hierarchy. SSE2030:Introduction to Computer Systems, Spring 2018, Jinkyu](https://reader036.vdocuments.mx/reader036/viewer/2022062922/5f06a42e7e708231d4190141/html5/thumbnails/7.jpg)
SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong ([email protected]) 7
Exploiting Locality
• How to exploit temporal locality?– Speed up data accesses by caching data in faster storage– Caching in multiple levels: form a memory hierarchy– The lower levels of the memory hierarchy tend to be
slower, but larger and cheaper
• How to exploit spatial locality?– Larger cache line size– Cache nearby data together
![Page 8: Memory Hierarchy - AndroBenchcsl.skku.edu/uploads/SSE2030S18/15-memory.pdf · storage systems known as a memory hierarchy. SSE2030:Introduction to Computer Systems, Spring 2018, Jinkyu](https://reader036.vdocuments.mx/reader036/viewer/2022062922/5f06a42e7e708231d4190141/html5/thumbnails/8.jpg)
SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong ([email protected]) 8
Cache
• A smaller, faster storage device that acts as a staging area for a subset of the data in a larger, slower device
• Improves the average access time• Exploits both temporal and spatial locality
![Page 9: Memory Hierarchy - AndroBenchcsl.skku.edu/uploads/SSE2030S18/15-memory.pdf · storage systems known as a memory hierarchy. SSE2030:Introduction to Computer Systems, Spring 2018, Jinkyu](https://reader036.vdocuments.mx/reader036/viewer/2022062922/5f06a42e7e708231d4190141/html5/thumbnails/9.jpg)
SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong ([email protected]) 9
General Cache Concepts
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
8 9 14 3Cache
MemoryLarger, slower, cheaper memoryviewed as partitioned into “blocks”
Data is copied in block-sized transfer units
Smaller, faster, more expensivememory caches a subset ofthe blocks
4
4
4
10
10
10
![Page 10: Memory Hierarchy - AndroBenchcsl.skku.edu/uploads/SSE2030S18/15-memory.pdf · storage systems known as a memory hierarchy. SSE2030:Introduction to Computer Systems, Spring 2018, Jinkyu](https://reader036.vdocuments.mx/reader036/viewer/2022062922/5f06a42e7e708231d4190141/html5/thumbnails/10.jpg)
SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong ([email protected]) 10
General Cache Concepts: Hit
0 1 2 34 5 6 78 9 10 11
12 13 14 15
8 9 14 3Cache
Memory
Data in block b is neededRequest: 14
14Block b is in cache:Hit!
![Page 11: Memory Hierarchy - AndroBenchcsl.skku.edu/uploads/SSE2030S18/15-memory.pdf · storage systems known as a memory hierarchy. SSE2030:Introduction to Computer Systems, Spring 2018, Jinkyu](https://reader036.vdocuments.mx/reader036/viewer/2022062922/5f06a42e7e708231d4190141/html5/thumbnails/11.jpg)
SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong ([email protected]) 11
General Cache Concepts: Miss
0 1 2 34 5 6 78 9 10 11
12 13 14 15
8 9 14 3Cache
Memory
Data in block b is neededRequest: 12
Block b is not in cache:Miss!
Block b is fetched frommemoryRequest: 12
12
12
12
Block b is stored in cache• Placement policy:
determines where b goes•Replacement policy:
determines which blockgets evicted (victim)
![Page 12: Memory Hierarchy - AndroBenchcsl.skku.edu/uploads/SSE2030S18/15-memory.pdf · storage systems known as a memory hierarchy. SSE2030:Introduction to Computer Systems, Spring 2018, Jinkyu](https://reader036.vdocuments.mx/reader036/viewer/2022062922/5f06a42e7e708231d4190141/html5/thumbnails/12.jpg)
SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong ([email protected]) 12
CPU Cache Design Issues
• Cache size– 8KB ~ 64KB for L1
• Cache line size– Typically, 32 ~ 64 bytes for L1
• Lookup– Fully associative– Set associative: 2-way, 4-way, 8-way, 16-way, etc.– Direct mapped
• Replacement– LRU (Least Recently Used)– FIFO (First-In First-Out), RANDOM, etc.
![Page 13: Memory Hierarchy - AndroBenchcsl.skku.edu/uploads/SSE2030S18/15-memory.pdf · storage systems known as a memory hierarchy. SSE2030:Introduction to Computer Systems, Spring 2018, Jinkyu](https://reader036.vdocuments.mx/reader036/viewer/2022062922/5f06a42e7e708231d4190141/html5/thumbnails/13.jpg)
SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong ([email protected]) 13
Intel Core i7 Cache Hierarchy
Regs
L1 d-cache
L1 i-cache
L2 unified cache
Core 0
Regs
L1 d-cache
L1 i-cache
L2 unified cache
Core 3
…
L3 unified cache(shared by all cores)
Main memory
Processor packageL1 i-cache and d-cache:
32 KB, 8-way, Access: 4 cycles
L2 unified cache:256 KB, 8-way,
Access: 10 cycles
L3 unified cache:8 MB, 16-way,Access: 40-75 cycles
Block size: 64 bytes for all caches.
![Page 14: Memory Hierarchy - AndroBenchcsl.skku.edu/uploads/SSE2030S18/15-memory.pdf · storage systems known as a memory hierarchy. SSE2030:Introduction to Computer Systems, Spring 2018, Jinkyu](https://reader036.vdocuments.mx/reader036/viewer/2022062922/5f06a42e7e708231d4190141/html5/thumbnails/14.jpg)
SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong ([email protected]) 14
Cache Performance• Average memory access time = Thit + Rmiss * Tmiss
• Hit time (Thit)– Time to deliver a line in the cache to the processor– Includes time to determine whether the line is in the cache– 1 clock cycle for L1, 3 ~ 8 clock cycles for L2
• Miss rate (Rmiss)– Fraction of memory references not found in cache – 3 ~ 10% for L1, < 1% for L2
• Miss penalty (Tmiss)– Additional time required because of a miss– Typically 25 ~ 100 cycles for main memory
![Page 15: Memory Hierarchy - AndroBenchcsl.skku.edu/uploads/SSE2030S18/15-memory.pdf · storage systems known as a memory hierarchy. SSE2030:Introduction to Computer Systems, Spring 2018, Jinkyu](https://reader036.vdocuments.mx/reader036/viewer/2022062922/5f06a42e7e708231d4190141/html5/thumbnails/15.jpg)
SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong ([email protected]) 15
Writing Cache Friendly Code
• Make the common case go fast– Focus on the inner loops of the core functions
• Minimize the misses in the inner loops– Repeated references to variables are good (temporal
locality)– Sequential reference patterns are good (spatial locality)
![Page 16: Memory Hierarchy - AndroBenchcsl.skku.edu/uploads/SSE2030S18/15-memory.pdf · storage systems known as a memory hierarchy. SSE2030:Introduction to Computer Systems, Spring 2018, Jinkyu](https://reader036.vdocuments.mx/reader036/viewer/2022062922/5f06a42e7e708231d4190141/html5/thumbnails/16.jpg)
SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong ([email protected]) 16
Matrix Multiplication (1)
• Description– Multiply N x N matrices– O(N3) total operations
• Assumptions– Line size = 32 bytes (big enough for 4 64-bit words)– Matrix dimension (N) is very large
/* ijk */for (i=0; i<n; i++) {for (j=0; j<n; j++) {sum = 0.0;for (k=0; k<n; k++) sum += a[i][k] * b[k][j];
c[i][j] = sum;}
}
Variable sumheld in register
A B C
(i,*)
(*,j)(i,j)X =
![Page 17: Memory Hierarchy - AndroBenchcsl.skku.edu/uploads/SSE2030S18/15-memory.pdf · storage systems known as a memory hierarchy. SSE2030:Introduction to Computer Systems, Spring 2018, Jinkyu](https://reader036.vdocuments.mx/reader036/viewer/2022062922/5f06a42e7e708231d4190141/html5/thumbnails/17.jpg)
SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong ([email protected]) 17
Matrix Multiplication (2)
• Matrix multiplication (ijk)
/* ijk */for (i=0; i<n; i++) {for (j=0; j<n; j++) {sum = 0.0;for (k=0; k<n; k++) sum += a[i][k] * b[k][j];
c[i][j] = sum;}
}
A B C(i,*)
(*,j)(i,j)
Inner loop:
Column-wise
Row-wise Fixed
Misses per Inner Loop Iteration:A B C
0.25 1.0 0.0
![Page 18: Memory Hierarchy - AndroBenchcsl.skku.edu/uploads/SSE2030S18/15-memory.pdf · storage systems known as a memory hierarchy. SSE2030:Introduction to Computer Systems, Spring 2018, Jinkyu](https://reader036.vdocuments.mx/reader036/viewer/2022062922/5f06a42e7e708231d4190141/html5/thumbnails/18.jpg)
SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong ([email protected]) 18
Matrix Multiplication (3)
• Matrix multiplication (jik)
A B C(i,*)
(*,j)(i,j)
Inner loop:
Misses per Inner Loop Iteration:A B C
0.25 1.0 0.0
/* jik */for (j=0; j<n; j++) {for (i=0; i<n; i++) {sum = 0.0;for (k=0; k<n; k++)sum += a[i][k] * b[k][j];
c[i][j] = sum}
} Column-wise
Row-wise Fixed
![Page 19: Memory Hierarchy - AndroBenchcsl.skku.edu/uploads/SSE2030S18/15-memory.pdf · storage systems known as a memory hierarchy. SSE2030:Introduction to Computer Systems, Spring 2018, Jinkyu](https://reader036.vdocuments.mx/reader036/viewer/2022062922/5f06a42e7e708231d4190141/html5/thumbnails/19.jpg)
SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong ([email protected]) 19
Matrix Multiplication (4)
• Matrix multiplication (kij)
Misses per Inner Loop Iteration:A B C0.0 0.25 0.25
/* kij */for (k=0; k<n; k++) {for (i=0; i<n; i++) {r = a[i][k];for (j=0; j<n; j++)c[i][j] += r * b[k][j];
}}
A B C(i,*)
(i,k) (k,*)
Inner loop:
Row-wiseFixed Row-wise
![Page 20: Memory Hierarchy - AndroBenchcsl.skku.edu/uploads/SSE2030S18/15-memory.pdf · storage systems known as a memory hierarchy. SSE2030:Introduction to Computer Systems, Spring 2018, Jinkyu](https://reader036.vdocuments.mx/reader036/viewer/2022062922/5f06a42e7e708231d4190141/html5/thumbnails/20.jpg)
SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong ([email protected]) 20
Matrix Multiplication (5)
• Matrix multiplication (ikj)
Misses per Inner Loop Iteration:A B C0.0 0.25 0.25
A B C(i,*)
(i,k) (k,*)
Inner loop:/* ikj */for (i=0; i<n; i++) {for (k=0; k<n; k++) {r = a[i][k];for (j=0; j<n; j++)c[i][j] += r * b[k][j];
}} Row-wiseFixed Row-wise
![Page 21: Memory Hierarchy - AndroBenchcsl.skku.edu/uploads/SSE2030S18/15-memory.pdf · storage systems known as a memory hierarchy. SSE2030:Introduction to Computer Systems, Spring 2018, Jinkyu](https://reader036.vdocuments.mx/reader036/viewer/2022062922/5f06a42e7e708231d4190141/html5/thumbnails/21.jpg)
SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong ([email protected]) 21
Matrix Multiplication (6)
• Matrix multiplication (jki)
Misses per Inner Loop Iteration:A B C1.0 0.0 1.0
/* jki */for (j=0; j<n; j++) {for (k=0; k<n; k++) {r = b[k][j];for (i=0; i<n; i++)c[i][j] += a[i][k] * r;
}}
A B C
(*,j)(k,j)
Inner loop:
(*,k)
Column-wise
Fixed Column-wise
![Page 22: Memory Hierarchy - AndroBenchcsl.skku.edu/uploads/SSE2030S18/15-memory.pdf · storage systems known as a memory hierarchy. SSE2030:Introduction to Computer Systems, Spring 2018, Jinkyu](https://reader036.vdocuments.mx/reader036/viewer/2022062922/5f06a42e7e708231d4190141/html5/thumbnails/22.jpg)
SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong ([email protected]) 22
Matrix Multiplication (7)
• Matrix multiplication (kji)
Misses per Inner Loop Iteration:A B C1.0 0.0 1.0
A B C
(*,j)(k,j)
Inner loop:
(*,k)
/* kji */for (k=0; k<n; k++) {for (j=0; j<n; j++) {r = b[k][j];for (i=0; i<n; i++)c[i][j] += a[i][k] * r;
}}
Column-wise
Fixed Column-wise
![Page 23: Memory Hierarchy - AndroBenchcsl.skku.edu/uploads/SSE2030S18/15-memory.pdf · storage systems known as a memory hierarchy. SSE2030:Introduction to Computer Systems, Spring 2018, Jinkyu](https://reader036.vdocuments.mx/reader036/viewer/2022062922/5f06a42e7e708231d4190141/html5/thumbnails/23.jpg)
SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong ([email protected]) 23
Matrix Multiplication (8)
• Summary
for (i=0; i<n; i++) {
for (j=0; j<n; j++) {
sum = 0.0;
for (k=0; k<n; k++)
sum += a[i][k] * b[k][j];
c[i][j] = sum;
}
}
ijk (& jik): • 2 loads, 0 stores• misses/iter = 1.25
for (k=0; k<n; k++) {
for (i=0; i<n; i++) {
r = a[i][k];
for (j=0; j<n; j++)
c[i][j] += r * b[k][j];
}
}
for (j=0; j<n; j++) {
for (k=0; k<n; k++) {
r = b[k][j];
for (i=0; i<n; i++)
c[i][j] += a[i][k] * r;
}
}
kij (& ikj): • 2 loads, 1 store• misses/iter = 0.5
jki (& kji): • 2 loads, 1 store• misses/iter = 2.0
A B C0.25 1.0 0.0
A B C0.0 0.25 0.25
A B C1.0 0.0 1.0
![Page 24: Memory Hierarchy - AndroBenchcsl.skku.edu/uploads/SSE2030S18/15-memory.pdf · storage systems known as a memory hierarchy. SSE2030:Introduction to Computer Systems, Spring 2018, Jinkyu](https://reader036.vdocuments.mx/reader036/viewer/2022062922/5f06a42e7e708231d4190141/html5/thumbnails/24.jpg)
SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong ([email protected]) 24
Matrix Multiplication (9)
• Performance in Core i7
1
10
100
50 100 150 200 250 300 350 400 450 500 550 600 650 700
Cyc
les
per i
nner
loop
iter
atio
n
Array size (n)
jkikjiijkjikkijikj
ijk / jik
jki / kji
kij / ikj
![Page 25: Memory Hierarchy - AndroBenchcsl.skku.edu/uploads/SSE2030S18/15-memory.pdf · storage systems known as a memory hierarchy. SSE2030:Introduction to Computer Systems, Spring 2018, Jinkyu](https://reader036.vdocuments.mx/reader036/viewer/2022062922/5f06a42e7e708231d4190141/html5/thumbnails/25.jpg)
SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong ([email protected]) 25
Summary
• Programmer can optimize for cache performance– How data structures are organized– How data are accessed• Nested loop structure• Blocking is a general technique
• All systems favor “cache friendly code”– Getting absolute optimum performance is very platform
specific• Cache sizes, line sizes, associativities, etc.– Can get most of the advantage with generic code• Keep working set reasonably small (temporal locality)• Use small strides (spatial locality)