latency vs. bandwidth which matters more? katherine yelick u.c. berkeley and lbnl joint with with:...
Post on 22-Dec-2015
217 views
TRANSCRIPT
Latency vs. BandwidthLatency vs. BandwidthWhich Matters More?Which Matters More?
Katherine Yelick
U.C. Berkeley and LBNL
Joint with with: Xiaoye Li, Lenny Oliker, Brian Gaeke, Parry Husbands (LBNL)The Berkeley IRAM group: Dave Patterson, Joe Gebis, Dave
Judd, Christoforos Kozyrakis, Sam Williams,…The Berkeley Bebop group: Jim Demmel, Rich Vuduc, Ben Lee,
Rajesh Nishtala,…
K. Yelick, PIM Software 2004
Blame the Memory Bus Blame the Memory Bus
Many scientific applications run at less than 10% of hardware peak, even on a single processor The trend is to blame the memory bus Is this accurate?
Need to understand bottlenecks to Design better machines Design better algorithms
Two parts Algorithm bottlenecks on microprocessors Bottlenecks on a PIM system, VIRAM
µProc60%/yr.
DRAM7%/yr.
1
10
100
1000
198
0 198
1 198
3 198
4 198
5 198
6 198
7 198
8 198
9 199
0 199
1 199
2 199
3 199
4 199
5 199
6 199
7 199
8 199
9 200
0
DRAM
CPU
198
2
Processor-MemoryPerformance Gap:(grows 50% / year)
Per
form
ance
Time
Note: this is latency, not bandwidth.
K. Yelick, PIM Software 2004
Memory Intensive ApplicationsMemory Intensive Applications
Poor performance is especially problematic for memory-intensive applications Low ratio of arithmetic operations to memory Irregular memory access patterns
Example Sparse matrix-vector multiply (dominant kernel of NAS CG)
Many scientific applications do this by some perspective Compute y = y + A*x
Matrix is stored as two main arrays:– Column index array (int)– Value array (floating point)
For each element y[i] compute j x[index[j]] * value[j]
So latency (to x) dominates, right? Irregular Not necessarily in cache
x
y
K. Yelick, PIM Software 2004
Performance Model is RevealingPerformance Model is Revealing
A simple analytical model for sparse matvec kernel # loads from memory * cost of load + # loads from cache …
Two versions: Only compulsory misses
to source vector, x All accesses to x
produce a miss to memory
Conclusion Cache misses to source
(memory latency) is not the dominant cost
PAPI measurements confirm
So bandwidth to the matrix dominates, right?
K. Yelick, PIM Software 2004
Memory Bandwidth MeasurementsMemory Bandwidth Measurements
Yes, but be careful about how you measure bandwidth Not a constant
K. Yelick, PIM Software 2004
An Architectural ProbeAn Architectural Probe
Sqmat is a tunable probe to measure architectures Stream of small matrices Square each matrix to some power: computational intensity The stream may be direct (dense), or indirect (sparse)
If indirect, how frequently is there a non-unit stride jump
Parameters: Matrix size within stream Computational Intensity Indirection (yes/no) # unit strides before jump
. . .. . .
K. Yelick, PIM Software 2004
Cost of IndirectionCost of Indirection
Adding a second load stream for indexes into stream has a big effect on some machines
This is truly a bandwidth issue
1
2
3
4
5
6
1 2 4 8 16 32 64 128 256 512Number of squarings (computational
intensity)
Slo
wd
ow
n
Itanium 2OpteronPow er3Pow er4
K. Yelick, PIM Software 2004
Cost of IrregularityCost of Irregularity
Slowdown relative to the previous slide results Even a tiny bit of irregularity (1/S) can have a big effect
0
3
6
9
12
15
1 2 4 8 16 32 64
# of squarings
1
2
3
4
1 2 4 8 16 32 64
# of Squarings
S=1S=2S=4S=8S=16S=128
1
2
3
4
1 2 4 8 16 32 64# of squarings
Slo
wd
ow
n
0
5
10
15
20
25
1 2 4 8 16 32 64
# of squarings
OpteronItanium2
Power3 Power4
K. Yelick, PIM Software 2004
What Does This Have to Do with PIMs?What Does This Have to Do with PIMs?
Performance of Sqmat on PIMs and others for 3x3 matrices, squared 10 times (high computational intensity!)
Imagine much faster for long streams, slower for short ones
Mflop/s as with varying stream lengths
0
500
1000
1500
2000
2500
3000
3500
IMAGINE IRAM DIVA Power3
MF
lop
s
8
16
32
64
128
256
512
1024
K. Yelick, PIM Software 2004
VIRAM OverviewVIRAM Overview
Technology: IBM SA-27E0.18m CMOS, 6 metal layers
290 mm2 die area225 mm2 for memory/logic
Transistor count: ~130M 13 MB of DRAM Power supply
1.2V for logic, 1.8V for DRAM Typical power consumption: 2.0 W
0.5 W (scalar) + 1.0 W (vector) + 0.2 W (DRAM) + 0.3 W (misc)
MIPS Scalar core + 4-lane vector Peak vector performance
1.6/3.2/6.4 Gops wo. multiply-add (64b/32b/16b operations)
3.2/6.4 /12.8 Gops w. madd1.6 Gflops (single-precision)
14.5 mm
20
.0 m
m
K. Yelick, PIM Software 2004
Vector IRAM ISA SummaryVector IRAM ISA Summary
s.intu.ints.fpd.fp
.v.vv.vs.sv
s.intu.int
unit strideconstant stride
indexed
loadstore
VectorALU
VectorMemory
Scalar MIPS64 scalar instruction set
alu op
8163264
•91 instructions•660 opcodes
ALU operations: integer, floating-point, fixed-point and DSP, convert, logical, vector processing,flag processing
K. Yelick, PIM Software 2004
VIRAM CompilerVIRAM Compiler
Based on the Cray’s production compiler Challenges:
narrow data types and scalar/vector memory consistency
Advantages relative to media-extensions: powerful addressing modes and ISA independent of datapath
width
Optimizer
C
Fortran95
C++
Frontends Code Generators
Cray’s
PDGCS
T3D/T3E
SV2/VIRAM
C90/T90/X1
K. Yelick, PIM Software 2004
Compiler and OS EnhancementsCompiler and OS Enhancements
Compiler based on Cray PDGCS Outer-loop vectorization Strided and indexed vector loads and stores Vectorization of loops with if statements
Full predicated execution of vector instructions using flag registers
Vectorization of reductions and FFTs Instructions for simple, intra-register permutations Automatic for reductions, manual (or StreamIT) for FFTs
Vectorization of loops with break statements Software speculation support for vector loads
OS development MMU-based virtual memory
OS performance Dirty and valid bits for registers to reduce context switch
overhead
K. Yelick, PIM Software 2004
HW Resources Visible to SoftwareHW Resources Visible to Software
Vector IRAM Pentium III
Visible to SW
Visible to SW
Transparent to SW
Tra
ns
pa
ren
t to
SW
• Software (applications/compiler/OS) can control– Main memory, registers, execution datapaths
K. Yelick, PIM Software 2004
VIRAM Chip StatisticsVIRAM Chip Statistics
Technology IBM SA-27E, 0.18um CMOS, 6 layers of copperDeep trench DRAM cell, full speed logic
Area 270 mm2: 65 mm2 logic, 140 mm2 for DRAM
Transistors ~130 millions: 7.5M logic, 122.5 DRAM
Supply 1.2V logic, 1.8V DRAM, 3.3V I/O
Clock 200 MHz
Power 2W: 0.5W MIPS core, 1W vector unit, 0.5W DRAM-I/O
Package 304-lead quad ceramic package (125 signal I/Os)
Crossbar BW
12.8 Gbytes/s per direction (load or store, peak)
Peak Performance
Integer wo. madd: 1.6/3.2/6.4 Gops (64b/32b/16b)Integer w. madd: 3.2/6.4/12.8 Gops (64b/32b/16b)FP: 1.6 Gflops (32b, wo. madd)
K. Yelick, PIM Software 2004
VIRAM Design StatisticsVIRAM Design Statistics
RTL model 170K lines of Verilog
DesignMethodology
Synthesized: MIPS core, vector unit control, FP datapathFull-custom: vector reg. file, crossbar, integer datapathsMacros: DRAM, SRAM for caches
IP Sources UC Berkeley (Vector coprocessor, crossbar, I/O)MIPS Technologies (MIPS core)IBM (DRAM/SRAM macros)MIT (FP Datapath)
Verification 566K lines of directed tests (9.8M lines of assembly)4 months of random testing on 20 linux workstations
Design team 5 graduate students
Status Place & route, chip assembly
Tape-out October, 2002
Design time ~2.5 years
K. Yelick, PIM Software 2004
VIRAM ChipVIRAM Chip
Taped out to IBM in October ‘02
Received wafers in June 2003. Chips were thinned, diced,
and packaged. Parts were sent to ISI, who
produced test boards.
DRAM
DRAM
I/O
MIPS
4 64-bit Vector Lanes
K. Yelick, PIM Software 2004
Demonstration SystemDemonstration System
Based on the MIPS Malta development board PCI, Ethernet, AMR, IDE, USB,
CompactFlash, parallel, serial VIRAM daughter-card
Designed at ISI-East VIRAM processor Galileo GT64120 chipset 1 DIMM slot for external DRAM
Software support and OS Monitor utility for debugging Modified version of MIPS Linux
K. Yelick, PIM Software 2004
Benchmarks for Scientific ProblemsBenchmarks for Scientific Problems
Dense and Sparse Matrix-vector multiplication Compare to tuned codes on conventional machines
Transitive-closure (small & large data set) On a dense graph representation
NSA Giga-Updates Per Second (GUPS, 16-bit & 64-bit) Fetch-and-increment a stream of “random” addresses
Sparse matrix-vector product: Order 10000, #nonzeros 177820
Computing a histogram Used for image processing of a 16-bit greyscale image: 1536 x
1536 2 algorithms: 64-elements sorting kernel; privatization Also used in sorting
2D unstructured mesh adaptation initial grid: 4802 triangles, final grid: 24010
K. Yelick, PIM Software 2004
Sparse MVM PerformanceSparse MVM Performance
Performance is matrix-dependent: lp matrix compiled for VIRAM using “independent” pragma
sparse column layout Sparsity-optimized for other machines
sparse row (or blocked row) layout
0
50
100
150
200
250
VIRA
M4
VIRA
M8
Sun U
ltra I
MIPS 1
0K
Alp
ha
21264
Power P
C
604e
MFLO
PS
K. Yelick, PIM Software 2004
Power and Performance on BLAS-2 Power and Performance on BLAS-2
100x100 matrix vector multiplication (column layout) VIRAM result compiled, others hand-coded or Atlas optimized VIRAM performance improves with larger matrices VIRAM power includes on-chip main memory 8-lane version of VIRAM nearly doubles MFLOPS
0
100
200
300
400
VIRAM Sun Ultra I Sun Ultra IIMIPS R12K Alpha21264
PowerPCG3
Power3630
MFLOPS MFLOPS/Watt
K. Yelick, PIM Software 2004
Performance ComparisonPerformance Comparison
IRAM designed for media processing Low power was a higher priority than high performance
IRAM (at 200MHz) is better for apps with sufficient parallelism
0100200300400500600700800900
1000
Transitive GUPS SPMV (reg) SPMV (rand) Hist Mesh
MO
PS
VIRAM
R10K
P-III
P4
Sparc
EV6
K. Yelick, PIM Software 2004
Power EfficiencyPower Efficiency
Same data on a log plot Includes both low power processors (Mobile PIII) The same picture for operations/cycle
0.1
1
10
100
1000
Transitive GUPS SPMV (reg) SPMV (rand) Hist Mesh
MO
PS
/Wa
tt
VIRAM R10K
P-III P4
Sparc EV6
K. Yelick, PIM Software 2004
Which Problems are Limited by Which Problems are Limited by Bandwidth?Bandwidth?
What is the bottleneck in each case? Transitive and GUPS are limited by bandwidth (near 6.4GB/s peak) SPMV and Mesh limited by address generation and bank conflicts For Histogram there is insufficient parallelism
0
1000
2000
3000
4000
5000
6000
Transitive GUPS SPMV(Regular)
SPMV(Random)
Histogram Mesh
MB
/s
0
100200
300400
500
600700
800900
1000
Mo
ps
MB/s
MOPS
K. Yelick, PIM Software 2004
Summary of 1-PIM ResultsSummary of 1-PIM Results
Programmability advantage All vectorized by the VIRAM compiler (Cray vectorizer) With restructuring and hints from programmers
Performance advantage Large on applications limited only by bandwidth More address generators/sub-banks would help irregular
performance
Performance/Power advantage Over both low power and high performance processors Both PIM and data parallelism are key
K. Yelick, PIM Software 2004
Alternative VIRAM DesignsAlternative VIRAM Designs
“VIRAM-4Lane”
4 lanes, 8 Mbytes
~190 mm2
3.2 Gops at 200MHz
“VIRAM-2Lanes”
2 lanes, 4 Mbytes
~120 mm2
1.6 Gops at 200MHz
“VIRAM-Lite”
1 lanes, 2 Mbytes
~60 mm2
0.8 Gops at 200MHz
K. Yelick, PIM Software 2004
Compiled Multimedia PerformanceCompiled Multimedia Performance
Single executable for multiple implementationsLinear scaling with number of lanesRemember, this is a 200MHz, 2W processor
0
1000
2000
3000
4000
Mill
ion
Op
era
tio
ns
pe
r s
eco
nd
matmul 64x64 saxpy 4K fir filter decrypt detect convolve compose colorspace
1 Lane 2 Lanes 4 Lanes
integerfloating-point
K. Yelick, PIM Software 2004
Third Party Comparison (I)Third Party Comparison (I)
0
5
10
15
20
25
Sp
ee
du
p o
ve
r P
PC
G3
Corner Turn Coherent SidelobeCanceller
Beam Steering
ISI Results for SLIIC Kernels (Performance)
PPC G3-400MHz M32R/D-80MHz PPC G4-733MHz
Pentium III-733MHz VIRAM-200MHz Imagine-400MHz
PP
C-G
4P
enti
um
III Im
agin
e
VIR
AM
PP
C-G
4
PP
C-G
4
Pen
tiu
m I
II
Pen
tiu
m I
II
VIR
AM
VIR
AM
Imag
ine
Imag
ine
K. Yelick, PIM Software 2004
Third Party Comparison (II)Third Party Comparison (II)
0
5
10
15
20
25
30
35
40
Imp
rove
men
t o
ver
PP
C G
3
Corner Turn Coherent SidelobeCanceller
Beam Steering
ISI-East Results for SLIIC Kernels (Peformance/Watt)
PPC G3-400MHz M32R/D-80MHz PPC G4-733MHz
Pentium III-733MHz VIRAM-200MHz Imagine-400MHz
PP
C-G
4P
enti
um
III
Imag
ine
VIR
AM
PP
C-G
4
PP
C-G
4
Pen
tiu
m I
II
Pen
tiu
m I
II VIR
AM
VIR
AM
Imag
ine
Imag
ine
K. Yelick, PIM Software 2004
Vectors VS. SIMD or VLIWVectors VS. SIMD or VLIW
SIMD Short, fixed-length, vector extensions
Require wide issue or ISA change to scale They don’t support vector memory accesses
Difficult to compile for Performance wasted for pack/unpack, shifts, rotates…
VLIW Architecture for instruction level parallelism
Orthogonal to vectors for data parallelism Inefficient for data parallelism
Large code size (3X for IA-64?) Extra work for software (scheduling more instructions) Extra work for hardware (decode more instructions)
K. Yelick, PIM Software 2004
Vector Vs. Wide Word SIMD: ExampleVector Vs. Wide Word SIMD: Example
Vector instruction sets have Strided and scatter/gather load/store operations
SIMD extensions load contiguous memory Implementation-independent vector length
SIMD extensions change ISA with bit wide in hardware
Simple example: conversion from RGB to YUV Thanks to Christoforos Kozyrakis
Y = [( 9798*R + 19235*G + 3736*B) / 32768]
U = [(-4784*R - 9437*G + 4221*B) / 32768] + 128
V = [(20218*R – 16941*G – 3277*B) / 32768] + 128
K. Yelick, PIM Software 2004
VIRAM CodeVIRAM CodeRGBtoYUV:
vlds.u.b r_v, r_addr, stride3, addr_inc # load R
vlds.u.b g_v, g_addr, stride3, addr_inc # load G
vlds.u.b b_v, b_addr, stride3, addr_inc # load B
xlmul.u.sv o1_v, t0_s, r_v # calculate Y
xlmadd.u.sv o1_v, t1_s, g_v
xlmadd.u.sv o1_v, t2_s, b_v
vsra.vs o1_v, o1_v, s_s
xlmul.u.sv o2_v, t3_s, r_v # calculate U
xlmadd.u.sv o2_v, t4_s, g_v
xlmadd.u.sv o2_v, t5_s, b_v
vsra.vs o2_v, o2_v, s_s
vadd.sv o2_v, a_s, o2_v
xlmul.u.sv o3_v, t6_s, r_v # calculate V
xlmadd.u.sv o3_v, t7_s, g_v
xlmadd.u.sv o3_v, t8_s, b_v
vsra.vs o3_v, o3_v, s_s
vadd.sv o3_v, a_s, o3_v
vsts.b o1_v, y_addr, stride3, addr_inc # store Y
vsts.b o2_v, u_addr, stride3, addr_inc # store U
vsts.b o3_v, v_addr, stride3, addr_inc # store V
subu pix_s,pix_s, len_s
bnez pix_s, RGBtoYUV
K. Yelick, PIM Software 2004
MMX Code (1)MMX Code (1)RGBtoYUV:
movq mm1, [eax]
pxor mm6, mm6
movq mm0, mm1
psrlq mm1, 16
punpcklbw mm0, ZEROS
movq mm7, mm1
punpcklbw mm1, ZEROS
movq mm2, mm0
pmaddwd mm0, YR0GR
movq mm3, mm1
pmaddwd mm1, YBG0B
movq mm4, mm2
pmaddwd mm2, UR0GR
movq mm5, mm3
pmaddwd mm3, UBG0B
punpckhbw mm7, mm6;
pmaddwd mm4, VR0GR
paddd mm0, mm1
pmaddwd mm5, VBG0B
movq mm1, 8[eax]
paddd mm2, mm3
movq mm6, mm1
paddd mm4, mm5
movq mm5, mm1
psllq mm1, 32
paddd mm1, mm7
punpckhbw mm6, ZEROS
movq mm3, mm1
pmaddwd mm1, YR0GR
movq mm7, mm5
pmaddwd mm5, YBG0B
psrad mm0, 15
movq TEMP0, mm6
movq mm6, mm3
pmaddwd mm6, UR0GR
psrad mm2, 15
paddd mm1, mm5
movq mm5, mm7
pmaddwd mm7, UBG0B
psrad mm1, 15
pmaddwd mm3, VR0GR
packssdw mm0, mm1
pmaddwd mm5, VBG0B
psrad mm4, 15
movq mm1, 16[eax]
K. Yelick, PIM Software 2004
MMX Code (2)MMX Code (2) paddd mm6, mm7
movq mm7, mm1
psrad mm6, 15
paddd mm3, mm5
psllq mm7, 16
movq mm5, mm7
psrad mm3, 15
movq TEMPY, mm0
packssdw mm2, mm6
movq mm0, TEMP0
punpcklbw mm7, ZEROS
movq mm6, mm0
movq TEMPU, mm2
psrlq mm0, 32
paddw mm7, mm0
movq mm2, mm6
pmaddwd mm2, YR0GR
movq mm0, mm7
pmaddwd mm7, YBG0B
packssdw mm4, mm3
add eax, 24
add edx, 8
movq TEMPV, mm4
movq mm4, mm6
pmaddwd mm6, UR0GR
movq mm3, mm0
pmaddwd mm0, UBG0B
paddd mm2, mm7
pmaddwd mm4,
pxor mm7, mm7
pmaddwd mm3, VBG0B
punpckhbw mm1,
paddd mm0, mm6
movq mm6, mm1
pmaddwd mm6, YBG0B
punpckhbw mm5,
movq mm7, mm5
paddd mm3, mm4
pmaddwd mm5, YR0GR
movq mm4, mm1
pmaddwd mm4, UBG0B
psrad mm0, 15
paddd mm0, OFFSETW
psrad mm2, 15
paddd mm6, mm5
movq mm5, mm7
K. Yelick, PIM Software 2004
MMX Code (3)MMX Code (3) pmaddwd mm7, UR0GR
psrad mm3, 15
pmaddwd mm1, VBG0B
psrad mm6, 15
paddd mm4, OFFSETD
packssdw mm2, mm6
pmaddwd mm5, VR0GR
paddd mm7, mm4
psrad mm7, 15
movq mm6, TEMPY
packssdw mm0, mm7
movq mm4, TEMPU
packuswb mm6, mm2
movq mm7, OFFSETB
paddd mm1, mm5
paddw mm4, mm7
psrad mm1, 15
movq [ebx], mm6
packuswb mm4,
movq mm5, TEMPV
packssdw mm3, mm4
paddw mm5, mm7
paddw mm3, mm7
movq [ecx], mm4
packuswb mm5, mm3
add ebx, 8
add ecx, 8
movq [edx], mm5
dec edi
jnz RGBtoYUV
K. Yelick, PIM Software 2004
SummarySummary
Combination of Vectors and PIM Simple execution model for hardware – pushes complexity to
compiler Low power/footprint/etc. PIM provides bandwidth needed by vectors Vectors hid latency effectively
Programmability Programmable from “high” level language More compact instruction stream Works well for:
Applications with fine-grained data parallelism Memory intensive problems
Both scientific and multimedia applications
K. Yelick, PIM Software 2004
The The EndEnd
K. Yelick, PIM Software 2004
Algorithm SpaceAlgorithm Space
Regularity
Reuse
Two-sided dense linear algebra
One-sided dense linear algebra
FFTs
Sparse iterative solvers
Sparse direct solvers
Asynchronous discrete even simulation
Grobner Basis (“Symbolic LU”)
Search
Sorting
K. Yelick, PIM Software 2004
VIRAM OverviewVIRAM Overview
14.5 mm
20
.0 m
m
MIPS core (200 MHz) Single-issue, 8 Kbyte I&D caches
Vector unit (200 MHz) 32 64b elements per register 256b datapaths, (16b, 32b, 64b
ops) 4 address generation units
Main memory system 13 MB of on-chip DRAM in 8 banks 12.8 GBytes/s peak bandwidth
Typical power consumption: 2.0 W Peak vector performance
1.6/3.2/6.4 Gops wo. multiply-add 1.6 Gflops (single-precision)
Fabrication by IBM Tape-out in O(1 month)
K. Yelick, PIM Software 2004
Benchmarks for Scientific ProblemsBenchmarks for Scientific Problems
Dense Matrix-vector multiplication Compare to hand-tuned codes on conventional machines
Transitive-closure (small & large data set) On a dense graph representation
NSA Giga-Updates Per Second (GUPS, 16-bit & 64-bit) Fetch-and-increment a stream of “random” addresses
Sparse matrix-vector product: Order 10000, #nonzeros 177820
Computing a histogram Used for image processing of a 16-bit greyscale image: 1536 x
1536 2 algorithms: 64-elements sorting kernel; privatization Also used in sorting
2D unstructured mesh adaptation initial grid: 4802 triangles, final grid: 24010
K. Yelick, PIM Software 2004
Power and Performance on BLAS-2 Power and Performance on BLAS-2
100x100 matrix vector multiplication (column layout) VIRAM result compiled, others hand-coded or Atlas optimized VIRAM performance improves with larger matrices VIRAM power includes on-chip main memory 8-lane version of VIRAM nearly doubles MFLOPS
0
100
200
300
400
VIRAM Sun Ultra I Sun Ultra IIMIPS R12K Alpha21264
PowerPCG3
Power3630
MFLOPS MFLOPS/Watt
K. Yelick, PIM Software 2004
Performance ComparisonPerformance Comparison
IRAM designed for media processing Low power was a higher priority than high performance
IRAM (at 200MHz) is better for apps with sufficient parallelism
0100200300400500600700800900
1000
Transitive GUPS SPMV (reg) SPMV (rand) Hist Mesh
MO
PS
VIRAM
R10K
P-III
P4
Sparc
EV6
K. Yelick, PIM Software 2004
Power EfficiencyPower Efficiency
Huge power/performance advantage in VIRAM from both PIM technology Data parallel execution model (compiler-controlled)
0
50
100
150
200
250
300
350
400
450
500
Transitive GUPS SPMV (reg) SPMV (rand) Hist Mesh
MO
PS
/Wa
tt
VIRAM
R10K
P-III
P4
Sparc
EV6
K. Yelick, PIM Software 2004
Power EfficiencyPower Efficiency
Same data on a log plot Includes both low power processors (Mobile PIII) The same picture for operations/cycle
0.1
1
10
100
1000
Transitive GUPS SPMV (reg) SPMV (rand) Hist Mesh
MO
PS
/Wa
tt
VIRAM R10K
P-III P4
Sparc EV6
K. Yelick, PIM Software 2004
Which Problems are Limited by Which Problems are Limited by Bandwidth?Bandwidth?
What is the bottleneck in each case? Transitive and GUPS are limited by bandwidth (near 6.4GB/s peak) SPMV and Mesh limited by address generation and bank conflicts For Histogram there is insufficient parallelism
0
1000
2000
3000
4000
5000
6000
Transitive GUPS SPMV(Regular)
SPMV(Random)
Histogram Mesh
MB
/s
0
100200
300400
500
600700
800900
1000
Mo
ps
MB/s
MOPS
K. Yelick, PIM Software 2004
Summary of 1-PIM ResultsSummary of 1-PIM Results
Programmability advantage All vectorized by the VIRAM compiler (Cray vectorizer) With restructuring and hints from programmers
Performance advantage Large on applications limited only by bandwidth More address generators/sub-banks would help irregular
performance
Performance/Power advantage Over both low power and high performance processors Both PIM and data parallelism are key
K. Yelick, PIM Software 2004
Analysis of a Multi-PIM SystemAnalysis of a Multi-PIM System
Machine Parameters Floating point performance
PIM-node dependent Application dependent, not theoretical peak
Amount of memory per processor Use 1/10th Algorithm data
Communication Overhead Time processor is busy sending a message Cannot be overlapped
Communication Latency Time across the network (can be overlapped)
Communication Bandwidth Single node and bisection
Back-of-the envelope calculations !
K. Yelick, PIM Software 2004
Real Data from an Old Machine (T3E)Real Data from an Old Machine (T3E)
UPC uses a global address space Non-blocking remote put/get model Does not cache remote data
Sparse Matrix-Vector Multiply (T3E)
0
50
100
150
200
250
1 2 4 8 16 32
Processors
Mfl
op
s
UPC + PrefetchMPI (Aztec)UPC BulkUPC Small
K. Yelick, PIM Software 2004
Running Sparse MVM on a Pflop PIMRunning Sparse MVM on a Pflop PIM
1 GHz * 8 pipes * 8 ALUs/Pipe = 64 GFLOPS/node peak 8 Address generators limit performance to 16 Gflops 500ns latency, 1 cycle put/get overhead, 100 cycle MP overhead Programmability differences too: packing vs. global address space
1.E+07
1.E+08
1.E+09
1.E+10
1.E+11
1.E+12
1.E+13
1.E+14
1.E+15
1.E+16
Op
s/s
ec
Put/Get
Blocking read/w rite
Synchronous MP
Asynch MP
Peak
K. Yelick, PIM Software 2004
Effect of Memory SizeEffect of Memory Size
For small memory nodes or smaller problem sizes Low overhead is more important
For large memory nodes and large problems packing is better
1.E+07
1.E+08
1.E+09
1.E+10
1.E+11
1.E+12
1.E+13
1.E+14
1.E+15
1.E+16
0.3
0.5
1.0
2.1
4.1
8.2
16.4
32.9
65.8
131.
626
3.1
526.
3
1052
.5
2105
.0
4210
.0
MB/node of data
Op
s/s
ec
Put/Get
Blocking read/w rite
Synchronous MP
Asynch MP
Peak
K. Yelick, PIM Software 2004
ConclusionsConclusions
Performance advantage for PIMS depends on application Need fine-grained parallelism to utilize on-chip bandwidth Data parallelism is one model with the usual trade-offs
Hardware and programming simplicity Limited expressibility
Largest advantages for PIMS are power and packaging Enables Peta-scale machine
Multiprocessor PIMs should be easier to program At least at scale of current machines (Tflops) Can we bget rid of the current programming model hierarchy?
K. Yelick, PIM Software 2004
BenchmarksBenchmarks
Kernels Designed to stress memory systems
Some taken from the Data Intensive Systems Stressmarks
Unit and constant stride memory Dense matrix-vector multiplication Transitive-closure
Constant stride FFT
Indirect addressing NSA Giga-Updates Per Second (GUPS) Sparse Matrix Vector multiplication Histogram calculation (sorting)
Frequent branching a well and irregular memory acess Unstructured mesh adaptation
K. Yelick, PIM Software 2004
Conclusions and VIRAM Future DirectionsConclusions and VIRAM Future Directions
VIRAM outperforms Pentium III on Scientific problems With lower power and clock rate than the Mobile Pentium
Vectorization techniques developed for the Cray PVPs applicable. PIM technology provides low power, low cost memory system. Similar combination used in Sony Playstation.
Small ISA changes can have large impact Limited in-register permutations sped up 1K FFT by 5x.
Memory system can still be a bottleneck Indexed/variable stride costly, due to address generation.
Future work: Ongoing investigations into impact of lanes, subbanks Technical paper in preparation – expect completion 09/01 Run benchmark on real VIRAM chips Examine multiprocessor VIRAM configurations
K. Yelick, PIM Software 2004
Management PlanManagement Plan
Roles of different groups and PIs Senior researchers working on particular class of benchmarks
Parry: sorting and histograms Sherry: sparse matrices Lenny: unstructured mesh adaptation Brian: simulation Jin and Hyun: specific benchmarks
Plan to hire additional postdoc for next year (focus on Imagine) Undergrad model used for targeted benchmark efforts
Plan for using computational resources at NERSC Few resourced used, except for comparisons
K. Yelick, PIM Software 2004
Future Funding ProspectsFuture Funding Prospects
FY2003 and beyond DARPA initiated DIS program Related projects are continuing under Polymorphic Computing New BAA coming in “High Productivity Systems” Interest from other DOE labs (LANL) in general problem
General model Most architectural research projects need benchmarking Work has higher quality if done by people who understand
apps. Expertise for hardware projects is different: system level
design, circuit design, etc. Interest from both IRAM and Imagine groups show level of
interest
K. Yelick, PIM Software 2004
Long Term ImpactLong Term Impact
Potential impact on Computer Science Promote research of new architectures and micro-
architectures Understand future architectures
Preparation for procurements Provide visibility of NERSC in core CS research areas
Correlate applications: DOE vs. large market problems
Influence future machines through research collaborations
K. Yelick, PIM Software 2004
Benchmark Performance on IRAM Benchmark Performance on IRAM SimulatorSimulator IRAM (200 MHz, 2 W) versus Mobile Pentium III (500 MHz, 4 W)
K. Yelick, PIM Software 2004
Project Goals for FY02 and BeyondProject Goals for FY02 and Beyond
Use established data-intensive scientific benchmarks with other emerging architectures:
IMAGINE (Stanford Univ.) Designed for graphics and image/signal processing Peak 20 GLOPS (32-bit FP) Key features: vector processing, VLIW, a streaming memory
system. (Not a PIM-based design.) Preliminary discussions with Bill Dally.
DIVA (DARPA-sponsored: USC/ISI) Based on PIM “smart memory” design, but for multiprocessors Move computation to data Designed for irregular data structures and dynamic databases. Discussions with Mary Hall about benchmark comparisons
K. Yelick, PIM Software 2004
Media BenchmarksMedia Benchmarks
FFT uses in-register permutations, generalized reduction All others written in C with Cray vectorizing compiler
0
0.5
1
1.5
2
2.5
3
3.5
4G
OP
S
K. Yelick, PIM Software 2004
Integer BenchmarksInteger Benchmarks
Strided access important, e.g., RGB narrow types limited by address generation
Outer loop vectorization and unrolling used helps avoid short vectors spilling can be a problem
01000200030004000500060007000
1 lane
2 lane
4 lane
K. Yelick, PIM Software 2004
Status of benchmarking software releaseStatus of benchmarking software release
Build and test scripts (Makefiles, timing, analysis, ...)
Standard random number generator
OptimizedGUPS
inner loop
GUPS C codes PointerJumping
PointerJumpingw/Update
Transitive Field
ConjugateGradient(Matrix)
Neighborhood
Optimizedvector
histogramcode
Vector histogramcode generator
GUPSDocs
Test cases (small and large working sets)
Optimized
Unoptimized Future work:
• Write more documentation, add better test cases as we find them
• Incorporate media benchmarks, AMR code, library of frequently-used compiler flags & pragmas
K. Yelick, PIM Software 2004
Status of benchmarking workStatus of benchmarking work
Two performance models: simulator (vsim-p), and trace analyzer (vsimII)
Recent work on vsim-p: Refining the performance model for double-precision FP
performance.
Recent work on vsimII: Making the backend modular
Goal: Model different architectures w/ same ISA. Fixing bugs in the memory model of the VIRAM-1 backend. Better comments in code for better maintainability. Completing a new backend for a new decoupled cluster
architecture.
K. Yelick, PIM Software 2004
Comparison with Mobile PentiumComparison with Mobile Pentium
GUPS: VIRAM gets 6x more GUPS
Data element width
16 bit 32 bit
64 bit
Mobile Pentium GUPS
.045 .046 .036
VIRAM GUPS .295 .295 .244
0
1
2
3
4
5
6
7
8
tran
sitiv
e
tran
sitiv
e
tran
sitiv
e
tran
sitiv
e
tran
sitiv
e
tran
sitiv
e
tran
sitiv
e
tran
sitiv
e
50 100150200250350450550
Matrix size
To
tal
exec
uti
on
tim
e (s
eco
nd
s)
P-III
VIRAM 4lane
0
0.0001
0.0002
0.0003
0.0004
0.0005
0.0006
update update update update update
0tiny test test2 test3 test4
Working set size
tota
l execu
tio
n t
ime (
seco
nd
s)
0
0.0005
0.001
0.0015
0.002
0.0025
0.003
0.0035
pointer pointer pointer pointer
0tiny test test2 test3
working set size
tota
l execu
tio
n t
ime (
seco
nd
s)
TransitivePointerUpdate
VIRAM=30-50% faster than P-III
Ex. time for VIRAM rises much more slowly w/ data size than for P-III
K. Yelick, PIM Software 2004
Sparse CGSparse CG
Solve Ax = b; Sparse matrix-vector multiplication dominates.
Traditional CRS format requires: Indexed load/store for X/Y vectors Variable vector length, usually short
Other formats for better vectorization: CRS with narrow band (e.g., RCM ordering)
Smaller strides for X vector Segmented-Sum (Modified the old code developed for Cray
PVP) Long vector length, of same size Unit stride
ELL format: make all rows the same length by padding zeros Long vector length, of same size Extra flops
K. Yelick, PIM Software 2004
SMVM PerformanceSMVM Performance
DIS matrix: N = 10000, M = 177820 (~ 17 nonzeros per row)
IRAM results (MFLOPS)
Mobile PIII (500 MHz) CRS: 35 MFLOPS
SubBanks
1 2 4 8
CRS 91 106 109 110
CRS banded
110 110 110 110
SEG-SUM 135 154 163 165
ELL (4.6 X more flops)
511(111)
570(124)
612(133)
632(137)
K. Yelick, PIM Software 2004
2D Unstructured Mesh Adaptation2D Unstructured Mesh Adaptation
Powerful tool for efficiently solving computational problems with evolving physical features (shocks, vortices, shear layers, crack propagation)
Complicated logic and data structures Difficult to achieve high efficiently
Irregular data access patterns (pointer chasing) Many conditionals / integer intensive
Adaptation is tool for making numerical solution cost effective Three types of element subdivision
K. Yelick, PIM Software 2004
Vectorization Strategy and Performance Vectorization Strategy and Performance ResultsResults Color elements based on vertices (not edges)
Guarantees no conflicts during vector operations
Vectorize across each subdivision (1:2, 1:3, 1:4) one color at a time Difficult: many conditionals, low flops, irregular data access,
dependencies Initial grid: 4802 triangles, Final grid 24010 triangles
Preliminary results demonstrate VIRAM 4.5x faster than Mobile Pentium III 500
Higher code complexity (requires graph coloring + reordering)
Pentium III 500 1 Lane 2 Lanes 4 Lanes
61 18 14 13Time (ms)