ibm research © 2008 feeding the multicore beast: it’s all about the data! michael perrone ibm...
TRANSCRIPT
IBM Research
© 2008
Feeding theMulticore Beast:It’s All About the Data!
Michael PerroneIBM Master InventorMgr, Cell Solutions Dept.
IBM Research
© 20082 [email protected]
Outline
History: Data challenge
Motivation for multicore
Implications for programmers
How Cell addresses these implications
Examples• 2D/3D FFT
– Medical Imaging, Petroleum, general HPC…
• Green’s Functions– Seismic Imaging (Petroleum)
• String Matching– Network Processing: DPI & Intrusion Detections
• Neural Networks– Finance
IBM Research
© 20084 [email protected]
The Hungry Beast
Processor(“beast”)
Data(“food”)
Data Pipe
Pipe too small = starved beast
Pipe big enough = well-fed beast
Pipe too big = wasted resources
IBM Research
© 20085 [email protected]
The Hungry Beast
Processor(“beast”)
Data(“food”)
Data Pipe
Pipe too small = starved beast
Pipe big enough = well-fed beast
Pipe too big = wasted resources
If flops grow faster than pipe capacity…
… the beast gets hungrier!
IBM Research
© 20086 [email protected]
Move the food closer
Example: Intel Tulsa– Xeon MP 7100 series
– 65nm, 349mm2, 2 Cores
– 3.4 GHz @ 150W
– ~54.4 SP GFlops
– http://www.intel.com/products/processor/xeon/index.htm
Large cache on chip
– ~50% of area
– Keeps data close for efficient access
If the data is local,the beast is happy!
– True for many algorithms
IBM Research
© 20087 [email protected]
What happens if the beast is still hungry?
Data
Cache
If the data set doesn’t fit in cache
– Cache misses
– Memory latency exposed
– Performance degraded
Several important application classes don’t fit
– Graph searching algorithms
– Network security
– Natural language processing
– Bioinformatics
– Many HPC workloads
IBM Research
© 20088 [email protected]
Make the food bowl larger
Data
Cache Cache size steadily increasing
Implications
– Chip real estate reserved for cache
– Less space on chip for computes
– More power required for fewer FLOPS
IBM Research
© 20089 [email protected]
Make the food bowl larger
Data
Cache Cache size steadily increasing
Implications
– Chip real estate reserved for cache
– Less space on chip for computes
– More power required for fewer FLOPS
But…
– Important application working sets are growing faster
– Multicore even more demanding on cache than uni-core
IBM Research
© 200811 [email protected]
Power Density – The fundamental problem
1
10
100
1000
1.5 1 0.7 0.5 0.35 0.25 0.18 0.13 0.1 0.07
i386i486
Pentium®
Pentium Pro ®
Pentium II ®Pentium III®
W/cm2
Hot Plate
Nuclear Reactor
Source: Fred Pollack, Intel. New Microprocessor Challenges in the Coming Generations of CMOS Technologies, Micro32
IBM Research
© 200812 [email protected]
What’s causing the problem?
10S Tox=11AGate Stack
Gate dielectric approaching a fundamental limit (a few atomic layers)
0.010.110.001
0.01
0.1
1
10
100
1000
Gate Length (microns)
Active Power
Passive Power
1994 2004Po
wer
Den
sity
(W
/cm
2 )
65 nM
Gate Length (microns)
1 0.010.1
1000
100
10
1
0.1
0.01
0.001
Power, signal jitter, etc...
IBM Research
© 200813 [email protected]
1.0E+02
1.0E+03
1.0E+04
1990 1995 2000 2005 2010
Clo
ck S
pee
d (
MH
z)
Clock Speed
103
102
104
Diminishing Returns on FrequencyIn a power-constrained environment, chip clock speed yields diminishing
returns. The industry has moved to lower frequency multicore architectures.
Frequency-DrivenDesignPoints
IBM Research
© 200814 [email protected]
Power vs Performance Trade Offs
Relative Performance
0
1
2
3
4
5
Rel
ativ
e P
ower
1
1.45
1.3.85 1.7
We need to adapt our algorithms to get performance out of multicore
IBM Research
© 200815 [email protected]
Implications of Multicore
There are more mouths to feed
– Data movement will take center stage
Complexity of cores will stop increasing
… and has started to decrease in some cases
Complexity increases will center around communication
Assumption
– Achieving a significant % or peak performance is important
IBM Research
© 200818 [email protected]
Feeding the Cell Processor
8 SPEs each with
– LS
– MFC
– SXU
PPE
– OS functions
– Disk IO
– Network IO
16B/cycle (2x)16B/cycle
BIC
FlexIOTM
MIC
Dual XDRTM
16B/cycle
EIB (up to 96B/cycle)
16B/cycle
64-bit Power Architecture with VMX
PPE
SPE
LS
SXUSPU
MFC
PXUL1
PPU
16B/cycle
L232B/cycle
LS
SXUSPU
MFC
LS
SXUSPU
MFC
LS
SXUSPU
MFC
LS
SXUSPU
MFC
LS
SXUSPU
MFC
LS
SXUSPU
MFC
LS
SXUSPU
MFC
IBM Research
© 200819 [email protected]
Cell Approach: Feed the beast more efficiently Explicitly “orchestrate” the data flow between main
memory and each SPE’s local store
– Use SPE’s DMA engine to gather & scatter data between memory main memory and local store
– Enables detailed programmer control of data flow
• Get/Put data when & where you want it• Hides latency: Simultaneous reads, writes & computes
– Avoids restrictive HW cache management
• Unlikely to determine optimal data flow• Potentially very inefficient
– Allows more efficient use of the existing bandwidth
IBM Research
© 200820 [email protected]
Cell Approach: Feed the beast more efficiently Explicitly “orchestrate” the data flow between main memory
and each SPE’s local store
– Use SPE’s DMA engine to gather & scatter data between memory main memory and local store
– Enables detailed programmer control of data flow
• Get/Put data when & where you want it
• Hides latency: Simultaneous reads, writes & computes
– Avoids restrictive HW cache management
• Unlikely to determine optimal data flow
• Potentially very inefficient
– Allows more efficient use of the existing bandwidth
BOTTOM LINE:
It’s all about the data!
IBM Research
© 200821 [email protected]
Cell Comparison: ~4x the FLOPS @ ~½ the power Both 65nm technology
(to scale)
IBM Research
© 200822 [email protected]
Memory Managing Processor vs. Traditional General Purpose Processor
IBM
AMD
Intel
Cell
BE
IBM Research
© 200823 [email protected]
Examples of Feeding Cell
2D and 3D FFTs
Seismic Imaging
String Matching
Neural Networks (function approximation)
IBM Research
© 200824 [email protected]
Feeding FFTs to Cell
Buffer
Input Image
Transposed Image
Tile
Transposed Tile
Transposed Buffer
SIMDized data
DMAs double buffered
Pass 1: For each buffer• DMA Get buffer
• Do four 1D FFTs in SIMD
• Transpose tiles
• DMA Put buffer
Pass 2: For each buffer• DMA Get buffer
• Do four 1D FFTs in SIMD
• Transpose tiles
• DMA Put buffer
IBM Research
© 200825 [email protected]
3D FFTs
Long stride trashes cache
Cell DMA allows prefetch
Single Element Data envelope
Stride 1
StrideN2
N
IBM Research
© 200826 [email protected]
Feeding Seismic Imaging to Cell
(X,Y)
New G at each (x,y)
Radial symmetry of G reduces BW requirements
Data
Green’s Function
ij
jiyxGjyixD ),,,(),(
IBM Research
© 200827 [email protected]
Feeding Seismic Imaging to Cell Data
SPE 0 SPE 1 SPE 2 SPE 3 SPE 4 SPE 5 SPE 6 SPE 7
IBM Research
© 200828 [email protected]
Feeding Seismic Imaging to Cell Data
SPE 0 SPE 1 SPE 2 SPE 3 SPE 4 SPE 5 SPE 6 SPE 7
IBM Research
© 200829 [email protected]
Feeding Seismic Imaging to Cell
For each X
– Load next column of data
– Load next column of indices
– For each Y• Load Green’s functions• SIMDize Green’s functions • Compute convolution at
(X,Y)
– Cycle buffers
H
2R+1
1
Data bufferGreen’s Index buffer
(X,Y)
R
2
IBM Research
© 200830 [email protected]
Feeding String Matching to Cell
Find (lots of) substrings in (long) string
Build graph of words & represent as DFA
Problem: Graph doesn’t fit in LS
Sample Word List:
“the”“that”
“math”
IBM Research
© 200834 [email protected]
Feeding Neural Networks to Cell
Neural net function F(X)
– RBF, MLP, KNN, etc.
If too big for LS, BW Bound
N Basis functions: dot product + nonlinearity
D Input dimensions
DxN Matrix of parameters
OutputF
X
IBM Research
© 200835 [email protected]
Convert BW Bound to Compute Bound
Split function over multiple SPEs
Avoids unnecessary memory traffic
Reduce compute time per SPE
Minimal merge overhead
Merge
IBM Research
© 200836 [email protected]
Moral of the Story:It’s All About the Data!
The data problem is growing: multicore
Intelligent software prefetching
– Use DMA engines
– Don’t rely on HW prefetching
Efficient data management
– Multibuffering: Hide the latency!
– BW utilization: Make every byte count!
– SIMDization: Make every vector count!
– Problem/data partitioning: Make every core work!
– Software multithreading: Keep every core busy!
IBM Research
© 200838 [email protected]
Abstract
Technological obstacles have prevented the microprocessor industry from achieving increased performance through increased chip clock speeds. In a reaction to these restrictions, the industry has chosen the multicore processors path. Multicore processors promise tremendous GFLOPS performance but raise the challenge of how one programs them. In this talk, I will discuss the motivation for multicore, the implications to programmers and how the Cell/B.E. processors design addresses these challenges. As an example, I will review one or two applications that highlight the strengths of Cell.