petascale execution time analysis architecture/vlsi chip floorplan monarch chip overview...
TRANSCRIPT
PetaScale Execution Time Analysis
Arc
hit
ectu
re/V
LS
IA
rch
itec
ture
/VL
SI Chip FloorplanMonarch Chip Overview
Computational Sciences DivisionComputational Sciences DivisionBob Lucas – DirectorBob Lucas – Director
Poster Participants:Poster Participants: Jeff Draper, Mary Hall, Jacqueline Chame, Pedro Diniz, Jeff Sondeen, Spundun Bhatt, Tim BarrettJeff Draper, Mary Hall, Jacqueline Chame, Pedro Diniz, Jeff Sondeen, Spundun Bhatt, Tim Barrett
USCVITERBI
SCHOOL OFENGINEERING
System: four boards with eight PIM chips
LD on PIMs in IA64 Host
Ap
p/S
ys P
roto
typ
eA
pp
/Sys
Pro
toty
pe
Au
tom
atic
Per
form
ance
Tu
nin
gA
uto
mat
ic P
erfo
rman
ce T
un
ing
Model Guided Empirical Optimization
ECO: Combining models and guided empirical search for memory hierarchy optimization
Authors: Pedro Diniz, Jeremy Abramson, Tejus Krishna Contact: [email protected]
Per
form
ance
Exp
ecta
tio
nP
erfo
rman
ce E
xpec
tati
on
Objective Evaluate link discovery (LD) algorithms on Godiva H/W.
Hypothesis LD algorithms are data-intensive and highly parallel Largely read-only data Irregular memory accesses poor cache performance PIM technology would yield performance improvement
Expected Results
Parallel PIM implementations of LD computations Performance comparisons with Itanium-2 host Analysis of software/hardware scalability requirements Analysis of programming complexity
Results of Scalability AnalysisRaw Performance MeasurementsPIMS for KNOWLEDGE DISCOVERY
in collaboration with Hans Chalupsky & Jafar Adibi, USC ISI
Tools Organization and Rationale
Code Isolator Model Guided Empirical Optimization Results
• IBM Cu-08 90nm CMOS
• Clock 333 MHz• 64 GOPS/GFLOPS• Power 3-6 GFLOPS/W• 12 Arithmetic Clusters
– 96 ALUs (32-bit integer/float)
• 31 Memory Clusters– 256W x 32 bits each
(128KB)• 6 RISC processors• 12 MBytes eDRAM• 2 memory interfaces (8
GB/s BW)• 2 RapidIO (x4 serial)
interfaces• 17 DIFL ports (2.6
GB/s ea)• On-chip quad ring (40
GB/s)
DIFL = Differential Inter FPCA Link
PBDIFLs
ED R P
ED R P
ED R P
EDRP
EDRP
EDRP
P
MemoryInterface
P PP
CM
ROMPort
DIFLs
DIFLs
DIFLs
DIFLs DIFLs
DIFLs
DIFLs DIFLs
DIFLsDIFLs
MemoryInterface
P
RIO
P
RIO
DI/DO
MONARCH Project
• MOrphable Networked ARCHitecture (MONARCH)– DARPA-funded collaboration between USC,
Raytheon, Mercury, IBM, Georgia Tech
• Combines two radically different computing paradigms
– Conventional thread-level parallel programming model• RISC processor with extensions
• WideWord (MMX-like) unit formed through morphing
• Useful for complex code sets containing data-dependent control flow decisions
– Stream programming model (dataflow stream operation)
• Field Programmable Compute Array (FPCA)
• Useful for predictable operations on large data streams, e.g., pre-filtering of sensor data
• Achieves highest data throughput
AC RISC
eDRAM
PBUF IC
HS
S
MC
ACNWW
eDRAM
eDRAM
eDRAM
eDRAMeDRAM HSS HSSHSS
HS
SH
SS
HS
SH
SS
HS
S
HS
S
PLL PLL
AC RISC
AC RISCAC RISC
AC RISCAC RISC
ACNWW
ACNWW
ACNWW
ACNWW
ACNWW
PBUF PBUF PBUF
PBUFPBUF
PBUF PBUF
PBUF
PBUF
PBUF
PBUF
IC
IC
IC
ICIC
ICIC
ICIC
MC MC MC MC MCMC
MC
MC
MC
MC
MC
MC
MC
MC MC
MC
MC
MC
MC
MC
MC
MC
MC
MC
MC
MC
MC
MC
MC
MC
Status - currently in fab
- First silicon expected 4Q06
- Prototype boards/modules expected 1Q06
ASIC Area BreakdownFull MONARCH ChipBased on IBM’s max die size of 352sq mm (18.76mm on a side)
Total Active Cu-08 Cells = 280,054,413
~100M Gate Equivalents
•Low-level Binary Instrumentation is too Expensive
•Takes time, thus precluding observing real runs•Generates lots of data, thus forcing to use sampling techniques
•Approach: synergistic combination of compiler static analysis and dynamic run-time data extraction
•Static analysis uncovers some program behavior information and identifies data to be extracted at run-time•Instruments source code to extract missing data at run-time
•Advantages:•Much faster then binary instrumentation approach•Can relate observed metrics to source-level program
Source Code C/Fortran
Instrumented Source Code
C/Fortran
What to Instrument
Open64 Front -End
(gccfe,fef90)
Static Analysis (source level)
• Basic Blocks • Loop & Bounds • Array Refs & Stride Info • Symbolic Address Ranges • Workloads (int, fp, load,.._ • Locality Metrics Info
Source Code Instrumentation
Static Info. Data Files
Open64 Tools (whirl2c,whirl2f)
Text Files
Whirl B Files
Off-Line Analysis • Basic Blocks • Loop & Bounds • Array Refs & Stride Info • Symbolic Address Ranges • Workloads (int, fp, load) • Locality Metrics Info
Target Arch Compiler
(gcc,f90,gf)
Analysis Files
Instrumentation Library
Application Executable
Dynamic Info. Data Files Execution
Application Inputs
Application Outputs
Whirl B Files
Whirl B Files
•Goal: Derive Performance Expectations from Source Code for Different Architectures
•What Should the Performance be and Why?•What is Limiting the Performance?
•Data-Dependences•Architecture Limitations
•Approach: Use Data-Flow Analysis & Scheduling Techniques
•Extract DFG from the High-Level Source Code•Make Assumptions about Memory Hierarchy•Compute As-Soon-As-Possible Schedule•Vary Number and Implementation Features of Units
•Load/Store Units•Functional Units
Compiler Approach to Performance Expectation
Architectural Exploration Results for UMT2K
0
200
400
600
800
1000
1200
1 2 3 4 5
Number of Load/Store Units
Cyc
les
1 ALU
2 ALU
3 ALU
4 ALU
5 ALU
0
500
1000
1500
2000
2500
1 2 3 4 5
Number of Load/Store Units
Cyc
les
1 ALU
2 ALU
3 ALU
4 ALU
5 ALU
No Unrolling of Inner Loop
Unrolling Inner Loop by 4x• Code:
– Inner Loop of the Angular Loop in snswp3D procedure
– 272 Operations, 4 FP div (non Pipelined); 41 FP Mults; 95 Int Ops; 84 Load/Store; 22 Int Mults.
• Analysis:– Compute-bound: adding more load/store units won’t help
– Not cost effective to have more than 2 ALU (non-unrolled) or 4 ALUs (4x unrolled)
Authors: Chun Chen, YoonJu L. Nelson, Jacqueline Chame, Mary Hall Contact: [email protected]
Authors: Jacqueline Chame, Mary Hall, Spundun Bhatt, Tim Barrett Contact: [email protected]
Authors: Jeff Draper, Jeff Sondeen, Sumit Mediratta, Rashed Bhatti, TJ Kwon, Tim Barrett, et. al. Contact: [email protected]
Model-guided compiler optimizationstatic models of architecture, profitability
Empirical optimizationempirical data guide optimization decisionsself-tuning libraries such as ATLAS, PhiPAC, FFTW and SPIRAL
Exploit complementary strengths of both approaches
compiler models prune unprofitable solutionsempirical data provide accurate measure of optimization impact
analysis/models
transformation modules
application code architecturespecification
code variant
generation ph
ase 1
set of parameterized code variants + constraints on unbound parameters
optimized code variant +representative input data set
search engine
performancemonitoring
supportexecution
environmentph
ase 2
optimized
code
Vendor BLASATLAS BLAS
NativeECO
ECO x ATLAS, vendor BLAS and native compiler
matrix multiply on SGI R10K
Targeting multimedia extension architectures(Superword-Level Parallelism (SLP)
empirical search engine
analysis/models
application code
ph
ase
1
parameterized code variants + constraints on unbound parameters
code variants optimized for caches/TLB + unroll&jam to expose SLP
transformation modules
ph
ase
2
code variant generation
• on unrolled code:• pack isomorphic operations• align operands• register optimizations: superword replacement, register packing• low-level optimizations
performance monitoring
execution environmentoptimized code + representative input data set
architecture specification
• select loop order• cache and TLB optimizations• unroll&jam loops with SLP and spatial reuse
Results for Intel SSEIn process
PPC AltiVec
2xDDR, 4% 17xFD Hybrid DIFL + PBUS DMA, 10%
DT Decaps, 0%
6xAC-RISC, 8%
6xAC-No_WW, 5%
12xPBUF, 3%
1xROM Port, 0%
Decaps, 6%
System, 3%
Reserved, 25%
eFuses, 1%
2xXPIRX (as Serial RapidIO), 1%
Serial RapidIO (Mercury), 1%
31xMC, 14%
6xeDRAM+BIST+ Wrapper, 17%
10xANBI (IOC), 2%
Intel SSEProgram Energy
LoopAngle Loop
Size(LOC)
232K 150 1.3K
Execution Time
(hh:mm:ss)
41:02:05 00:00:12 00:10:00
#Args. 16 50
Input Data (Bytes)
0.57M 61.69M 442.84M
UMT2K SummaryDevelop “benchmark” of computation kernel from large application
Performance behavior equivalent to full application
Programmer and/or compiler tool
Support Model-guided Empirical Optimization (ECO project)
Increase machine and programmer efficiencies
Develop tool support for automatic performance tuning
Locality optimizations
Shared-memory parallel optimizations
MUTUAL INFORMATION
Clock Execution
TimeCycles
Instructions Per Cycle
Itanium-2 900 MHz 5.5ms 4.9M 1.588
Single PIM (superword,
compiler+hand tuned)
140 MHz 32.1ms 4M n/a
GRAPH CLUSTERING
Clock Execution
TimeCycles
Instructions Per Cycle
Itanium-2 900 MHz 0.26ms 233K 0.806
Single PIM (scalar, compiler)
140 MHz 1.11ms 155K n/a
18% Fewer Cycles
33% Fewer Cycles
Assume same clock on PIM and Itanium-2
Speedup using 1 PIM =
IT2 Cycles
PIM Cycles
1.225 for MI1.503 for GC
(1.008 for 2 PIMs) =
Now normalize by IPC of scaled data, since PIM behavior is consistent across data sets.
IT2 Cycles * (IPCtest / IPCscaled)
PIM Cycles=
1.316 for MI2.611 for GC, (1.75 for 2 PIMs)
Original Program
Code Fragmentto be executed
void main(){
Call OutlineFunc((<InputParameters>){}
void OutlineFunc(<InputParameters>){
}
Isolated Program
Isolated code
1.Compilable
StoreInitialDataValues
CaptureMachineState SetMachineState
2.Executable 3.Machine State
StoreInitialDataValues
<InputParameters>=SetInitialDataValues