iram and istore projects
DESCRIPTION
IRAM and ISTORE Projects. - PowerPoint PPT PresentationTRANSCRIPT
Slide 1
IRAM and ISTORE Projects
Aaron Brown, James Beck, Rich Fromm, Joe Gebis, Paul Harvey, Adam Janin, Dave Judd, Kimberly
Keeton, Christoforos Kozyrakis, David Martin, Rich Martin, Thinh Nguyen, David Oppenheimer,
Steve Pope, Randi Thomas, Noah Treuhaft, Sam Williams, John Kubiatowicz, Kathy
Yelick, and David Patterson
http://iram.cs.berkeley.edu/[istore]
Fall 1999 DIS DARPA Meeting
Slide 2
ISTORE Hardware Vision
• System-on-a-chip enables computer, memory, redundant network interfaces without significantly increasing size of disk
• Target for + 5-7 years:
– building block: 2006 MicroDrive integrated with IRAM
» 9GB disk, 50 MB/sec from disk» connected via crossbar switch
– 10,000+ nodes fit into one rack!
Slide 4
VIRAM: System on a Chip
Prototype scheduled for tape-out 1H 2000•0.18 um EDL process
•16 MB DRAM, 8 banks
•MIPS Scalar core and caches @ 200 MHz
•4 64-bit vector unit pipelines @ 200 MHz
•4 100 MB parallel I/O lines
•17x17 mm, 2 Watts
•25.6 GB/s memory (6.4 GB/s per direction and per Xbar)
•1.6 Gflops (64-bit), 6.4 GOPs (16-bit)
CPU+$
I/O4 Vector Pipes/Lanes
Memory (64 Mbits / 8 MBytes)
Memory (64 Mbits / 8 MBytes)
Xbar
Slide 5
Intelligent PDA ( 2003?)Pilot PDA
+ gameboy, cell phone, radio, timer, camera, TV remote, am/fm radio, garage door opener, ...
+ Wireless data (WWW)
+ Speech, vision recog.
+ Voice output for conversations
Speech control +Vision to see, scan documents, read bar code, ...
Slide 6
IRAM Update•IBM to supply embedded DRAM/Logic (99%)
–DRAM macro added to 0.18 micron logic process –DRAM specs under NDA; final agreement soon
•Sandcraft to supply scalar core–64-bit MIPS embedded processor, caches, TLB, FPU
•Test chip received from LG Semicon•ISA Manual and Simulator complete
–better fixed-point model and instructions–better support for short vectors
»auto-increment memory addressing»instructions for in-register reductions & butterfly-permutations
•VIRAM-1 Tape-out scheduled for 1H 2000–Writing Verilog of control now –Layout of multiplier, register file nearly complete
Slide 7
IRAM Update• Vectorizing Compiler for VIRAM
– preliminary version complete using SUIF– retargeting CRAY/SGI compiler
» Scalar codegen validated on commercial suite (~100 tests)
» Debug and test of vector instructions underway» Scheduling and memory barriers leverage Cray SV2
work
• Speech & video applications & media library underway
• Benchmarking results
Slide 8
VIRAM-1 block diagram
Slide 9
Microarchitecture configuration
• 2 arithmetic units– both execute integer
operations– one executes FP
operations– 4 64-bit datapaths
(lanes) per unit• 2 flag processing units
– for conditional execution and speculation support
• 1 load-store unit– optimized for strides
1,2,3, and 4– 4 addresses/cycle for
indexed and strided operations
– decoupled indexed and strided stores
• Memory system– 8 DRAM banks– 256-bit synchronous
interface– 1 sub-bank per bank– 16 Mbytes total capacity
• Peak performance– 3.2 GOPS64, 12.8 GOPS16
(w. madd)
– 1.6 GOPS64, 6.4 GOPS16 (wo. madd)
– 0.8 GFLOPS64, 3.2 GFLOPS32 (w. madd)
– 6.4 Gbyte/s memory bandwidth
Slide 10
Media Kernel Performance
PeakPerf.
SustainedPerf.
%of Peak
Image Composition 6.4 GOPS 6.40 GOPS 100.0%
iDCT 6.4 GOPS 1.97 GOPS 30.7%
Color Conversion 3.2 GOPS 3.07 GOPS 96.0%
Image Convolution 3.2 GOPS 3.16 GOPS 98.7%
Integer MV Multiply 3.2 GOPS 2.77 GOPS 86.5%
Integer VM Multiply 3.2 GOPS 3.00 GOPS 93.7%
FP MV Multiply 3.2 GFLOPS 2.80 GFLOPS 87.5%
FP VM Multiply 3.2 GFLOPS 3.19 GFLOPS 99.6%
AVERAGE 86.6%
Slide 11
Base-line system comparison
VIRAM MMX VIS TMS320C82
ImageComposition
0.13 - 2.22 (17.0x) -
iDCT 1.18 3.75 (3.2x) - -
ColorConversion
0.78 8.00 (10.2x) - 5.70 (7.6x)
ImageConvolution
5.49 5.49 (4.5x) 6.19 (5.1x) 6.50 (5.3x)
• All numbers in cycles/pixel
•MMX and VIS results assume all data in L1 cache
Slide 12
0
2
4
6
8
16 32 64
vpw
GO
P/s
2
4
8
IRAM/VSUIF Decryption (IDEA)
• IDEA Decryption operates on 16-bit ints • Compiled with IRAM/VSUIF • Note scalability of both #lanes and data width• Some hand-optimizations (unrolling) will be
automated by Cray compiler
# lanes
Virtual processor width
Slide 13
1D FFT on IRAMFFT study on IRAM
– bit-reversal time included; cost hidden using indexed store
– Faster than DSPs on floating point (32-bit) FFTs– CRI Pathfinder does 24-bit fixed point, 1K points in 28
usec (2 Watts without SRAM)
Slide 14
3D FFT on ISTORE 2006• Performance of large 3D FFT’s depend on 2 factors
– speed of 1D FFT on a single node (next slide)– network bandwidth for “transposing” data
– 1.3 Tflop FFT possible w/ 1K IRAM nodes, if network bisection bandwidth scales (!)
Slide 15
Scaling to 10K Processors• IRAM + micro-disk offer huge scaling
opportunities• Still many hard system problems (SAM)
– Scalability» Dynamic scaling with plug-and-play components» Scalable performance, gracefully down as well as
up» Machines become heterogeneous in performance at
scale
– Availability» 24 x7 databases without human intervention» Discrete vs. continuous model of machine being up
– Maintainability» 42% of system failures are due to administrative
errors» self-monitoring, tuning, and repair
Slide 16
Hardware: plug-and-play intelligent devices with self-monitoring, diagnostics, and fault injection hardware–intelligence used to collect and filter monitoring data
–diagnostics and fault injection enhance robustness
–networked to create a scalable shared-nothing cluster
•Scheduled for 4Q 99 and 1Q 2000
Intelligent Chassis80 nodes, 8 per tray2 levels of switches•20 100 Mb/s•2 1 Gb/sEnvironment Monitoring:UPS, redundant PS,fans, heat and vibrartion sensors...
Intelligent Disk “Brick”Portable PC Processor: Pentium II+ DRAM
Redundant NICs (4 100 Mb/s links)Diagnostic Processor
Disk
Half-height canister
ISTORE-1: Hardware for SAM
Slide 17
ISTORE Software Approach
• Two-pronged approach to providing reliability:
1) reactive self-maintenance: dynamic reaction to exceptional system events
» self-diagnosing, self-monitoring hardware» software monitoring and problem detection» automatic reaction to detected problems
2) proactive self-maintenance: continuous online self- testing and self-analysis
» automatic characterization of system components» in situ fault injection, self-testing, and scrubbing to
detect flaky hardware components and to exercise rarely-taken application code paths before they’re used
Slide 18
ISTORE Applications• Storage-intensive, reliable services for ISTORE-1
– infrastructure for “thin clients,” e.g., PDAs – web services– databases, including decision-support
• Scalable memory-intensive computations for ISTORE in 2006
– DIS benchmarks» 3D FFT
• 1.4 Gflops on IRAM nodes
» Electromagnetic scattering (MoM)• Sparse matrix/vector multiply 500/250 Mflops on IRAM
nodes
– RT-STAP » QR Decomposition currently in use as test case for
compiler
– Performance estimates through IRAM simulation + model
Slide 19
Performance Heterogeneity• System performance limited by the weakest link• NOW Sort experience: performance heterogeneity is the
norm– disks: inner vs. outer track (50%), fragmentation– processors: load (1.5-5x) and heat
• Virtual Streams: dynamically off-load I/O work from slower disks to faster ones
0
1
2
3
4
5
6
100% 67% 39% 29%
Efficiency Of Single Slow Disk
Min
imu
m P
er-
Pro
ce
ss
B
an
dw
idth
(MB
/se
c)
Ideal
Virtual Streams
Static
Slide 20
Hardware: plug-and-play intelligent devices with self-monitoring, diagnostics, and fault injection hardware
–intelligence used to collect and filter monitoring data
–diagnostics and fault injection enhance robustness
–networked to create a scalable shared-nothing cluster
•Scheduled for 4Q 99 and 1Q 2000
Intelligent Chassis80 nodes, 8 per tray2 levels of switches•20 100 Mb/s•2 1 Gb/sEnvironment Monitoring:UPS, redundant PS,fans, heat and vibrartion sensors...
Intelligent Disk “Brick”Portable PC Processor: Pentium II+ DRAM
Redundant NICs (4 100 Mb/s links)Diagnostic Processor
Disk
Half-height canister
ISTORE-1: Prototype Hardware
Slide 21
ISTORE Brick Block Diagram
CPUNorthBridge
Mobile Pentium II Module
DRAM256 MB
DiagnosticProcessor
PCI
SCSI
SouthBridge
SuperI/O
BIOS
DUALUART
Ethernets4x100 Mb/s
DiagnosticNet
Flash RTC RAM
Monitor&
Control
Disk (18 GB)
• Sensors for heat and vibration
• Control over power to individual nodes
Slide 22
Conclusion• IRAM attractive for two Post-PC applications
because of low power, small size, high memory bandwidth– Mobile consumer electronic devices– Scaleable infrastructure
• IRAM benchmarking result: faster than DSPs
• ISTORE: hardware/software architecture for single-use, introspective storage
• Scaling systems requires – new continuous models of availability– performance not limited by the weakest link– self* systems to reduce human interaction
Slide 23
Backup Slides
Slide 24
ISTORE-1 System Layout
Brick shelfBrick shelf
Brick shelf
Brick shelf
Brick shelf
Brick shelf
Brick shelf
Brick shelf
Slide 25
+
Vector Registers
x
÷
Load/Store
Vector 4 x 64or
8 x 32or
16 x 16
4 x 644 x 64
QueueInstruction
V-IRAM1: 0.18 µm, Fast Logic, 200 MHz
1.6 GFLOPS(64b)/6.4 GOPS(16b)/32MB
Memory Crossbar Switch
16K I cache 16K D cache
2-way SuperscalarProcessor
M
M
…
M
M
M
…
M
M
M
…
M
M
M
…
M
M
M
…
M
M
M
…
M
…
M
M
…
M
M
M
…
M
M
M
…
M
M
M
…
M
4 x 64 4 x 64 4 x 64 4 x 64 4 x 64
I/OI/O
I/OI/O
100MBeach
Slide 26
Fixed-point multiply-add model
• Same basic model, different set of instructions– fixed-point: multiply & shift & round, shift right & round, shift left & saturate
– integer saturated arithmetic: add or sub & saturate
– added multiply-add instruction for improved performance and energy consumption
sat
Round
a
w
y
z
+*
x
n/2
n/2
n
n
n
Multiply half word & Shift & Round Add & Saturate
n
Slide 27
Other ISA modifications• Auto-increment loads/stores
– a vector load/store can post-increment its base address
– added base (16), stride (8), and increment (8) registers
– necessary for applications with short vectors or scaled-up implementations
• Butterfly permutation instructions– perform step of a butterfly permutation
within a vector register– used for FFT and reduction operations
• Miscellaneous instructions added– min and max instructions (integer and FP)– FP reciprocal and reciprocal square root
Slide 28
Major architecture updates• Integer arithmetic units support multiply-add
instructions• 1 load store unit
– complexity Vs. benefit• Optimize for strides 2, 3, and 4
– useful for complex arithmetic and image processing functions
• Decoupled strided and indexed stores– memory stalls due to bank conflicts do not stall
the arithmetic pipelines– allows scheduling of independent arithmetic
operations in parallel with stores that experience many stalls
– implemented with address, not data, buffering – currently examining a similar optimization for
loads
Slide 29
Micro-kernel results: simulated systems
1 LaneSystem
2 LaneSystem
4 LaneSystem
8 LaneSystem
# of 64-bit lanes 1 2 4 8
Addresses per cyclefor strided-indexedaccesses
1 2 4 8
Crossbar width 64b 128b 256b 512b
Width of DRAM bankinterface
64b 128b 256b 512b
DRAM banks 8 8 8 8
•Note : simulations performed with 2 load-store units and without decoupled stores or optimizations for strides 2, 3, and 4
Slide 30
Micro-kernels
Benchmark OperationsType
DataWidth
MemoryAccesses
OtherComments
ImageComposition(Blending)
Integer 16b Unit-stride
2D iDCT (8x8image blocks)
Integer 16b Unit-strideStrided
Color Conversion(RGB to YUV)
Integer 32b Unit-stride
ImageConvolution
Integer 32b Unit-stride
Matrix-vectorMultiply (MV)
IntegerFP
32b Unit-stride Uses reductions
Vector-matrixMultiply (VM)
IntegerFP
32b Unit-stride
•Vectorization and scheduling performed manually
Slide 31
Scaled system results
•Near linear speedup for all application apart from iDCT
•iDCT bottlenecks
•large number of bank conflicts
•4 addresses/cycle for strided accesses
0
1
2
3
4
5
6
7
8
Compositing iDCT Color Conversion Convolution MxV INT (32) VxM INT (32) MxV FP (32) VxM FP(32)
Spe
edup
1 Lane 2 Lanes 4 Lanes 8 Lanes
Slide 32
iDCT scaling with sub-banks
• Sub-banks reduce bank conflicts and increase performance• Alternative (but not as effective) ways to reduce conflicts:
– different memory layout– different address interleaving schemes
0
1
2
3
4
5
6
7
8
1 Sub-Bank 2 Sub-Banks 4 Sub-Banks 8 Sub-Banks
Sp
ee
du
p
1 Lane 2 Lanes 4 Lanes 8 Lanes
Slide 33
Compiling for VIRAM• Long-term success of DIS technology depends
on simple programming model, i.e., a compiler• Needs to handle significant class of
applications– IRAM: multimedia, graphics, speech and
image processing– ISTORE: databases, signal processing, other
DIS benchmarks• Needs to utilize hardware features for
performance– IRAM: vectorization– ISTORE: scalability of shared-nothing
programming model
Slide 34
IRAM Compilers• IRAM/Cray vectorizing compiler [Judd]
– Production compiler» Used on the T90, C90, as well as the T3D and T3E» Being ported (by SGI/Cray) to the SV2 architecture
– Has C, C++, and Fortran front-ends (focus on C)
– Extensive vectorization capability» outer loop vectorization, scatter/gather, short loops,
…
– VIRAM port is under way• IRAM/VSUIF vectorizing compiler [Krashinsky]
– Based on VSUIF from Corinna Lee’s group at Toronto which is based on MachineSUIF from Mike Smith’s group at Harvard which is based on SUIF compiler from Monica Lam’s group at Stanford
– This is a “research” compiler, not intended for compiling large complex applications
– It has been working since 5/99.
Slide 35
IRAM/Cray Compiler Status
• MIPS backend developed in this year– Validated using a commercial test suite for
code generation• Vector backend recently started
– Testing with simulator under way • Leveraging from Cray
– Automatic vectorization
Vectorizer
C
Fortran
C++
Frontends Code Generators
PDGCS
IRAM
C90
Slide 36
VIRAM/VSUIF Matrix/Vector Multiply
• VIRAM/VSUIF does reasonably well on long loops
0
200
400
600
800
1000
1200dot
padded
saxpy
han
dop
t
Mflop/ s
mvm vmm
• 256x256 single matrix• Compare to 1600 Mflop/s (peak
without multadd)• Note BLAS-2 (little reuse)• ~350 on Power3 and EV6
• Problems specific to VSUIF
– hand strip-mining results in short loops
– reductions– no multadd support
Slide 37
Reactive Self-Maintenance• ISTORE defines a layered system model for
monitoring and reaction:
Self-monitoringhardware
SW monitoring
Problem detection
Coordinationof reaction
Reaction mechanisms
Provided by ISTORE Runtime System
Provided byApplication
• ISTORE API defines interface between runtime system and app. reaction mechanisms
Polic
ies
ISTORE API
• Policies define system’s monitoring, detection, and reaction behavior
Slide 38
Proactive Self-Maintenance• Continuous online self-testing of HW and SW
– detects flaky, failing, or buggy components via:
» fault injection: triggering hardware and software error handling paths to verify their integrity/existence
» stress testing: pushing HW/SW components past normal operating parameters
» scrubbing: periodic restoration of potentially “decaying” hardware or software state
– automates preventive maintenance• Dynamic HW/SW component characterization
– used to adapt to heterogeneous hardware and behavior of application software components
Slide 39
ISTORE-0 Prototype and Plans• ISTORE-0: testbed for early experimentation with
ISTORE research ideas• Hardware: cluster of 6 PCs
– intended to model ISTORE-1 using COTS components
– nodes interconnected using ISTORE-1 network fabric
– custom fault-injection hardware on subset of nodes
• Initial research plans– runtime system software– fault injection– scalability, availability, maintainability
benchmarking– applications: block storage server, database, FFT
Slide 40
Runtime System Software
• Demonstrate simple policy-driven adaptation– within context of a single OS and application– software monitoring information collected and
processed in realtime» e.g., health & performance parameters of OS,
application
– problem detection and coordination of reaction » controlled by a stock set of configurable policies
– application-level adaptation mechanisms» invoked to implement reaction
• Use experience to inform ISTORE API design• Investigate reinforcement learning as technique
to infer appropriate reactions from goals
Slide 41
Record-breaking performance is not the common case
• NOW-Sort records demonstrate peak performance• But perturb just 1 of 8 nodes and...
0
1
23
4
5
Best
case
Bad
disk
layout
Busy
disk
Light
CPU
Heavy
CPU
Paging
Slow
dow
n
Slide 42
Virtual Streams:Dynamic load balancing for I/O
• Replicas of data serve as second sources• Maintain a notion of each process’s progress • Arbitrate use of disks to ensure equal progress• The right behavior, but what mechanism?
Process
Virtual Streams Software
Disk
Arbiter
Slide 43
Graduated Declustering:A Virtual Streams implementation
• Clients send progress, servers schedule in response
ToClient0
Before Slowdown After Slowdown
0 1 1 2 2 3 3 0
Client0B
Client1B
Client2B
Client3B
Server0B
Server1B
Server2B
Server3B
ToClient0
FromServer3
B/2B/2
B/2
B/2
B/2B/2 B/2
B/2
B/2
0 1 1 2 2 3 3 0
Client07B/8
Client17B/8
Client27B/8
Client37B/8
Server0B
Server1B/2
Server2B
Server3B
FromServer3
B/23B/8
5B/8
B/4
B/45B/8 3B/8
B/2
B/2
Slide 44
Read Performance:Multiple Slow Disks
0
1
2
3
4
5
6
0 1 2 3 4 5 6 7 8
# Of Slow Disks (out of 8)
Min
imu
m P
er-
Pro
ce
ss
B
an
dw
idth
(M
B/s
ec
) Ideal
Virtual Streams
Static
Slide 45
Storage Priorities: Research v. Users
Traditional Research Priorities
1) Performance1’) Cost 3) Scalability4) Availability5) Maintainability
ISTORE Priorities1) Maintainability2) Availability3) Scalability4) Performance5) Cost
} easy
to measure
} hard
to measure
Slide 46
Intelligent Storage Project Goals
• ISTORE: a hardware/software architecture for building scaleable, self-maintaining storage–An introspective system: it monitors itself and acts on its observations
• Self-maintenance: does not rely on administrators to configure, monitor, or tune system
Slide 47
Self-maintenance• Failure management
– devices must fail fast without interrupting service
– predict failures and initiate replacement
– failures immediate human intervention
• System upgrades and scaling– new hardware automatically incorporated
without interruption – new devices immediately improve
performance or repair failures• Performance management
– system must adapt to changes in workload or access patterns
Slide 48
ISTORE-I: 2H99• Intelligent disk
– Portable PC Hardware: Pentium II, DRAM– Low Profile SCSI Disk (9 to 18 GB)– 4 100-Mbit/s Ethernet links per node– Placed inside Half-height canister– Monitor Processor/path to power off
components?• Intelligent Chassis
– 64 nodes: 8 enclosures, 8 nodes/enclosure» 64 x 4 or 256 Ethernet ports
– 2 levels of Ethernet switches: 14 small, 2 large » Small: 20 100-Mbit/s + 2 1-Gbit; Large: 25 1-Gbit» Just for prototype; crossbar chips for real system
– Enclosure sensing, UPS, redundant PS, fans, ...
Slide 49
Disk Limit
• Continued advance in capacity (60%/yr) and bandwidth (40%/yr)
• Slow improvement in seek, rotation (8%/yr)• Time to read whole disk
Year Sequentially Randomly (1 sector/seek)
1990 4 minutes 6 hours
1999 35 minutes 1 week(!)• 3.5” form factor make sense in 5-7 years?
Slide 50
Related Work• ISTORE adds to several recent research efforts• Active Disks, NASD (UCSB, CMU)• Network service appliances (NetApp, Snap!,
Qube, ...)• High availability systems (Compaq/Tandem, ...)• Adaptive systems (HP AutoRAID, M/S AutoAdmin,
M/S Millennium)• Plug-and-play system construction (Jini, PC
Plug&Play, ...)
Slide 51
Other (Potential) Benefits of ISTORE
• Scalability: add processing power, memory, network bandwidth as add disks
• Smaller footprint vs. traditional server/disk• Less power
– embedded processors vs. servers– spin down idle disks?
• For decision-support or web-service applications, potentially better performance than traditional servers
Slide 52
Disk Limit: I/O Buses
CPU Memory bus
Memory
C
External I/O bus
(SCSI)C
(PCI)
C Internal I/O busC
Multiple copies of data,SW layers
• Bus rate vs. Disk rate– SCSI: Ultra2 (40 MHz),
Wide (16 bit): 80 MByte/s– FC-AL: 1 Gbit/s = 125 MByte/s (single disk in
2002)
Cannot use 100% of bus Queuing Theory (<
70%) Command overhead
(Effective size = size x 1.2)
Controllers(15 disks)
Slide 53
State of the Art: Seagate Cheetah 36
–36.4 GB, 3.5 inch disk –12 platters, 24 surfaces–10,000 RPM–18.3 to 28 MB/s internal media transfer rate(14 to 21 MB/s user data)
–9772 cylinders (tracks), (71,132,960 sectors total)
–Avg. seek: read 5.2 ms, write 6.0 ms (Max. seek: 12/13,1 track: 0.6/0.9 ms)
–$2100 or 17MB/$ (6¢/MB)(list price)
–0.15 ms controller timesource: www.seagate.com
Slide 54
User Decision Support Demand
vs. Processor speed
1
10
100
1996 1997 1998 1999 2000
CPU speed2X / 18 months
Database demand:2X / 9-12 months
Database-Proc.Performance Gap:
“Greg’s Law”
“Moore’s Law”