istore update
DESCRIPTION
ISTORE Update. David Patterson University of California at Berkeley [email protected] UC Berkeley IRAM Group UC Berkeley ISTORE Group [email protected] May 2000. Lampson: Systems Challenges. Systems that work Meeting their specs Always available - PowerPoint PPT PresentationTRANSCRIPT
Slide 1
ISTORE Update
David PattersonUniversity of California at Berkeley
UC Berkeley IRAM Group UC Berkeley ISTORE Group
[email protected] 2000
Slide 2
Lampson: Systems Challenges⢠Systems that work
â Meeting their specsâ Always availableâ Adapting to changing environmentâ Evolving while they runâ Made from unreliable componentsâ Growing without practical limit
⢠Credible simulations or analysis⢠Writing good specs⢠Testing⢠Performance
â Understanding when it doesnât matter
âComputer Systems Research-Past and Futureâ Keynote address,
17th SOSP, Dec. 1999
Butler LampsonMicrosoft
Slide 3
Hennessy: What Should the âNew Worldâ Focus Be?⢠Availability
â Both appliance & service⢠Maintainability
â Two functions:Âť Enhancing availability by preventing failureÂť Ease of SW and HW upgrades
⢠Scalabilityâ Especially of service
⢠Costâ per device and per service transaction
⢠Performanceâ Remains important, but its not SPECint
âBack to the Future: Time to Return to Longstanding
Problems in Computer Systems?â Keynote address,
FCRC, May 1999
John HennessyStanford
Slide 4
The real scalability problems: AME⢠Availability
â systems should continue to meet quality of service goals despite hardware and software failures
⢠Maintainabilityâ systems should require only minimal ongoing
human administration, regardless of scale or complexity
⢠Evolutionary Growthâ systems should evolve gracefully in terms of
performance, maintainability, and availability as they are grown/upgraded/expanded
⢠These are problems at todayâs scales, and will only get worse as systems grow
Slide 5
Principles for achieving AME (1)⢠No single points of failure
⢠Redundancy everywhere⢠Performance robustness is more
important than peak performanceâ âperformance robustnessâ implies that real-world
performance is comparable to best-case performance
⢠Performance can be sacrificed for improvements in AMEâ resources should be dedicated to AME
Âť compare: biological systems spend > 50% of resources on maintenance
â can make up performance by scaling system
Slide 6
Principles for achieving AME (2)
⢠Introspectionâ reactive techniques to detect and adapt to
failures, workload variations, and system evolution
â proactive techniques to anticipate and avert problems before they happen
Slide 7
ISTORE-1 hardware platform⢠80-node x86-based cluster, 1.4TB storage
â cluster nodes are plug-and-play, intelligent, network-attached storage âbricksâ
Âť a single field-replaceable unit to simplify maintenanceâ each node is a full x86 PC w/256MB DRAM, 18GB diskâ more CPU than NAS; fewer disks/node than cluster
ISTORE Chassis80 nodes, 8 per tray2 levels of switchesâ˘20 100 Mbit/sâ˘2 1 Gbit/sEnvironment Monitoring:UPS, redundant PS,fans, heat and vibration sensors...
Intelligent Disk âBrickâPortable PC CPU: Pentium II/266 + DRAM
Redundant NICs (4 100 Mb/s links)Diagnostic Processor
DiskHalf-height canister
Slide 8
ISTORE-1 Status⢠10 Nodes manufactured; 45 board
fabbed, 40 to go⢠Boots OS⢠Diagnostic Processor Interface SW
complete⢠PCB backplane: not yet designed⢠Finish 80 node system: Summer 2000
Slide 9
Hardware techniques⢠Fully shared-nothing cluster
organizationâ truly scalable architectureâ architecture that tolerates partial failureâ automatic hardware redundancy
Slide 10
Hardware techniques (2)⢠No Central Processor Unit:
distribute processing with storageâ Serial lines, switches also growing with Mooreâs
Law; less need today to centralize vs. bus oriented systems
â Most storage servers limited by speed of CPUs; why does this make sense?
â Why not amortize sheet metal, power, cooling infrastructure for disk to add processor, memory, and network?
â If AME is important, must provide resources to be used to help AME: local processors responsible for health and maintenance of their storage
Slide 11
Hardware techniques (3)⢠Heavily instrumented hardware
â sensors for temp, vibration, humidity, power, intrusionâ helps detect environmental problems before they can
affect system integrity⢠Independent diagnostic processor on each
nodeâ provides remote control of power, remote console
access to the node, selection of node boot codeâ collects, stores, processes environmental data for
abnormalitiesâ non-volatile âflight recorderâ functionalityâ all diagnostic processors connected via independent
diagnostic network
Slide 12
Hardware techniques (4)⢠On-demand network
partitioning/isolationâ Internet applications must remain available
despite failures of components, therefore can isolate a subset for preventative maintenance
â Allows testing, repair of online systemâ Managed by diagnostic processor and network
switches via diagnostic network
Slide 13
Hardware techniques (5)⢠Built-in fault injection capabilities
â Power control to individual node componentsâ Injectable glitches into I/O and memory bussesâ Managed by diagnostic processor â Used for proactive hardware introspection
Âť automated detection of flaky componentsÂť controlled testing of error-recovery mechanisms
â Important for AME benchmarking (see next slide)
Slide 14
âHardwareâ techniques (6)⢠Benchmarking
â One reason for 1000X processor performance was ability to measure (vs. debate) which is better
Âť e.g., Which most important to improve: clock rate, clocks per instruction, or instructions executed?
â Need AME benchmarksâwhat gets measured gets doneââbenchmarks shape a fieldââquantification brings rigorâ
Slide 15
Availability benchmark methodology⢠Goal: quantify variation in QoS metrics as
events occur that affect system availability⢠Leverage existing performance benchmarks
â to generate fair workloadsâ to measure & trace quality of service metrics
⢠Use fault injection to compromise systemâ hardware faults (disk, memory, network, power)â software faults (corrupt input, driver error returns)â maintenance events (repairs, SW/HW upgrades)
⢠Examine single-fault and multi-fault workloadsâ the availability analogues of performance micro- and
macro-benchmarks
Slide 16
TimePerformance
}normal behavior(99% conf)
injecteddisk failure reconstruction
0
⢠Results are most accessible graphicallyâ plot change in QoS metrics over timeâ compare to ânormalâ behavior?
Âť 99% confidence intervals calculated from no-fault runs
Benchmark Availability?Methodology for reporting
results
Slide 17
Time (minutes)0 10 20 30 40 50 60 70 80 90 100 110
80
100
120
140
160
0
1
2
Hits/sec# failures tolerated
0 10 20 30 40 50 60 70 80 90 100 110
Hits per second
190
195
200
205
210
215
220
#failures tolerated
0
1
2
Reconstruction
Reconstruction
Example single-fault result
⢠Compares Linux and Solaris reconstructionâ Linux: minimal performance impact but longer
window of vulnerability to second faultâ Solaris: large perf. impact but restores redundancy
fast
Linux
Solaris
Slide 18
Reconstruction Policy⢠Linux: favors performance over data availability
â automatically-initiated reconstruction, idle bandwidthâ virtually no performance impact on applicationâ very long window of vulnerability (>1hr for 3GB RAID)
⢠Solaris: favors data availability over app. perf.â automatically-initiated reconstruction at high BWâ as much as 34% drop in application performanceâ short window of vulnerability (10 minutes for 3GB)
⢠Windows: favors neither!â manually-initiated reconstruction at moderate BWâ as much as 18% app. performance dropâ somewhat short window of vulnerability (23 min/3GB)
Slide 19
Transient error handling Policy⢠Linux is paranoid with respect to transients
â stops using affected disk (and reconstructs) on any error, transient or not
Âť fragile: system is more vulnerable to multiple faultsÂť disk-inefficient: wastes two disks per transientÂť but no chance of slowly-failing disk impacting perf.
⢠Solaris and Windows are more forgivingâ both ignore most benign/transient faults
Âť robust: less likely to lose data, more disk-efficientÂť less likely to catch slowly-failing disks and remove them
⢠Neither policy is ideal!â need a hybrid that detects streams of transients
Slide 20
Software techniques⢠Fully-distributed, shared-nothing code
â centralization breaks as systems scale up O(10000)
â avoids single-point-of-failure front ends
⢠Redundant data storageâ required for high availability, simplifies self-
testingâ replication at the level of application objects
Âť application can control consistency policyÂť more opportunity for data placement optimization
Slide 21
Software techniques (2)⢠âRiverâ storage interfaces
â NOW Sort experience: performance heterogeneity is the norm
Âť e.g., disks: outer vs. inner track (1.5X), fragmentation
Âť e.g., processors: load (1.5-5x)â So demand-driven delivery of data to apps
Âť via distributed queues and graduated declusteringÂť for apps that can handle unordered data delivery
â Automatically adapts to variations in performance of producers and consumers
â Also helps with evolutionary growth of cluster
Slide 22
Software techniques (3)⢠Reactive introspection
â Use statistical techniques to identify normal behavior and detect deviations from it
â Policy-driven automatic adaptation to abnormal behavior once detected
Âť initially, rely on human administrator to specify policy
Âť eventually, system learns to solve problems on its own by experimenting on isolated subsets of the nodes
â˘one candidate: reinforcement learning
Slide 23
Software techniques (4)⢠Proactive introspection
â Continuous online self-testing of HW and SWÂť in deployed systems!Âť goal is to shake out âHeisenbugsâ before theyâre
encountered in normal operationÂť needs data redundancy, node isolation, fault injection
â Techniques:Âť fault injection: triggering hardware and software
error handling paths to verify their integrity/existenceÂť stress testing: push HW/SW to their limitsÂť scrubbing: periodic restoration of potentially
âdecayingâ hardware or software stateâ˘self-scrubbing data structures (like MVS)â˘ECC scrubbing for disks and memory
Slide 24
Initial Applications⢠ISTORE is not one super-system that
demonstrates all these techniques!â Initially provide middleware, library to support
AME goals⢠Initial application targets
â cluster web/email serversÂť self-scrubbing data structures, online self-testingÂť statistical identification of normal behavior
â information retrieval for multimedia dataÂť self-scrubbing data structures, structuring
performance-robust distributed computation
Slide 25
A glimpse into the future?⢠System-on-a-chip enables computer,
memory, redundant network interfaces without significantly increasing size of disk
⢠ISTORE HW in 5-7 years:â building block: 2006 MicroDrive
integrated with IRAM Âť 9GB disk, 50 MB/sec from diskÂť connected via crossbar switch
â If low power, 10,000 nodes fit into one rack!
⢠O(10,000) scale is our ultimate design point
Slide 26
Future Targets⢠Maintenance in DoD application⢠Security in Computer Systems⢠Computer Vision
Slide 27
Maintenance in DoD systems⢠Introspective Middleware, Builtin Fault
Injection, Diagnostic Computer, Isolatable Subsystems ... should reduce Maintenance of DoD Hardware and Software systems
⢠Is Maintenance a major concern of DoD?⢠Does Improved Maintenance fit within
Goals of Polymorphous Computing Architecture?
Slide 28
Security in DoD Systems?
⢠Separate Diagnostic Processor and Network gives interesting Security possibilitiesâ Monitoring of behavior by separate computerâ Isolation of portion of cluster from rest of networkâ Remote reboot, software installation
Slide 29
Attacking Computer Vision⢠Analogy: Computer Vision Recognition in
2000 like Computer Speech Recognition in 1985â Pre 1985 community searching for good algorithms:
classic AI vs. statistics?â By 1985 reached consensus on statisticsâ Field focuses and makes progress, uses special
hardwareâ Systems become fast enough that can train systems
rather than preload information, which accelerates progress
â By 1995 speech regonition systems starting to deployâ By 2000 widely used, available on PCs
Slide 30
Computer Vision at Berkeley⢠Jitendra Malik believes has an approach
that is very promising⢠2 step process:
1) Segmentation: Divide image into regions of coherent color, texture and motion2) Recognition: combine regions and search image database to find a match
⢠Algorithms for 1) work well, just slowly (300 seconds per image using PC)
⢠Algorithms for 2) being tested this summer using hundreds of PCs; will determine accuracy
Slide 31
Human Quality Computer Vision
⢠Suppose Algorithms Work: What would it take to match Human Vision?
⢠At 30 images per second: segmentationâ Convolution and Vector-Matrix Multiply of Sparse
Matrices (10,000 x 10,000, 10% nonzero/row)â 32-bit Floating Pointâ 300 seconds on PC (assuming 333 MFLOPS)
=> 100G FL Ops/imageâ 30 Hz => 3000 GFLOPs machine to do
segmentation
Slide 32
Human Quality Computer Vision
⢠At 1 / second: object recognitionâ Human can remember 10,000 to 100,000 objects per
category (e.g., 10k faces, 10k Chinese characters, high school vocabulary of 50k words, ..)
â To recognize a 3D object, need ~10 2D viewsâ 100 x 100 x 8 bit (or fewer bits) per view
=> 10,000 x 10 x 100 x 100 bytes or 109 bytesâ Pruning using color and texture and by organizing
shapes into an index reduce shape matches to 1000â Compare 1000 candidate merged regions with 1000
candidate object imagesâ If 10 hours on PC (333 MFLOPS) => 12000 GFLOPS
Slide 33
ISTORE Successor does Human Quality Vision?
⢠10,000 nodes with System-On-A-Chip + Microdrive + networkâ 1 to 10 GFLOPS/node => 10,000 to 100,000
GFLOPSâ High Bandwidth Networkâ 1 to 10 GB of Disk Storage per Node
=> can replicate images per nodeâ Need Dependability, Maintainability advances to
keep 10,000 nodes useful⢠Human quality vision useful for DoD
Apps? Retrainable recognition?
Slide 34
Conclusions (1): ISTORE⢠Availability, Maintainability, and
Evolutionary growth are key challenges for server systemsâ more important even than performance
⢠ISTORE is investigating ways to bring AME to large-scale, storage-intensive serversâ via clusters of network-attached, computationally-
enhanced storage nodes running distributed codeâ via hardware and software introspectionâ we are currently performing application studies to
investigate and compare techniques⢠Availability benchmarks a powerful tool?
â revealed undocumented design decisions affecting SW RAID availability on Linux and Windows 2000