the 3rd international conference on emerging ubiquitous systems and pervasive networks amman,...

72
The 3rd International Conference on Emerging Ubiquitous Systems and Pervasive Networks www.iasks.org/conferences/EU SPN2011 Amman, Jordan October 10-13, 2011

Upload: matthew-ross

Post on 11-Jan-2016

225 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: The 3rd International Conference on Emerging Ubiquitous Systems and Pervasive Networks  Amman, Jordan October 10-13,

The 3rd International Conference on Emerging Ubiquitous Systems and

Pervasive Networkswww.iasks.org/conferences/EUSPN2011

Amman, JordanOctober 10-13, 2011

Page 2: The 3rd International Conference on Emerging Ubiquitous Systems and Pervasive Networks  Amman, Jordan October 10-13,

Challenges to High Productivity Computing Systems and

Networks

Mohammad MalkawiDean of Engineering,

Jadara University

[email protected]

Page 3: The 3rd International Conference on Emerging Ubiquitous Systems and Pervasive Networks  Amman, Jordan October 10-13,

Outline

High Productivity Computing Systems (HPCS) - The Big Picture

The Challenges IBM PERCS Cray Cascade SUN Hero Program Cloud Computing

Page 4: The 3rd International Conference on Emerging Ubiquitous Systems and Pervasive Networks  Amman, Jordan October 10-13,

HPCS: The Big Picture

Manufacture and deliver a peta-flop class computer

Complex architecture High performance Easier to program Easier to use

Page 5: The 3rd International Conference on Emerging Ubiquitous Systems and Pervasive Networks  Amman, Jordan October 10-13,

HPCS Goals

Productivity Reduce code development time

Processing power Floating point & integer arithmetic

Memory Large size, high bandwidth & low

latency Interconnection

Large bisection bandwidth

Page 6: The 3rd International Conference on Emerging Ubiquitous Systems and Pervasive Networks  Amman, Jordan October 10-13,

HPCS Challenges

High Effective Bandwidth High bandwidth/low latency memory systems

Balanced System Architecture Processors, memory, interconnects,

programming environments Robustness

Hardware and software reliability Compute through failure Intrusion identification and resistance

techniques.

Page 7: The 3rd International Conference on Emerging Ubiquitous Systems and Pervasive Networks  Amman, Jordan October 10-13,

HPCS Challenges

Performance Measurement and Prediction

New class of metrics and benchmarks to measure and predict performance of system architecture and applications software

Scalability Adapt and optimize to changing workload and

user requirements; e.g., multiple programming models, selectable machine abstractions, and configurable software/hardware architectures

Page 8: The 3rd International Conference on Emerging Ubiquitous Systems and Pervasive Networks  Amman, Jordan October 10-13,

Productivity Challenges

Quantify productivity for code development and production

Identify characteristics of ➔ Application codes➔ Workflow➔ Bottlenecks and obstacles➔ Lessons learned so that decisions by the

productivity team and the vendors are based on real data rather than anecdotal data

Page 9: The 3rd International Conference on Emerging Ubiquitous Systems and Pervasive Networks  Amman, Jordan October 10-13,

Did Not Learn the Lessons

Defect Arrival Rate

0.0

50.0

100.0

150.0

200.0

250.0

300.0

92

_9

3

No

v-9

7

Au

g-9

8

Ma

y-

Fe

b-0

0

No

v-0

0

Au

g-0

1

Ma

y-

Fe

b-0

3

No

v-0

3

Au

g-0

4

Ma

y-

Fe

b-0

6

No

v-0

6

De

fec

t ra

te p

er

mo

nth

R 8

R9

R10

Figure 2: Defect Arrival Rate for R8, R9 and R10

Page 10: The 3rd International Conference on Emerging Ubiquitous Systems and Pervasive Networks  Amman, Jordan October 10-13,

Productivity Dilemma - 1

Diminishing productivity is alarming Coding Debugging Optimizing Modifying Over-provisioning hardware Running high-end applications

Page 11: The 3rd International Conference on Emerging Ubiquitous Systems and Pervasive Networks  Amman, Jordan October 10-13,

Productivity Dilemma - 2

Not long ago, a computational scientist could personally write, debug and optimize code to run on a leadership class high performance computing system without the help of others.

Today, the programming for a cluster of machines is significantly more difficult than traditional programming, and the scale of the machines and problems has increased more than 1,000 times.

Page 12: The 3rd International Conference on Emerging Ubiquitous Systems and Pervasive Networks  Amman, Jordan October 10-13,

Productivity Dilemma - 3

Owning and running high-end computational facilities for nuclear research, seismic modeling, gene sequencing or business intelligence, takes sizeable investment in terms of staffing, procurement and operations.

Applications achieve 5 to 10 percent of the theoretical peak performance of the system.

Applications must be restarted from scratch every time a hardware or software failure interrupts the job.

Page 13: The 3rd International Conference on Emerging Ubiquitous Systems and Pervasive Networks  Amman, Jordan October 10-13,

HPCS Trends: Productivity Crisis

Page 14: The 3rd International Conference on Emerging Ubiquitous Systems and Pervasive Networks  Amman, Jordan October 10-13,

High Productivity Computing

Scaling the Program Without Scaling the

Programmer

Bandwidth enables productivity and allows for simpler programming environments and systems with

greater fault tolerance

Page 15: The 3rd International Conference on Emerging Ubiquitous Systems and Pervasive Networks  Amman, Jordan October 10-13,

Language Challenges

MPI is a fairly low-level language Reliable, predictable and works. Extension of Fortran, C and C++

New languages with higher level of abstraction

Improve legacy applications Scale to Petascale levels

SUN – Fortress➔ IBM - X10➔ Cray – Chapel➔ Open MP

Page 16: The 3rd International Conference on Emerging Ubiquitous Systems and Pervasive Networks  Amman, Jordan October 10-13,

Global View Programming Model

Global View programs present a single, global view of the program's data structures,

Begin with a single main thread. Parallel execution then spreads out

dynamically as work becomes available.

Page 17: The 3rd International Conference on Emerging Ubiquitous Systems and Pervasive Networks  Amman, Jordan October 10-13,

Unprecedented Performance Leap

Performance targets require aggressive improvements in system parameters traditionally ignored by the "Linpack" benchmark.

Improve system performance under the most demanding benchmarks (GUPS)

Determine whether general applications will be written or modified to benefit from these features.

Page 18: The 3rd International Conference on Emerging Ubiquitous Systems and Pervasive Networks  Amman, Jordan October 10-13,

Trade-Offs

Portability versus innovations Abstractions vs. difficulty of

programming and performance overhead

Shared memory versus message passing

Page 19: The 3rd International Conference on Emerging Ubiquitous Systems and Pervasive Networks  Amman, Jordan October 10-13,

Cost of Petascale Computing

Require petabytes of memory Order of 106 processors Hundreds of petabytes of disk storage for

capacity and bandwidth. Power consumption and cost for DRAM

and disks (Tens of Mega Watts) Operational cost

Page 20: The 3rd International Conference on Emerging Ubiquitous Systems and Pervasive Networks  Amman, Jordan October 10-13,

The DARPA HPCS Program

First major program to devote effort to make high end computers more user-friendly

Mask the difficulty of developing and running codes on HPCS

Mask the challenge of getting good performance for a general code

Fast, large, and low latency RAM Fast processing Quantitative measure of productivity

Page 21: The 3rd International Conference on Emerging Ubiquitous Systems and Pervasive Networks  Amman, Jordan October 10-13,

IBM HPCS EXAMPLE

Page 22: The 3rd International Conference on Emerging Ubiquitous Systems and Pervasive Networks  Amman, Jordan October 10-13,

IBM HPCS Program – PERC 2011 Productive, Easy-to-use, Reliable Computer Rich programming environment

Develop new applications and maintain existing ones. Support existing programming models and languages Scalability to the peta-level

Automate performance tuning tasks Rich graphical interfaces Automate monitoring and recovery tasks Fewer system administrators to handle

larger systems more effectively

Page 23: The 3rd International Conference on Emerging Ubiquitous Systems and Pervasive Networks  Amman, Jordan October 10-13,

IBM Blue Gene – HPCS Base

Page 24: The 3rd International Conference on Emerging Ubiquitous Systems and Pervasive Networks  Amman, Jordan October 10-13,

IBM Approach - Hardware

Innovative processor chip design & leverage the POWER processor server line.

Lower Soft Error Rates (SER) Reduce the latency of memory accesses by

placing the processors close to large memory arrays.

Multiple chip configuration to suit different workloads.

Page 25: The 3rd International Conference on Emerging Ubiquitous Systems and Pervasive Networks  Amman, Jordan October 10-13,

IBM Approach - Software

Large set of tools integrated into a modern, user-friendly programming environment.

Support both legacy programming models and languages (MPI, OpenMP, C, C++, Fortran, etc.),

Support emerging ones (PGAS) Design new experimental programming

language, called X10.

Page 26: The 3rd International Conference on Emerging Ubiquitous Systems and Pervasive Networks  Amman, Jordan October 10-13,

X10 Features Designed for parallel processing from the

ground up. Falls under the Partitioned Global Address

Space (PGAS) category Balance between a high-level abstraction

and exposing the topology of the system Asynchronous interactions among the

parallel threads Avoid the blocking synchronization style

Page 27: The 3rd International Conference on Emerging Ubiquitous Systems and Pervasive Networks  Amman, Jordan October 10-13,

CRAY HPCS EXAMPLE

Page 28: The 3rd International Conference on Emerging Ubiquitous Systems and Pervasive Networks  Amman, Jordan October 10-13,

Multiple Processing Technologies

In high performance computing: one size does not fit all

Heterogeneous computing using custom processing technologies.

Performance achieved via deeper pipelining and more complex microarchitectures

Introduction of multi-core processors: Further stresses processor-memory balance issues Drives up the number of processors required to solve

large problems

Page 29: The 3rd International Conference on Emerging Ubiquitous Systems and Pervasive Networks  Amman, Jordan October 10-13,

Specialized Computing Technologies

Vector processing and field programmable gate arrays (FPGAs)

➔ Ability to extract more performance out of the transistors on a chip with less control overhead.

➔ Allow higher processor performance, with lower power

➔ Reduce the number of processors required to solve a given problem

➔ Vector processors tolerate memory latency extremely well

Page 30: The 3rd International Conference on Emerging Ubiquitous Systems and Pervasive Networks  Amman, Jordan October 10-13,

Specialized Computing Technologies

Multithreading improve latency tolerance Cascade design will combine

multiple computing technologies Pure scalar nodes, based on Opteron

microprocessors Nodes providing vector, massively

multithreaded, and FPGA-based acceleration.

Nodes that can adapt their mode of operation to the application.

Page 31: The 3rd International Conference on Emerging Ubiquitous Systems and Pervasive Networks  Amman, Jordan October 10-13,

Cray: The Cascade Approach Scalable, high-bandwidth system Globally addressable memory Heterogeneous processing technologies Fast serial execution Massive multithreading Vector processing and FPGA-based

application acceleration. Adaptive supercomputing:

➔ The system adapts to the application rather than requiring the programmer to adapt the application to the system.

Page 32: The 3rd International Conference on Emerging Ubiquitous Systems and Pervasive Networks  Amman, Jordan October 10-13,

Cascade Approach

Use Cray T3ETM massively parallel system Use best-of-class microprocessor Processors directly access global memory

with very low overhead and at very high data rates.

Hierarchical address translation allows the processors to access very large data sets without suffering from TLB faults

AMD's Opteron will be the base processor for Cascade

Page 33: The 3rd International Conference on Emerging Ubiquitous Systems and Pervasive Networks  Amman, Jordan October 10-13,

Cray – Adaptive Supercomputing

The system adapts to the application The user logs into a single system, and sees

one global file system. The compiler analyzes the code to

determine which processing technology best fits the code

The scheduling software automatically deploys the code on the appropriate nodes.

Page 34: The 3rd International Conference on Emerging Ubiquitous Systems and Pervasive Networks  Amman, Jordan October 10-13,

Balanced Hardware Design

A balanced hardware design Complements processor flops with memory,

network and I/O bandwidth Scalable performance Improving programmability and breadth

of applicability. Balanced systems also require fewer

processors to scale to a given level of performance, reducing failure rates and administrative overhead.

Page 35: The 3rd International Conference on Emerging Ubiquitous Systems and Pervasive Networks  Amman, Jordan October 10-13,

Cray- System Bandwidth Challenge

The Cascade program is attacking this problem on two fronts

Signalling technology and Network design.

Provide truly massive global bandwidth at an affordable cost.

A key part of the design is a common, globally addressable memory across the whole machine.

Efficient, low-overhead communication.

Page 36: The 3rd International Conference on Emerging Ubiquitous Systems and Pervasive Networks  Amman, Jordan October 10-13,

Cray- System Bandwidth Challenge

Accessing remote data is as simple as issuing a load or store instruction, rather than calling a library function to pass messages between processors.

Allows many outstanding references to be overlapped with each other and with ongoing computation.

Page 37: The 3rd International Conference on Emerging Ubiquitous Systems and Pervasive Networks  Amman, Jordan October 10-13,

Cray Programming Model Support MPI for legacy purposes Unified Parallel C (UPC) and Coarray

Fortran (CAF) simpler and easier to write than MPI

Reference memory on remote nodes as easily as referencing memory on the local node

Data sharing is much more natural Communication overhead is much lower.

Page 38: The 3rd International Conference on Emerging Ubiquitous Systems and Pervasive Networks  Amman, Jordan October 10-13,

The Chapel – Cray HPCS Language

Support for graphs, hash tables, sparse arrays, and iterators.

Ability to separate the specification of an algorithm from structural details of the computation including

Data layouts Work decomposition and communication. Simplifies the creation of the basic algorithms Allows these structural components to be

gradually tuned over time.

Page 39: The 3rd International Conference on Emerging Ubiquitous Systems and Pervasive Networks  Amman, Jordan October 10-13,

Cray's Programming Tools

Reduce the complexity of working on highly scalable applications.

The Cascade debugger solution will Focus on data rather than control Support application porting Allow scaling commensurate with the

application Integrated user environment (IDE)

Page 40: The 3rd International Conference on Emerging Ubiquitous Systems and Pervasive Networks  Amman, Jordan October 10-13,

Cascade Performance Analysis Tools

Hardware performance counters Software introspection techniques. Present the user with insight, rather than

statistics. Act as a parallel programming expert Provide high-level feedback on program

behaviour Provide suggestions for program

modifications to remove key bottlenecks or otherwise improve performance.

Page 41: The 3rd International Conference on Emerging Ubiquitous Systems and Pervasive Networks  Amman, Jordan October 10-13,

SUN HPCS EXAMPLE

Page 42: The 3rd International Conference on Emerging Ubiquitous Systems and Pervasive Networks  Amman, Jordan October 10-13,

Evolution of HPCS at SUN

Grid: Loosely coupled heterogeneous resources Multiple administrative domains Wide area network

Clusters Tightly coupled high performance systems Message passing – MPI

Ultrascale Distributed scalable systems High productivity shared memory systems High bandwidth, global address space, unified

administration tools

Page 43: The 3rd International Conference on Emerging Ubiquitous Systems and Pervasive Networks  Amman, Jordan October 10-13,

SUN Approach – The Hero System

Rich bandwidth Low latencies Very high levels of fault tolerance Highly integrated toolset to scale the

program and not the programmers Multithreading technologies ( > 100

concurrent threads)

Page 44: The 3rd International Conference on Emerging Ubiquitous Systems and Pervasive Networks  Amman, Jordan October 10-13,

SUN Approach – The Hero System

Globally addressable memory System level and application

checkpointing Hardware and software telemetry for

dramatically improved fault tolerance. The system appears more like a flat

memory system Focus on solving the problem at hand

rather than making elaborate efforts to distribute data in a robust manner.

Page 45: The 3rd International Conference on Emerging Ubiquitous Systems and Pervasive Networks  Amman, Jordan October 10-13,

Definition: Bisection Bandwidth

Example is an all-to-all interconnect between 8 cabinets

There are 28 total connections, of which 16 cross the bisection (orange) and 12 do not (blue)

High bandwidth optical connections are key to meeting HPCS peta-scale bisection bandwidth target

Split a system into equal halves such that there is the minimum number of connections across the

split- the bandwidth across the split is the bisection bandwidth

A standard metric for system’s ability to globally move data

Page 46: The 3rd International Conference on Emerging Ubiquitous Systems and Pervasive Networks  Amman, Jordan October 10-13,

System Bandwidth Over TimeA giant leap in productivity expected

Page 47: The 3rd International Conference on Emerging Ubiquitous Systems and Pervasive Networks  Amman, Jordan October 10-13,

High Bandwidth Required by HPCSRadical Changes From Today’s Architecture Necessary

Page 48: The 3rd International Conference on Emerging Ubiquitous Systems and Pervasive Networks  Amman, Jordan October 10-13,

Motivation for Higher Bandwidth

Page 49: The 3rd International Conference on Emerging Ubiquitous Systems and Pervasive Networks  Amman, Jordan October 10-13,

Growing BW demand in HPCS

➔ Multicore CPUs: Aggregation of multiple cores is unstoppable and copper interconnects are stressed at very large scale

➔ Silicon Photonics is the solution since it brings a potential of unlimited BW on the best medium allowing for large aggregation of multicore CPUs

Page 50: The 3rd International Conference on Emerging Ubiquitous Systems and Pervasive Networks  Amman, Jordan October 10-13,

Growing BW demand in HPCS

Clusters are growing in number of nodes and in performance/node

Interconnects are the limiting factor in BW, latency, distance

Protocols reduce latency & copper increases latency.

Silicon Photonics brings high BW and low latency

Page 51: The 3rd International Conference on Emerging Ubiquitous Systems and Pervasive Networks  Amman, Jordan October 10-13,

Growing BW demand in HPCS

Storage I/O BW increasing exponentially due to the faster data/rate and the parallelism caused by striping technologies

WDM will eventually allow 10Tb of data to be transmitted down a single piece of fiber

Silicon Photonics is at the beginning of its life cycle with headroom for explosive BW growth without any increase in latency or reduction in reach

Page 52: The 3rd International Conference on Emerging Ubiquitous Systems and Pervasive Networks  Amman, Jordan October 10-13,

Proximity + CMOS Photonics

Page 53: The 3rd International Conference on Emerging Ubiquitous Systems and Pervasive Networks  Amman, Jordan October 10-13,

Proximity Communication -2

Page 54: The 3rd International Conference on Emerging Ubiquitous Systems and Pervasive Networks  Amman, Jordan October 10-13,

Proximity Communication -3

Page 55: The 3rd International Conference on Emerging Ubiquitous Systems and Pervasive Networks  Amman, Jordan October 10-13,

Proximity Communication Capacitive coupling enables high-speed

data communication between neighboring chips without the need for wires of any kind

➔ Allows for the alignment of metal plates on one chip with metal plates on a neighboring chip and the transfer of data between them

➔ reduced power ➔ improves cross-section bandwidth

and➔ communication power

Page 56: The 3rd International Conference on Emerging Ubiquitous Systems and Pervasive Networks  Amman, Jordan October 10-13,

Proximity Communication - SUN➔ 3.6 x 4.1 mm test chip➔ 0.35 um technology➔ 50 um bit pitch➔ 1.35 Gbps/channel for 16

simultaneous channels➔ < 10^-12 BER @ 1Gbps➔ 3.6 mW/channel static

power➔ 3.9 pJ/bit dynamic power

Page 57: The 3rd International Conference on Emerging Ubiquitous Systems and Pervasive Networks  Amman, Jordan October 10-13,

Proximity Communication -4

Page 58: The 3rd International Conference on Emerging Ubiquitous Systems and Pervasive Networks  Amman, Jordan October 10-13,

Proximity Communication -5

Page 59: The 3rd International Conference on Emerging Ubiquitous Systems and Pervasive Networks  Amman, Jordan October 10-13,

Low Cost, Low Power Optics

Page 60: The 3rd International Conference on Emerging Ubiquitous Systems and Pervasive Networks  Amman, Jordan October 10-13,

DWDM CMOS Photonics

Page 61: The 3rd International Conference on Emerging Ubiquitous Systems and Pervasive Networks  Amman, Jordan October 10-13,

CMOS Photonics Module

Page 62: The 3rd International Conference on Emerging Ubiquitous Systems and Pervasive Networks  Amman, Jordan October 10-13,

SUN Programming ModelSimpler Code with High Bandwidth Shared MemoryNAS Parallel Benchmark CG (Conjugate Gradient) Lines of Code

Page 63: The 3rd International Conference on Emerging Ubiquitous Systems and Pervasive Networks  Amman, Jordan October 10-13,

SUN Fortress Language

➔ Catch stupid mistakes➔ Extensive libraries➔ Platform indpendence➔ Security model➔ Type safety➔ Multithreading➔ Dynamic compilation

To Do For Fortran What JavaTM Did For C

Page 64: The 3rd International Conference on Emerging Ubiquitous Systems and Pervasive Networks  Amman, Jordan October 10-13,

Object-Based “Smart” Storage

With Object Storage File Systems For Massive Scalability and Extreme Performance

Page 65: The 3rd International Conference on Emerging Ubiquitous Systems and Pervasive Networks  Amman, Jordan October 10-13,

Ultra-scale Computing in 2010 Simpler development environments will

make HPC more accessible to a diverse range of users

Lone researchers and small teams will once again be able to harness the computational power of leadership class systems

Many gaps regarding commercial and scientific computing will narrow

Page 66: The 3rd International Conference on Emerging Ubiquitous Systems and Pervasive Networks  Amman, Jordan October 10-13,

Cloud Computing

Service computing The net is the computer More than 100 vendors Growing fast Programming environment

Page 67: The 3rd International Conference on Emerging Ubiquitous Systems and Pervasive Networks  Amman, Jordan October 10-13,

BACKUP SLIDES

Page 68: The 3rd International Conference on Emerging Ubiquitous Systems and Pervasive Networks  Amman, Jordan October 10-13,

HPCS TechnologiesSome Publicly Announced Projects

Page 69: The 3rd International Conference on Emerging Ubiquitous Systems and Pervasive Networks  Amman, Jordan October 10-13,

IBM HPCS - PERCS

➔ Open source operating systems and hypervisors will provide HPC-oriented

➔ Virtualization➔ Security➔ Resource management ➔ Affinity control ➔ Resource limits➔ Checkpoint-restart and reliability features

that will improve the robustness and availability of the system.

Page 70: The 3rd International Conference on Emerging Ubiquitous Systems and Pervasive Networks  Amman, Jordan October 10-13,

MPI Paradigm➔ Writing applications in MPI requires breaking up all the data and

computation into a large number of discrete pieces➔ and then using library code to explicitly bundle up data and pass it

between processors in messages whenever processors need to share data.

➔ It's a cumbersome affair that distracts scientists from their primary focus.

➔ Once an application is written, it's generally a time-consuming process to debug and tune it.

➔ Traditional debugging models just don't scale well to thousands or tens of thousands of processors (try opening up 10,000 debugger windows, one for each thread!).

➔ Trying to figure out why your application isn't getting the performance you think it should is also exceedingly difficult at large scales.

➔ Traditional profiling and even sophisticated statistics-gathering may be insufficient to ascertain why the performance is lagging, much less how to change the code to improve it.

Page 71: The 3rd International Conference on Emerging Ubiquitous Systems and Pervasive Networks  Amman, Jordan October 10-13,

Productivity Challenges

The time spent trying to structure an application to fit the attributes of the target machine.

If the machine is a cluster with limited interconnect bandwidth

➔ the programmer must carefully minimize communication

➔ make sure that any sparse data to be communicated is first bundled together into larger messages to reduce communication overheads.

Page 72: The 3rd International Conference on Emerging Ubiquitous Systems and Pervasive Networks  Amman, Jordan October 10-13,

Productivity Challenges If the machine uses conventional

microprocessors Care must be taken to maximize cache re-

use Eliminate global memory references,

which tend to stall the processor. If the machine looks like a hammer

You'd better make all your codes look like nails!

This can lead to "unnatural" algorithms and data structures, which significantly reduces programmer productivity