computer architecture ii 1 computer architecture ii introduction

58
Computer Architecture II 1 Computer architecture II Introduction

Post on 19-Dec-2015

299 views

Category:

Documents


11 download

TRANSCRIPT

Page 1: Computer Architecture II 1 Computer architecture II Introduction

Computer Architecture II

1

Computer architecture II

Introduction

Page 2: Computer Architecture II 1 Computer architecture II Introduction

Computer Architecture II

2

Today’s overview• Why parallel computing?

– Technology trends• Processors• Storage• Architectural

– Application trends– Challenging computational problems

• What is a parallel computer?• Classical parallel computer classifications

– Architecture – Memory access

• Cluster and grid computing (definitions)• Top 500• Parallel architectures and their convergence

Page 3: Computer Architecture II 1 Computer architecture II Introduction

Computer Architecture II

3

Units of Measure in HPC• High Performance Computing (HPC) units are:

– Flops: floating point operations– Flop/s: floating point operations per second– Bytes: size of data (a double precision floating point number is 8

bytes long)• Typical sizes are millions, billions, trillions…

Mega Mflop/s = 106 flop/sec Mbyte = 106 byte (also 220 = 1048576)Giga Gflop/s = 109 flop/sec Gbyte = 109 byte (also 230 =

1073741824)Tera Tflop/s = 1012 flop/sec Tbyte = 1012 byte (also 240 =

10995211627776)Peta Pflop/s = 1015 flop/sec Pbyte = 1015 byte (also 250 =

1125899906842624)Exa Eflop/s = 1018 flop/sec Ebyte = 1018 byte

Page 4: Computer Architecture II 1 Computer architecture II Introduction

Computer Architecture II

4

Why parallel computing?

• Sequential computer– von Neumann model– One processor– One memory– One instruction

executed at a time– Fastest machines: a

couple of billion of operations per second (GFLOPS)

Controlunit

Arithmetic Logic Unit

Processor

Connecting logic

I/O system Memory

Page 5: Computer Architecture II 1 Computer architecture II Introduction

Computer Architecture II

5

Tunnel Vision by Experts• “I think there is a world market for maybe five

computers.”• Thomas Watson, chairman of IBM, 1943.

• “There is no reason for any individual to have a computer in their home”

• Ken Olson, president and founder of digital equipment corporation, 1977.

• “640K [of memory] ought to be enough for anybody.”

• Bill Gates, chairman of Microsoft,1981.

Slide source: Warfield et al.

Page 6: Computer Architecture II 1 Computer architecture II Introduction

Computer Architecture II

6

Technology Trends: Microprocessor Capacity

2X transistors/Chip Every 1.5 years

Called “Moore’s Law”

Moore’s Law

Microprocessors have become smaller, denser, and more powerful.

Gordon Moore (co-founder of Intel) predicted in 1965 that the transistor density of semiconductor chips would double roughly every 18 months. Slide source: Jack Dongarra

Page 7: Computer Architecture II 1 Computer architecture II Introduction

Computer Architecture II

7

Page 8: Computer Architecture II 1 Computer architecture II Introduction

Computer Architecture II

8

Impact of Device Shrinkage• What happens when the transistor size shrinks by a

factor of x ?• Clock rate goes up by x because wires are shorter

–actually less than x, because of power consumption

• Transistors per unit area goes up by x2

• Die size also tends to increase–typically another factor of ~x

• Raw computing power of the chip goes up by ~ x4 !–of which x3 is devoted either to parallelism or locality

Page 9: Computer Architecture II 1 Computer architecture II Introduction

Computer Architecture II

9

Microprocessor Transistors per Chip

i4004

i80286

i80386

i8080

i8086

R3000R2000

R10000

Pentium

1,000

10,000

100,000

1,000,000

10,000,000

100,000,000

1970 1975 1980 1985 1990 1995 2000 2005

Year

Tran

sist

ors

Growth in transistors per chip Increase in clock rate

0.1

1

10

100

1000

1970 1980 1990 2000

Year

Clo

ck R

ate

(MH

z)

Page 10: Computer Architecture II 1 Computer architecture II Introduction

Computer Architecture II

10

Limiting forces: Increased cost and difficulty of manufacturing

Page 11: Computer Architecture II 1 Computer architecture II Introduction

Computer Architecture II

11

How fast can a serial computer be? (James Demmel)

• Consider the 1 Tflop/s sequential machine:– Data must travel some distance, r, to get from

memory to CPU.– To get 1 data element per cycle, this means 1012

times per second at the speed of light, c = 3x108 m/s. Thus r < c/1012 = 0.3 mm.

• Now put 1 Tbyte of storage in a 0.3 mm x 0.3 mm area:– Each word occupies about 3 square Angstroms, or

the size of a small atom.• No choice but parallelism

r = 0.3 mm

1 Tflop/s, 1 Tbyte sequential machine

Page 12: Computer Architecture II 1 Computer architecture II Introduction

Computer Architecture II

12

Storage: Locality and Parallelism

• Large memories are slow, fast memories are small• Storage hierarchies are large and fast on average• Parallel processors, collectively, have large, fast cache ($)

– the slow accesses to “remote” data we call “communication”

• Algorithm should do most work on local data

ProcCache

L2 Cache

L3 Cache

Memory

Conventional Storage Hierarchy

ProcCache

L2 Cache

L3 Cache

Memory

ProcCache

L2 Cache

L3 Cache

Memory

potentialinterconnects

Page 13: Computer Architecture II 1 Computer architecture II Introduction

Computer Architecture II

13

Processor-DRAM Gap (latency)

µProc60%/yr.

DRAM7%/yr.

1

10

100

1000

1980

1981

1983

1984

1985

1986

1987

1988

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

DRAM

CPU1982

Processor-MemoryPerformance Gap:(grows 50% / year)

Per

form

ance

Time

“Moore’s Law”

Page 14: Computer Architecture II 1 Computer architecture II Introduction

Computer Architecture II

14

Storage Trends• Divergence between memory capacity and speed

even more pronounced– Capacity increased by 1000x from 1980-95, speed only 2x– Gigabit DRAM by c. 2000, but gap with processor speed

much greater

• Larger memories are slower, while processors get faster

– Need to transfer more data in parallel– Need deeper cache hierarchies– How to organize caches?

• Parallelism increases effective size of each level of hierarchy, without increasing access time

• Disks: Parallel disks plus caching

Page 15: Computer Architecture II 1 Computer architecture II Introduction

Computer Architecture II

15

Architectural Trends• Resolve the tradeoff between parallelism

and locality– Current microprocessor: 1/3 compute, 1/3 cache,

1/3 off-chip connect– Tradeoffs may change with scale and technology

advances

• Understanding microprocessor architectural trends => Helps build intuition about design issues or

parallel machines=> Shows fundamental role of parallelism even in

“sequential” computers

Page 16: Computer Architecture II 1 Computer architecture II Introduction

Computer Architecture II

16

Transis

tors

1,000

10,000

100,000

1,000,000

10,000,000

100,000,000

1970 1975 1980 1985 1990 1995 2000 2005

Bit-level parallelism Instruction-level Thread-level (?)

i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000

Phases in “VLSI” Generation

Page 17: Computer Architecture II 1 Computer architecture II Introduction

Computer Architecture II

17

Architectural Trends• Greatest trend in VLSI generation is increase in

parallelism– Up to 1985: bit level parallelism: 4-bit -> 8 bit -> 16-bit

• slows after 32 bit • adoption of 64-bit now under way (Opteron, Itanium), 128-bit

far (not performance issue)

– Mid 80s to mid 90s: instruction level parallelism (ILP)• pipelining and simple instruction sets, + compiler advances

(RISC)• on-chip caches and functional units => superscalar execution• greater sophistication: out of order execution, speculation,

prediction

– Current step: • thread level parallelism• multicore

Page 18: Computer Architecture II 1 Computer architecture II Introduction

Pipeline of a superscalar processor

In s tru c tio n F e tch

. . .

In s tru c tio n D eco d e

an d R en am e

. . .In

stru

ctio

n W

indo

w

I s su e

Res

erva

tion

St

atio

ns

E x ecu tio n

Res

erva

tion

St

atio

ns

E x ecu tio n

. . .

R etire an d

W rite B ack

In-order In-order

Out-of-order

Page 19: Computer Architecture II 1 Computer architecture II Introduction

Computer Architecture II

19

0 1 2 3 4 5 6+0

5

10

15

20

25

30

0 5 10 150

0.5

1

1.5

2

2.5

3

Fra

ctio

n o

f to

tal c

ycle

s (%

)

Number of instructions issuedS

pe

ed

up

Instructions issued per cycle

How far will ILP go?

• Simulation for discovering the maximum available ILP

– Infinite fetch bandwidth

– Infinite function units

– perfect branch prediction

– Cache misses: 0 cycles

Page 20: Computer Architecture II 1 Computer architecture II Introduction

Computer Architecture II

20

Multithreaded architectures

Page 21: Computer Architecture II 1 Computer architecture II Introduction

Computer Architecture II

21

Multithreaded architectures

Examples: Pentium 4 Xeon, Ultrasparc T1 (32 &64 threads) Itanium Montecito (also dualcore)

Page 22: Computer Architecture II 1 Computer architecture II Introduction

Computer Architecture II

22

Multi-core

• Intel:– Dual Pentium Extreme Edition 840 (first)– Quad Core Xeon 5300– 80-core chip capable of cranking through

1.28TFlops.

• AMD: Dual Core Opteron, Quad Core FX (3GHz)

• Sun: Rock: 16 cores (due 2008)• IBM: 2 cores Power6 5GHz

Page 23: Computer Architecture II 1 Computer architecture II Introduction

Computer Architecture II

23

Alternative: Cell•general-purpose Power Architecture core of modest performance• coprocessing elements multimedia and vector processing applications•PowerPC core

•controls 8 SPE (Synergistic processing elements): SIMD•Cache coherent

•25. GB/s XDR memory controller

Page 24: Computer Architecture II 1 Computer architecture II Introduction

Computer Architecture II

24

Alternative: Cell•SPE: register hierarchy

•128x128b single cycle registers•16kx128b 6 cycles registers

•DMA in parallel with SIMD processing

Page 25: Computer Architecture II 1 Computer architecture II Introduction

Computer Architecture II

25

Overview of Cell processor

Page 26: Computer Architecture II 1 Computer architecture II Introduction

Computer Architecture II

26

Application Trends

• Demand for cycles fuels advances in hardware, and vice-versa– Cycle drives exponential increase in microprocessor performance– Drives parallel architecture harder: most demanding applications

• Goal of applications in using parallel machines: Speedup

Speedup (p processors) =

• For a fixed problem size (input data set), performance = 1/time

Speedup fixed problem (p processors) =

Performance (p processors)

Performance (1 processor)

Time (1 processor)

Time (p processors)

New ApplicationsMore Performance

Page 27: Computer Architecture II 1 Computer architecture II Introduction

Computer Architecture II

27

Improving the speedup of Parallel Applications

• AMBER molecular dynamics simulation program– Motion of large biological models (proteins, DNA)

• 145 MFLOP on Cray90, 406 for final version on 128-processor Paragon, 891 on 128-processor Cray T3D

• 9/94: optimize the balance• 8/94: optimize the communication

Page 28: Computer Architecture II 1 Computer architecture II Introduction

Computer Architecture II

28

Particularly Challenging Computations• Science

– Global climate modeling– Astrophysical modeling– Biology: genomics; protein folding; drug design– Computational Chemistry– Computational Material Sciences and Nanosciences

• Engineering– Crash simulation– Semiconductor design– Earthquake and structural modeling– Computation fluid dynamics (airplane design)– Combustion (engine design)

• Business– Financial and economic modeling– Transaction processing, web services and search engines

• Defense– Nuclear weapons -- test by simulations– Cryptography

Page 29: Computer Architecture II 1 Computer architecture II Introduction

Computer Architecture II

29

$5B Market in Technical Computing

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

1998 1999 2000 2001 2002 2003 Other

Technical Management andSupport

Simulation

Scientific Research and R&D

MechanicalDesign/Engineering Analysis

Mechanical Design andDrafting

Imaging

Geoscience and Geo-engineering

Electrical Design/EngineeringAnalysis

Economics/Financial

Digital Content Creation andDistribution

Classified Defense

Chemical Engineering

Biosciences

Source: IDC 2004, from USA´s National Research Council Future of Supercomputer Report

Page 30: Computer Architecture II 1 Computer architecture II Introduction

Computer Architecture II

30

Scientific Computing Demand

Page 31: Computer Architecture II 1 Computer architecture II Introduction

Computer Architecture II

31

NRC report on Future of Supercomputing

• “In climate modeling or plasma physics, there is a broad consensus that up to seven orders of magnitude of performance improvements will be needed to achieve well-defined computational goals.”

Page 32: Computer Architecture II 1 Computer architecture II Introduction

Computer Architecture II

32

What is Parallel Architecture?• A parallel computer is a collection of processing elements

that cooperate to solve large problems fast• Some broad issues:

– Resource Allocation:• how large a collection? • how powerful are the elements?• how much memory?

– Data access, Communication and Synchronization• how do the elements cooperate and communicate?• how are data transmitted between processors?• what are the abstractions and primitives for cooperation?

– Performance and Scalability• how does it all translate into performance?• how does it scale?

Page 33: Computer Architecture II 1 Computer architecture II Introduction

Computer Architecture II

33

Role of a computer architect:

To design and engineer the various levels of a computer system to maximize performance and programmability within limits of technology and cost.

Parallelism:• Provides alternative to faster clock for performance

• Applies at all levels of system design

• Is a fascinating perspective from which to view architecture

• Is increasingly central in information processing

Why Study Parallel Architecture?

Page 35: Computer Architecture II 1 Computer architecture II Introduction

Computer Architecture II

35

SISD

• One instruction stream

• One data stream

• One instruction issued on each clock cycle

• One instruction executed on one element of data (scalar) at a time

• Traditional von Neumann architecture

Page 36: Computer Architecture II 1 Computer architecture II Introduction

Computer Architecture II

36

SIMD

• Also von Neumann architectures but more powerful instructions

• Each instruction may operate on more than one data element

• Usually intermediate host executes program logic and broadcasts instructions to other processors

• Synchronous (lockstep) • Rating how fast these machines can issue instructions is

not a good measure of their performance • Two major types:

– Vector SIMD – Parallel SIMD

Page 37: Computer Architecture II 1 Computer architecture II Introduction

Computer Architecture II

37

Vector SIMD

• Single instruction results in multiple operands being updated

• Scalar processing operates on single data elements. Vector processing operates on whole vectors (groups) of data at a time.

• Examples: • Cell• Cray 1 • NEC SX-2 • Fujitsu VP • Hitachi S820

Page 38: Computer Architecture II 1 Computer architecture II Introduction

Computer Architecture II

38

Parallel SIMD

• Several processors execute the same instruction in lockstep

• Each processor modifies a different element of data

• Drawback: idle processors• Advantage: no explicit synchronization required• Examples

– Connection Machine CM-2 – Maspar MP-1, MP-2

Page 39: Computer Architecture II 1 Computer architecture II Introduction

Computer Architecture II

39

MIMD• Several processors executing different instructions on

different data• Advantages:

– different jobs can be performed at a time– A better utilization can be achieved

• Drawbacks: – Explicit synchronization needed– Difficult to program

• Examples– MIMD Accomplished via Parallel SISD machines: Sequent, nCUBE , Intel

iPSC/2, IBM RS6000 cluster, ALL CLUSTERS – MIMD Accomplished via Parallel SIMD machines: Cray C 90, Cray 2, NEC

SX-3, Fujitsu VP 2000, Convex C-2, Intel Paragon, CM 5, KSR-1, IBM SP1, IBM SP2

Page 40: Computer Architecture II 1 Computer architecture II Introduction

Computer Architecture II

40

2nd Classification: Memory architectures

• Shared memory – UMA– NUMA

• CC-NUMA

• Distributed memory– COMA

Page 41: Computer Architecture II 1 Computer architecture II Introduction

Computer Architecture II

41

UMA (Uniform Memory Access)

Interconnect

M1

P1

M2 Mk

…P2 Pn

Page 42: Computer Architecture II 1 Computer architecture II Introduction

Computer Architecture II

42

NUMA (Non Uniform Memory Access)

Interconnect

P2

M2

PE2

P1

M1

PE1

Pn

Mn

PEn

Page 43: Computer Architecture II 1 Computer architecture II Introduction

Computer Architecture II

43

CC-NUMA (Cache Coherent NUMA)

Interconnect

C1

M1

PE1

P1

C2

M2

PE2

P2

Cn

Mn

PEn

Pn

Page 44: Computer Architecture II 1 Computer architecture II Introduction

Computer Architecture II

44

Distributed memory

Interconnect

P1

M1

PE1

…P2

M2

PE2

Pn

Mn

PEn

Page 45: Computer Architecture II 1 Computer architecture II Introduction

Computer Architecture II

45

COMA (Cache Only Machine)

Interconnect

P1

C1

PE1

…P2

C2

PE2

Pn

Cn

PEn

Page 46: Computer Architecture II 1 Computer architecture II Introduction

Computer Architecture II

46

Memory architecture

Logical view

Physical view

shared

shared

distributed

distributed

“Easy” Programming

UMA

NUMA M. Dist.

Future!

scalability

Very im

portant!

Page 47: Computer Architecture II 1 Computer architecture II Introduction

Computer Architecture II

47

Mem

Network

P

$

Communicationassist (CA)

Generic Parallel Architecture

• Node: processor(s), memory system, plus communication assist– Network interface and communication controller

• Scalable network

Page 48: Computer Architecture II 1 Computer architecture II Introduction

Computer Architecture II

48

Clusters and Cluster Computing

• Definition of a cluster:

• Communication infrastructure:– High performance networks, faster than traditional LAN

( Myrinet, Infiniband, Gbit Ethernet)– Low latency communication protocols– Loosly coupled compared to traditional proprietary

supercomputers (eg. IBM SP, Intel Paragon)

A cluster is a type of parallel or distributed processing system, which consists of a collection of interconnected stand-alone/complete computers cooperatively working together as a single, integrated computing resource. [Buyya98]

Page 49: Computer Architecture II 1 Computer architecture II Introduction

Computer Architecture II

49

Cluster architecture

Page 50: Computer Architecture II 1 Computer architecture II Introduction

Computer Architecture II

50

Clusters and Cluster Computing

• Cluster networks:– Ethernet (10Mbps) (*), Fast Ethernet (100Mbps), Gigabit Ethernet

(1Gbps), ATM, Myrinet (1.2Gbps), Fiber Channel, FDDI, Infiniband, etc.

• Cluster projects: – Beowulf (CalTech and NASA) - USA– Condor - Wisconsin State University, USA – DQS (Distributed Queuing System) - Florida State University, USA.– HPVM -(High Performance Virtual Machine),UIUC&now UCSB,USA– far - University of Liverpool, UK – Gardens - Queensland University of Technology, Australia– Kerrighed – INRIA, France– MOSIX - Hebrew University of Jerusalem, Israel– NOW (Network of Workstations) - Berkeley, USA

Page 51: Computer Architecture II 1 Computer architecture II Introduction

Computer Architecture II

51

What is a Grid?• 1969, Len Kleinrock:

“We will probably see the spread of ‘computer utilities’, which, like present electric and telephone utilities, will

service individual homes and offices across the country.”

• 1998, Kesselman & Foster:

“A computational grid is a hardware and software infrastructure that provides dependable, consistent,

pervasive, and inexpensive access to high-end computational capabilities.”

• 2000, Kesselman, Foster, Tuecke:

“…coordinated resource sharing and problem solving in dynamic, multi-institutional virtual organizations.

Page 52: Computer Architecture II 1 Computer architecture II Introduction

Computer Architecture II

52

GRID vs. Cluster

• Cluster vs. GRID:– Cluster: Computer network typically dedicated 100

% to execute a specific task – GRID: computer networks distributed planet-wide,

that can be shared by the means of resource management software

Page 53: Computer Architecture II 1 Computer architecture II Introduction

Computer Architecture II

53

Cluster computing vs. others

Distance between nodes

A chip

A rack

A room

A builduing

The world

Dis

trib

uted

com

putin

g

Grid computing

Cluster computing

SM Parallelcomputing

Page 54: Computer Architecture II 1 Computer architecture II Introduction

Computer Architecture II

54

Top500Top500

Page 55: Computer Architecture II 1 Computer architecture II Introduction

Computer Architecture II

55

Generalities

• Since 1993 twice a year: June and November

• Ranking of the most powerful computing systems in the world

• Ranking criteria: performance of the LINPACK benchmark

• Jack Dongarra alma máter

• Site web: www.top500.org

Page 56: Computer Architecture II 1 Computer architecture II Introduction

Computer Architecture II

56

HPL: High-Performance Linpack• solves a dense system of linear equations

– Variant of LU factorization of matrices of size N

• measure of a computer’s floating-point rate of execution

• computation done in 64 bit floating point arithmetic

• Rpeak : theoretic system performance – upper bound for the real performance (in MFLOP)– Ex: Intel Itanium 2 at 1.5 GHz 4 FP/s -> 6GFLOPS

• Nmax : obtained by varying N and choosing the maximum performance

• Rmax : maximum real performance achieved for Nmax

• N1/2: size of problem needed to achieve ½ of Rmax

Page 57: Computer Architecture II 1 Computer architecture II Introduction

Computer Architecture II

57Jack Dongarra´s slide

Page 58: Computer Architecture II 1 Computer architecture II Introduction

Computer Architecture II

58