computer architecture ii 1 computer architecture ii introduction
Post on 19-Dec-2015
299 views
TRANSCRIPT
Computer Architecture II
1
Computer architecture II
Introduction
Computer Architecture II
2
Today’s overview• Why parallel computing?
– Technology trends• Processors• Storage• Architectural
– Application trends– Challenging computational problems
• What is a parallel computer?• Classical parallel computer classifications
– Architecture – Memory access
• Cluster and grid computing (definitions)• Top 500• Parallel architectures and their convergence
Computer Architecture II
3
Units of Measure in HPC• High Performance Computing (HPC) units are:
– Flops: floating point operations– Flop/s: floating point operations per second– Bytes: size of data (a double precision floating point number is 8
bytes long)• Typical sizes are millions, billions, trillions…
Mega Mflop/s = 106 flop/sec Mbyte = 106 byte (also 220 = 1048576)Giga Gflop/s = 109 flop/sec Gbyte = 109 byte (also 230 =
1073741824)Tera Tflop/s = 1012 flop/sec Tbyte = 1012 byte (also 240 =
10995211627776)Peta Pflop/s = 1015 flop/sec Pbyte = 1015 byte (also 250 =
1125899906842624)Exa Eflop/s = 1018 flop/sec Ebyte = 1018 byte
Computer Architecture II
4
Why parallel computing?
• Sequential computer– von Neumann model– One processor– One memory– One instruction
executed at a time– Fastest machines: a
couple of billion of operations per second (GFLOPS)
Controlunit
Arithmetic Logic Unit
Processor
Connecting logic
I/O system Memory
Computer Architecture II
5
Tunnel Vision by Experts• “I think there is a world market for maybe five
computers.”• Thomas Watson, chairman of IBM, 1943.
• “There is no reason for any individual to have a computer in their home”
• Ken Olson, president and founder of digital equipment corporation, 1977.
• “640K [of memory] ought to be enough for anybody.”
• Bill Gates, chairman of Microsoft,1981.
Slide source: Warfield et al.
Computer Architecture II
6
Technology Trends: Microprocessor Capacity
2X transistors/Chip Every 1.5 years
Called “Moore’s Law”
Moore’s Law
Microprocessors have become smaller, denser, and more powerful.
Gordon Moore (co-founder of Intel) predicted in 1965 that the transistor density of semiconductor chips would double roughly every 18 months. Slide source: Jack Dongarra
Computer Architecture II
7
Computer Architecture II
8
Impact of Device Shrinkage• What happens when the transistor size shrinks by a
factor of x ?• Clock rate goes up by x because wires are shorter
–actually less than x, because of power consumption
• Transistors per unit area goes up by x2
• Die size also tends to increase–typically another factor of ~x
• Raw computing power of the chip goes up by ~ x4 !–of which x3 is devoted either to parallelism or locality
Computer Architecture II
9
Microprocessor Transistors per Chip
i4004
i80286
i80386
i8080
i8086
R3000R2000
R10000
Pentium
1,000
10,000
100,000
1,000,000
10,000,000
100,000,000
1970 1975 1980 1985 1990 1995 2000 2005
Year
Tran
sist
ors
Growth in transistors per chip Increase in clock rate
0.1
1
10
100
1000
1970 1980 1990 2000
Year
Clo
ck R
ate
(MH
z)
Computer Architecture II
10
Limiting forces: Increased cost and difficulty of manufacturing
Computer Architecture II
11
How fast can a serial computer be? (James Demmel)
• Consider the 1 Tflop/s sequential machine:– Data must travel some distance, r, to get from
memory to CPU.– To get 1 data element per cycle, this means 1012
times per second at the speed of light, c = 3x108 m/s. Thus r < c/1012 = 0.3 mm.
• Now put 1 Tbyte of storage in a 0.3 mm x 0.3 mm area:– Each word occupies about 3 square Angstroms, or
the size of a small atom.• No choice but parallelism
r = 0.3 mm
1 Tflop/s, 1 Tbyte sequential machine
Computer Architecture II
12
Storage: Locality and Parallelism
• Large memories are slow, fast memories are small• Storage hierarchies are large and fast on average• Parallel processors, collectively, have large, fast cache ($)
– the slow accesses to “remote” data we call “communication”
• Algorithm should do most work on local data
ProcCache
L2 Cache
L3 Cache
Memory
Conventional Storage Hierarchy
ProcCache
L2 Cache
L3 Cache
Memory
ProcCache
L2 Cache
L3 Cache
Memory
potentialinterconnects
Computer Architecture II
13
Processor-DRAM Gap (latency)
µProc60%/yr.
DRAM7%/yr.
1
10
100
1000
1980
1981
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
DRAM
CPU1982
Processor-MemoryPerformance Gap:(grows 50% / year)
Per
form
ance
Time
“Moore’s Law”
Computer Architecture II
14
Storage Trends• Divergence between memory capacity and speed
even more pronounced– Capacity increased by 1000x from 1980-95, speed only 2x– Gigabit DRAM by c. 2000, but gap with processor speed
much greater
• Larger memories are slower, while processors get faster
– Need to transfer more data in parallel– Need deeper cache hierarchies– How to organize caches?
• Parallelism increases effective size of each level of hierarchy, without increasing access time
• Disks: Parallel disks plus caching
Computer Architecture II
15
Architectural Trends• Resolve the tradeoff between parallelism
and locality– Current microprocessor: 1/3 compute, 1/3 cache,
1/3 off-chip connect– Tradeoffs may change with scale and technology
advances
• Understanding microprocessor architectural trends => Helps build intuition about design issues or
parallel machines=> Shows fundamental role of parallelism even in
“sequential” computers
Computer Architecture II
16
Transis
tors
1,000
10,000
100,000
1,000,000
10,000,000
100,000,000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level (?)
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Phases in “VLSI” Generation
Computer Architecture II
17
Architectural Trends• Greatest trend in VLSI generation is increase in
parallelism– Up to 1985: bit level parallelism: 4-bit -> 8 bit -> 16-bit
• slows after 32 bit • adoption of 64-bit now under way (Opteron, Itanium), 128-bit
far (not performance issue)
– Mid 80s to mid 90s: instruction level parallelism (ILP)• pipelining and simple instruction sets, + compiler advances
(RISC)• on-chip caches and functional units => superscalar execution• greater sophistication: out of order execution, speculation,
prediction
– Current step: • thread level parallelism• multicore
Pipeline of a superscalar processor
In s tru c tio n F e tch
. . .
In s tru c tio n D eco d e
an d R en am e
. . .In
stru
ctio
n W
indo
w
I s su e
Res
erva
tion
St
atio
ns
E x ecu tio n
Res
erva
tion
St
atio
ns
E x ecu tio n
. . .
R etire an d
W rite B ack
In-order In-order
Out-of-order
Computer Architecture II
19
0 1 2 3 4 5 6+0
5
10
15
20
25
30
0 5 10 150
0.5
1
1.5
2
2.5
3
Fra
ctio
n o
f to
tal c
ycle
s (%
)
Number of instructions issuedS
pe
ed
up
Instructions issued per cycle
How far will ILP go?
• Simulation for discovering the maximum available ILP
– Infinite fetch bandwidth
– Infinite function units
– perfect branch prediction
– Cache misses: 0 cycles
Computer Architecture II
20
Multithreaded architectures
Computer Architecture II
21
Multithreaded architectures
Examples: Pentium 4 Xeon, Ultrasparc T1 (32 &64 threads) Itanium Montecito (also dualcore)
Computer Architecture II
22
Multi-core
• Intel:– Dual Pentium Extreme Edition 840 (first)– Quad Core Xeon 5300– 80-core chip capable of cranking through
1.28TFlops.
• AMD: Dual Core Opteron, Quad Core FX (3GHz)
• Sun: Rock: 16 cores (due 2008)• IBM: 2 cores Power6 5GHz
Computer Architecture II
23
Alternative: Cell•general-purpose Power Architecture core of modest performance• coprocessing elements multimedia and vector processing applications•PowerPC core
•controls 8 SPE (Synergistic processing elements): SIMD•Cache coherent
•25. GB/s XDR memory controller
Computer Architecture II
24
Alternative: Cell•SPE: register hierarchy
•128x128b single cycle registers•16kx128b 6 cycles registers
•DMA in parallel with SIMD processing
Computer Architecture II
25
Overview of Cell processor
Computer Architecture II
26
Application Trends
• Demand for cycles fuels advances in hardware, and vice-versa– Cycle drives exponential increase in microprocessor performance– Drives parallel architecture harder: most demanding applications
• Goal of applications in using parallel machines: Speedup
Speedup (p processors) =
• For a fixed problem size (input data set), performance = 1/time
Speedup fixed problem (p processors) =
Performance (p processors)
Performance (1 processor)
Time (1 processor)
Time (p processors)
New ApplicationsMore Performance
Computer Architecture II
27
Improving the speedup of Parallel Applications
• AMBER molecular dynamics simulation program– Motion of large biological models (proteins, DNA)
• 145 MFLOP on Cray90, 406 for final version on 128-processor Paragon, 891 on 128-processor Cray T3D
• 9/94: optimize the balance• 8/94: optimize the communication
Computer Architecture II
28
Particularly Challenging Computations• Science
– Global climate modeling– Astrophysical modeling– Biology: genomics; protein folding; drug design– Computational Chemistry– Computational Material Sciences and Nanosciences
• Engineering– Crash simulation– Semiconductor design– Earthquake and structural modeling– Computation fluid dynamics (airplane design)– Combustion (engine design)
• Business– Financial and economic modeling– Transaction processing, web services and search engines
• Defense– Nuclear weapons -- test by simulations– Cryptography
Computer Architecture II
29
$5B Market in Technical Computing
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
1998 1999 2000 2001 2002 2003 Other
Technical Management andSupport
Simulation
Scientific Research and R&D
MechanicalDesign/Engineering Analysis
Mechanical Design andDrafting
Imaging
Geoscience and Geo-engineering
Electrical Design/EngineeringAnalysis
Economics/Financial
Digital Content Creation andDistribution
Classified Defense
Chemical Engineering
Biosciences
Source: IDC 2004, from USA´s National Research Council Future of Supercomputer Report
Computer Architecture II
30
Scientific Computing Demand
Computer Architecture II
31
NRC report on Future of Supercomputing
• “In climate modeling or plasma physics, there is a broad consensus that up to seven orders of magnitude of performance improvements will be needed to achieve well-defined computational goals.”
Computer Architecture II
32
What is Parallel Architecture?• A parallel computer is a collection of processing elements
that cooperate to solve large problems fast• Some broad issues:
– Resource Allocation:• how large a collection? • how powerful are the elements?• how much memory?
– Data access, Communication and Synchronization• how do the elements cooperate and communicate?• how are data transmitted between processors?• what are the abstractions and primitives for cooperation?
– Performance and Scalability• how does it all translate into performance?• how does it scale?
Computer Architecture II
33
Role of a computer architect:
To design and engineer the various levels of a computer system to maximize performance and programmability within limits of technology and cost.
Parallelism:• Provides alternative to faster clock for performance
• Applies at all levels of system design
• Is a fascinating perspective from which to view architecture
• Is increasingly central in information processing
Why Study Parallel Architecture?
Computer Architecture II
34
1st Architecture classification
• There are several different methods used to classify computers
• No single taxonomy fits all designs • Flynn's taxonomy uses the relationship of program
instructions to program data. – SISD - Single Instruction, Single Data Stream – SIMD - Single Instruction, Multiple Data Stream – MISD - Multiple Instruction, Single Data Stream (no
practical examples) – MIMD - Multiple Instruction, Multiple Data Stream
Computer Architecture II
35
SISD
• One instruction stream
• One data stream
• One instruction issued on each clock cycle
• One instruction executed on one element of data (scalar) at a time
• Traditional von Neumann architecture
Computer Architecture II
36
SIMD
• Also von Neumann architectures but more powerful instructions
• Each instruction may operate on more than one data element
• Usually intermediate host executes program logic and broadcasts instructions to other processors
• Synchronous (lockstep) • Rating how fast these machines can issue instructions is
not a good measure of their performance • Two major types:
– Vector SIMD – Parallel SIMD
Computer Architecture II
37
Vector SIMD
• Single instruction results in multiple operands being updated
• Scalar processing operates on single data elements. Vector processing operates on whole vectors (groups) of data at a time.
• Examples: • Cell• Cray 1 • NEC SX-2 • Fujitsu VP • Hitachi S820
Computer Architecture II
38
Parallel SIMD
• Several processors execute the same instruction in lockstep
• Each processor modifies a different element of data
• Drawback: idle processors• Advantage: no explicit synchronization required• Examples
– Connection Machine CM-2 – Maspar MP-1, MP-2
Computer Architecture II
39
MIMD• Several processors executing different instructions on
different data• Advantages:
– different jobs can be performed at a time– A better utilization can be achieved
• Drawbacks: – Explicit synchronization needed– Difficult to program
• Examples– MIMD Accomplished via Parallel SISD machines: Sequent, nCUBE , Intel
iPSC/2, IBM RS6000 cluster, ALL CLUSTERS – MIMD Accomplished via Parallel SIMD machines: Cray C 90, Cray 2, NEC
SX-3, Fujitsu VP 2000, Convex C-2, Intel Paragon, CM 5, KSR-1, IBM SP1, IBM SP2
Computer Architecture II
40
2nd Classification: Memory architectures
• Shared memory – UMA– NUMA
• CC-NUMA
• Distributed memory– COMA
Computer Architecture II
41
UMA (Uniform Memory Access)
Interconnect
M1
P1
M2 Mk
…P2 Pn
…
Computer Architecture II
42
NUMA (Non Uniform Memory Access)
Interconnect
P2
M2
PE2
P1
M1
PE1
Pn
Mn
PEn
…
Computer Architecture II
43
CC-NUMA (Cache Coherent NUMA)
Interconnect
C1
M1
PE1
P1
C2
M2
PE2
P2
Cn
Mn
PEn
Pn
Computer Architecture II
44
Distributed memory
Interconnect
P1
M1
PE1
…P2
M2
PE2
Pn
Mn
PEn
Computer Architecture II
45
COMA (Cache Only Machine)
Interconnect
P1
C1
PE1
…P2
C2
PE2
Pn
Cn
PEn
Computer Architecture II
46
Memory architecture
Logical view
Physical view
shared
shared
distributed
distributed
“Easy” Programming
UMA
NUMA M. Dist.
Future!
scalability
Very im
portant!
Computer Architecture II
47
Mem
Network
P
$
Communicationassist (CA)
Generic Parallel Architecture
• Node: processor(s), memory system, plus communication assist– Network interface and communication controller
• Scalable network
Computer Architecture II
48
Clusters and Cluster Computing
• Definition of a cluster:
• Communication infrastructure:– High performance networks, faster than traditional LAN
( Myrinet, Infiniband, Gbit Ethernet)– Low latency communication protocols– Loosly coupled compared to traditional proprietary
supercomputers (eg. IBM SP, Intel Paragon)
A cluster is a type of parallel or distributed processing system, which consists of a collection of interconnected stand-alone/complete computers cooperatively working together as a single, integrated computing resource. [Buyya98]
Computer Architecture II
49
Cluster architecture
Computer Architecture II
50
Clusters and Cluster Computing
• Cluster networks:– Ethernet (10Mbps) (*), Fast Ethernet (100Mbps), Gigabit Ethernet
(1Gbps), ATM, Myrinet (1.2Gbps), Fiber Channel, FDDI, Infiniband, etc.
• Cluster projects: – Beowulf (CalTech and NASA) - USA– Condor - Wisconsin State University, USA – DQS (Distributed Queuing System) - Florida State University, USA.– HPVM -(High Performance Virtual Machine),UIUC&now UCSB,USA– far - University of Liverpool, UK – Gardens - Queensland University of Technology, Australia– Kerrighed – INRIA, France– MOSIX - Hebrew University of Jerusalem, Israel– NOW (Network of Workstations) - Berkeley, USA
Computer Architecture II
51
What is a Grid?• 1969, Len Kleinrock:
“We will probably see the spread of ‘computer utilities’, which, like present electric and telephone utilities, will
service individual homes and offices across the country.”
• 1998, Kesselman & Foster:
“A computational grid is a hardware and software infrastructure that provides dependable, consistent,
pervasive, and inexpensive access to high-end computational capabilities.”
• 2000, Kesselman, Foster, Tuecke:
“…coordinated resource sharing and problem solving in dynamic, multi-institutional virtual organizations.
Computer Architecture II
52
GRID vs. Cluster
• Cluster vs. GRID:– Cluster: Computer network typically dedicated 100
% to execute a specific task – GRID: computer networks distributed planet-wide,
that can be shared by the means of resource management software
Computer Architecture II
53
Cluster computing vs. others
Distance between nodes
A chip
A rack
A room
A builduing
The world
Dis
trib
uted
com
putin
g
Grid computing
Cluster computing
SM Parallelcomputing
Computer Architecture II
54
Top500Top500
Computer Architecture II
55
Generalities
• Since 1993 twice a year: June and November
• Ranking of the most powerful computing systems in the world
• Ranking criteria: performance of the LINPACK benchmark
• Jack Dongarra alma máter
• Site web: www.top500.org
Computer Architecture II
56
HPL: High-Performance Linpack• solves a dense system of linear equations
– Variant of LU factorization of matrices of size N
• measure of a computer’s floating-point rate of execution
• computation done in 64 bit floating point arithmetic
• Rpeak : theoretic system performance – upper bound for the real performance (in MFLOP)– Ex: Intel Itanium 2 at 1.5 GHz 4 FP/s -> 6GFLOPS
• Nmax : obtained by varying N and choosing the maximum performance
• Rmax : maximum real performance achieved for Nmax
• N1/2: size of problem needed to achieve ½ of Rmax
Computer Architecture II
57Jack Dongarra´s slide
Computer Architecture II
58