why parallel computing? - androbenchcsl.skku.edu/uploads/sse3054s17/01-paracomp.pdf ·...
TRANSCRIPT
SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected])
Why Parallel Computing?
Jinkyu Jeong ([email protected])Computer Systems Laboratory
Sungkyunkwan Universityhttp://csl.skku.edu
SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 2
What is Parallel Computing?
• Software with multiple threads
• Parallel vs. concurrent– Parallel computing executes multiple threads at the same time on
multiple processors – performance– Concurrent computing can either alternate multiple threads on a
single processor or execute in parallel on multiple processors –convenient design
SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 3
What is Parallel Architecture?
• Machines with multiple processors
IntelQuadCorei7(Nehalem) SonyPlaystation 3/4 IBMBlueGene
GoogleDataCenterLocations
ParallelComputingvs.
DistributedComputing
Facebook DataCenter
SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 4
What is Parallel Architecture?
• One definition– A parallel computer is a collection of processing
elements that cooperate to solve large problems fast
• Some general issues– Resource allocation– Data access, communication and synchronization– Performance and scalability
SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 5
Why Study Parallel Computing?
• 10 years ago– Some important applications demand high performance– With CPU clock frequency scaling, it is not enough– Parallel computing provides higher performance
• Today– Many interesting applications demand high performance– CPU clock rates are no longer increasing!– Instruction-level parallelism is not increasing either!– Parallel computing is the only way to achieve higher
performance in the foreseeable future
SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 6
Why Study Parallel Architecture?
• Role of a computer architect– To design and engineer the various levels of a computer
system to maximize performance and programmability within limits of technology and cost
• Parallelism– Provides alternative to faster clock for performance– Applies at all levels of system design
• Bits, instructions, loops, threads, …
SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 7
Why Parallel Computing?
• Application demands
• Architectural Trends
• The impact of technology and power
SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 8
Application Trends• A positive feedback cycle between the two
– Delivered performance – Applications’ demand for performance
• Example application domains– Scientific computing: computational fluid dynamics, biology, meteorology,
…– General purpose computing: video, graphics, …– Commercial computing: databases, …
New ApplicationsMore Performance
SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 9
Speedup
• Goal of applications in using parallel machines
– Speedup (p processors) =
• For a fixed problem size (input data set), performance = 1/time
– Speedup fixed problem (p processors) =
Performance (p processors)Performance (1 processor)
Time (1 processor)Time (p processors)
SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected])
ScientificComputingDemand
SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 11
Example: N-body Simulation
• Simulation of a dynamical system of particles– N particles (location, mass, velocity)– Influence of physical forces (e.g., gravity)– Physics, astronomy, …
SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 12
Example: Graph Computation
• Google page-rank algorithm– Determining the importance of a web page by
counting the number and quality of likes to the page
• Relative importanceof a web page u
– Repeat until stabilized
• Trillion pages in WebCredit:wikipedia
SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 13
Engineering Computing
• Large parallel machines are mainstays in many industries
– Petroleum – reservoir analysis– Automotive – crash simulation, drag analysis, combustion
efficiency– Aeronautics – airflow analysis, engine efficiency, structural
mechanics, electromagnetism– Computer-aided design –VLSI layout– Pharmaceuticals – molecular modeling– Visualization – all the above, entertainment, architecture– Financial modeling – yield and derivative analysis– …
SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 14
Commercial Computing
• Relies on parallelism for high end– Computational power determines scale of business that can be
handled– e.g.) databases, online-transaction processing, decision support,
data mining, data warehousing ...
• TPC benchmarks– TPC-C (order entry), TPC-D (decision support)– Explicit scaling criteria provided– Size of enterprise scales with size of system– Problem size no longer fixed as p increases, so throughput is
used as a performance measure, transactions per minute (tpm)
SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 15
Commercial Computing
• Hardware threads in the Top-10 TPC-C machines
SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 16
Why Parallel Computing?
• Application demands
• Architectural Trends
• The impact of technology and power
SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 17
Architectural Trends
• Architecture translates technology’s gifts into performance and capability
• Four generations of architectural history– Tube, transistor, IC, VLSI– Greatest delineation in VLSI has been in type of
parallelism exploited
SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 18
Moore’s Law
• 1965: the number of transistors/inch2
doubles every 1.5 years
GordonMooreCo-founderofIntelco.
Credit:shigeru23(CC-BY-SA-3.0)
SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 19
Evolution of Intel Microprocessors
Intel8088,19781core,nocache29Ktransistors
IntelNehalem-EX,20098cores,24MBcache
2.3Btransistors
Figurecredits:ExtremeTech,Thefutureofmicroprocessors,CommunicationsoftheACM,vol.54,no.5
Intelknightlanding,2016(?)72cores,16GBDDR3LLC(?)
7.1Btransistors
SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 20
Transistors & Clock Frequency
Credit:Thefutureofcomputingperformance:gameoverornextlevel?,TheNationalAcademiesPress,2011
SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 21
Impact of Device Shrinkage• What happens when the transistor size shrinks by a factor of x
?• Clock rate increases by x because wires are shorter
– Actually less than x, because of power consumption
• Transistors per unit area increases by x2
• Die size also tends to increase– Typically another factor of ~x
• Raw computing power of the chip goes up by ~ x4 !– Typically x3 is devoted to either on-chip
• Parallelism: hidden parallelism such as ILP• Locality: caches
• So most programs x3 times faster, without changing them
SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 22
Architectural Trends
• Greatest trend in VLSI generation is – Increase in parallelism
– Up to 1985: bit level parallelism• 4-bit → 8 bit → 16-bit, slows after 32 bit • Adoption of 64-bit now under way, 128-bit far (not performance issue)
– Mid 80s to mid 90s: instruction level parallelism• Pipelining and simple instruction sets + compiler advances (RISC)• On-chip caches and functional units ⇒ superscalar execution• Greater sophistication: out of order execution, speculation, prediction
– To deal with control transfer and latency problems– Now: thread level parallelism
• Multicore processors
SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected])
Phasesin“VLSI”GenerationTr
ansis
tors
◆◆◆◆◆◆◆
◆◆
◆
◆◆
◆
◆
◆ ◆◆
◆
◆
◆
◆
◆◆
◆◆◆ ◆
◆◆
◆
◆
◆ ◆
◆
◆
◆
◆
◆
◆◆
◆ ◆
◆◆◆
◆◆◆ ◆
◆◆ ◆◆ ◆
◆
◆◆◆ ◆
◆◆◆
◆
◆◆◆◆
1,000
10,000
100,000
1,000,000
10,000,000
100,000,000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level (?)
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
4bits → 8bits → 16bits → 32bits
PipeliningSuperscalarBranch predictionOut-of-order executionVLIW
Multicore / manycoreHeterogenous multicoreGPGPU
SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 24
Superscalar Execution
• Exploits instruction-level parallelism– With N pipelines, N instructions can be issued (executed) in
parallel
2-waysuperscalarexecutionpipeline
1. LoadR1,@10002. LoadR2,@10083. AddR1,@10044. AddR2,@100C5. AddR1,R26. StoreR1,@2000
SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 25
Out-of-order Execution
• Exploits instruction-level parallelism– With N pipelines, N instructions can be issued (executed) in
parallel– Reorder instructions
2-waysuperscalarexecutionpipeline
dependency1. LoadR1,@10002. LoadR2,@10083. AddR1,@10044. AddR2,@100C5. AddR1,R26. StoreR1,@2000
SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 26
Architecture-driven Advances
Credit:TheFutureofMicroprocessors,CommunicationsoftheACM,2011
SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 27
Limitation in ILP
• Speedup begins to saturate after issue width of 4– Assumption of ideal machine
• Infinite resources and fetch bandwidth, perfect branch prediction and renaming
– But with realistic memory• Real caches and non-zero miss latencies
0 1 2 3 4 5 6+0
5
10
15
20
25
30
●
●
●● ●
0 5 10 150
0.5
1
1.5
2
2.5
3Fr
actio
n of
tota
l cyc
les
(%)
Number of instructions issued
Spee
dup
Instructions issued per cycle (Issuewidth)
SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 28
Instruction-level Parallelism Concerns
• Issue waste
cycles
Fullissueslot
Emptyissueslot
Horizontalwaste=9slots
Verticalwaste=12slots
Issueslots(4way)Credit:D.M.Tullsen,S.J.Eggers,andH.M.Levy,“SimultaneousMultithreading:MaximizingOn-ChipParallelism,”ISCA1995
• Contributingfactors• Instructiondependencies• Long-latencyoperationswithinathread
SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 29
Some Sources of Wasted Issue Slots
• TLB miss• Instruction cache miss• Data cache miss• Load delays (L1 hit latency)
• Branch misprediction
• Instruction dependency• Memory conflict
Memoryhierarchy
Controlflow
Instructionstream
SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 30
Simulation of 8-issue Superscalar Processor
Credit:D.M.Tullsen,S.J.Eggers,andH.M.Levy,“SimultaneousMultithreading:MaximizingOn-ChipParallelism,”ISCA1995
Highlyunderutilizedprocessorcycles
• InmostofSPEC92applications• Onaverage<1.5IPC(19%)• Dominantwastediffersbyapplication
SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 31
Single-Thread Performance Curve
Therateofsingle-threadperformanceimprovementhasdecreased
Technology-driven
Architecture-driven
[ComputerArchitectures– Hennessy&Patterson]
SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 32
Single-chip Multiprocessor (Multicore)
• Limited instruction-level parallelism• à Exploits thread-level parallelism
6-waysuperscalarprocessor 42-waymultiprocessor
30%area
Credit:K.Olukotun,B.A.Nayfeh,L.Hammond,K.Wilson,andK.Chang,“Acaseforsingle-chipmultiprocessor,”ASPLOS,1996
SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 33
Why Parallel Computing?
• Application demands
• Architectural Trends
• The impact of technology and power
SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 34
Desperate Cooling?
SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 35
Cooling• Power delivery, packaging, and cooling costs
– Increased cost of thermal packing• $1 per watt for CPUs more than 35W [Tiwari, DAC98]
– At high-end, 1W of cooling for every 1W of power
• Cooling– Physical solutions
• Heatsink, thermal paste, fan– APM (advanced power management)
• BIOS manages clock speed• Triggered by thermal diodes
– ACPI (Advanced Configuration and Power Interface)• OS manages power for all devices• Idle, nap, sleep modes
SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 36
Power Consumption Trend
• Reason for power trends– Smaller feature size– Increased compaction– High clock frequency– More complicated processors
2000 2005 2010 2015
ProcessTransistorsSupplyvoltageFrequencyPower
180nm0.2B1.8-1.5V1GHz100W
90nm1.2B1.2-0.9V3GHz175W
55nm7.8B0.7V10GHz300W
3nm50B0.5V30GHz540W
source:Intel(2002)
400480088080
8085
8086
286 386 486Pentium®
P6
1
10
100
1000
10000
1970 1980 1990 2000 2010Year
Pow
er D
ensi
ty (
W/
cm2 ) Hot Plate
NuclearReactor
RocketNozzle
Sun’sSurfaceSource: Patrick Gelsinger, Shenkar Bokar, Intelâ
SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 37
Power Density Limits Serial Performance
• Single-core processor– Dynamic power = V2fC
• V = supply voltage, f = frequency, C = capacitance– Increasing frequency (f) increases supply voltage (V)à Cubic effect
• Multicore processor– Dynamic power = V2fC– Increasing # of cores increase capacitance (C) linearly
• More transistors by Moore’s lawà more cores for higher performance instead
or, same performance at reduced power
SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected])
PowerConsumption(watts)
Core2duo1.2GHz(mobile)
Core2duo2.4GHz(mobile)
Core2duo3.0GHzCore2Quad3.0GHz
Source:MIT6.198IAP
SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 39
Evolution of Microprocessors
• Chip density continues increase 2x every 2 years• Clock speed is not• # of cores increase• Power no longer increases
SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 40
Recent Multicore Processors• 2H 15: Intel Knight’s Landing
– 60+ cores; 4-way SMT; 16GB (on pkg)
• 2015: Oracle SPARC M7 – 32 cores; 8-way fine-grain MT; 64MB cache
• Fall 14: Intel Haswell– 18 cores; 2-way SMT; 45MB cache
• June 14: IBM Power8 – 12 cores; 8-way SMT; 96MB cache
• Sept 13: SPARC M6 – 12 cores; 8-way fine-grain MT; 48MB cache
• May 12: AMD Trinity – 4 CPU cores; 384 graphics cores
• Q2 13: Intel Knight’s Corner (coprocessor) – 61 cores; 2-way SMT; 16MB cache
• Feb 12: Blue Gene/Q – 16+1+1 cores; 4-way SMT; 32MB cache
BlueGene/Qcomputechip
Credit:TheIBMBlueGene/QComputeChip,IEEEMicro2012
SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 41
Simultaneous Multithreading (SMT)
• Permit several independent threads to issue to multiple functional units each cycle
Credit:R. Kalla, B. Sinharoy, J. Tendler, “SimultaneousMulti-threadingimplementationinPower5,”HotChips,2003
SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 42
Simultaneous Multithreading (SMT)
• SMT are easily added to superscalar architecture– Two set of PCs and registers– PC shares I-fetch unit– Register rename maps second set of registers
Credit:R. Kalla, B. Sinharoy, J. Tendler, “SimultaneousMulti-threadingimplementationinPower5,”HotChips,2003
SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 43
Large-scale Parallel Hardware
• Top 7 sites in Top500 supercomputers (Nov. 2014)
148th,169th,170thKoreaMeteorologicaladministration
Credit:top500.org
SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 44
Cores/Socket & Cores in Top500
• fef
0
50
100
150
200
250
300
350
400
450
500
1993 1998 2003 2008
System
s
1
2
3-4 5-8 9-16 17-32 33-64 65-128 129-256 257-512 513-1024
1025-2048
2049-4096
4k-8k 8k-16k 16k-32k 32k-64k 64k-128k 128k-
Credit:top500.org,[email protected]
SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 45
Blue Gene/Q Packing Hierarchy
1.Chip16cores
2.ModuleSingleChip
4.NodeCard32ComputeCards,
OpticalModules,LinkChips,Torus
5a.Midplane16NodeCards
6.Rack2Midplanes
1,2or4I/ODrawers7.System20PF/s
3.ComputeCardOnesinglechipmodule,16GBDDR3Memory
5b.I/ODrawer8I/OCards
8PCIeGen2slots
Credit:TheBlueGene/QComputeChip,HotChips 2011
SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 46
Summary: Why Parallel Computing?
• If you can’t exploit parallelism, performance will be a zero sum game
– If you want to add a new feature and maintain performance, you will need to remove something else.
• The future is “multicore” (i.e. parallel) according to all major processor vendors
– We expect more and more cores will be delivered with each generation
• Starting now, everyone will need to know how to write parallel programs
– It is no longer a niche area: it is a mainstream.
SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 47
References
• Some slides are adopted and adapted from– CS194: Parallel Programming by Prof. Katherine Yelick at UC Berkeley– CS267: Applications of Parallel Computers by Prof. Jim Demmel at UC
Berkeley– CS258: Parallel Computer Architecture by Prof. David E. Culler at UC
Berkeley– COMP422: Parallel Computing by Prof. John Mellor-Crummey at Rice
Univ.