why parallel computing? - androbenchcsl.skku.edu/uploads/sse3054s17/01-paracomp.pdf ·...

SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected])

Why Parallel Computing?

Jinkyu Jeong ([email protected])Computer Systems Laboratory

Sungkyunkwan Universityhttp://csl.skku.edu

SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong ([email protected]) 2

What is Parallel Computing?

• Software with multiple threads

• Parallel vs. concurrent– Parallel computing executes multiple threads at the same time on

multiple processors – performance– Concurrent computing can either alternate multiple threads on a

single processor or execute in parallel on multiple processors –convenient design


What is Parallel Architecture?

• Machines with multiple processors

IntelQuadCorei7(Nehalem) SonyPlaystation 3/4 IBMBlueGene

GoogleDataCenterLocations

ParallelComputingvs.

DistributedComputing

Facebook DataCenter


What is Parallel Architecture?

• One definition– A parallel computer is a collection of processing

elements that cooperate to solve large problems fast

• Some general issues– Resource allocation– Data access, communication and synchronization– Performance and scalability


Why Study Parallel Computing?

• 10 years ago– Some important applications demand high performance– With CPU clock frequency scaling, it is not enough– Parallel computing provides higher performance

• Today– Many interesting applications demand high performance– CPU clock rates are no longer increasing!– Instruction-level parallelism is not increasing either!– Parallel computing is the only way to achieve higher

performance in the foreseeable future


Why Study Parallel Architecture?

• Role of a computer architect– To design and engineer the various levels of a computer

system to maximize performance and programmability within limits of technology and cost

• Parallelism– Provides alternative to faster clock for performance– Applies at all levels of system design

• Bits, instructions, loops, threads, …



• Application demands

• Architectural Trends

• The impact of technology and power


Application Trends• A positive feedback cycle between the two

– Delivered performance – Applications’ demand for performance

• Example application domains– Scientific computing: computational fluid dynamics, biology, meteorology,

…– General purpose computing: video, graphics, …– Commercial computing: databases, …

New ApplicationsMore Performance


Speedup

• Goal of applications in using parallel machines

– Speedup (p processors) =

• For a fixed problem size (input data set), performance = 1/time

– Speedup fixed problem (p processors) =

Performance (p processors)Performance (1 processor)

Time (1 processor)Time (p processors)


ScientificComputingDemand


Example: N-body Simulation

• Simulation of a dynamical system of particles– N particles (location, mass, velocity)– Influence of physical forces (e.g., gravity)– Physics, astronomy, …


Example: Graph Computation

• Google page-rank algorithm– Determining the importance of a web page by

counting the number and quality of likes to the page

• Relative importanceof a web page u

– Repeat until stabilized

• Trillion pages in WebCredit:wikipedia


Engineering Computing

• Large parallel machines are mainstays in many industries

– Petroleum – reservoir analysis– Automotive – crash simulation, drag analysis, combustion

efficiency– Aeronautics – airflow analysis, engine efficiency, structural

mechanics, electromagnetism– Computer-aided design –VLSI layout– Pharmaceuticals – molecular modeling– Visualization – all the above, entertainment, architecture– Financial modeling – yield and derivative analysis– …


Commercial Computing

• Relies on parallelism for high end– Computational power determines scale of business that can be

handled– e.g.) databases, online-transaction processing, decision support,

data mining, data warehousing ...

• TPC benchmarks– TPC-C (order entry), TPC-D (decision support)– Explicit scaling criteria provided– Size of enterprise scales with size of system– Problem size no longer fixed as p increases, so throughput is

used as a performance measure, transactions per minute (tpm)


Commercial Computing

• Hardware threads in the Top-10 TPC-C machines


Architectural Trends

• Architecture translates technology’s gifts into performance and capability

• Four generations of architectural history– Tube, transistor, IC, VLSI– Greatest delineation in VLSI has been in type of

parallelism exploited


Moore’s Law

• 1965: the number of transistors/inch2

doubles every 1.5 years

GordonMooreCo-founderofIntelco.

Credit:shigeru23(CC-BY-SA-3.0)


Evolution of Intel Microprocessors

Intel8088,19781core,nocache29Ktransistors

IntelNehalem-EX,20098cores,24MBcache

2.3Btransistors

Figurecredits:ExtremeTech,Thefutureofmicroprocessors,CommunicationsoftheACM,vol.54,no.5

Intelknightlanding,2016(?)72cores,16GBDDR3LLC(?)

7.1Btransistors


Transistors & Clock Frequency

Credit:Thefutureofcomputingperformance:gameoverornextlevel?,TheNationalAcademiesPress,2011


Impact of Device Shrinkage• What happens when the transistor size shrinks by a factor of x

?• Clock rate increases by x because wires are shorter

– Actually less than x, because of power consumption

• Transistors per unit area increases by x2

• Die size also tends to increase– Typically another factor of ~x

• Raw computing power of the chip goes up by ~ x4 !– Typically x3 is devoted to either on-chip

• Parallelism: hidden parallelism such as ILP• Locality: caches

• So most programs x3 times faster, without changing them


Architectural Trends

• Greatest trend in VLSI generation is – Increase in parallelism

– Up to 1985: bit level parallelism• 4-bit → 8 bit → 16-bit, slows after 32 bit • Adoption of 64-bit now under way, 128-bit far (not performance issue)

– Mid 80s to mid 90s: instruction level parallelism• Pipelining and simple instruction sets + compiler advances (RISC)• On-chip caches and functional units ⇒ superscalar execution• Greater sophistication: out of order execution, speculation, prediction

– To deal with control transfer and latency problems– Now: thread level parallelism

• Multicore processors


Phasesin“VLSI”GenerationTr

ansis

tors

◆◆◆◆◆◆◆

◆◆

◆

◆◆

◆

◆

◆ ◆◆

◆

◆

◆

◆

◆◆

◆◆◆ ◆

◆◆

◆

◆

◆ ◆

◆

◆

◆

◆

◆

◆◆

◆ ◆

◆◆◆

◆◆◆ ◆

◆◆ ◆◆ ◆

◆

◆◆◆ ◆

◆◆◆

◆

◆◆◆◆

1,000

10,000

100,000

1,000,000

10,000,000

100,000,000

1970 1975 1980 1985 1990 1995 2000 2005

Bit-level parallelism Instruction-level Thread-level (?)

i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000

4bits → 8bits → 16bits → 32bits

PipeliningSuperscalarBranch predictionOut-of-order executionVLIW

Multicore / manycoreHeterogenous multicoreGPGPU


Superscalar Execution

• Exploits instruction-level parallelism– With N pipelines, N instructions can be issued (executed) in

parallel

2-waysuperscalarexecutionpipeline

1. LoadR1,@10002. LoadR2,@10083. AddR1,@10044. AddR2,@100C5. AddR1,R26. StoreR1,@2000


Out-of-order Execution

• Exploits instruction-level parallelism– With N pipelines, N instructions can be issued (executed) in

parallel– Reorder instructions

2-waysuperscalarexecutionpipeline

dependency1. LoadR1,@10002. LoadR2,@10083. AddR1,@10044. AddR2,@100C5. AddR1,R26. StoreR1,@2000


Architecture-driven Advances

Credit:TheFutureofMicroprocessors,CommunicationsoftheACM,2011


Limitation in ILP

• Speedup begins to saturate after issue width of 4– Assumption of ideal machine

• Infinite resources and fetch bandwidth, perfect branch prediction and renaming

– But with realistic memory• Real caches and non-zero miss latencies

0 1 2 3 4 5 6+0

5

10

15

20

25

30

●

●

●● ●

0 5 10 150

0.5

1

1.5

2

2.5

3Fr

actio

n of

tota

l cyc

les

(%)

Number of instructions issued

Spee

dup

Instructions issued per cycle (Issuewidth)


Instruction-level Parallelism Concerns

• Issue waste

cycles

Fullissueslot

Emptyissueslot

Horizontalwaste=9slots

Verticalwaste=12slots

Issueslots(4way)Credit:D.M.Tullsen,S.J.Eggers,andH.M.Levy,“SimultaneousMultithreading:MaximizingOn-ChipParallelism,”ISCA1995

• Contributingfactors• Instructiondependencies• Long-latencyoperationswithinathread


Some Sources of Wasted Issue Slots

• TLB miss• Instruction cache miss• Data cache miss• Load delays (L1 hit latency)

• Branch misprediction

• Instruction dependency• Memory conflict

Memoryhierarchy

Controlflow

Instructionstream


Simulation of 8-issue Superscalar Processor

Credit:D.M.Tullsen,S.J.Eggers,andH.M.Levy,“SimultaneousMultithreading:MaximizingOn-ChipParallelism,”ISCA1995

Highlyunderutilizedprocessorcycles

• InmostofSPEC92applications• Onaverage<1.5IPC(19%)• Dominantwastediffersbyapplication


Single-Thread Performance Curve

Therateofsingle-threadperformanceimprovementhasdecreased

Technology-driven

Architecture-driven

[ComputerArchitectures– Hennessy&Patterson]


Single-chip Multiprocessor (Multicore)

• Limited instruction-level parallelism• à Exploits thread-level parallelism

6-waysuperscalarprocessor 42-waymultiprocessor

30%area

Credit:K.Olukotun,B.A.Nayfeh,L.Hammond,K.Wilson,andK.Chang,“Acaseforsingle-chipmultiprocessor,”ASPLOS,1996


Desperate Cooling?


Cooling• Power delivery, packaging, and cooling costs

– Increased cost of thermal packing• $1 per watt for CPUs more than 35W [Tiwari, DAC98]

– At high-end, 1W of cooling for every 1W of power

• Cooling– Physical solutions

• Heatsink, thermal paste, fan– APM (advanced power management)

• BIOS manages clock speed• Triggered by thermal diodes

– ACPI (Advanced Configuration and Power Interface)• OS manages power for all devices• Idle, nap, sleep modes


Power Consumption Trend

• Reason for power trends– Smaller feature size– Increased compaction– High clock frequency– More complicated processors

2000 2005 2010 2015

ProcessTransistorsSupplyvoltageFrequencyPower

180nm0.2B1.8-1.5V1GHz100W

90nm1.2B1.2-0.9V3GHz175W

55nm7.8B0.7V10GHz300W

3nm50B0.5V30GHz540W

source:Intel(2002)

400480088080

8085

8086

286 386 486Pentium®

P6

1

10

100

1000

10000

1970 1980 1990 2000 2010Year

Pow

er D

ensi

ty (

W/

cm2 ) Hot Plate

NuclearReactor

RocketNozzle

Sun’sSurfaceSource: Patrick Gelsinger, Shenkar Bokar, Intelâ


Power Density Limits Serial Performance

• Single-core processor– Dynamic power = V2fC

• V = supply voltage, f = frequency, C = capacitance– Increasing frequency (f) increases supply voltage (V)à Cubic effect

• Multicore processor– Dynamic power = V2fC– Increasing # of cores increase capacitance (C) linearly

• More transistors by Moore’s lawà more cores for higher performance instead

or, same performance at reduced power


PowerConsumption(watts)

Core2duo1.2GHz(mobile)

Core2duo2.4GHz(mobile)

Core2duo3.0GHzCore2Quad3.0GHz

Source:MIT6.198IAP


Evolution of Microprocessors

• Chip density continues increase 2x every 2 years• Clock speed is not• # of cores increase• Power no longer increases


Recent Multicore Processors• 2H 15: Intel Knight’s Landing

– 60+ cores; 4-way SMT; 16GB (on pkg)

• 2015: Oracle SPARC M7 – 32 cores; 8-way fine-grain MT; 64MB cache

• Fall 14: Intel Haswell– 18 cores; 2-way SMT; 45MB cache

• June 14: IBM Power8 – 12 cores; 8-way SMT; 96MB cache

• Sept 13: SPARC M6 – 12 cores; 8-way fine-grain MT; 48MB cache

• May 12: AMD Trinity – 4 CPU cores; 384 graphics cores

• Q2 13: Intel Knight’s Corner (coprocessor) – 61 cores; 2-way SMT; 16MB cache

• Feb 12: Blue Gene/Q – 16+1+1 cores; 4-way SMT; 32MB cache

BlueGene/Qcomputechip

Credit:TheIBMBlueGene/QComputeChip,IEEEMicro2012


Simultaneous Multithreading (SMT)

• Permit several independent threads to issue to multiple functional units each cycle

Credit:R. Kalla, B. Sinharoy, J. Tendler, “SimultaneousMulti-threadingimplementationinPower5,”HotChips,2003


Simultaneous Multithreading (SMT)

• SMT are easily added to superscalar architecture– Two set of PCs and registers– PC shares I-fetch unit– Register rename maps second set of registers

Credit:R. Kalla, B. Sinharoy, J. Tendler, “SimultaneousMulti-threadingimplementationinPower5,”HotChips,2003


Large-scale Parallel Hardware

• Top 7 sites in Top500 supercomputers (Nov. 2014)

148th,169th,170thKoreaMeteorologicaladministration

Credit:top500.org


Cores/Socket & Cores in Top500

• fef

0

50

100

150

200

250

300

350

400

450

500

1993 1998 2003 2008

System

s

1

2

3-4 5-8 9-16 17-32 33-64 65-128 129-256 257-512 513-1024

1025-2048

2049-4096

4k-8k 8k-16k 16k-32k 32k-64k 64k-128k 128k-

Credit:top500.org,[email protected]


Blue Gene/Q Packing Hierarchy

1.Chip16cores

2.ModuleSingleChip

4.NodeCard32ComputeCards,

OpticalModules,LinkChips,Torus

5a.Midplane16NodeCards

6.Rack2Midplanes

1,2or4I/ODrawers7.System20PF/s

3.ComputeCardOnesinglechipmodule,16GBDDR3Memory

5b.I/ODrawer8I/OCards

8PCIeGen2slots

Credit:TheBlueGene/QComputeChip,HotChips 2011


Summary: Why Parallel Computing?

• If you can’t exploit parallelism, performance will be a zero sum game

– If you want to add a new feature and maintain performance, you will need to remove something else.

• The future is “multicore” (i.e. parallel) according to all major processor vendors

– We expect more and more cores will be delivered with each generation

• Starting now, everyone will need to know how to write parallel programs

– It is no longer a niche area: it is a mainstream.


References

• Some slides are adopted and adapted from– CS194: Parallel Programming by Prof. Katherine Yelick at UC Berkeley– CS267: Applications of Parallel Computers by Prof. Jim Demmel at UC

Berkeley– CS258: Parallel Computer Architecture by Prof. David E. Culler at UC

Berkeley– COMP422: Parallel Computing by Prof. John Mellor-Crummey at Rice

Univ.

why parallel computing? - androbenchcsl.skku.edu/uploads/sse3054s17/01-paracomp.pdf ·...

Documents