structure of computer systems

Structure of Computer Structure of Computer SystemsSystems

Course 11 Course 11 Parallel computer architecturesParallel computer architectures

MotivationsMotivations Why parallel execution?Why parallel execution?

users want faster-and-faster computers - why?users want faster-and-faster computers - why?• advanced multimedia processingadvanced multimedia processing• scientific computing: physics, info-biology (e.g. DNA analysis), scientific computing: physics, info-biology (e.g. DNA analysis),

medicine, chemistry, earth sciences)medicine, chemistry, earth sciences)• implementation of heavy-load servers: multimedia provisioningimplementation of heavy-load servers: multimedia provisioning• why not !!!! why not !!!!

performance improvement through clock frequency increase is no performance improvement through clock frequency increase is no longer possible longer possible

• power dissipation issues limit the clock signal’s frequency to 2-3GHzpower dissipation issues limit the clock signal’s frequency to 2-3GHz continue to maintain the Moor’s Law regarding performance continue to maintain the Moor’s Law regarding performance

increase through parallelizationincrease through parallelization

How ?How ? Parallelization principle: Parallelization principle:

““if one processor cannot make a computation (execute if one processor cannot make a computation (execute an application) in a reasonable time more processors an application) in a reasonable time more processors should be involved in the computation”should be involved in the computation”

similar, as in the case of human activitiessimilar, as in the case of human activities some parts or whole computer systems can work some parts or whole computer systems can work

simultaneously:simultaneously:• multiple ALUsmultiple ALUs• multiple instruction executing unitsmultiple instruction executing units• multiple CPU-smultiple CPU-s• multiple computer systemsmultiple computer systems

Flynn’s taxonomyFlynn’s taxonomy

Classification of computer systemsClassification of computer systems Michael Flynn – 1966Michael Flynn – 1966

• Classification based on the presence of single or Classification based on the presence of single or multiple streams of instructions and datamultiple streams of instructions and data

Instruction streamInstruction stream: a sequence instructions : a sequence instructions executed by a processorexecuted by a processor

Data streamData stream: a sequence of data required by : a sequence of data required by an instruction streaman instruction stream

Flynn’s taxonomyFlynn’s taxonomySingle instruction Single instruction streamstream

Multiple instruction Multiple instruction streamsstreams

Single data Single data streamstream

SISD – SISD – Single Instruction Single Instruction Single DataSingle Data

MISD – MISD – Multiple Instruction Multiple Instruction Single Data Single Data

Multiple data Multiple data streamsstreams

SIMD – SIMD – Single Instruction Single Instruction Multiple DataMultiple Data

MIMD –MIMD –Multiple Instruction Multiple Instruction Multiple DataMultiple Data

Flynn’s taxonomyFlynn’s taxonomy

... MPC M

...M...

SI

MI

SD MD

... M...

C

C

C

C

C

C

P

P

P

P

P P

P

I D

C – control unit P – processing unit (ALU)M - memory

SISD

MISD MIMD

SIMD

Flynn’s taxonomyFlynn’s taxonomy SISDSISD – Single instruction flow and single data flow – Single instruction flow and single data flow

not a parallel architecturenot a parallel architecture sequential processing – one instruction and one data at a sequential processing – one instruction and one data at a

timetime SIMDSIMD – Single instruction flow and multiple data – Single instruction flow and multiple data

flowflow data-level parallelismdata-level parallelism architectures with multiple ALUsarchitectures with multiple ALUs one instruction processes multiple dataone instruction processes multiple data process multiple data flows in parallelprocess multiple data flows in parallel useful in case of vectors, matrices – regular data useful in case of vectors, matrices – regular data

structuresstructures not useful for database applicationsnot useful for database applications

Flynn’s taxonomyFlynn’s taxonomy MISDMISD – Multiple instruction flows and single data – Multiple instruction flows and single data

flowflow two view:two view:

• there is no such a computerthere is no such a computer• pipeline architectures may be considered in this classpipeline architectures may be considered in this class

instruction level parallelisminstruction level parallelism superscalar architectures – sequential from outside, superscalar architectures – sequential from outside,

parallel insideparallel inside MIMDMIMD – Multiple instruction flows and multiple – Multiple instruction flows and multiple

data flowsdata flows true parallel architecturestrue parallel architectures

• multi-coresmulti-cores• multiprocessor systems: parallel and distributed systemsmultiprocessor systems: parallel and distributed systems

Issues regarding parallel executionIssues regarding parallel execution

subjective issues subjective issues (which depends on us):(which depends on us): human thinking is mainly sequentialhuman thinking is mainly sequential – hard to imagine – hard to imagine

doing thinks in paralleldoing thinks in parallel hard to divide a problemhard to divide a problem in parts that can be executed in parts that can be executed

simultaneously simultaneously • multitasking, multi-threading multitasking, multi-threading • some problems/applications are inherently parallel (e.g. if data some problems/applications are inherently parallel (e.g. if data

is organized on vectors, if there are loops in the program, etc.)is organized on vectors, if there are loops in the program, etc.)• how to divide a problem between 100 -1000 parallel unitshow to divide a problem between 100 -1000 parallel units

hard to predict consequences of parallel executionhard to predict consequences of parallel execution• e.g. concurrent access to shared resourcese.g. concurrent access to shared resources• writing multi-thread-safe applicationswriting multi-thread-safe applications

Issues regarding parallel executionIssues regarding parallel execution

objective issuesobjective issues efficient access to shared resourcesefficient access to shared resources

• shared memoryshared memory• shared data paths (buses)shared data paths (buses)• shared I/O facilitiesshared I/O facilities

efficient communicationefficient communication between intelligent parts between intelligent parts• interconnection networks, multiple buses, pipes, shared interconnection networks, multiple buses, pipes, shared

memory zonesmemory zones synchronization and mutual exclusionsynchronization and mutual exclusion

• causal dependenciescausal dependencies• consecutive start and end of tasksconsecutive start and end of tasks

data-race and I/O-racedata-race and I/O-race

Amdahl’s Law for parallel executionAmdahl’s Law for parallel execution

Speedup limitation caused by the sequential Speedup limitation caused by the sequential part of an applicationpart of an application an application = parts executed sequentially + an application = parts executed sequentially +

parts executable in parallel parts executable in parallel

nqqnpartseqtexecseqt

execparalleltexecseqtspeedup

/)1(1

/_

__

where:where:q – fraction of total time in which the application can be executed in parallel; q – fraction of total time in which the application can be executed in parallel; 0<f<=10<f<=1(1-q) – fraction of total time in which application is executed sequentially (1-q) – fraction of total time in which application is executed sequentially n – number of processors involved in the execution (degree of parallel n – number of processors involved in the execution (degree of parallel execution )execution )

Amdahl’s Law for parallel executionAmdahl’s Law for parallel execution Examples:Examples:

1.1. f = 0.9 (90%); n=2f = 0.9 (90%); n=2

2.2. f = 0.9 (90%); n=f = 0.9 (90%); n=10001000

3.3. f = 0.5 (50%); n=f = 0.5 (50%); n=10001000

81.12/9.0)9.01(

1

speedup

91.91000/9.0)9.01(

1

speedup

99.11000/5.0)5.01(

1

speedup

Parallel architecturesParallel architecturesData level parallelism (DLP)Data level parallelism (DLP)

SIMD architecturesSIMD architectures use of multiple parallel ALUsuse of multiple parallel ALUs it is efficient if the same operation must be performed on all the it is efficient if the same operation must be performed on all the

elements of a vector or matrixelements of a vector or matrix example of applications that can benefit:example of applications that can benefit:

• signal processing, image processingsignal processing, image processing• graphical rendering and simulationgraphical rendering and simulation• scientific computations with vectors and matrices scientific computations with vectors and matrices

versions:versions:• vector architectures vector architectures • systolic arraysystolic array• neural architecturesneural architectures

examples:examples:• Pentium II – MMX and SSE2Pentium II – MMX and SSE2

MMX moduleMMX module destined for multimedia processing destined for multimedia processing

MMX = Multimedia Extension MMX = Multimedia Extension used for vector computationsused for vector computations

adding, subtraction, multiply, division , AND, OR, NOTadding, subtraction, multiply, division , AND, OR, NOT one instruction can process 1 to 8 data in parallelone instruction can process 1 to 8 data in parallel scalar product of 2 vectors – convolution of 2 functionsscalar product of 2 vectors – convolution of 2 functions

• implementation of digital filters (e.g. image processing)implementation of digital filters (e.g. image processing)

iiTkTfiTxkTy )(*)()(

x(0) x(1) x(2) x(3) x(4) x(5) x(6) x(7)

f(0) f(1) f(2) f(3) f(4) f(5) f(6) f(7) * * * * * * * *

Σx(i)*f(i) i=0..3 i=4..8

Systolic arraySystolic array systolic array = piped network of simple systolic array = piped network of simple

processing units (cells); processing units (cells); all cells are synchronized – make one all cells are synchronized – make one

processing step simultaneouslyprocessing step simultaneously multiple data-flows cross the array, similarly multiple data-flows cross the array, similarly

with the way blood is pumped by the heart with the way blood is pumped by the heart in the arteries and organs (systolic in the arteries and organs (systolic behavior)behavior)

dedicated for fast computation of a given dedicated for fast computation of a given complex operationcomplex operation

• product of matricesproduct of matrices• evaluation of a polynomialevaluation of a polynomial• multiple steps of an image processing chainmultiple steps of an image processing chain

it is a it is a data-stream-driven processingdata-stream-driven processing, in , in opposition to the traditional (von Neumann) opposition to the traditional (von Neumann) instruction-stream processinginstruction-stream processing

Input flows

Inpu

t flo

ws

Output flows

Out

put f

low

s

Systolic arraySystolic array Example: matrix multiplicationExample: matrix multiplication

each cell in each step makes a multiply-and-accumulate operationeach cell in each step makes a multiply-and-accumulate operation at the end each cell contains one element of the resulting matrixat the end each cell contains one element of the resulting matrix

b0,0b1,0 b0,1b2,0 b1,1 b0,2 b2,1 b1,2

b2,2

a0,2 a0,1 a0,0

a1,2 a1,1 a1,0

a2,2 a2,1 a2,0

a0,0*b0,0+ a0,1*b1,0+ ...

a0,0*b0,1+ .. a0,0a0,1

b1,0

b0,0

b0,1

Parallel architecturesParallel architecturesInstruction level parallelism (ILP)Instruction level parallelism (ILP)

MISD – multiple instruction single dataMISD – multiple instruction single data types:types:

• pipeline architectures pipeline architectures • VLIW – very large instruction wordVLIW – very large instruction word• superscalar and super-pipeline architecturessuperscalar and super-pipeline architectures

Pipeline architecturesPipeline architectures – multiple instruction stages performed – multiple instruction stages performed by specialized units in parallel:by specialized units in parallel:

• instruction fetchinstruction fetch• instruction decode and data fetchinstruction decode and data fetch• instruction executioninstruction execution• memory operationmemory operation• write back the resultwrite back the result

issues – hazardsissues – hazards• data hazard – data dependency between consecutive instructionsdata hazard – data dependency between consecutive instructions• control hazard – jump instructions’ unpredictabilitycontrol hazard – jump instructions’ unpredictability• structural hazard – same structural element used by different stages of structural hazard – same structural element used by different stages of

consecutive instructionsconsecutive instructions see course no. 4 and 5see course no. 4 and 5

Pipeline architecturePipeline architectureThe MIPS pipeline The MIPS pipeline


VLIW – very large instruction wordVLIW – very large instruction word idea –a number of simple instructions idea –a number of simple instructions

(operations) are formatted into in a(operations) are formatted into in a very large very large (super) instruction (called bundle)(super) instruction (called bundle)• it will be read and executed as a single instruction, it will be read and executed as a single instruction,

but with some parallel operationsbut with some parallel operations• operations are grouped in a wide instruction code operations are grouped in a wide instruction code

only if they can be executed in parallelonly if they can be executed in parallel• usually the instructions are grouped by the compiler usually the instructions are grouped by the compiler • the solution is efficient only if there are multiple the solution is efficient only if there are multiple

execution units that can execute operations execution units that can execute operations included in an instruction in a parallel wayincluded in an instruction in a parallel way


VLIW – very large instruction word (cont.)VLIW – very large instruction word (cont.) advantage: parallel execution, simultaneous advantage: parallel execution, simultaneous

execution possibility detected at compilationexecution possibility detected at compilation drawback: because of some dependencies not always drawback: because of some dependencies not always

the compiler can find instructions that can be the compiler can find instructions that can be executed in parallel executed in parallel

examples of processors: examples of processors: • Intel Itanium – 3 operations/instructionIntel Itanium – 3 operations/instruction• IA-64 EPIC (Explicitly Parallel Instruction Computing)IA-64 EPIC (Explicitly Parallel Instruction Computing)• C6000 – digital signal processor (Texas Instruments)C6000 – digital signal processor (Texas Instruments)• embedded processorsembedded processors


Superscalar architectureSuperscalar architecture:: ““more than a scalar architecture”, towards more than a scalar architecture”, towards

parallel executionparallel execution superscalar: superscalar:

• from outside – sequential (scalar) instruction from outside – sequential (scalar) instruction execution execution

• inside – parallel instruction execution inside – parallel instruction execution example: Pentium Pro – 3-5 instructions fetched and example: Pentium Pro – 3-5 instructions fetched and

executed in every clock periodexecuted in every clock period consequence: programs are written in a consequence: programs are written in a

sequential manner but executed in parallelsequential manner but executed in parallel


Superscalar architecture (cont.)Superscalar architecture (cont.) Advantages: more instructions executed in every clock period; Advantages: more instructions executed in every clock period;

• extend the potential of a pipeline architectureextend the potential of a pipeline architecture• CPI<1CPI<1

Drawback: more complex hazard detection and correction mechanisms Drawback: more complex hazard detection and correction mechanisms

Examples:Examples:• P6 (Pentium Pro) architecture: 3 instructions decoded in every clock P6 (Pentium Pro) architecture: 3 instructions decoded in every clock

periodperiod

IF ID ex Mem WB IF ID ex Mem WB IF ID ex Mem WB

IF ID ex Mem WB IF ID ex Mem WB

IF ID ex Mem WB


Super-pipeline Super-pipeline architecturearchitecture

pipeline extended to pipeline extended to extremesextremes

• more pipeline stages (e.g. more pipeline stages (e.g. 20 in case of NetBurst 20 in case of NetBurst architecture) architecture)

• one step executed in half of one step executed in half of the clock period (better the clock period (better than doubling the clock than doubling the clock frequency)frequency)

IF ID ex Mem WB IF ID ex Mem WB IF ID ex Mem WB

IF ID ex Mem WB





Pipeline (classic)

Super-pipeline

Super-scalar

Superscalar,EPIC, VLIWSuperscalar,EPIC, VLIW

From Mark Smotherman, “Understanding EPIC Architectures and Implementations”

Grouping Grouping instructionsinstructions

Functional unit Functional unit assignmentassignment

SchedulingScheduling

SuperscalarSuperscalar HardwareHardware HardwareHardware HardwareHardware

EPICEPIC CompilerCompiler HardwareHardware HardwareHardware

Dynamic VLIWDynamic VLIW CompilerCompiler CompilerCompiler HardwareHardware

VLIWVLIW CompilerCompiler CompilerCompiler CompilerCompiler

http://www.cs.clemson.edu/~mark/464/acmse_epic.pdf

Superscalar,EPIC, VLIWSuperscalar,EPIC, VLIW

From Mark Smotherman, “Understanding EPIC Architectures and Implementations”

Code generation

Instr. grouping

Functional unit assignment

Scheduling

Instr. grouping

Functional unit assignment

Scheduling

Superscalar

EPIC

Dynamic VLIW

VLIW

Compiler Hardware

http://www.cs.clemson.edu/~mark/464/acmse_epic.pdf


We reached the limits of instruction level We reached the limits of instruction level parallelization:parallelization: pipelining – 12-15 stages pipelining – 12-15 stages

• Pentium 4 – NetBurst architecture – 20 stages – Pentium 4 – NetBurst architecture – 20 stages – was too muchwas too much

superscalar and VLIW – 3-4 instructions superscalar and VLIW – 3-4 instructions fetched and executed at a timefetched and executed at a time

Main issue:Main issue: hard to detect and solve efficiently hazard hard to detect and solve efficiently hazard

casescases

Parallel architecturesParallel architecturesTThread level parallelism (TLP)hread level parallelism (TLP)

TLP (Thread Level Parallelism)TLP (Thread Level Parallelism)• parallel execution at thread levelparallel execution at thread level• examples:examples:

hyper-threading – 2 threads on the same pipeline executed hyper-threading – 2 threads on the same pipeline executed in parallel (up to 30% speedup)in parallel (up to 30% speedup)

multi-core architectures – multiple CPUs on a single chipmulti-core architectures – multiple CPUs on a single chip multiprocessor systems (parallel systems)multiprocessor systems (parallel systems)

Th1

Th2

Main memoryHyper-threading

Multi-core and multi-processor

L2 Cache

L1 C L1 C

Core1 Core2

L2 Cache

L1 C L1 C

Core1 Core2

IF ID Ex WB


Issues:Issues:• transforming a sequential program into a multi-transforming a sequential program into a multi-

thread one:thread one: procedures transformed into threadsprocedures transformed into threads loops (for, whiles, do ...) transformed into threadsloops (for, whiles, do ...) transformed into threads

• synchronizationsynchronization• concurrent access to common resourcesconcurrent access to common resources• context-switch timecontext-switch time=> thread-safe programming=> thread-safe programming


programming example:programming example:

result: depend on the memory consistency modelresult: depend on the memory consistency model• no consistency control: (a,b) -> no consistency control: (a,b) ->

Th1;Th2 => (5,100)Th1;Th2 => (5,100) Th2;Th1 => (1,50)Th2;Th1 => (1,50) Th1 interleaved with Th2 => (5,50)Th1 interleaved with Th2 => (5,50)

• thread level consistency:thread level consistency: Th1 => (5,100) Th2=>(1,50)Th1 => (5,100) Th2=>(1,50)

int a = 1;

int b=100;

a = 5;

print(b);;

Thread 1

b = 50;

print(a);;

Thread 2


when do we switch between threads?when do we switch between threads? Fine grain threading – alternate after every Fine grain threading – alternate after every

instructioninstruction Coarse grain – alternate when one thread is Coarse grain – alternate when one thread is

stalled (e.g. cache miss)stalled (e.g. cache miss)

Forms of parallel executionForms of parallel execution

Processor timeCycles

SuperscalarFine grain threading

Coarse grain threading

Multiprocessor

Hyper-threadingsimultaneous multithreading

StallThread 1

Thread 2Thread 3

Thread 4Thread 5


Fine-Grained MultithreadingFine-Grained Multithreading Switches between threads on each instruction, causing the Switches between threads on each instruction, causing the

execution of multiple threads to be interleaved execution of multiple threads to be interleaved Usually done in a round-robin fashion, skipping any stalled Usually done in a round-robin fashion, skipping any stalled

threadsthreads CPU must be able to switch threads every clockCPU must be able to switch threads every clock Advantage: it can hide both short and long stalls, Advantage: it can hide both short and long stalls,

• instructions from other threads executed when one thread stalls instructions from other threads executed when one thread stalls Disadvantage: it slows down execution of individual threads, Disadvantage: it slows down execution of individual threads,

since a thread ready to execute without stalls will be delayed by since a thread ready to execute without stalls will be delayed by instructions from other threadsinstructions from other threads

Used on Sun’s NiagaraUsed on Sun’s Niagara


Coarse-Grained MultithreadingCoarse-Grained Multithreading Switches threads only on costly stalls, such as L2 cache missesSwitches threads only on costly stalls, such as L2 cache misses Advantages Advantages

• Relieves need to have very fast thread-switchingRelieves need to have very fast thread-switching• Doesn’t slow down thread, since instructions from other threads Doesn’t slow down thread, since instructions from other threads

issued only when the thread encounters a costly stall issued only when the thread encounters a costly stall Disadvantage:Disadvantage:

• hard to overcome throughput losses from shorter stalls, due to hard to overcome throughput losses from shorter stalls, due to pipeline start-up costspipeline start-up costs

• Since CPU issues instructions from 1 thread, when a stall occurs, Since CPU issues instructions from 1 thread, when a stall occurs, the pipeline must be emptied or frozen the pipeline must be emptied or frozen

• New thread must fill pipeline before instructions can complete New thread must fill pipeline before instructions can complete Because of this start-up overhead, coarse-grained Because of this start-up overhead, coarse-grained

multithreading is better for reducing penalty of high cost stalls, multithreading is better for reducing penalty of high cost stalls, where pipeline refill << stall timewhere pipeline refill << stall time

Used in IBM AS/400Used in IBM AS/400

Parallel architecturesParallel architectures PLP - Process Level ParallelismPLP - Process Level Parallelism

Process: an execution unit in UNIXProcess: an execution unit in UNIX a secured environment to execute an application or taska secured environment to execute an application or task the operating system allocates resources at process the operating system allocates resources at process

level:level:• protected memory zonesprotected memory zones• I/O interfaces and interruptsI/O interfaces and interrupts• file access systemfile access system

Thread – a ”light weight process”Thread – a ”light weight process” a process may contain a number of threads;a process may contain a number of threads; threads share resources allocated to a processthreads share resources allocated to a process no (or minimal) protection between threads of the same no (or minimal) protection between threads of the same

processprocess


Architectural support for PLP:Architectural support for PLP: Multiprocessor systems (2 or more processors in one computer system)Multiprocessor systems (2 or more processors in one computer system)

• processors managed by the operating systemprocessors managed by the operating system GRID computer systemsGRID computer systems

• many computers interconnected through a networkmany computers interconnected through a network• processors and storage managed by a middleware (Condor, gLite, Globus processors and storage managed by a middleware (Condor, gLite, Globus

Toolkit)Toolkit)• example - EGI – European Grid Initiativeexample - EGI – European Grid Initiative• a special language to describe:a special language to describe:

processing treesprocessing trees input filesinput files output filesoutput files

• advantage - hundreds of thousands of computers available for scientific advantage - hundreds of thousands of computers available for scientific purposes purposes

• drawback – batch processing, very little interaction between the system and drawback – batch processing, very little interaction between the system and the end-user the end-user

Cloud computer systemsCloud computer systems• computing infrastructure as a servicecomputing infrastructure as a service• see Amazon: see Amazon:

EC2 – computing service – Elastic Computer CloudEC2 – computing service – Elastic Computer Cloud S3 – storage service – Simple Storage ServiceS3 – storage service – Simple Storage Service


It’s more a question of software and not of It’s more a question of software and not of computer architecturecomputer architecture the same computers may be part of a GRID or the same computers may be part of a GRID or

a Clouda Cloud Hardware Requirements:Hardware Requirements:

enough bandwidth between processorsenough bandwidth between processors

ConclusionsConclusions data level parallelismdata level parallelism

still some extension possibilities, but depends still some extension possibilities, but depends on the regular structure of dataon the regular structure of data

instruction level parallelisminstruction level parallelism almost at the end of the improvement almost at the end of the improvement

capabilitiescapabilities thread/process parallelismthread/process parallelism

still an important source for performance still an important source for performance improvementimprovement

structure of computer systems

Documents

data instruction stream

sequence of data

multiple alusone instruction

parallel unitshard

single data flownot

processor data stream

multiple buses

timesimd single instruction