computer and hardware architecture iiapachepersonal.miun.se/~bentho/ec/download/f3.pdf · flynn’s...

37
Benny Thörnberg Associate Professor in Electronics Computer and Hardware Architecture II

Upload: phamminh

Post on 26-May-2018

247 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Computer and Hardware Architecture IIapachepersonal.miun.se/~bentho/ec/download/f3.pdf · Flynn’s taxonomy - SIMD Processor Instructions Data Processor Processor Processor Processor

Benny Thörnberg

Associate Professor in Electronics

Computer and Hardware Architecture II

Page 2: Computer and Hardware Architecture IIapachepersonal.miun.se/~bentho/ec/download/f3.pdf · Flynn’s taxonomy - SIMD Processor Instructions Data Processor Processor Processor Processor

Parallelism – Microscopic vs Macroscopic

• Microscopic parallelism – hardware solutions inside system components providing parallel computations without being visible to the user, e.g

• Registers

• Memory

• Parallel busses

• Instructions pipeline

• Macroscopic parallelism - duplicated large-scale components providing parallelism on system level

• Dual- or Quad-core processors

• Vector or Graphics processors

• Co-processors

• I/O processors

Page 3: Computer and Hardware Architecture IIapachepersonal.miun.se/~bentho/ec/download/f3.pdf · Flynn’s taxonomy - SIMD Processor Instructions Data Processor Processor Processor Processor

Parallelism – Symmetric vs Asymmetric

• Symmetric parallelism – uses replications of identical processing elements that can operate in parallel

• Multicore processors

• Asymmetric parallelism – uses a set of processing elements that operate in parallel but differs

• PC with CPU, Graphics processor, math processor, I/O processor

Page 4: Computer and Hardware Architecture IIapachepersonal.miun.se/~bentho/ec/download/f3.pdf · Flynn’s taxonomy - SIMD Processor Instructions Data Processor Processor Processor Processor

Parallelism – Fine-grain vs Coarse-grain

• Fine-grain parallelism – computers providing parallel computations on the level of instructions or data items

• Vector processors

• Digital signal processors with special SIMD instructions

• Coarse-grain parallelism – computers providing parallelism on the level of programs or larger data structures

• Dual- or Quad-core processors

Page 5: Computer and Hardware Architecture IIapachepersonal.miun.se/~bentho/ec/download/f3.pdf · Flynn’s taxonomy - SIMD Processor Instructions Data Processor Processor Processor Processor

Parallelism – Explicit vs Implicit

• Explicit parallelism – programmer need to control how available parallelism is exploited in the code, through e.g. partitioning into parallel processes, constraints and special instructions.

• Implicit parallelism – hardware can exploit parallelism in the executed code without constraints or any special instructions defined by the programmer

Page 6: Computer and Hardware Architecture IIapachepersonal.miun.se/~bentho/ec/download/f3.pdf · Flynn’s taxonomy - SIMD Processor Instructions Data Processor Processor Processor Processor

Flynn’s taxonomy

• 1966 Michael J Flynn proposed a classification of computers

One Many

One

Many

Instruction streams

Data streams

SISD: Single instruction stream

Single data stream

SIMD: Single instruction stream

Multiple data streams

MISD:Multiple instruction streams

Single data stream

MIMD: Multiple instruction streams

Multiple data streams

Page 7: Computer and Hardware Architecture IIapachepersonal.miun.se/~bentho/ec/download/f3.pdf · Flynn’s taxonomy - SIMD Processor Instructions Data Processor Processor Processor Processor

Flynn’s taxonomy - SISD

Processor

Instructions

Data

• Capable of executing single instructions, operating on a single data stream• E.g. conventional von-Neumann

architecture

Page 8: Computer and Hardware Architecture IIapachepersonal.miun.se/~bentho/ec/download/f3.pdf · Flynn’s taxonomy - SIMD Processor Instructions Data Processor Processor Processor Processor

Flynn’s taxonomy - SIMD

Processor

Instructions

Data

Processor Processor ProcessorProcessor

• Capable of executing the same instruction on all processing elements operating on different data streams• E.g. vector processors

Page 9: Computer and Hardware Architecture IIapachepersonal.miun.se/~bentho/ec/download/f3.pdf · Flynn’s taxonomy - SIMD Processor Instructions Data Processor Processor Processor Processor

Flynn’s taxonomy - MISD

Processor

Instructions

Data

Processor Processor ProcessorProcessor

• Executes different instructions on each processing element operating on the same data stream.(Useful for only a limited amount of applications)

Page 10: Computer and Hardware Architecture IIapachepersonal.miun.se/~bentho/ec/download/f3.pdf · Flynn’s taxonomy - SIMD Processor Instructions Data Processor Processor Processor Processor

Flynn’s taxonomy - MIMD

Processor

Instructions

Data

Processor Processor ProcessorProcessor Processor

• Executes multiple instructions on multiple data streams• E.g. multiprocessors

Page 11: Computer and Hardware Architecture IIapachepersonal.miun.se/~bentho/ec/download/f3.pdf · Flynn’s taxonomy - SIMD Processor Instructions Data Processor Processor Processor Processor

System Bus Architectures

Multi-master point to point communication over a single system bus requires bus arbitration.

Processors, co-processors and DMA-controllers are typically operating as bus masters.

Reference

Page 12: Computer and Hardware Architecture IIapachepersonal.miun.se/~bentho/ec/download/f3.pdf · Flynn’s taxonomy - SIMD Processor Instructions Data Processor Processor Processor Processor

System Bus Architectures

Time multiplexing of data and addresses on common lines• Lower cost• Lower performance

Reference

Page 13: Computer and Hardware Architecture IIapachepersonal.miun.se/~bentho/ec/download/f3.pdf · Flynn’s taxonomy - SIMD Processor Instructions Data Processor Processor Processor Processor

System Bus Architectures

• A computer could be designed for using multiple buses for different purposes

• Cheaper solution to include a bridge• Typically used for e.g. USB or Ethernet

Reference

Page 14: Computer and Hardware Architecture IIapachepersonal.miun.se/~bentho/ec/download/f3.pdf · Flynn’s taxonomy - SIMD Processor Instructions Data Processor Processor Processor Processor

System Bus Architectures

Fetch and Store paradigm

Reference

Page 15: Computer and Hardware Architecture IIapachepersonal.miun.se/~bentho/ec/download/f3.pdf · Flynn’s taxonomy - SIMD Processor Instructions Data Processor Processor Processor Processor

System Bus Architectures

Conclusions:• A system bus can only perform one transfer at a time• It is thus a limited resource for communication• More than one master can compete for access to this

resource. Processors, co-processors and DMA-controllers

How to mitigate limitations on communication over a system bus?

Page 16: Computer and Hardware Architecture IIapachepersonal.miun.se/~bentho/ec/download/f3.pdf · Flynn’s taxonomy - SIMD Processor Instructions Data Processor Processor Processor Processor

Switching fabrics

Significantly more expensive then a system bus

Page 17: Computer and Hardware Architecture IIapachepersonal.miun.se/~bentho/ec/download/f3.pdf · Flynn’s taxonomy - SIMD Processor Instructions Data Processor Processor Processor Processor

AXI4 channel – switch

Xilinx AXI4 bus is a derivate of the Arm AMBA bus developed for SoC applications. Picture is showing a switch for AXI4

Connects one or more similar AXI memory mapped masters to one or more similar memory mapped slaves.

Reference: XilinxUser Guide 1037

Page 18: Computer and Hardware Architecture IIapachepersonal.miun.se/~bentho/ec/download/f3.pdf · Flynn’s taxonomy - SIMD Processor Instructions Data Processor Processor Processor Processor

AXI4 and AXI4-Lite bus

Consists of five channels:• Read address channel• Write address channel• Read data channel• Write data channel• Write response channel

Data can move simultaneously in both directions. AXI4 allow for burst of 256 data transfers using only one address. AXI4-Lite allow only for single data transactions.

Maste

r

Sla

ve

A master is taking the initiative to a data transfer, slave is responding

Page 19: Computer and Hardware Architecture IIapachepersonal.miun.se/~bentho/ec/download/f3.pdf · Flynn’s taxonomy - SIMD Processor Instructions Data Processor Processor Processor Processor

AXI4 bus – read operationReference: XilinxUser Guide 1037

Page 20: Computer and Hardware Architecture IIapachepersonal.miun.se/~bentho/ec/download/f3.pdf · Flynn’s taxonomy - SIMD Processor Instructions Data Processor Processor Processor Processor

AXI4 bus – write operationReference: XilinxUser Guide 1037

Page 21: Computer and Hardware Architecture IIapachepersonal.miun.se/~bentho/ec/download/f3.pdf · Flynn’s taxonomy - SIMD Processor Instructions Data Processor Processor Processor Processor

AXI4-stream

Maste

r

Sla

ve

Unidirectional streaming of data

Page 22: Computer and Hardware Architecture IIapachepersonal.miun.se/~bentho/ec/download/f3.pdf · Flynn’s taxonomy - SIMD Processor Instructions Data Processor Processor Processor Processor

AXI4-stream implementation

• Used for high speed data centric streaming applications, e.g. video• TLAST indicates packet boundaries• TVALID indicates valid data

Reference: XilinxUser Guide 1037

Page 23: Computer and Hardware Architecture IIapachepersonal.miun.se/~bentho/ec/download/f3.pdf · Flynn’s taxonomy - SIMD Processor Instructions Data Processor Processor Processor Processor

AXI4-Stream Interconnect

Parallel routing of traffic between N masters and M slaves

Reference: XilinxUser Guide 1037

Page 24: Computer and Hardware Architecture IIapachepersonal.miun.se/~bentho/ec/download/f3.pdf · Flynn’s taxonomy - SIMD Processor Instructions Data Processor Processor Processor Processor

Multiprocessor architectures

Challenges for multiprocessor architectures• Communication• Coordination• Contention

Reference

Page 25: Computer and Hardware Architecture IIapachepersonal.miun.se/~bentho/ec/download/f3.pdf · Flynn’s taxonomy - SIMD Processor Instructions Data Processor Processor Processor Processor

Challenges

• Communication – must be scalable to handle communication between large number of processors

• Coordination – a strategy for how to distribute tasks among all processors is required

• Contention – situations where two or more processors try to access a resource at the same time. This problem explodes with increasing number of processors• In particular problems will occur with memory accesses• Cashing can mitigate but introduces another problem,

• Cache coherence – how to guarantee that cache memories, local to each processor carries the same data for common memory locations?

Page 26: Computer and Hardware Architecture IIapachepersonal.miun.se/~bentho/ec/download/f3.pdf · Flynn’s taxonomy - SIMD Processor Instructions Data Processor Processor Processor Processor

Using Peripheral Processors

Reference

Page 27: Computer and Hardware Architecture IIapachepersonal.miun.se/~bentho/ec/download/f3.pdf · Flynn’s taxonomy - SIMD Processor Instructions Data Processor Processor Processor Processor

Performance of Multi-Processor architectures

������� = �

Reference

Page 28: Computer and Hardware Architecture IIapachepersonal.miun.se/~bentho/ec/download/f3.pdf · Flynn’s taxonomy - SIMD Processor Instructions Data Processor Processor Processor Processor

Data Pipelining

Stage 1 Stage 2 Stage 3 Stage 4 Stage 5

��� �� � �

Input data stream

Output data stream

• A pipeline divides a larger computational task into a series of smaller tasks

• Benefits:• Smaller tasks are less complex to describe• Allow for reuse of code modules• Reveals coarse grained parallelism that can be mapped to a

multi-processor architecture for increased throughput

Page 29: Computer and Hardware Architecture IIapachepersonal.miun.se/~bentho/ec/download/f3.pdf · Flynn’s taxonomy - SIMD Processor Instructions Data Processor Processor Processor Processor

Data Pipelining

Stage 1 Stage 2 Stage 3 Stage 4 Stage 5

��� �� � �

Input data stream

Output data stream

• Necessary conditions:• Partionable problem• Low communication overhead• Equivalent processor speed as for single processor

Page 30: Computer and Hardware Architecture IIapachepersonal.miun.se/~bentho/ec/download/f3.pdf · Flynn’s taxonomy - SIMD Processor Instructions Data Processor Processor Processor Processor

Data Pipelining

Stage 1 Stage 2 Stage 3 Stage 4 Stage 5

��� �� � �

Input data stream

Output data stream

������������ > � ∧ �� > �� ∧ �� > � …�� > �

�������������� = 1

�� [��������� ��������⁄ ]

Latency = � + �� + �� + � +⋯+ � [timeunits]

Page 31: Computer and Hardware Architecture IIapachepersonal.miun.se/~bentho/ec/download/f3.pdf · Flynn’s taxonomy - SIMD Processor Instructions Data Processor Processor Processor Processor

Data Flow Graph

Actor

1

Actor

2

Actor

3

Actor

4

Input data stream Output data stream

A data flow graph is describing computations without including any information on “how” the computation is going to be done. Hence, only data flow and no control flow is described.

This programming paradigm is supported by functional languages such as DFL suitable for digital signal processing systems and also ideal for capturing pipelined computations.

Imperative languages such as C and C++ model both control- and data flow and are no good for capturing parallelism.

Page 32: Computer and Hardware Architecture IIapachepersonal.miun.se/~bentho/ec/download/f3.pdf · Flynn’s taxonomy - SIMD Processor Instructions Data Processor Processor Processor Processor

Data Pipelining on FPGA logic

CN

QD

>

Input data stream Output data stream

Clk

A large combinatorial network is driving an output register

• Propagation delay time for CN is �

• Max frequency for clock signal Clk then becomes ,-./ =

0

Page 33: Computer and Hardware Architecture IIapachepersonal.miun.se/~bentho/ec/download/f3.pdf · Flynn’s taxonomy - SIMD Processor Instructions Data Processor Processor Processor Processor

Data Pipelining on FPGA logic

CN 1

QD

>

CN 2

��

QD

>

CN 3

��

QD

>

CN 4

QD

>

CN M

�1

QD

>

• Assume that CN is partionable into M smaller combinatorial networks• Insert registers in between all combinatorial nets

������������ > � ∧ �� > �� ∧ �� > � …�� > �

����,-./ = 1

��23

Latency = 456�575856��

Clk

Page 34: Computer and Hardware Architecture IIapachepersonal.miun.se/~bentho/ec/download/f3.pdf · Flynn’s taxonomy - SIMD Processor Instructions Data Processor Processor Processor Processor

Power in computational logic

• The dynamic energy 9:consumed when changing state of a cmoslogic output

• ; is the total capacitive load of the output• <:: is the supply voltage

9: =1

2· ; · <::

• The average dynamic power ?: =; · <::

@-./= ; · <::

� · ,-./

• We can conclude that power dissipation is proportional to clock frequency

and proportional to square of supply voltage

• Trying to increase speed of a processor by simply increasing clock frequency

at the same time as physical scaling of technology increases can only be

done until the power wall is reached

• With current technology ?�A��B�66 = 100DEFFG

-HI

Page 35: Computer and Hardware Architecture IIapachepersonal.miun.se/~bentho/ec/download/f3.pdf · Flynn’s taxonomy - SIMD Processor Instructions Data Processor Processor Processor Processor

Power in computational logic

• The delay time J for a gate can be approximated to

J = K ·L · <::

<:: − <FN

• <FN is the cmos threshold voltage and K, L are technology dependent constants

• Delay J will depend mostly on K andL for larger supply voltages <::• Delay J will increase dramatically when <:: is decreased close to <FN

• Dynamic voltage and frequency scaling means that both supply voltage and clock frequency is adjusted so that a processor can deliver just enough speed

• A reduction of both frequency and supply voltage will result in a dramatic reduction of dynamic power consumption

Page 36: Computer and Hardware Architecture IIapachepersonal.miun.se/~bentho/ec/download/f3.pdf · Flynn’s taxonomy - SIMD Processor Instructions Data Processor Processor Processor Processor

Using sleep mode to control energy consumption

9G = ?G · @GEnergy consumed during shutdown

9P = ?P · @PEnergy consumed during wakeup

Energy consumed when running processor for time � 9Q = ?QRS · �

Energy consumed when going to sleep for time t

9G.TTU = 9G + 9P + ?VWW · � − @G − @P

Energy can be saved when 9G.TTU < 9QRS

Reference

Page 37: Computer and Hardware Architecture IIapachepersonal.miun.se/~bentho/ec/download/f3.pdf · Flynn’s taxonomy - SIMD Processor Instructions Data Processor Processor Processor Processor

Example – Battery powered oil detector for wastewater

• A smart sensor can detect petroleum contamination in wastewater

• Numerous sensors are installed at selected checkpoints which allow

tracing of sources of contamination

• Task for sensor is to measure wastewater every 15 minutes and

send alarm data over radio link whenever a contamination is

detected.

• This task finishes in milliseconds while the rest of the 15 minutes

cycle is spent on sleeping