computer and hardware architecture iiapachepersonal.miun.se/~bentho/ec/download/f3.pdf · flynn’s...

Benny Thörnberg

Associate Professor in Electronics

Computer and Hardware Architecture II

Parallelism – Microscopic vs Macroscopic

• Microscopic parallelism – hardware solutions inside system components providing parallel computations without being visible to the user, e.g

• Registers

• Memory

• Parallel busses

• Instructions pipeline

• Macroscopic parallelism - duplicated large-scale components providing parallelism on system level

• Dual- or Quad-core processors

• Vector or Graphics processors

• Co-processors

• I/O processors

Parallelism – Symmetric vs Asymmetric

• Symmetric parallelism – uses replications of identical processing elements that can operate in parallel

• Multicore processors

• Asymmetric parallelism – uses a set of processing elements that operate in parallel but differs

• PC with CPU, Graphics processor, math processor, I/O processor

Parallelism – Fine-grain vs Coarse-grain

• Fine-grain parallelism – computers providing parallel computations on the level of instructions or data items

• Vector processors

• Digital signal processors with special SIMD instructions

• Coarse-grain parallelism – computers providing parallelism on the level of programs or larger data structures

• Dual- or Quad-core processors

Parallelism – Explicit vs Implicit

• Explicit parallelism – programmer need to control how available parallelism is exploited in the code, through e.g. partitioning into parallel processes, constraints and special instructions.

• Implicit parallelism – hardware can exploit parallelism in the executed code without constraints or any special instructions defined by the programmer

Flynn’s taxonomy

• 1966 Michael J Flynn proposed a classification of computers

One Many

One

Many

Instruction streams

Data streams

SISD: Single instruction stream

Single data stream

SIMD: Single instruction stream

Multiple data streams

MISD:Multiple instruction streams

Single data stream

MIMD: Multiple instruction streams

Multiple data streams

Flynn’s taxonomy - SISD

Processor

Instructions

Data

• Capable of executing single instructions, operating on a single data stream• E.g. conventional von-Neumann

architecture

Flynn’s taxonomy - SIMD

Processor

Instructions

Data

Processor Processor ProcessorProcessor

• Capable of executing the same instruction on all processing elements operating on different data streams• E.g. vector processors

Flynn’s taxonomy - MISD

Processor

Instructions

Data

Processor Processor ProcessorProcessor

• Executes different instructions on each processing element operating on the same data stream.(Useful for only a limited amount of applications)

Flynn’s taxonomy - MIMD

Processor

Instructions

Data

Processor Processor ProcessorProcessor Processor

• Executes multiple instructions on multiple data streams• E.g. multiprocessors

System Bus Architectures

Multi-master point to point communication over a single system bus requires bus arbitration.

Processors, co-processors and DMA-controllers are typically operating as bus masters.

Reference


Time multiplexing of data and addresses on common lines• Lower cost• Lower performance

Reference


• A computer could be designed for using multiple buses for different purposes

• Cheaper solution to include a bridge• Typically used for e.g. USB or Ethernet

Reference


Fetch and Store paradigm

Reference


Conclusions:• A system bus can only perform one transfer at a time• It is thus a limited resource for communication• More than one master can compete for access to this

resource. Processors, co-processors and DMA-controllers

How to mitigate limitations on communication over a system bus?

Switching fabrics

Significantly more expensive then a system bus

AXI4 channel – switch

Xilinx AXI4 bus is a derivate of the Arm AMBA bus developed for SoC applications. Picture is showing a switch for AXI4

Connects one or more similar AXI memory mapped masters to one or more similar memory mapped slaves.

Reference: XilinxUser Guide 1037

AXI4 and AXI4-Lite bus

Consists of five channels:• Read address channel• Write address channel• Read data channel• Write data channel• Write response channel

Data can move simultaneously in both directions. AXI4 allow for burst of 256 data transfers using only one address. AXI4-Lite allow only for single data transactions.

Maste

r

Sla

ve

A master is taking the initiative to a data transfer, slave is responding

AXI4 bus – read operationReference: XilinxUser Guide 1037

AXI4 bus – write operationReference: XilinxUser Guide 1037

AXI4-stream

Maste

r

Sla

ve

Unidirectional streaming of data

AXI4-stream implementation

• Used for high speed data centric streaming applications, e.g. video• TLAST indicates packet boundaries• TVALID indicates valid data


AXI4-Stream Interconnect

Parallel routing of traffic between N masters and M slaves


Multiprocessor architectures

Challenges for multiprocessor architectures• Communication• Coordination• Contention

Reference

Challenges

• Communication – must be scalable to handle communication between large number of processors

• Coordination – a strategy for how to distribute tasks among all processors is required

• Contention – situations where two or more processors try to access a resource at the same time. This problem explodes with increasing number of processors• In particular problems will occur with memory accesses• Cashing can mitigate but introduces another problem,

• Cache coherence – how to guarantee that cache memories, local to each processor carries the same data for common memory locations?

Using Peripheral Processors

Reference

Performance of Multi-Processor architectures

�� = �

�

Reference

Data Pipelining

Stage 1 Stage 2 Stage 3 Stage 4 Stage 5

��

Input data stream

Output data stream

…

• A pipeline divides a larger computational task into a series of smaller tasks

• Benefits:• Smaller tasks are less complex to describe• Allow for reuse of code modules• Reveals coarse grained parallelism that can be mapped to a

multi-processor architecture for increased throughput

Data Pipelining


��

Input data stream

Output data stream

…

• Necessary conditions:• Partionable problem• Low communication overhead• Equivalent processor speed as for single processor

Data Pipelining


��

Input data stream

Output data stream

�� > � ∧ �� > �� ∧ �� > � …�� > �

�� = 1

�� [�� ⁄ ]

Latency = � + �� + �� + � +⋯+ � [timeunits]

…

Data Flow Graph

Actor

1

Actor

2

Actor

3

Actor

4

Input data stream Output data stream

A data flow graph is describing computations without including any information on “how” the computation is going to be done. Hence, only data flow and no control flow is described.

This programming paradigm is supported by functional languages such as DFL suitable for digital signal processing systems and also ideal for capturing pipelined computations.

Imperative languages such as C and C++ model both control- and data flow and are no good for capturing parallelism.

Data Pipelining on FPGA logic

CN

�

QD

>

Input data stream Output data stream

Clk

A large combinatorial network is driving an output register

• Propagation delay time for CN is �

• Max frequency for clock signal Clk then becomes ,-./ =

0

Data Pipelining on FPGA logic

CN 1

�

QD

>

CN 2

��

QD

>

CN 3

��

QD

>

CN 4

�

QD

>

CN M

�1

QD

>

…

• Assume that CN is partionable into M smaller combinatorial networks• Insert registers in between all combinatorial nets

�� > � ∧ �� > �� ∧ �� > � …�� > �

��,-./ = 1

��23

Latency = 456�575856��

Clk

Power in computational logic

• The dynamic energy 9:consumed when changing state of a cmoslogic output

• ; is the total capacitive load of the output• <:: is the supply voltage

9: =1

2· ; · <::

�

• The average dynamic power ?: =; · <::

�

@-./= ; · <::

� · ,-./

• We can conclude that power dissipation is proportional to clock frequency

and proportional to square of supply voltage

• Trying to increase speed of a processor by simply increasing clock frequency

at the same time as physical scaling of technology increases can only be

done until the power wall is reached

• With current technology ?�A��B�66 = 100DEFFG

-HI

Power in computational logic

• The delay time J for a gate can be approximated to

J = K ·L · <::

<:: − <FN

• <FN is the cmos threshold voltage and K, L are technology dependent constants

• Delay J will depend mostly on K andL for larger supply voltages <::• Delay J will increase dramatically when <:: is decreased close to <FN

• Dynamic voltage and frequency scaling means that both supply voltage and clock frequency is adjusted so that a processor can deliver just enough speed

• A reduction of both frequency and supply voltage will result in a dramatic reduction of dynamic power consumption

Using sleep mode to control energy consumption

9G = ?G · @GEnergy consumed during shutdown

9P = ?P · @PEnergy consumed during wakeup

Energy consumed when running processor for time � 9Q = ?QRS · �

Energy consumed when going to sleep for time t

9G.TTU = 9G + 9P + ?VWW · � − @G − @P

Energy can be saved when 9G.TTU < 9QRS

Reference

Example – Battery powered oil detector for wastewater

• A smart sensor can detect petroleum contamination in wastewater

• Numerous sensors are installed at selected checkpoints which allow

tracing of sources of contamination

• Task for sensor is to measure wastewater every 15 minutes and

send alarm data over radio link whenever a contamination is

detected.

• This task finishes in milliseconds while the rest of the 15 minutes

cycle is spent on sleeping

computer and hardware architecture iiapachepersonal.miun.se/~bentho/ec/download/f3.pdf · flynn’s...

Documents