04 - dsp architecture and microarchitecture · i von neumann architecture vs harvard architecture...

04 - DSP Architecture and Microarchitecture

Andreas Ehliar

September 11, 2014

Andreas Ehliar 04 - DSP Architecture and Microarchitecture

Conclusions - Instruction set design

I An assembly language instruction set must be more efficientthan Junior

I Accelerations shall be implemented at arithmetic andalgorithmic levels.

I Addressing and data accesses can be executed in parallel witharithmetic computing.

I Program flow control, loop or conditional execution, can alsobe accelerated


Conclusions - Instruction set design

I A DSP processor will seldomly have a pure RISC-likeinstruction set

I To accelerate important DSP kernels, CISC-like extensions areacceptable (especially if they don’t add any real hardwarecost)

I (Also, note that both RISC and CISC are losers in theprocessor wars today, real processors are typically hybrids)


What if you can’t create an ASIP?

I Trade program memory for performanceI To avoid control complexity (loop unrolling)I To avoid addressing complexity

I Other clever programming tricksI Conditional executionI (Self modifying code)I Rewrite algorithmI etc. . .


History of DSP architectures

I Von Neumann architecture vs Harvard architecture

Memory

Control unit

Arithmetic unit

In-out

Program memory

Control unit

Arithmetic unit

In-out

Data memory

[Liu2008, Figure 3.3]



DP DM

CP PM

DP DM

CP PM

MUX

DP DM

CP PM

MUX

(a) (b) (c)


I a) Normal Harvard architecture

I b) Words from PM can be sent to the datapath

I c) Use a dual port data memory



I Efficient FIR filter with only two memories (PM and DM)

Alt 1: Carries coefficient Alt 2: Override instruction

as immediate fetch, fetch data from PM

mac A, DM[AR0++%], -1 conv 8,DM[AR0++%]

mac A, DM[AR0++%], -743 .data -1

mac A, DM[AR0++%], 0 .data -743

mac A, DM[AR0++%], 8977 .data 0

mac A, DM[AR0++%], 16297 .data 8977

mac A, DM[AR0++%], 8977 .data 16297

mac A, DM[AR0++%], 0 .data 8977

mac A, DM[AR0++%], -743 .data 0

mac A, DM[AR0++%], -1 .data -743

rnd A .data -1

rnd A

More orthogonal No need for wide PM



DP DM

CP PM/DM

MUX

DP DM

CP PM

MUX

(d) (e)

Cac

he

DM


I d) Use a small (loop?) cache to allow for one memory to beshared between PM and DM

I e) Typical three memory configuration.


DSP Processor vs DSP Core

Bus and arbiration

DSP Processor

DSP core

Interrupt Timer

MMU

Other pheriph

Main memories

Chip inferface

RF Control pathADG

DM DM PMDMA

ALUMAC accelerator DSP

sub

syst

em

[Liu2008, Figure 3.2]Andreas Ehliar 04 - DSP Architecture and Microarchitecture

Architecture selection

I Selecting a suitable ASIP architecture for the desiredapplication domain

I The decision includes how many function modules arerequired, how to interconnect these modules (relationsbetween modules), and how to connect the ASIP to theembedded system

I Closely related to instruction set selection if an efficientimplementation is desired



I DSP processor developers have an advantage over generalpurpose CPU developers (e.g. Intel, AMD, ARM):

I Known applicationsI Known scheduling requirementsI Vector based algorithms and processing



I Challenges of DSP parallelizationI Hard real time and high performanceI Low memory and low power costsI Data and control dependencies

I Remember Amdahl’s law: Your speedup is ultimately limitedby the amount of sequential parts you have in yourapplication.


Ways to speed up a processor - Discussion break

I Programmer visible:I VLIWI Multiple memoriesI AcceleratorsI SIMD

I Programmer invisibleI CacheI PipeliningI Superscalar (in- or

out-of-order)I DataforwardingI Voltage scalingI Branch predictionI Multiple clocksI Low-level circuit

optimizations


Advanced architectures: Dual MAC

* *

+ +Register files

A B

AAR BAR

MO1 MO2+

Legends in this figure

*

Accumulation arithmetic unit

Multiplier

Multiplexer

Register or a pipeline stage


Liu2008


Advanced architectures: Dual MAC

I Allows you to speed up operations such as FIR filters.

I Can allow you to calculate y [n] =∑N−1

k=0 h[k]x [n − k] and

y [n + 1] =∑N−1

k=0 h[k]x [n + 1 − k] at the same time forexample.

I Note: Will roughly halve the number of memory accesses(More on this in a later lecture.)


Advanced architectures: SIMD

Program memory carries only one instruction

Address

Execution unit

Address

Execution unit

Address

Execution unit

Address

Execution unit

I-decoding

[Liu2008, Figure 3.24 (modified)]

I Advantage: Low power and area

I Disadvantage: Difficult to use efficiently, very difficult targetfor a compiler.


Advanced architectures: VLIW

I Why: DSP tasks are relatively predictableI A parallel datapath gives higher performance

I How: Very Large Instruction WordI Multiple instruction issues per-cycleI Compiler manages data dependency

I ChallengesI Memory issue and on chip connectionsI Register (fan-out ports) costsI Hard compiler target


Advanced architectures: Superscalar

I Analyze instruction flowI Run several instruction in parallel

I (And possibly out of order)


VLIW vs Superscalar

I VLIW:I Relatively easy to design

and verify the hardwareI Not code efficient due to

instruction size and NOPinstructions

I Hard to keep binarycompatibility

I Hard to create an efficientcompiler

I SuperscalarI Hard to design and verify

the hardwareI Good code efficiency,

relatively smallinstructions, No NOPsneeded

I Easier to managecompatibility betweenprocessor versions


Multicore architectures

I Heterogenous or homogenousI Well known heterogenous architecture: CellI Well known homogenous architecture: Modern X86

I Usually harder to program than single threaded arch.I Heterogenous architectures are well suited for ASIPs

I Standard MCU for main part of applicationI Specialized DSP for performance critical parts


Summary: Advanced Architectures

I Dual MAC: Easy, not a huge improvement

I SIMD DSP: Very good for regular tasks

I VLIW: Good parallelism but hard for compiler

I Superscalar: Relatively easy for a compiler, but highest siliconcost and verification cost

I Multicore: Whenever a single core is not powerful enough


Summary: Advanced Architectures

[Liu2008, Figure 4.5 (modified)]

Liu2008


ASIP Design flow

I Understand target application

I Design architecture and assembly instruction set

I Create microarchitecture specification

I Write RTL code

I Synthesize code

I Backend design

I Tape out

I Celebrate!


ASIP Design flow

Application / function coverage analysis

Instruction set proposal and 90% 10% locality

Instruction set simulator and assembler

Benchmarking: speed up and coverage

satisfied

Release the instruction set architecture

No

yes

Micro architecture design, RTL, and VLSI

yes

Source code profiling

Instruction set design

Processor implementation


Understanding applications

I Read standards

I Read and profile reference codeI Read research papers about target application

I IEEE Xplore, Scopus, Google Scholar, etc

I Interview application expert


90-10 code locality rule

I About 10% of the instructions are used about 90% of thetime

I (Not really a rule, more of an observation)I Holds fairly well for DSP-like applications

I We should create an instruction set so that those 10% can beexecuted efficiently


Profiling

I Pen and paper methodI Look at reference code (Matlab, C, pseudo code etc)I Manually figure out required number ofI Arithmetic operationsI Control flow instructionsI Memory accessesI Address calculations

I Works for small applications/subroutinesI Such as the problems in the exam


Profiling tools

I Tell compiler that you want to use profilingI For gcc: -pg

I Signal (timer interrupt) is generated regularlyI Every 10 ms when using gcc on Linux

I Current PC is stored

I Functions are instrumentedI A counter is incremented for each function call


Profiling example: mplayer using gprof in Linux

Each sample counts as 0.01 seconds.

% cumulative self self total

time seconds seconds calls ms/call ms/call name

12.40 2.27 2.27 73363735 0.00 0.00 decode_residual

11.86 4.44 2.17 6180354 0.00 0.00 put_h264_qpel8o

8.14 5.93 1.49 21857226 0.00 0.00 h264_h_loop_fil

7.43 7.29 1.36 9922560 0.00 0.00 ff_h264_decode_

5.52 8.30 1.01 18181724 0.00 0.00 put_h264_chroma

4.70 9.16 0.86 82688 0.01 0.07 loop_filter

4.67 10.02 0.86 9922560 0.00 0.00 ff_h264_filter_

4.43 10.83 0.81 23096042 0.00 0.00 h264_h_loop_fil


Profiling flow for ASIP designers

I Find reference application codeI Compile it for your desktop computer

I With profiling/debugging informationI Run it with typical input dataI Look at profiling output to quickly determine parts that are

likely to be critical


Profiling pitfalls

I Is your reference code well optimized?I Is your reference code using hardware features in such a way

that your profiling output is misleading?I Hand optimized assembler codeI Memory subsystem, cache sizeI Specialized instructionsI Vector instructions


Profiling pitfalls

I While care must be taken in interpreting the results, profilingis still a very good friend of an ASIP designer!

I Never optimize your code unless you have profiled it and seenthat it makes sense to optimize it!

I ”The First Rule of Program Optimization: Don’t do it. TheSecond Rule of Program Optimization (for experts only!):Don’t do it yet.” — Michael A. Jackson


Advanced profiling

I Profiling of other kinds of events:I Cache hits and cache missesI Floating point operationsI Taken/not taken branchesI Predicted/mispredicted branchesI TLB missesI Instruction decoder stallsI Etc...

I Interested? Look at oprofile or VTune


ASIP profiling

I Once you have an assembler and instruction set simulator foryour ASIP you can benchmark your own ASIP

I Count the number of times each instruction is used in anumber of applications

I Useful for fine tuning of the DSP architecture but too timeconsuming to start with

I Without a compiler it is very hard to profile completeapplications

I See lab 4!


Static vs Dynamic Profiling

I Dynamic profiling: Running an application with typical inputdata

I Static profiling: Analyzing the control flow graph of anapplication and determining worst case execution time(WCET)

I Critical for hard real time applications!I Some hardware features complicates this:

I Interrupts, caches, superscalar, OoO, branch prediction, DMA,etc...


Evaluating instruction sets

I Evaluation of an instruction setI Cycle cost and memory usageI Suitability for specific applications

I How to evaluate a processorI Good assembly instruction setI Good (open and scalable) architectureI (Max clock frequency, low power, less area)

I Use benchmarking techniques!


General benchmarks

I Algorithm benchmarks/kernel benchmarks

I Normal precision and native word lengthI What to check:

I Cycle costs of kernels, prologs, and epilogsI Program/data memory costs

I Algorithms includingI FIR, IIR, LMS, FFT, DCT, FSM


Third Party Benchmarks

I BDTI: Berkeley Design Tech IncorporationI Professional hand written assemblyI http://www.bdti.com

I EEMBC (the EDN Embedded Microprocessor BenchmarkConsortium), fall into five classes:

I automotive/industrial, consumer, networking, officeautomation, and telecommunication

I http://www.eembc.org


Ideal benchmark

I The application you are interested in!I Preferably optimized for your DSP architecture

I Difficult in practice


Microarchitecture Design

I Step 1: Partition each assembly instruction intomicrooperations, allocate each microoperation intocorresponding hardware modules.

I Step 2: Collect all microoperations allocated in a module andspecify hardware multiplexing for RTL coding of the module

I Step 3: Fine-tune intermodule specifications of the ASIParchitecture and finalize the top-level connections andpipeline.


Hardware Multiplexing

I Reusing one hardware module for several different operations

I Example: Signed and unsigned 16-bit multiplication


Hardware Multiplexing

Pre processing

MA

A B

MB

A B

Kernel processing

MUL

Post processing-1

MP1

Possible functions1. A + C2. A + D3. B + C4. B + D5. A * C6. A * D7. B * C8. B * D9. SAT(A + C)10.SAT(A + D)11.SAT(B + C)12.SAT(B + D)13.SAT(A * C)14.SAT(A * D)15.SAT(B * C)16.SAT(B * D)

ADD

Control[2]

Control[1] Control[0]

0 1

0 1 0 1

Post processing-2

MP2Control[3] 0 1

Saturation

opa opb

result1

C D

[Liu2008]


Hardware multiplexing

I Hardware multiplexing can be implemented either by SW orby configuring the HW

I A processor is basically a very neat design pattern formultiplexing different HW units

I Perhaps the most important skill of a good VLSI designer


Typical design pattern for datapath modules

Pre-operation-I

Colle

cted

micr

oop

erati

ons

Pre-operation-II

Pre-operation-X

Pre-operation-Y

... ...

Kern

elop

erati

ons

Post-operation-I

Post-operation-II

Post-operation-X

Post-operation-Y

... ...

Resu

ltsan

dve

rifica

tions

[Liu2008]


Discussion break:

I What is most area expensive of these units?I 17 x 17 bit multiplierI 32-bit Adder/subtracterI 32-bit 16 to 1 muxI 32-bit AdderI 8 KiB memory


Area properties (a.k.a. what to optimize)

I Relative areas of a few different componentsI 32-bit Adder: 0.2 to 1 area unitsI 32-bit Adder/subtracter: 0.3 to 2 area unitsI 32-bit 16 to 1 mux: 0.5 – 0.6 area unitsI 17 x 17 bit multiplier: 1.3-3.7 area unitsI 8 KiB memory (32 bit wide): 33 area units


Performance properties

I Relative maximum frequenciesI 32-bit adder: 0.1 to 1I 32-bit adder/subtracter: 0.1 to 0.9I 32-bit 16 to 1 mux: 0.31 to 0.9I 17 x 17 bit multiplier: 0.11 – 0.44I 8 KiB memory (32 bit wide): 0.53


Optimizing memory size is often the most important task

I MP3 decoder example

I All memories in the chip are3 time the size of the DSPcore itself

I (I/O pads are also largerthan the DSP core itself)


Microarchitecture design of an instruction

I Required microoperations for a typical convolution instruction:

I conv ACRx,DM0[ARy%++),DM1(ARz++)

I Required microoperations:I Instruction decodingI Perform addressing calculationI Read memoriesI Perform signed multiplicationI Add guard bits to the result of the multiplicationI Accumulate the resultI Set flagsI For a combined repeat/conv instruction:

I PC <= PC while in the loopI PC <= PC + 1 as the last step in the loopI No saturation/rounding during the iterationI Saturate/round after final loop iteration


The register file (RF)

I The RF gets data from data memories by running loadinstructions while preparing for an execution of a subroutine.

I While running a subroutine, the register file is used ascomputing buffers.

I After running the subroutine, results in the RF will be storedinto data memories by running store instructions.


General register file

RF

AG

UPC

FSM

PM

Pro

gram

addr

ess

Con

figu

ratio

nan

dst

atus

Exe

cun

itA

LU

/MA

C

Inst

ruct

ion

DM

Program flow control

Operation ctrl

MEM ctrl

Res

ults

Inst

ruct

ion

deco

der

Ope

rand

&re

sult

cont

rol

Results and flags

Target addresses, configuration vectors from register file

[Liu2008]

I Connected to almost all parts of the core


Register file schematic

register 1

register 2

register 3

register n

............

from register filefrom memory 1from memory 2from ALU

from MAC

from ports

......

OPA

OPB

ctrl_reg_in

ctrl_o_a

ctrl_o_b

Write circuit

Sto

reci

rcui

t

Read circuit

[Liu2008]


Register file area comparisons

I Register file area comparisons

I Adder (for comparison): 0.1 - 1I One port, both read/write: (uncommon)

I Relative area: 2.5-2.7

I One read port, one write port: Accumulator registersI Relative area: 2.5-2.6

I Two read ports, one write: Basic RISC RFI Relative area: 3-3.2

I Four read ports, two write: Simple VLIW, Simple superscalarI Relative area: 4.6-5.8


Register file speed

I Almost (but not quite) the same speed as a very fast 32-bitadder (in this particular technology)

I Also note that it is possible to use special register filememories (but at an increased verification cost)


Read before write or write before read

I A processor architect has to decide how the register file shouldwork when reading and writing the same register

I Read before writeI The old value is read

I Write before readI The new value is read (more costly in terms of the timing

budget)


Physical design: fan-out problem

Fan-out of the control signal

For the first stage: 16*16*2 = 512


For the second stage: 16*8*2 = 256


For the third stage: 16*4*2 = 128


For the fourth stage: 16*2*2 = 64


For the fivth stage: 16*1*2 = 32

Selected operand

From 32 registers in a register file

[Liu2008]


Register File in Verilog

reg [15:0] rf[31:0]; // 16 bit wide RF with 32 entries

always @(posedge clk) begin

if(we) rf[waddr] <= wdata;

end

always @* begin

op_a = rf[opaddr_a];

op_b = rf[opaddr_b];

end


04 - dsp architecture and microarchitecture · i von neumann architecture vs harvard architecture...

Documents