04 - dsp architecture and microarchitecture · i von neumann architecture vs harvard architecture...

57
04 - DSP Architecture and Microarchitecture Andreas Ehliar September 11, 2014 Andreas Ehliar 04 - DSP Architecture and Microarchitecture

Upload: others

Post on 26-May-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 04 - DSP Architecture and Microarchitecture · I Von Neumann architecture vs Harvard architecture Memory Control unit Arithmetic unit In-out Program memory Control unit Arithmetic

04 - DSP Architecture and Microarchitecture

Andreas Ehliar

September 11, 2014

Andreas Ehliar 04 - DSP Architecture and Microarchitecture

Page 2: 04 - DSP Architecture and Microarchitecture · I Von Neumann architecture vs Harvard architecture Memory Control unit Arithmetic unit In-out Program memory Control unit Arithmetic

Conclusions - Instruction set design

I An assembly language instruction set must be more efficientthan Junior

I Accelerations shall be implemented at arithmetic andalgorithmic levels.

I Addressing and data accesses can be executed in parallel witharithmetic computing.

I Program flow control, loop or conditional execution, can alsobe accelerated

Andreas Ehliar 04 - DSP Architecture and Microarchitecture

Page 3: 04 - DSP Architecture and Microarchitecture · I Von Neumann architecture vs Harvard architecture Memory Control unit Arithmetic unit In-out Program memory Control unit Arithmetic

Conclusions - Instruction set design

I A DSP processor will seldomly have a pure RISC-likeinstruction set

I To accelerate important DSP kernels, CISC-like extensions areacceptable (especially if they don’t add any real hardwarecost)

I (Also, note that both RISC and CISC are losers in theprocessor wars today, real processors are typically hybrids)

Andreas Ehliar 04 - DSP Architecture and Microarchitecture

Page 4: 04 - DSP Architecture and Microarchitecture · I Von Neumann architecture vs Harvard architecture Memory Control unit Arithmetic unit In-out Program memory Control unit Arithmetic

What if you can’t create an ASIP?

I Trade program memory for performanceI To avoid control complexity (loop unrolling)I To avoid addressing complexity

I Other clever programming tricksI Conditional executionI (Self modifying code)I Rewrite algorithmI etc. . .

Andreas Ehliar 04 - DSP Architecture and Microarchitecture

Page 5: 04 - DSP Architecture and Microarchitecture · I Von Neumann architecture vs Harvard architecture Memory Control unit Arithmetic unit In-out Program memory Control unit Arithmetic

History of DSP architectures

I Von Neumann architecture vs Harvard architecture

Memory

Control unit

Arithmetic unit

In-out

Program memory

Control unit

Arithmetic unit

In-out

Data memory

[Liu2008, Figure 3.3]

Andreas Ehliar 04 - DSP Architecture and Microarchitecture

Page 6: 04 - DSP Architecture and Microarchitecture · I Von Neumann architecture vs Harvard architecture Memory Control unit Arithmetic unit In-out Program memory Control unit Arithmetic

History of DSP architectures

DP DM

CP PM

DP DM

CP PM

MUX

DP DM

CP PM

MUX

(a) (b) (c)

[Liu2008, Figure 3.4]

I a) Normal Harvard architecture

I b) Words from PM can be sent to the datapath

I c) Use a dual port data memory

Andreas Ehliar 04 - DSP Architecture and Microarchitecture

Page 7: 04 - DSP Architecture and Microarchitecture · I Von Neumann architecture vs Harvard architecture Memory Control unit Arithmetic unit In-out Program memory Control unit Arithmetic

History of DSP architectures

I Efficient FIR filter with only two memories (PM and DM)

Alt 1: Carries coefficient Alt 2: Override instruction

as immediate fetch, fetch data from PM

mac A, DM[AR0++%], -1 conv 8,DM[AR0++%]

mac A, DM[AR0++%], -743 .data -1

mac A, DM[AR0++%], 0 .data -743

mac A, DM[AR0++%], 8977 .data 0

mac A, DM[AR0++%], 16297 .data 8977

mac A, DM[AR0++%], 8977 .data 16297

mac A, DM[AR0++%], 0 .data 8977

mac A, DM[AR0++%], -743 .data 0

mac A, DM[AR0++%], -1 .data -743

rnd A .data -1

rnd A

More orthogonal No need for wide PM

Andreas Ehliar 04 - DSP Architecture and Microarchitecture

Page 8: 04 - DSP Architecture and Microarchitecture · I Von Neumann architecture vs Harvard architecture Memory Control unit Arithmetic unit In-out Program memory Control unit Arithmetic

History of DSP architectures

DP DM

CP PM/DM

MUX

DP DM

CP PM

MUX

(d) (e)

Cac

he

DM

[Liu2008, Figure 3.5]

I d) Use a small (loop?) cache to allow for one memory to beshared between PM and DM

I e) Typical three memory configuration.

Andreas Ehliar 04 - DSP Architecture and Microarchitecture

Page 9: 04 - DSP Architecture and Microarchitecture · I Von Neumann architecture vs Harvard architecture Memory Control unit Arithmetic unit In-out Program memory Control unit Arithmetic

DSP Processor vs DSP Core

Bus and arbiration

DSP Processor

DSP core

Interrupt Timer

MMU

Other pheriph

Main memories

Chip inferface

RF Control pathADG

DM DM PMDMA

ALUMAC accelerator DSP

sub

syst

em

[Liu2008, Figure 3.2]Andreas Ehliar 04 - DSP Architecture and Microarchitecture

Page 10: 04 - DSP Architecture and Microarchitecture · I Von Neumann architecture vs Harvard architecture Memory Control unit Arithmetic unit In-out Program memory Control unit Arithmetic

Architecture selection

I Selecting a suitable ASIP architecture for the desiredapplication domain

I The decision includes how many function modules arerequired, how to interconnect these modules (relationsbetween modules), and how to connect the ASIP to theembedded system

I Closely related to instruction set selection if an efficientimplementation is desired

Andreas Ehliar 04 - DSP Architecture and Microarchitecture

Page 11: 04 - DSP Architecture and Microarchitecture · I Von Neumann architecture vs Harvard architecture Memory Control unit Arithmetic unit In-out Program memory Control unit Arithmetic

Architecture selection

I DSP processor developers have an advantage over generalpurpose CPU developers (e.g. Intel, AMD, ARM):

I Known applicationsI Known scheduling requirementsI Vector based algorithms and processing

Andreas Ehliar 04 - DSP Architecture and Microarchitecture

Page 12: 04 - DSP Architecture and Microarchitecture · I Von Neumann architecture vs Harvard architecture Memory Control unit Arithmetic unit In-out Program memory Control unit Arithmetic

Architecture selection

I Challenges of DSP parallelizationI Hard real time and high performanceI Low memory and low power costsI Data and control dependencies

I Remember Amdahl’s law: Your speedup is ultimately limitedby the amount of sequential parts you have in yourapplication.

Andreas Ehliar 04 - DSP Architecture and Microarchitecture

Page 13: 04 - DSP Architecture and Microarchitecture · I Von Neumann architecture vs Harvard architecture Memory Control unit Arithmetic unit In-out Program memory Control unit Arithmetic

Ways to speed up a processor - Discussion break

I Programmer visible:I VLIWI Multiple memoriesI AcceleratorsI SIMD

I Programmer invisibleI CacheI PipeliningI Superscalar (in- or

out-of-order)I DataforwardingI Voltage scalingI Branch predictionI Multiple clocksI Low-level circuit

optimizations

Andreas Ehliar 04 - DSP Architecture and Microarchitecture

Page 14: 04 - DSP Architecture and Microarchitecture · I Von Neumann architecture vs Harvard architecture Memory Control unit Arithmetic unit In-out Program memory Control unit Arithmetic

Advanced architectures: Dual MAC

* *

+ +Register files

A B

AAR BAR

MO1 MO2+

Legends in this figure

*

Accumulation arithmetic unit

Multiplier

Multiplexer

Register or a pipeline stage

[Liu2008, Figure 3.22]

Liu2008

Andreas Ehliar 04 - DSP Architecture and Microarchitecture

Page 15: 04 - DSP Architecture and Microarchitecture · I Von Neumann architecture vs Harvard architecture Memory Control unit Arithmetic unit In-out Program memory Control unit Arithmetic

Advanced architectures: Dual MAC

I Allows you to speed up operations such as FIR filters.

I Can allow you to calculate y [n] =∑N−1

k=0 h[k]x [n − k] and

y [n + 1] =∑N−1

k=0 h[k]x [n + 1 − k] at the same time forexample.

I Note: Will roughly halve the number of memory accesses(More on this in a later lecture.)

Andreas Ehliar 04 - DSP Architecture and Microarchitecture

Page 16: 04 - DSP Architecture and Microarchitecture · I Von Neumann architecture vs Harvard architecture Memory Control unit Arithmetic unit In-out Program memory Control unit Arithmetic

Advanced architectures: SIMD

Program memory carries only one instruction

Address

Execution unit

Address

Execution unit

Address

Execution unit

Address

Execution unit

I-decoding

[Liu2008, Figure 3.24 (modified)]

I Advantage: Low power and area

I Disadvantage: Difficult to use efficiently, very difficult targetfor a compiler.

Andreas Ehliar 04 - DSP Architecture and Microarchitecture

Page 17: 04 - DSP Architecture and Microarchitecture · I Von Neumann architecture vs Harvard architecture Memory Control unit Arithmetic unit In-out Program memory Control unit Arithmetic

Advanced architectures: VLIW

I Why: DSP tasks are relatively predictableI A parallel datapath gives higher performance

I How: Very Large Instruction WordI Multiple instruction issues per-cycleI Compiler manages data dependency

I ChallengesI Memory issue and on chip connectionsI Register (fan-out ports) costsI Hard compiler target

Andreas Ehliar 04 - DSP Architecture and Microarchitecture

Page 18: 04 - DSP Architecture and Microarchitecture · I Von Neumann architecture vs Harvard architecture Memory Control unit Arithmetic unit In-out Program memory Control unit Arithmetic

Advanced architectures: Superscalar

I Analyze instruction flowI Run several instruction in parallel

I (And possibly out of order)

Andreas Ehliar 04 - DSP Architecture and Microarchitecture

Page 19: 04 - DSP Architecture and Microarchitecture · I Von Neumann architecture vs Harvard architecture Memory Control unit Arithmetic unit In-out Program memory Control unit Arithmetic

VLIW vs Superscalar

I VLIW:I Relatively easy to design

and verify the hardwareI Not code efficient due to

instruction size and NOPinstructions

I Hard to keep binarycompatibility

I Hard to create an efficientcompiler

I SuperscalarI Hard to design and verify

the hardwareI Good code efficiency,

relatively smallinstructions, No NOPsneeded

I Easier to managecompatibility betweenprocessor versions

Andreas Ehliar 04 - DSP Architecture and Microarchitecture

Page 20: 04 - DSP Architecture and Microarchitecture · I Von Neumann architecture vs Harvard architecture Memory Control unit Arithmetic unit In-out Program memory Control unit Arithmetic

Multicore architectures

I Heterogenous or homogenousI Well known heterogenous architecture: CellI Well known homogenous architecture: Modern X86

I Usually harder to program than single threaded arch.I Heterogenous architectures are well suited for ASIPs

I Standard MCU for main part of applicationI Specialized DSP for performance critical parts

Andreas Ehliar 04 - DSP Architecture and Microarchitecture

Page 21: 04 - DSP Architecture and Microarchitecture · I Von Neumann architecture vs Harvard architecture Memory Control unit Arithmetic unit In-out Program memory Control unit Arithmetic

Summary: Advanced Architectures

I Dual MAC: Easy, not a huge improvement

I SIMD DSP: Very good for regular tasks

I VLIW: Good parallelism but hard for compiler

I Superscalar: Relatively easy for a compiler, but highest siliconcost and verification cost

I Multicore: Whenever a single core is not powerful enough

Andreas Ehliar 04 - DSP Architecture and Microarchitecture

Page 22: 04 - DSP Architecture and Microarchitecture · I Von Neumann architecture vs Harvard architecture Memory Control unit Arithmetic unit In-out Program memory Control unit Arithmetic

Summary: Advanced Architectures

[Liu2008, Figure 4.5 (modified)]

Liu2008

Andreas Ehliar 04 - DSP Architecture and Microarchitecture

Page 23: 04 - DSP Architecture and Microarchitecture · I Von Neumann architecture vs Harvard architecture Memory Control unit Arithmetic unit In-out Program memory Control unit Arithmetic

ASIP Design flow

I Understand target application

I Design architecture and assembly instruction set

I Create microarchitecture specification

I Write RTL code

I Synthesize code

I Backend design

I Tape out

I Celebrate!

Andreas Ehliar 04 - DSP Architecture and Microarchitecture

Page 24: 04 - DSP Architecture and Microarchitecture · I Von Neumann architecture vs Harvard architecture Memory Control unit Arithmetic unit In-out Program memory Control unit Arithmetic

ASIP Design flow

Application / function coverage analysis

Instruction set proposal and 90% 10% locality

Instruction set simulator and assembler

Benchmarking: speed up and coverage

satisfied

Release the instruction set architecture

No

yes

Micro architecture design, RTL, and VLSI

yes

Source code profiling

Instruction set design

Processor implementation

Andreas Ehliar 04 - DSP Architecture and Microarchitecture

Page 25: 04 - DSP Architecture and Microarchitecture · I Von Neumann architecture vs Harvard architecture Memory Control unit Arithmetic unit In-out Program memory Control unit Arithmetic

Understanding applications

I Read standards

I Read and profile reference codeI Read research papers about target application

I IEEE Xplore, Scopus, Google Scholar, etc

I Interview application expert

Andreas Ehliar 04 - DSP Architecture and Microarchitecture

Page 26: 04 - DSP Architecture and Microarchitecture · I Von Neumann architecture vs Harvard architecture Memory Control unit Arithmetic unit In-out Program memory Control unit Arithmetic

90-10 code locality rule

I About 10% of the instructions are used about 90% of thetime

I (Not really a rule, more of an observation)I Holds fairly well for DSP-like applications

I We should create an instruction set so that those 10% can beexecuted efficiently

Andreas Ehliar 04 - DSP Architecture and Microarchitecture

Page 27: 04 - DSP Architecture and Microarchitecture · I Von Neumann architecture vs Harvard architecture Memory Control unit Arithmetic unit In-out Program memory Control unit Arithmetic

Profiling

I Pen and paper methodI Look at reference code (Matlab, C, pseudo code etc)I Manually figure out required number ofI Arithmetic operationsI Control flow instructionsI Memory accessesI Address calculations

I Works for small applications/subroutinesI Such as the problems in the exam

Andreas Ehliar 04 - DSP Architecture and Microarchitecture

Page 28: 04 - DSP Architecture and Microarchitecture · I Von Neumann architecture vs Harvard architecture Memory Control unit Arithmetic unit In-out Program memory Control unit Arithmetic

Profiling tools

I Tell compiler that you want to use profilingI For gcc: -pg

I Signal (timer interrupt) is generated regularlyI Every 10 ms when using gcc on Linux

I Current PC is stored

I Functions are instrumentedI A counter is incremented for each function call

Andreas Ehliar 04 - DSP Architecture and Microarchitecture

Page 29: 04 - DSP Architecture and Microarchitecture · I Von Neumann architecture vs Harvard architecture Memory Control unit Arithmetic unit In-out Program memory Control unit Arithmetic

Profiling example: mplayer using gprof in Linux

Each sample counts as 0.01 seconds.

% cumulative self self total

time seconds seconds calls ms/call ms/call name

12.40 2.27 2.27 73363735 0.00 0.00 decode_residual

11.86 4.44 2.17 6180354 0.00 0.00 put_h264_qpel8o

8.14 5.93 1.49 21857226 0.00 0.00 h264_h_loop_fil

7.43 7.29 1.36 9922560 0.00 0.00 ff_h264_decode_

5.52 8.30 1.01 18181724 0.00 0.00 put_h264_chroma

4.70 9.16 0.86 82688 0.01 0.07 loop_filter

4.67 10.02 0.86 9922560 0.00 0.00 ff_h264_filter_

4.43 10.83 0.81 23096042 0.00 0.00 h264_h_loop_fil

Andreas Ehliar 04 - DSP Architecture and Microarchitecture

Page 30: 04 - DSP Architecture and Microarchitecture · I Von Neumann architecture vs Harvard architecture Memory Control unit Arithmetic unit In-out Program memory Control unit Arithmetic

Profiling flow for ASIP designers

I Find reference application codeI Compile it for your desktop computer

I With profiling/debugging informationI Run it with typical input dataI Look at profiling output to quickly determine parts that are

likely to be critical

Andreas Ehliar 04 - DSP Architecture and Microarchitecture

Page 31: 04 - DSP Architecture and Microarchitecture · I Von Neumann architecture vs Harvard architecture Memory Control unit Arithmetic unit In-out Program memory Control unit Arithmetic

Profiling pitfalls

I Is your reference code well optimized?I Is your reference code using hardware features in such a way

that your profiling output is misleading?I Hand optimized assembler codeI Memory subsystem, cache sizeI Specialized instructionsI Vector instructions

Andreas Ehliar 04 - DSP Architecture and Microarchitecture

Page 32: 04 - DSP Architecture and Microarchitecture · I Von Neumann architecture vs Harvard architecture Memory Control unit Arithmetic unit In-out Program memory Control unit Arithmetic

Profiling pitfalls

I While care must be taken in interpreting the results, profilingis still a very good friend of an ASIP designer!

I Never optimize your code unless you have profiled it and seenthat it makes sense to optimize it!

I ”The First Rule of Program Optimization: Don’t do it. TheSecond Rule of Program Optimization (for experts only!):Don’t do it yet.” — Michael A. Jackson

Andreas Ehliar 04 - DSP Architecture and Microarchitecture

Page 33: 04 - DSP Architecture and Microarchitecture · I Von Neumann architecture vs Harvard architecture Memory Control unit Arithmetic unit In-out Program memory Control unit Arithmetic

Advanced profiling

I Profiling of other kinds of events:I Cache hits and cache missesI Floating point operationsI Taken/not taken branchesI Predicted/mispredicted branchesI TLB missesI Instruction decoder stallsI Etc...

I Interested? Look at oprofile or VTune

Andreas Ehliar 04 - DSP Architecture and Microarchitecture

Page 34: 04 - DSP Architecture and Microarchitecture · I Von Neumann architecture vs Harvard architecture Memory Control unit Arithmetic unit In-out Program memory Control unit Arithmetic

ASIP profiling

I Once you have an assembler and instruction set simulator foryour ASIP you can benchmark your own ASIP

I Count the number of times each instruction is used in anumber of applications

I Useful for fine tuning of the DSP architecture but too timeconsuming to start with

I Without a compiler it is very hard to profile completeapplications

I See lab 4!

Andreas Ehliar 04 - DSP Architecture and Microarchitecture

Page 35: 04 - DSP Architecture and Microarchitecture · I Von Neumann architecture vs Harvard architecture Memory Control unit Arithmetic unit In-out Program memory Control unit Arithmetic

Static vs Dynamic Profiling

I Dynamic profiling: Running an application with typical inputdata

I Static profiling: Analyzing the control flow graph of anapplication and determining worst case execution time(WCET)

I Critical for hard real time applications!I Some hardware features complicates this:

I Interrupts, caches, superscalar, OoO, branch prediction, DMA,etc...

Andreas Ehliar 04 - DSP Architecture and Microarchitecture

Page 36: 04 - DSP Architecture and Microarchitecture · I Von Neumann architecture vs Harvard architecture Memory Control unit Arithmetic unit In-out Program memory Control unit Arithmetic

Evaluating instruction sets

I Evaluation of an instruction setI Cycle cost and memory usageI Suitability for specific applications

I How to evaluate a processorI Good assembly instruction setI Good (open and scalable) architectureI (Max clock frequency, low power, less area)

I Use benchmarking techniques!

Andreas Ehliar 04 - DSP Architecture and Microarchitecture

Page 37: 04 - DSP Architecture and Microarchitecture · I Von Neumann architecture vs Harvard architecture Memory Control unit Arithmetic unit In-out Program memory Control unit Arithmetic

General benchmarks

I Algorithm benchmarks/kernel benchmarks

I Normal precision and native word lengthI What to check:

I Cycle costs of kernels, prologs, and epilogsI Program/data memory costs

I Algorithms includingI FIR, IIR, LMS, FFT, DCT, FSM

Andreas Ehliar 04 - DSP Architecture and Microarchitecture

Page 38: 04 - DSP Architecture and Microarchitecture · I Von Neumann architecture vs Harvard architecture Memory Control unit Arithmetic unit In-out Program memory Control unit Arithmetic

Third Party Benchmarks

I BDTI: Berkeley Design Tech IncorporationI Professional hand written assemblyI http://www.bdti.com

I EEMBC (the EDN Embedded Microprocessor BenchmarkConsortium), fall into five classes:

I automotive/industrial, consumer, networking, officeautomation, and telecommunication

I http://www.eembc.org

Andreas Ehliar 04 - DSP Architecture and Microarchitecture

Page 39: 04 - DSP Architecture and Microarchitecture · I Von Neumann architecture vs Harvard architecture Memory Control unit Arithmetic unit In-out Program memory Control unit Arithmetic

Ideal benchmark

I The application you are interested in!I Preferably optimized for your DSP architecture

I Difficult in practice

Andreas Ehliar 04 - DSP Architecture and Microarchitecture

Page 40: 04 - DSP Architecture and Microarchitecture · I Von Neumann architecture vs Harvard architecture Memory Control unit Arithmetic unit In-out Program memory Control unit Arithmetic

Microarchitecture Design

I Step 1: Partition each assembly instruction intomicrooperations, allocate each microoperation intocorresponding hardware modules.

I Step 2: Collect all microoperations allocated in a module andspecify hardware multiplexing for RTL coding of the module

I Step 3: Fine-tune intermodule specifications of the ASIParchitecture and finalize the top-level connections andpipeline.

Andreas Ehliar 04 - DSP Architecture and Microarchitecture

Page 41: 04 - DSP Architecture and Microarchitecture · I Von Neumann architecture vs Harvard architecture Memory Control unit Arithmetic unit In-out Program memory Control unit Arithmetic

Hardware Multiplexing

I Reusing one hardware module for several different operations

I Example: Signed and unsigned 16-bit multiplication

Andreas Ehliar 04 - DSP Architecture and Microarchitecture

Page 42: 04 - DSP Architecture and Microarchitecture · I Von Neumann architecture vs Harvard architecture Memory Control unit Arithmetic unit In-out Program memory Control unit Arithmetic

Hardware Multiplexing

Pre processing

MA

A B

MB

A B

Kernel processing

MUL

Post processing-1

MP1

Possible functions1. A + C2. A + D3. B + C4. B + D5. A * C6. A * D7. B * C8. B * D9. SAT(A + C)10.SAT(A + D)11.SAT(B + C)12.SAT(B + D)13.SAT(A * C)14.SAT(A * D)15.SAT(B * C)16.SAT(B * D)

ADD

Control[2]

Control[1] Control[0]

0 1

0 1 0 1

Post processing-2

MP2Control[3] 0 1

Saturation

opa opb

result1

C D

[Liu2008]

Andreas Ehliar 04 - DSP Architecture and Microarchitecture

Page 43: 04 - DSP Architecture and Microarchitecture · I Von Neumann architecture vs Harvard architecture Memory Control unit Arithmetic unit In-out Program memory Control unit Arithmetic

Hardware multiplexing

I Hardware multiplexing can be implemented either by SW orby configuring the HW

I A processor is basically a very neat design pattern formultiplexing different HW units

I Perhaps the most important skill of a good VLSI designer

Andreas Ehliar 04 - DSP Architecture and Microarchitecture

Page 44: 04 - DSP Architecture and Microarchitecture · I Von Neumann architecture vs Harvard architecture Memory Control unit Arithmetic unit In-out Program memory Control unit Arithmetic

Typical design pattern for datapath modules

Pre-operation-I

Colle

cted

micr

oop

erati

ons

Pre-operation-II

Pre-operation-X

Pre-operation-Y

... ...

Kern

elop

erati

ons

Post-operation-I

Post-operation-II

Post-operation-X

Post-operation-Y

... ...

Resu

ltsan

dve

rifica

tions

[Liu2008]

Andreas Ehliar 04 - DSP Architecture and Microarchitecture

Page 45: 04 - DSP Architecture and Microarchitecture · I Von Neumann architecture vs Harvard architecture Memory Control unit Arithmetic unit In-out Program memory Control unit Arithmetic

Discussion break:

I What is most area expensive of these units?I 17 x 17 bit multiplierI 32-bit Adder/subtracterI 32-bit 16 to 1 muxI 32-bit AdderI 8 KiB memory

Andreas Ehliar 04 - DSP Architecture and Microarchitecture

Page 46: 04 - DSP Architecture and Microarchitecture · I Von Neumann architecture vs Harvard architecture Memory Control unit Arithmetic unit In-out Program memory Control unit Arithmetic

Area properties (a.k.a. what to optimize)

I Relative areas of a few different componentsI 32-bit Adder: 0.2 to 1 area unitsI 32-bit Adder/subtracter: 0.3 to 2 area unitsI 32-bit 16 to 1 mux: 0.5 – 0.6 area unitsI 17 x 17 bit multiplier: 1.3-3.7 area unitsI 8 KiB memory (32 bit wide): 33 area units

Andreas Ehliar 04 - DSP Architecture and Microarchitecture

Page 47: 04 - DSP Architecture and Microarchitecture · I Von Neumann architecture vs Harvard architecture Memory Control unit Arithmetic unit In-out Program memory Control unit Arithmetic

Performance properties

I Relative maximum frequenciesI 32-bit adder: 0.1 to 1I 32-bit adder/subtracter: 0.1 to 0.9I 32-bit 16 to 1 mux: 0.31 to 0.9I 17 x 17 bit multiplier: 0.11 – 0.44I 8 KiB memory (32 bit wide): 0.53

Andreas Ehliar 04 - DSP Architecture and Microarchitecture

Page 48: 04 - DSP Architecture and Microarchitecture · I Von Neumann architecture vs Harvard architecture Memory Control unit Arithmetic unit In-out Program memory Control unit Arithmetic

Optimizing memory size is often the most important task

I MP3 decoder example

I All memories in the chip are3 time the size of the DSPcore itself

I (I/O pads are also largerthan the DSP core itself)

Andreas Ehliar 04 - DSP Architecture and Microarchitecture

Page 49: 04 - DSP Architecture and Microarchitecture · I Von Neumann architecture vs Harvard architecture Memory Control unit Arithmetic unit In-out Program memory Control unit Arithmetic

Microarchitecture design of an instruction

I Required microoperations for a typical convolution instruction:

I conv ACRx,DM0[ARy%++),DM1(ARz++)

I Required microoperations:I Instruction decodingI Perform addressing calculationI Read memoriesI Perform signed multiplicationI Add guard bits to the result of the multiplicationI Accumulate the resultI Set flagsI For a combined repeat/conv instruction:

I PC <= PC while in the loopI PC <= PC + 1 as the last step in the loopI No saturation/rounding during the iterationI Saturate/round after final loop iteration

Andreas Ehliar 04 - DSP Architecture and Microarchitecture

Page 50: 04 - DSP Architecture and Microarchitecture · I Von Neumann architecture vs Harvard architecture Memory Control unit Arithmetic unit In-out Program memory Control unit Arithmetic

The register file (RF)

I The RF gets data from data memories by running loadinstructions while preparing for an execution of a subroutine.

I While running a subroutine, the register file is used ascomputing buffers.

I After running the subroutine, results in the RF will be storedinto data memories by running store instructions.

Andreas Ehliar 04 - DSP Architecture and Microarchitecture

Page 51: 04 - DSP Architecture and Microarchitecture · I Von Neumann architecture vs Harvard architecture Memory Control unit Arithmetic unit In-out Program memory Control unit Arithmetic

General register file

RF

AG

UPC

FSM

PM

Pro

gram

addr

ess

Con

figu

ratio

nan

dst

atus

Exe

cun

itA

LU

/MA

C

Inst

ruct

ion

DM

Program flow control

Operation ctrl

MEM ctrl

Res

ults

Inst

ruct

ion

deco

der

Ope

rand

&re

sult

cont

rol

Results and flags

Target addresses, configuration vectors from register file

[Liu2008]

I Connected to almost all parts of the core

Andreas Ehliar 04 - DSP Architecture and Microarchitecture

Page 52: 04 - DSP Architecture and Microarchitecture · I Von Neumann architecture vs Harvard architecture Memory Control unit Arithmetic unit In-out Program memory Control unit Arithmetic

Register file schematic

register 1

register 2

register 3

register n

............

from register filefrom memory 1from memory 2from ALU

from MAC

from ports

......

OPA

OPB

ctrl_reg_in

ctrl_o_a

ctrl_o_b

Write circuit

Sto

reci

rcui

t

Read circuit

[Liu2008]

Andreas Ehliar 04 - DSP Architecture and Microarchitecture

Page 53: 04 - DSP Architecture and Microarchitecture · I Von Neumann architecture vs Harvard architecture Memory Control unit Arithmetic unit In-out Program memory Control unit Arithmetic

Register file area comparisons

I Register file area comparisons

I Adder (for comparison): 0.1 - 1I One port, both read/write: (uncommon)

I Relative area: 2.5-2.7

I One read port, one write port: Accumulator registersI Relative area: 2.5-2.6

I Two read ports, one write: Basic RISC RFI Relative area: 3-3.2

I Four read ports, two write: Simple VLIW, Simple superscalarI Relative area: 4.6-5.8

Andreas Ehliar 04 - DSP Architecture and Microarchitecture

Page 54: 04 - DSP Architecture and Microarchitecture · I Von Neumann architecture vs Harvard architecture Memory Control unit Arithmetic unit In-out Program memory Control unit Arithmetic

Register file speed

I Almost (but not quite) the same speed as a very fast 32-bitadder (in this particular technology)

I Also note that it is possible to use special register filememories (but at an increased verification cost)

Andreas Ehliar 04 - DSP Architecture and Microarchitecture

Page 55: 04 - DSP Architecture and Microarchitecture · I Von Neumann architecture vs Harvard architecture Memory Control unit Arithmetic unit In-out Program memory Control unit Arithmetic

Read before write or write before read

I A processor architect has to decide how the register file shouldwork when reading and writing the same register

I Read before writeI The old value is read

I Write before readI The new value is read (more costly in terms of the timing

budget)

Andreas Ehliar 04 - DSP Architecture and Microarchitecture

Page 56: 04 - DSP Architecture and Microarchitecture · I Von Neumann architecture vs Harvard architecture Memory Control unit Arithmetic unit In-out Program memory Control unit Arithmetic

Physical design: fan-out problem

Fan-out of the control signal

For the first stage: 16*16*2 = 512

Fan-out of the control signal

For the second stage: 16*8*2 = 256

Fan-out of the control signal

For the third stage: 16*4*2 = 128

Fan-out of the control signal

For the fourth stage: 16*2*2 = 64

Fan-out of the control signal

For the fivth stage: 16*1*2 = 32

Selected operand

From 32 registers in a register file

[Liu2008]

Andreas Ehliar 04 - DSP Architecture and Microarchitecture

Page 57: 04 - DSP Architecture and Microarchitecture · I Von Neumann architecture vs Harvard architecture Memory Control unit Arithmetic unit In-out Program memory Control unit Arithmetic

Register File in Verilog

reg [15:0] rf[31:0]; // 16 bit wide RF with 32 entries

always @(posedge clk) begin

if(we) rf[waddr] <= wdata;

end

always @* begin

op_a = rf[opaddr_a];

op_b = rf[opaddr_b];

end

Andreas Ehliar 04 - DSP Architecture and Microarchitecture