04 - dsp architecture and microarchitecture · i von neumann architecture vs harvard architecture...
TRANSCRIPT
04 - DSP Architecture and Microarchitecture
Andreas Ehliar
September 11, 2014
Andreas Ehliar 04 - DSP Architecture and Microarchitecture
Conclusions - Instruction set design
I An assembly language instruction set must be more efficientthan Junior
I Accelerations shall be implemented at arithmetic andalgorithmic levels.
I Addressing and data accesses can be executed in parallel witharithmetic computing.
I Program flow control, loop or conditional execution, can alsobe accelerated
Andreas Ehliar 04 - DSP Architecture and Microarchitecture
Conclusions - Instruction set design
I A DSP processor will seldomly have a pure RISC-likeinstruction set
I To accelerate important DSP kernels, CISC-like extensions areacceptable (especially if they don’t add any real hardwarecost)
I (Also, note that both RISC and CISC are losers in theprocessor wars today, real processors are typically hybrids)
Andreas Ehliar 04 - DSP Architecture and Microarchitecture
What if you can’t create an ASIP?
I Trade program memory for performanceI To avoid control complexity (loop unrolling)I To avoid addressing complexity
I Other clever programming tricksI Conditional executionI (Self modifying code)I Rewrite algorithmI etc. . .
Andreas Ehliar 04 - DSP Architecture and Microarchitecture
History of DSP architectures
I Von Neumann architecture vs Harvard architecture
Memory
Control unit
Arithmetic unit
In-out
Program memory
Control unit
Arithmetic unit
In-out
Data memory
[Liu2008, Figure 3.3]
Andreas Ehliar 04 - DSP Architecture and Microarchitecture
History of DSP architectures
DP DM
CP PM
DP DM
CP PM
MUX
DP DM
CP PM
MUX
(a) (b) (c)
[Liu2008, Figure 3.4]
I a) Normal Harvard architecture
I b) Words from PM can be sent to the datapath
I c) Use a dual port data memory
Andreas Ehliar 04 - DSP Architecture and Microarchitecture
History of DSP architectures
I Efficient FIR filter with only two memories (PM and DM)
Alt 1: Carries coefficient Alt 2: Override instruction
as immediate fetch, fetch data from PM
mac A, DM[AR0++%], -1 conv 8,DM[AR0++%]
mac A, DM[AR0++%], -743 .data -1
mac A, DM[AR0++%], 0 .data -743
mac A, DM[AR0++%], 8977 .data 0
mac A, DM[AR0++%], 16297 .data 8977
mac A, DM[AR0++%], 8977 .data 16297
mac A, DM[AR0++%], 0 .data 8977
mac A, DM[AR0++%], -743 .data 0
mac A, DM[AR0++%], -1 .data -743
rnd A .data -1
rnd A
More orthogonal No need for wide PM
Andreas Ehliar 04 - DSP Architecture and Microarchitecture
History of DSP architectures
DP DM
CP PM/DM
MUX
DP DM
CP PM
MUX
(d) (e)
Cac
he
DM
[Liu2008, Figure 3.5]
I d) Use a small (loop?) cache to allow for one memory to beshared between PM and DM
I e) Typical three memory configuration.
Andreas Ehliar 04 - DSP Architecture and Microarchitecture
DSP Processor vs DSP Core
Bus and arbiration
DSP Processor
DSP core
Interrupt Timer
MMU
Other pheriph
Main memories
Chip inferface
RF Control pathADG
DM DM PMDMA
ALUMAC accelerator DSP
sub
syst
em
[Liu2008, Figure 3.2]Andreas Ehliar 04 - DSP Architecture and Microarchitecture
Architecture selection
I Selecting a suitable ASIP architecture for the desiredapplication domain
I The decision includes how many function modules arerequired, how to interconnect these modules (relationsbetween modules), and how to connect the ASIP to theembedded system
I Closely related to instruction set selection if an efficientimplementation is desired
Andreas Ehliar 04 - DSP Architecture and Microarchitecture
Architecture selection
I DSP processor developers have an advantage over generalpurpose CPU developers (e.g. Intel, AMD, ARM):
I Known applicationsI Known scheduling requirementsI Vector based algorithms and processing
Andreas Ehliar 04 - DSP Architecture and Microarchitecture
Architecture selection
I Challenges of DSP parallelizationI Hard real time and high performanceI Low memory and low power costsI Data and control dependencies
I Remember Amdahl’s law: Your speedup is ultimately limitedby the amount of sequential parts you have in yourapplication.
Andreas Ehliar 04 - DSP Architecture and Microarchitecture
Ways to speed up a processor - Discussion break
I Programmer visible:I VLIWI Multiple memoriesI AcceleratorsI SIMD
I Programmer invisibleI CacheI PipeliningI Superscalar (in- or
out-of-order)I DataforwardingI Voltage scalingI Branch predictionI Multiple clocksI Low-level circuit
optimizations
Andreas Ehliar 04 - DSP Architecture and Microarchitecture
Advanced architectures: Dual MAC
* *
+ +Register files
A B
AAR BAR
MO1 MO2+
Legends in this figure
*
Accumulation arithmetic unit
Multiplier
Multiplexer
Register or a pipeline stage
[Liu2008, Figure 3.22]
Liu2008
Andreas Ehliar 04 - DSP Architecture and Microarchitecture
Advanced architectures: Dual MAC
I Allows you to speed up operations such as FIR filters.
I Can allow you to calculate y [n] =∑N−1
k=0 h[k]x [n − k] and
y [n + 1] =∑N−1
k=0 h[k]x [n + 1 − k] at the same time forexample.
I Note: Will roughly halve the number of memory accesses(More on this in a later lecture.)
Andreas Ehliar 04 - DSP Architecture and Microarchitecture
Advanced architectures: SIMD
Program memory carries only one instruction
Address
Execution unit
Address
Execution unit
Address
Execution unit
Address
Execution unit
I-decoding
[Liu2008, Figure 3.24 (modified)]
I Advantage: Low power and area
I Disadvantage: Difficult to use efficiently, very difficult targetfor a compiler.
Andreas Ehliar 04 - DSP Architecture and Microarchitecture
Advanced architectures: VLIW
I Why: DSP tasks are relatively predictableI A parallel datapath gives higher performance
I How: Very Large Instruction WordI Multiple instruction issues per-cycleI Compiler manages data dependency
I ChallengesI Memory issue and on chip connectionsI Register (fan-out ports) costsI Hard compiler target
Andreas Ehliar 04 - DSP Architecture and Microarchitecture
Advanced architectures: Superscalar
I Analyze instruction flowI Run several instruction in parallel
I (And possibly out of order)
Andreas Ehliar 04 - DSP Architecture and Microarchitecture
VLIW vs Superscalar
I VLIW:I Relatively easy to design
and verify the hardwareI Not code efficient due to
instruction size and NOPinstructions
I Hard to keep binarycompatibility
I Hard to create an efficientcompiler
I SuperscalarI Hard to design and verify
the hardwareI Good code efficiency,
relatively smallinstructions, No NOPsneeded
I Easier to managecompatibility betweenprocessor versions
Andreas Ehliar 04 - DSP Architecture and Microarchitecture
Multicore architectures
I Heterogenous or homogenousI Well known heterogenous architecture: CellI Well known homogenous architecture: Modern X86
I Usually harder to program than single threaded arch.I Heterogenous architectures are well suited for ASIPs
I Standard MCU for main part of applicationI Specialized DSP for performance critical parts
Andreas Ehliar 04 - DSP Architecture and Microarchitecture
Summary: Advanced Architectures
I Dual MAC: Easy, not a huge improvement
I SIMD DSP: Very good for regular tasks
I VLIW: Good parallelism but hard for compiler
I Superscalar: Relatively easy for a compiler, but highest siliconcost and verification cost
I Multicore: Whenever a single core is not powerful enough
Andreas Ehliar 04 - DSP Architecture and Microarchitecture
Summary: Advanced Architectures
[Liu2008, Figure 4.5 (modified)]
Liu2008
Andreas Ehliar 04 - DSP Architecture and Microarchitecture
ASIP Design flow
I Understand target application
I Design architecture and assembly instruction set
I Create microarchitecture specification
I Write RTL code
I Synthesize code
I Backend design
I Tape out
I Celebrate!
Andreas Ehliar 04 - DSP Architecture and Microarchitecture
ASIP Design flow
Application / function coverage analysis
Instruction set proposal and 90% 10% locality
Instruction set simulator and assembler
Benchmarking: speed up and coverage
satisfied
Release the instruction set architecture
No
yes
Micro architecture design, RTL, and VLSI
yes
Source code profiling
Instruction set design
Processor implementation
Andreas Ehliar 04 - DSP Architecture and Microarchitecture
Understanding applications
I Read standards
I Read and profile reference codeI Read research papers about target application
I IEEE Xplore, Scopus, Google Scholar, etc
I Interview application expert
Andreas Ehliar 04 - DSP Architecture and Microarchitecture
90-10 code locality rule
I About 10% of the instructions are used about 90% of thetime
I (Not really a rule, more of an observation)I Holds fairly well for DSP-like applications
I We should create an instruction set so that those 10% can beexecuted efficiently
Andreas Ehliar 04 - DSP Architecture and Microarchitecture
Profiling
I Pen and paper methodI Look at reference code (Matlab, C, pseudo code etc)I Manually figure out required number ofI Arithmetic operationsI Control flow instructionsI Memory accessesI Address calculations
I Works for small applications/subroutinesI Such as the problems in the exam
Andreas Ehliar 04 - DSP Architecture and Microarchitecture
Profiling tools
I Tell compiler that you want to use profilingI For gcc: -pg
I Signal (timer interrupt) is generated regularlyI Every 10 ms when using gcc on Linux
I Current PC is stored
I Functions are instrumentedI A counter is incremented for each function call
Andreas Ehliar 04 - DSP Architecture and Microarchitecture
Profiling example: mplayer using gprof in Linux
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls ms/call ms/call name
12.40 2.27 2.27 73363735 0.00 0.00 decode_residual
11.86 4.44 2.17 6180354 0.00 0.00 put_h264_qpel8o
8.14 5.93 1.49 21857226 0.00 0.00 h264_h_loop_fil
7.43 7.29 1.36 9922560 0.00 0.00 ff_h264_decode_
5.52 8.30 1.01 18181724 0.00 0.00 put_h264_chroma
4.70 9.16 0.86 82688 0.01 0.07 loop_filter
4.67 10.02 0.86 9922560 0.00 0.00 ff_h264_filter_
4.43 10.83 0.81 23096042 0.00 0.00 h264_h_loop_fil
Andreas Ehliar 04 - DSP Architecture and Microarchitecture
Profiling flow for ASIP designers
I Find reference application codeI Compile it for your desktop computer
I With profiling/debugging informationI Run it with typical input dataI Look at profiling output to quickly determine parts that are
likely to be critical
Andreas Ehliar 04 - DSP Architecture and Microarchitecture
Profiling pitfalls
I Is your reference code well optimized?I Is your reference code using hardware features in such a way
that your profiling output is misleading?I Hand optimized assembler codeI Memory subsystem, cache sizeI Specialized instructionsI Vector instructions
Andreas Ehliar 04 - DSP Architecture and Microarchitecture
Profiling pitfalls
I While care must be taken in interpreting the results, profilingis still a very good friend of an ASIP designer!
I Never optimize your code unless you have profiled it and seenthat it makes sense to optimize it!
I ”The First Rule of Program Optimization: Don’t do it. TheSecond Rule of Program Optimization (for experts only!):Don’t do it yet.” — Michael A. Jackson
Andreas Ehliar 04 - DSP Architecture and Microarchitecture
Advanced profiling
I Profiling of other kinds of events:I Cache hits and cache missesI Floating point operationsI Taken/not taken branchesI Predicted/mispredicted branchesI TLB missesI Instruction decoder stallsI Etc...
I Interested? Look at oprofile or VTune
Andreas Ehliar 04 - DSP Architecture and Microarchitecture
ASIP profiling
I Once you have an assembler and instruction set simulator foryour ASIP you can benchmark your own ASIP
I Count the number of times each instruction is used in anumber of applications
I Useful for fine tuning of the DSP architecture but too timeconsuming to start with
I Without a compiler it is very hard to profile completeapplications
I See lab 4!
Andreas Ehliar 04 - DSP Architecture and Microarchitecture
Static vs Dynamic Profiling
I Dynamic profiling: Running an application with typical inputdata
I Static profiling: Analyzing the control flow graph of anapplication and determining worst case execution time(WCET)
I Critical for hard real time applications!I Some hardware features complicates this:
I Interrupts, caches, superscalar, OoO, branch prediction, DMA,etc...
Andreas Ehliar 04 - DSP Architecture and Microarchitecture
Evaluating instruction sets
I Evaluation of an instruction setI Cycle cost and memory usageI Suitability for specific applications
I How to evaluate a processorI Good assembly instruction setI Good (open and scalable) architectureI (Max clock frequency, low power, less area)
I Use benchmarking techniques!
Andreas Ehliar 04 - DSP Architecture and Microarchitecture
General benchmarks
I Algorithm benchmarks/kernel benchmarks
I Normal precision and native word lengthI What to check:
I Cycle costs of kernels, prologs, and epilogsI Program/data memory costs
I Algorithms includingI FIR, IIR, LMS, FFT, DCT, FSM
Andreas Ehliar 04 - DSP Architecture and Microarchitecture
Third Party Benchmarks
I BDTI: Berkeley Design Tech IncorporationI Professional hand written assemblyI http://www.bdti.com
I EEMBC (the EDN Embedded Microprocessor BenchmarkConsortium), fall into five classes:
I automotive/industrial, consumer, networking, officeautomation, and telecommunication
I http://www.eembc.org
Andreas Ehliar 04 - DSP Architecture and Microarchitecture
Ideal benchmark
I The application you are interested in!I Preferably optimized for your DSP architecture
I Difficult in practice
Andreas Ehliar 04 - DSP Architecture and Microarchitecture
Microarchitecture Design
I Step 1: Partition each assembly instruction intomicrooperations, allocate each microoperation intocorresponding hardware modules.
I Step 2: Collect all microoperations allocated in a module andspecify hardware multiplexing for RTL coding of the module
I Step 3: Fine-tune intermodule specifications of the ASIParchitecture and finalize the top-level connections andpipeline.
Andreas Ehliar 04 - DSP Architecture and Microarchitecture
Hardware Multiplexing
I Reusing one hardware module for several different operations
I Example: Signed and unsigned 16-bit multiplication
Andreas Ehliar 04 - DSP Architecture and Microarchitecture
Hardware Multiplexing
Pre processing
MA
A B
MB
A B
Kernel processing
MUL
Post processing-1
MP1
Possible functions1. A + C2. A + D3. B + C4. B + D5. A * C6. A * D7. B * C8. B * D9. SAT(A + C)10.SAT(A + D)11.SAT(B + C)12.SAT(B + D)13.SAT(A * C)14.SAT(A * D)15.SAT(B * C)16.SAT(B * D)
ADD
Control[2]
Control[1] Control[0]
0 1
0 1 0 1
Post processing-2
MP2Control[3] 0 1
Saturation
opa opb
result1
C D
[Liu2008]
Andreas Ehliar 04 - DSP Architecture and Microarchitecture
Hardware multiplexing
I Hardware multiplexing can be implemented either by SW orby configuring the HW
I A processor is basically a very neat design pattern formultiplexing different HW units
I Perhaps the most important skill of a good VLSI designer
Andreas Ehliar 04 - DSP Architecture and Microarchitecture
Typical design pattern for datapath modules
Pre-operation-I
Colle
cted
micr
oop
erati
ons
Pre-operation-II
Pre-operation-X
Pre-operation-Y
... ...
Kern
elop
erati
ons
Post-operation-I
Post-operation-II
Post-operation-X
Post-operation-Y
... ...
Resu
ltsan
dve
rifica
tions
[Liu2008]
Andreas Ehliar 04 - DSP Architecture and Microarchitecture
Discussion break:
I What is most area expensive of these units?I 17 x 17 bit multiplierI 32-bit Adder/subtracterI 32-bit 16 to 1 muxI 32-bit AdderI 8 KiB memory
Andreas Ehliar 04 - DSP Architecture and Microarchitecture
Area properties (a.k.a. what to optimize)
I Relative areas of a few different componentsI 32-bit Adder: 0.2 to 1 area unitsI 32-bit Adder/subtracter: 0.3 to 2 area unitsI 32-bit 16 to 1 mux: 0.5 – 0.6 area unitsI 17 x 17 bit multiplier: 1.3-3.7 area unitsI 8 KiB memory (32 bit wide): 33 area units
Andreas Ehliar 04 - DSP Architecture and Microarchitecture
Performance properties
I Relative maximum frequenciesI 32-bit adder: 0.1 to 1I 32-bit adder/subtracter: 0.1 to 0.9I 32-bit 16 to 1 mux: 0.31 to 0.9I 17 x 17 bit multiplier: 0.11 – 0.44I 8 KiB memory (32 bit wide): 0.53
Andreas Ehliar 04 - DSP Architecture and Microarchitecture
Optimizing memory size is often the most important task
I MP3 decoder example
I All memories in the chip are3 time the size of the DSPcore itself
I (I/O pads are also largerthan the DSP core itself)
Andreas Ehliar 04 - DSP Architecture and Microarchitecture
Microarchitecture design of an instruction
I Required microoperations for a typical convolution instruction:
I conv ACRx,DM0[ARy%++),DM1(ARz++)
I Required microoperations:I Instruction decodingI Perform addressing calculationI Read memoriesI Perform signed multiplicationI Add guard bits to the result of the multiplicationI Accumulate the resultI Set flagsI For a combined repeat/conv instruction:
I PC <= PC while in the loopI PC <= PC + 1 as the last step in the loopI No saturation/rounding during the iterationI Saturate/round after final loop iteration
Andreas Ehliar 04 - DSP Architecture and Microarchitecture
The register file (RF)
I The RF gets data from data memories by running loadinstructions while preparing for an execution of a subroutine.
I While running a subroutine, the register file is used ascomputing buffers.
I After running the subroutine, results in the RF will be storedinto data memories by running store instructions.
Andreas Ehliar 04 - DSP Architecture and Microarchitecture
General register file
RF
AG
UPC
FSM
PM
Pro
gram
addr
ess
Con
figu
ratio
nan
dst
atus
Exe
cun
itA
LU
/MA
C
Inst
ruct
ion
DM
Program flow control
Operation ctrl
MEM ctrl
Res
ults
Inst
ruct
ion
deco
der
Ope
rand
&re
sult
cont
rol
Results and flags
Target addresses, configuration vectors from register file
[Liu2008]
I Connected to almost all parts of the core
Andreas Ehliar 04 - DSP Architecture and Microarchitecture
Register file schematic
register 1
register 2
register 3
register n
............
from register filefrom memory 1from memory 2from ALU
from MAC
from ports
......
OPA
OPB
ctrl_reg_in
ctrl_o_a
ctrl_o_b
Write circuit
Sto
reci
rcui
t
Read circuit
[Liu2008]
Andreas Ehliar 04 - DSP Architecture and Microarchitecture
Register file area comparisons
I Register file area comparisons
I Adder (for comparison): 0.1 - 1I One port, both read/write: (uncommon)
I Relative area: 2.5-2.7
I One read port, one write port: Accumulator registersI Relative area: 2.5-2.6
I Two read ports, one write: Basic RISC RFI Relative area: 3-3.2
I Four read ports, two write: Simple VLIW, Simple superscalarI Relative area: 4.6-5.8
Andreas Ehliar 04 - DSP Architecture and Microarchitecture
Register file speed
I Almost (but not quite) the same speed as a very fast 32-bitadder (in this particular technology)
I Also note that it is possible to use special register filememories (but at an increased verification cost)
Andreas Ehliar 04 - DSP Architecture and Microarchitecture
Read before write or write before read
I A processor architect has to decide how the register file shouldwork when reading and writing the same register
I Read before writeI The old value is read
I Write before readI The new value is read (more costly in terms of the timing
budget)
Andreas Ehliar 04 - DSP Architecture and Microarchitecture
Physical design: fan-out problem
Fan-out of the control signal
For the first stage: 16*16*2 = 512
Fan-out of the control signal
For the second stage: 16*8*2 = 256
Fan-out of the control signal
For the third stage: 16*4*2 = 128
Fan-out of the control signal
For the fourth stage: 16*2*2 = 64
Fan-out of the control signal
For the fivth stage: 16*1*2 = 32
Selected operand
From 32 registers in a register file
[Liu2008]
Andreas Ehliar 04 - DSP Architecture and Microarchitecture
Register File in Verilog
reg [15:0] rf[31:0]; // 16 bit wide RF with 32 entries
always @(posedge clk) begin
if(we) rf[waddr] <= wdata;
end
always @* begin
op_a = rf[opaddr_a];
op_b = rf[opaddr_b];
end
Andreas Ehliar 04 - DSP Architecture and Microarchitecture