design of embedded dsp processors · 10/9/2017 unit 11 of tsea26 –2017 –h1 14 isa template...
TRANSCRIPT
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
10/9/2017 Unit 11 of TSEA26 – 2017 –H1 1
Design of Embedded DSP
Processors
Unit 11: ASIP design
review and applications
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
10/9/2017 Unit 11 of TSEA26 – 2017 –H1 2
Review of the
course
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
To save your time
10/9/2017 Unit 11 of TSEA26 – 2017 –H1 3
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
To save your time
10/9/2017 Unit 11 of TSEA26 – 2017 –H1 4
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
10/9/2017 Unit 11 of TSEA26 – 2017 –H1 5
What should we get from the course
1. ASIP concept & design flow
2. Profiling and plan for HW design
3. ASM and micro architecture design
– ALU, MAC, and Register file
– PFC and memory addressing
4. Toolchain design
5. FW design and benchmark
6. Integration and verification
7. Advanced ASIP architectures
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
What is an ASIP
10/9/2017 6
• HW-SW co-design
for an applicaion
domain
• Accelerate 10% codes
running 90% time
• Scoped flexibility
• Usually based on
an instruction set
template
• + custom arch
for datapath, data
access, & control
• To reach custom
performance and
power, silicon cost
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
Compare CPU, ASIP, and ASIC
CPU
• All general applications
• X86, ARM,
• High end, high cost
• Strong SW ecologicalsupports
ASIP
1.For embedded applications
2.Low software ecological requirement
3.Performance design for an application domain
ASIC
• A function module not progrmmable
• Very high performance
• High design cost & Short life time
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®Embedded computing performance required by market
1GOPS
10 100 1T 10T
Video/image
Baseband
Learning
Graphics
H.263
H.264
High end ISP for quality camera
H.265
WCDMA
11g/a
LTE, LTEA, LTE-Hi terminal
Word recognition
LTE base stations
2D graphics
3D video games
Car registration
AR VR associated
Deep learning
in terminal
1080p 8k
GSM
5G BS
CT
Radar
HSPA
language learning
Ultrasonic array
3DTV
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
ASIP on markets (mostly in SoC as IP)
2017/10/9 Unit 11 of TSEA26 – 2017 –H1 9
ASIP Applications IP/year $/year
SDR baseband Handset, base station 1B 3-5B
ISP for image and video Handset, video, camera 1B 1-2B
Video codec Handset, survaillence 0.5B 2-4B
Storage SSD, Memory cards >100M ~500M
Gateways Gateway, home gateway >50M ~500M
Network processors ISP, router, industrial >10M ~100M
Industrial control Motion and motor control 100M 300M
Robots Vision, control 10M 100M
IoT Communication, sensing 50B 50B
Deep learning Server, terminal ? ?
Defense DFE Baseband, sensing, ISP ? ?
Defense AP Recognition, decision ? ?
……Video application, VR, AR, medical, toys, home, and much more……
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
2017/10/9 Unit 11 of TSEA26 – 2017 –H1 10
ASIP design flow
Source code analysis, Decision for ISA of ASIP
Design instruction set and toolchain for prototyping
Benchmark (kernel), evaluate microarchitecturte
Microarchitecture design, VLSI design, Verifications
Change
ISA?Satisfied?
Yes
No
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
Code profiling to find what to accelerate
• We gave examples, not yet systematically gave
profiling methods and profiling tools.
– Collect algorithms from codes and related text books
• Algorithm scope is related to product life-time (up to you)
– Profiling flow (tool selection and use of the tool):
• Select a (static/dynamic) tool & right data set, set up flow
• To find 10% codes run at 90% time, accelerate!
10/9/2017 Unit 11 of TSEA26 – 2017 –H1 11
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
What shall we accelerate
1. Computing: Instruction fusion or magic
instructions (Eric Anders)
2. Data access: To hide (pipeline) data access
cost behind computing (Andreas K)
3. Control: Minimize control overheads by
hiding it or using extra control HW (ch14)
4. NoC: Reduce SoC / NoC cost before chip
integration (special I/O for core).
10/9/2017 Unit 11 of TSEA26 – 2017 –H1 12
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
10/9/2017 Unit 11 of TSEA26 – 2017 –H1 13
How do we accelerate ASIP requirement specification
Early manual partition according to application profiling
ASIP Integration, final function verification and performance validation
Instruction set
specification
Assembly instruction set simulator
Benchmarking of
instruction set
Application SW implementation
Processor architecture
specification
Microarchitecture design
Processor HW implementation
Implement the function as a subroutine
Implement the function as an instruction
Implement the function as a subroutine
Implement the function as an instruction Design for HW
acceleration
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
10/9/2017 Unit 11 of TSEA26 – 2017 –H1 14
ISA template selection
A typical ASIP DSP processor assembly instruction set
Move
instructions
Arithmetic
instructions
Control
Lo
ad i
mm
edia
te d
ata
Lo
ad o
r st
ore
bet
wee
n
mem
ory
an
d r
egis
ters
Mo
ve
bet
wee
n r
egis
ters
Gen
eral
ari
thm
etic
, lo
gic
,
shif
t /
rota
te i
nst
ruct
ion
s
Div
isio
n a
nd
oth
er v
ecto
r
and
ite
rati
ve
inst
ruct
ion
s
Lon
g a
rith
met
ic o
per
atio
ns
Bit
an
d b
its
man
ipu
lati
on
s
Bra
nsc
h a
nd
cal
l
Oth
er p
rog
ram
flo
w c
on
tro
l
inst
ruct
ion
s
Res
erv
ed f
or
acce
lera
tio
n
exte
nsi
on
RISC CISC
Mu
ltip
lica
tio
ns
CISC
instructions
MA
C a
nd
co
nv
olu
tio
n
Rep
eat
inst
ruct
ion
Accelerate
extensions
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
Design an basic instruction set
• Load store instructions: between all registers
and to / from memories
• ALU instructions: for single, double precision,
signed, unsigned, integer, fractional, arithmetic,
logic, & iterative (innermost loop) computing.
• Flow control instructions: cover conditional
unconditional jumps, call / returns, NOP, loop
control / repeat, and custom control.
10/9/2017 Unit 11 of TSEA26 – 2017 –H1 15
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
Numerical presentationsSigned
integer
Unsigned
integer
Signed
fractional
Block
floating point
Floating
point
Normal arithmetic Y Y Y Y
Audio, voice Y Y
Image, video Y Y
Normal DSP Y Y Y Y
Logic operations Y
Control flow Y
Addressing Y
HPC Y
10/9/2017 Unit 11 of TSEA26 – 2017 –H1 16
Guarding and scaling before computing
Result = truncation (saturation (rounding (scaling (A))))
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
Coding for an instruction set
10/9/2017 Unit 11 of TSEA26 – 2017 –H1 17
All micro-operations in an assembly instruction
Imp
licit
mic
ro-o
pera
tio
ns:
for
ex
am
ple
bu
s tr
an
sacti
on
s,
an
d i
nst
ructi
on d
ecod
ing
Explicit micro-operations specified in assembly manual:
Explicit micro-operations specified in assembly
code and binary machine code:
Implicit micro-
operations not
specified in
assembly code:
For example
flag ops and
PC<=PC+1
Data
mem
ory
ad
dre
ssin
g
Op
era
nd
s
Dest
inati
on
Op
era
tio
n
Ex
pli
cit
specif
iers
Targ
et
ad
dre
ssin
g
• An assembly instruction set in binary format can be executed by HW. • Othorgonal coding for structural design: easy instruction decoding
• Efficiebnt coding for low program memory cost: may be less flexible
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
10/9/2017 Unit 11 of TSEA26 – 2017 –H1 18
A code multiplexing example
Type 1 (2b) Sub type a (2b) Operation code (6b) Operand A (5b) Operand B (5b)
Target address (20b) for 1M PM space
Operand A (5b) 16-b constant
Register (5b) Memory address (16b)
Multiplexing code
(Control codes)
Multiplexed fields
Type 1 (2b) Sub type a (2b) Operation code (6b)
Type 1 (2b) Sub type a (2b) Operation code (6b)
Type 1 (2b) Sub type a (2b) Operation code (6b)
The trade off between orthogonality and code density
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
10/9/2017 Unit 11 of TSEA26 – 2017 –H1 19
Prepare for the
exam
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
10/9/2017 Unit 11 of TSEA26 – 2017 –H1 20
Prepare for the exam1. Check your ASIC and ASIP knowledge
1. Basic concepts (small questions), design
for accelerations
2. Design for an ALU and a register file
3. Design for a MAC, convolution, and other
operations in MAC
4. Design for normal / accelerated (modulo)
memory addressing
2. ASIP: PFC, jump, repeat, call, return,
and instruction decoding
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
Basic concepts, not limited to
1. Use Y-chart (behavior, structure, physical) to specify HW
2. Hardware multiplexing (MUX, keeper), pipeline, memory
concepts (principle, model, partition)
3. Finite precision and design/verification corners (Datapath,
Data access, and Control path corners)
4. Critical path, pipeline balance, and (hidden) fan-out
5. Hazard, delay slot, and pipeline induce problem (RTL/SIM)
6. Basic concept of assembly coding tools and FW design using
HW knowledge from the course
7. Anything mentioned during teaching and tutorials as well as
lab discussions (low %, yet essential).
10/9/2017 Unit 11 of TSEA26 – 2017 –H1 21
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
Register file
• Write port (how many inputs, from all SRF)
• Registers (operand keepers)
• Operand output ports (to RF, all DP, SRF, M)
• Special registers (where and how to access)
• Hidden critical path from control
• How to design a RF with multi write ports
10/9/2017 Unit 11 of TSEA26 – 2017 –H1 22
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
ALU
• Learn hardware multiplexing (MUX design)
• Design multiplexer controls (control table)
• Primitive based design method and portable
design (barrel shift primitive)
• E.g., special functions, such as flags
• E.g., special functions, such as ABS, MAX
• E.g., advanced, register forwarding (control)
10/9/2017 Unit 11 of TSEA26 – 2017 –H1 23
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
MAC
• Integer / fractional, signed / unsigned MUL
• How to emulate double precision MUL
• MAC using fractional data, MAC supporting
very long iteration
• Arithmetic computing using accumulator
• R = truncation(saturation(round(scale(A))))
• There will be at least one question of MAC
10/9/2017 Unit 11 of TSEA26 – 2017 –H1 24
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
Data access
• Basic memory sub system design knowledge:
peripheral, multi blocks, hierarchy
• Design multiple address pointers in parallel
• Modulo addressing principle, circuit design
• Overflow and underflow checking
• Acceleration for custom data accesses
• Is addressing circuit a signed / unsigned HW?
10/9/2017 Unit 11 of TSEA26 – 2017 –H1 25
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
Control path
• Skills to design an instruction decoder logic
• Skills to design a basic PC FSM circuits
– Hazards and handling principle /circuits
– Pipeline execution table of a processor core
– Delay slot design for (conditional) jumps
– Design flush control when a jump is taken
– Design for repeat control
• Option: control of register forwarding
10/9/2017 Unit 11 of TSEA26 – 2017 –H1 26
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
Toolchain
• Basic concepts will be in the exam
• Toolchain concept from user’s point of view
• Basic knowledge from the lab to use tools
• How to design a ISS for programmers
– (Lab4) FSM, clock counting, hazard, debugging
• How to design a microarchitecture simulator
• How to write C function for an instruction
10/9/2017 Unit 11 of TSEA26 – 2017 –H1 27
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
Firmware design
• Algorithm selection, constraints (time, Memory costs)
• How to insert gain measurements / gain controls in a
program while suing finite precision fixed point data
• To use register variable lifetime and optimize codes
• Why cycle checking is the last step before compiling?
• Innermost loop subroutine (speed up / hazard control
by scheduling and unrolling)
• FW programming / development flow (with 3 entries)
10/9/2017 Unit 11 of TSEA26 – 2017 –H1 28
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
Vector processors
• Acceleration opportunities on instruction level
– Parallel computing, hidden data accesses and control
– Instruction fusion, magic instructions
• Three kinds of SIMD architectures (our definition)
– Vector (flat), Reduce, and 2D datapath
• Three basic SIMD challenges
– Data alignment: permutation based access on SPMs
– Conditional execution: Execute true & false separately
– Compiling: avoid doing it, using intrinsic based model
10/9/2017 Unit 11 of TSEA26 – 2017 –H1 29
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
Integration and verification
• What to care during core integration
– Function, structure, and physical
• Verification
– Compliance test
– Corner test (DP, data access, control corners)
– Write ASM code & select data for a corner test
• What is DUT and how to write a test suit
10/9/2017 Unit 11 of TSEA26 – 2017 –H1 30
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
Chapters / sections you can skip
• Following sections will not be essential and will not be examed
– 1.7.3, 2.1.6, 2.1.8, 3.1, 3.2, 3.4, 3.5.1, 3.5.2, 3.5.4, 3.5.5, 4.6
– chapter 5, 6, 9, 16, and 17 will not be examed
– 7.1.2, 7.1.3, 7.1.4, 7.1.5, 7.4, 7.5, 8.2, 8.3, 8.4, 8.9, 10.1, 10.5, 11.3
– Do not read chapter 14. Read my compendium instead
– 18.2.5
– 19.1 is rather old. Carefully follow my lecture/slides is enough
– 19.2 is OK to read. Chapter 20 is rather old, try to follow my slides
• To reach high score in the exam, you can skip listed part. You
are suggested to read through the book if you really want to
design a processor.
10/9/2017 Unit 11 of TSEA26 – 2017 –H1 31
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
10/9/2017 Unit 11 of TSEA26 – 2017 –H1 32
Application
case studies
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
Ba
seb
an
d f
un
ctio
n F
low
Transform FFT
Beamforming
Decimator, Rate matching (Farrow filter)
Energy measure, gain control
Rotator, notch, bandpass filter
Symbol and frame synchronization
Channel estimation
Single carrier: finger
finder
OFDM: LS or
MMSE
Matrix pre-process
Single carrier: RAKE
receiver
OFDM: LS /
MMSE
Data detection (soft LLR, hard)
De-interleave
FEC
Turbo LDPC
CRC
CRC
Rate matching and interleaving
Precoding
Modulation
CC
Error correction coding
Turbo LDPC RS
Beamforming
CDMA
Scrambling and Channel access
multiplexing
OFDM
FFT and Channel access
multiplexing
Filtering for pulse shaping
DPD for RF power amplifier
Viterbi
DAC interface
ADC interface
MCU
Legends for functional partition
Symbol FEC BIT MAC interface
MAC interface
FEC
RS
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
10/9/2017 Unit 11 of TSEA26 – 2017 –H1 34
Baseband subsystem
MCU (the baseband controller)
Baseband connection network
Symbol processor
DFE
Symbol processor
Matrix
LLR
processor
FEC
processor
Host interface
Memory interface
ADC port
DAC port
Bit
processor
Symbol processor
FFT
Different kinds of SIMD processors
SIMD1 SIMD2 SIMD3 SIMD4 SIMD5 SIMD6
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
DFE – Digital Front End,
2 designer-years
• Function: Low pass, band pass, biquad, and
Farrow filters, rotators for I and Q
• Structure: dedicated SIMD, translate IIR to
avoid dependence
• Physical constraints: Up to 100 MAC
operations for one sample data (I or Q), two
filter chains, low power and low silicon cost
10/9/2017 Unit 11 of TSEA26 – 2017 –H1 35
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
Digital Front End
10/9/2017 Unit 11 of TSEA26 – 2017 –H1 36
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
FFT DFT machine,
2 designer-years
• Function: Parallel execution multiple R2, R4,
R8, R16 FFT. R3, R5 DFT
• Structure: A R16 machine can be divided into
2XR8, 2XR5, 4XR4, 4XR3, 8XR2
• Physical constraints: Critical path is a 17b
MUL, to keep internal precision using block
floating point. To minimize the twiddle factor
memory cost.10/9/2017 Unit 11 of TSEA26 – 2017 –H1 37
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
Complex data matrix computing
more than 10 design-years
10/9/2017 Unit 11 of TSEA26 – 2017 –H1 38
• Function: +, -, MUL, determinants |A|, Hermitian,
Transpose, matrix inversion, LUD, QRD, SVD
• Structure: 2D datapath for multi-out / reduce,
data access with permutation
• Physical constraints: Matrix inversion is the
critical function. MUL is in the critical path
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
ePUMA for matrix and tensor
10/9/2017 Unit 11 of TSEA26 – 2017 –H1 39
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
FEC ASIP based on SIMD
4 design-years
10/9/2017 Unit 11 of TSEA26 – 2017 –H1 40
• Function: Convolutional and block Turbo,
LDPC, Viterbi (not for Reed Solomon)
• Structure: BCJR (Bahl, Cocke, Jelinek, Raviv)
maximum a posteriori decoding based Forward
and backward recursion, permutated addressing
• Physical constraints: Dual port memory is the
bottleneck. Addressing path might be critical.
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
Merge FBR algorithms into one flow
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
Implement the flow into an ASIP
10/9/2017 Unit 11 of TSEA26 – 2017 –H1 42
Synopsys Design Compiler, Cadence Encounter
ST Microelectronics 65 nm CMOS Low Power 1.1V
200 MHz(memory 400 MHz)P=12,W=32(Turbo)or 64(CC)Turbo:
Currently 12 SISO, 200MHz, 6 iteration, 186MBPS
Future: 24SISO, 500MHz, 4 iteration, 1395MBPS
Cost
Area 2.12 mm2
Power consumption 322 mW
• ASIP-FEC: A FEC baseband processor for
Turbo, LDPC, and Viterbi, 2014
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
PBIT ASIP based on SIMD
3 designer-years
10/9/2017 Unit 11 of TSEA26 – 2017 –H1 43
• Function: Bit manipulation in parallel including
all LFSR (CRC, CC, Scrambling) GF (Reed
Solomon codec, AEC, DEC, ZUC, Snow3G….)
• Structure: LUT (look up table) based parallel
bit SIMD GF ALU, permutated addressing
• Physical constraints: not much, small memory
blocks, table address generation.
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
PBIT ASIP based on SIMD
Performance
11.6 Gb/s AES
32.0 Gb/s SNOW 3G
16.0 Gb/s ZUC
128.0 Gb/s CRC
8.0 Gb/s RS(255,239)
STMicroelectronics的65 nm
Low Power 1.2 V
1.0 GHz
Cost
area 0.77 mm2
Logic gates 207 KGates
power 489 mW
• BP-ASIP: A 128-way parallel baseband processor for parallel bit
manipulation covering RS codec, LFSR, and encryptions. 2016
10/9/2017 Unit 11 of TSEA26 – 2017 –H1 44
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
Summarize what/how to learn
System
understanding
Plan HW
schematic
HW
codingFW coding
Integration
verification
Finite precision Just enough quality Where/what Sat/rnd Gain ctrl Corner cases
Micro architecture Functions to map Sharing sharing HW knowledge Balance
Register file Write conflict Critical path Fanout Life time Fanin fanout
ALU: Arithmetic & Logic HW sharing Reuse skill IP code precision corner
MAC: MUL and ACC MAC/LALU/MLU Reuse skill IP code Use MAC corner
Memory and data access Modulo Pipeline pipeline D-allocate IP coding
Program flow control PC and I-decoder PFC pipeline PC Hazard Pipeline
Assembly coding tools Behavior/arch SIM D-hazard ------ Lab4 Verification
Firmware plan & design Bit/mem/cycle ------ ------ plan vs code SW v.s. HW
Survey of Different ASIP Efficient VPU Tool limited critical Kernels
10/9/2017 Unit 11 of TSEA26 – 2017 –H1 45
Skills
Con
cep
ts
5% 15% 10%20% 50%
10%
10%
10%
10%
10%
10%
10%
10%
10%
10%
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
LOGO
Dake Liu, Room 556 coridoor B, Hus-B, phone 281256, [email protected]
Welcome to ask any
questions you want to
• I can answer
• Or discuss together
• I want to know what you want