vliw dsp processor design for mobile communication … · 2013. 1. 9. · embedded processing of...

VLIW DSP Processor Design forMobile Communication Applications

Contents crafted byDr. Christian Panis

Catena Radio Design

Agenda

Trends in mobile communication

Architectural core features withsignificant impact on performance

Case study: 3a – a scalable VLIW architecture

Design space exploration

Challenges of scalability

Summary

Trends in Mobile Communication

IEEE Spectrum, July 2004

Trends in Mobile Communication

Embedded systems emerging increasingly Bandwidth demands leads to significant

increase in computational requirements Trade-off:

power dissipation vs. flexibility vs. performance

Cost pressure, feature size, application spaceMultistandard solutions

Application-specific and customizable processors

How to Tackle the Problem?

Application specific processor architectures provides support for application specific requirements provides domain specific problem solutions provides trade-off power vs. area effort vs. flexibility

Domain Specific Processor Architectures

Things to be considered For each core a seperate tooling/tool-chain? How to analyse the application specific requirements? How to analyse the gain compared with a standard core solution? How to deal with additional verification effort caused by flexibility?

Focus / Why VLIW ?

FocusEmbedded processing of lower communication layers

CharacteristicaMix of traditional loop-centric DSP algorithms with control code

load/store VLIW is one possible solution Real time requirements for signal processing algorithms (+) High ILP support allows efficient execution of inner loops (+) Code density drawback of VLIW (-) Poor cache support (+/-)

Architectural Key Characteristika

Register filesize, number of entries, structure

Data path(s)number, parallel availability, type

Memory ports / bandwidthnumber, data width, supported granularity

ISA, binary codinginstruction mapping, binary coding, native instruction word size

Pipeline structurenumber of stages, exceptions

Register filesize, number of entries, structure

d29 d28

d31 d30

a14/l14

a15/l15

a14/l14

a15/l15

D [0..31]L [0..15]A [0..15]

D [0..15]L [0..15]A [0..15]

Data path(s)number, parallel availability, type of supported functions

SIMD MUL

op1 op2

res 1 op3

op4 op5

res 2 op6

SIMD MUL

op1 op2

res 1 op3

op4 op5

res 2 op6

op1 op2 op3

op4 op5 op6

Memory ports / bandwidthnumber, data width, supported granularity

AGU 1 AGU 2

PORT 1 PORT 2

AGU 1 AGU 2

PORT 1 PORT 2

ISA, binary codinginstruction mapping, binary coding, native instruction word size

add op1, op2add op1, op2, op3

sub op1, op2sub op1, op2, op3

20 bits 20 bits 16 bits 16 bits

16 bits

number instructions

long instructionsbytes

710164

710215

20 bits 16 bits

16 bits 16 bits

Pipeline structurenumber of stages, exceptions

Load/store operation write backread op1 read op2

Load/store operation accu write back 1read op2b

read op1 read op2a write back 2

address calculation

address calculation 1

address calculation 2

address register update

Instruction Fetch Alignment Instruction

Decode Execute 1 Execute 2

Decode Execute 1 Execute 2 Execute 3

3a: case study

Key architectural aspects Modified Dual Harvard load-store architecture RISC instruction set xLIW (scalable long instruction word) Orthogonal register file and ISA Destination register based predicated execution Instruction buffer for power efficient inner loop processing Scalable and configurable core architecture Architecture specified considering an optimizing C-compiler

3a: case study

C-compiler aspects Load/store architecture Orthogonal ISA Large uniform register files Functionality stored in ISA Simple issue rules

Mode dependent instructions Irregular instructions Implicit dependencies Modes for instruction sets

3a: case study

xLIW –scalable long instruction word

align unitprogram memory

decoder ports

inst1 inst2 inst3inst4 inst5 inst6 inst7inst8 inst9 inst10 inst11

inst12 inst13 inst14 inst15inst16 inst17 inst18 inst19

inst n-3 inst n-2 inst n-1 inst n

inst0 inst1

LD/ST LD/ST CMP CMP PSEQ

inst3 inst4

inst6 inst7

cycle m

cycle m+1

cycle m+2

cycle m+3

cycle m+4

3a: case study

Destination register based predicated execution

load/store load/store arithmetic arithmetic predicated execution

flag register file

3a: case study

Orthogonal register file incl. flag register file

data address flag

register file

gb d1 d0

data register

long register

accumulator register

r0m0 address register

modifier register

3a: case study

”3-phase” pipeline

Decode Execute 1 Execute 2

Phase 1: fetch Phase 3: executePhase 2: decode

3a: case study

Things to be considered For each core a seperate tooling/tool chain? How to analyse the application specific requirements? How to deal with additional verification effort caused by flexibility? How to analyse the gain compared with a standard core solution?

Scaleable core architecture – Adaptable to application specific requirements Scalable in performance

”one core” – ”one tool chain”

Design Space Exploration – Design Flow

functional testing

hardware generators

configuration nconfiguration 2

optimizing C-compiler

application C-code

assemblerlinker

static analysis results

dynamic analysis results

verification report

compiler generator

configuration 1

testcase generator

documentationgenerator

Evaluation Phase

Production Phasebincode

generatorchosen core configuration

documentationgenerator

hardware generators

optimizing C-compiler

assemblerlinker

static analysis results

dynamic analysis results

verification report

compiler generator

testcase generator

application C-code

binary executable

Design Space Exploration – Static Analysis

Code SizeMeasure how efficient the application can be mapped on aprocessor in term of required code space

ParallelismMeasure how efficient the application code can be mappedon a parallel architecture

Instruction histogramMeasure how frequent instructions are used duringmapping of the application code onto the chosen ISA

Design Space Exploration – Dynamic Analysis

Program memory fetchMeasure of efficient use of memory fetches, mainly influenced bypipelined processors and application code with low branchdistance and high branch frequency

Execution count per bundleMeasure how often a certain execution bundle will be executed

Execution count per instructionMeasure how frequent a certain instruction will be executed

Design Space Exploration – Example

Design Space Exploration – Statistics

3a: case study

Things to be considered For each core a seperate tooling/tool chain? How to analyse the application specific requirements? How to analyse the gain compared with a standard core solution? How to deal with additional verification effort caused by flexibility?

Application code analysis Detailed static & dynamic analysis Quantitative analysis of different core based platforms Balance different core features against area/power consumption Quantitative support to optimize HW/SW partitioning Identify ”Hot Spots”

3a: case study

Benchmarking?

Does the application requirements fit to standard benchmarks?

Application Benchmarking:Benchmarking of theTarget Architecture

Design Space Exploration

Things to be considered For each core a seperate tooling/tool chain? How to analyse the application specific requirements? How to analyse the gain compared with a standard core solution? How to deal with additional verification effort caused by flexibility?

Analysis application code on ”function” level Compare MIPS/Memory requirements for different application

setup’s and for different core architecture Benchmarking for the target architecture

Challenges of scalability

Verification effort versus Flexibility

XML based configuration file

Binary code generator

Documentation generator/adaptation

HW code generator

Testcase generator

Summary

Application Specific Processors allows to meetarea and power dissipation requirements inSoC’s for mobile communication platforms

Multistandard requirement leads todomain specific processor architectures

”one core” – ”one tool chain”

Design Space Exploration is required to analysedomain specific requirements on core subsystem

vliw dsp processor design for mobile communication … · 2013. 1. 9. · embedded processing of...

Documents

agc dsp agc dsp professor a g constantinides©1 adaptive...

from leibniz’s characteristica geometrica to …

cohen 1954 characteristica

6g serdes, powerful dsp blocks, maco ......6g serdes,...

dsp design in wireless communication liang liu and fredrik...

real time dsp algorithms for mobile communication

sonamp dsp amplifiers - sonance€¦ · dsp 2-150 dsp 8-130...

human interaction and communication · 2019-03-10 ·...

dsp based equalization for 40-gbps fiber optic communication...

advanced dsp for coherent optical fiber communication

introduction to dsp - technicalsymposium.com · web...

preliminary summary · core and a magicv vliw dsp of the...

introduction to dsp. principles and operation of dsp ... ·...

memory intensive architectures for dsp and data...

considerations in the determination of s&c derivatives and...

communication system design using dsp algorithms -...

additional experiments for communication system design using...

an industry-academia partnership that fosters …dsp primer...

direct support professionals core competencies initial dsp...

dsp flexipower series - smb equipos modelo dsp... · 2014....