ultra-low-power software-defined radio for lte wireless baseband
TRANSCRIPT
Ultra-Low-Power Software-Defined Radio for LTE Wireless Baseband:for LTE Wireless Baseband:
an embedded systems grand challenge
Chris RowenFounder and CTO, Tensilica
18 March 201018 March 2010
Two Curves2009 ITRS MPU/ASIC Metal 1 half pitch2009 ITRS MPU Printed Gate Length2009 ITRS MPU Physical Gate Length
800
Mobile Broadband Terminal Units
1000
600
700
nano
met
er
Smartphones400
500
ts (M
)
100
200
300
Uni
t10
Notebooks, netbooks,
MIDs0
100
200
Copyright © 2009, Tensilica, Inc.
02006 2007 2008 2009 2010 2011 2012 2013
11995 2000 2025 2010 2015 2020 2025
Two CurvesWhy Are These Curves Exciting?• Incredible density mean true single-chip integration• High volume on range of semiconductor designs• New opportunities for processors
Why are These Curves Frightening?• Moore’s Law enables high density, but how to we cope with complexity
and rate of change?and rate of change?• Many required functions, but few viable single-function chips
Silicon platforms must integrate or die:• So many transistors, so little battery capacity
Steeper improvement in density than energy efficiency (new transistor types?)Steeper improvement in density than energy efficiency (new transistor types?)
Copyright © 2009, Tensilica, Inc.
A Grand Challenge Problem: LTE HandsetsDesign Goals• High data rates: 150Mbps DL/50Mbps UL• High spectral efficiency with
O th l F Di i i M lti l i– Orthogonal Frequency Division Multiplexing– Multiple-Input Multiple Output
• Scalable bandwidth:1.25MHz to 20 MHz• Both Frequency-Division and Time Division Duplex
All IP N t k• All IP Networks
Implementation Needs• Low silicon cost: <20mm2 for digital PHY• Low power: < 2mW/Mbps DL
Paradox: Reach performance and power goals hil b ildi bl twhile building a programmable system
Copyright © 2009, Tensilica, Inc. 4
The World’s Quickest Introduction to LTE
PUCCH Signal GenCRC-Scrambler,
Bit Interleave,HARQ,Rate Match,
FEC Encode
Frequency RotatorFFT-DFT
TransmitResource Block Map
Pulse Shaper FIRRef Signal Gen
omai
n PDCP Encrypt
RLC,MAC
Frequency RotSynchronization
FFT
Channel Estimation
Decoder (ML)Soft DemapperTurbo Decode
Viterbi Decodeon Control Descrambler
DeinterleaverRate Dematch
Receive RF
pps/
IP D
o
MAC, RLCDecryptPDCP FFTSoft DemapperTurbo Decode
On DataRate Dematch
HARQAp PDCP
Data stream typesI,Q (16b,16b)Soft bit (8b) Hard bit (1b)F t t 10 10 bf Hard bit (1b)
Subframe structure: 1ms = 2 slots Frame structure: 10ms = 10 subframes
Slot = 7 symbols1 Symbol = 2048 tones 1200 useful tones1200 useful tones = 100 RBs x 12 tones… b-
freq
uenc
ies
Copyright © 2009, Tensilica, Inc.
1 tone = up to 6b (64QAM)
2048
sub
Building Solutions to Solve Problems
The raw material: more efficient processorsefficient processors
A range of building blocks: baseband processorsbaseband processors
A solution architecture: LTE Reference Architecture
Copyright © 2009, Tensilica, Inc.
Reference Architecture
6
The Essential Building BlockXtensa LX3 Extensible Dataplane Processor
Inst. Memory Management,
Protection & Error Recovery
InstructionRAM
InstructionROM
Processor Controls
Exception Support
Exception Registers
Instruction Fetch / Decode
InstructionVLIW (FLIX)
Parallel Execution pipelines
Recovery
External Interface
Xtensa
Trace PortJTAG Tap Control
Data AddressWatch Registers
Instruction Address
On-Chip Debug
Base
Base ISA Execution Pipeline
RAM
SystemBus
InstructionCache
LX3 Processor Interface Control
Write Buffer
PIF
Brid
ge
QIF32
Instruction AddressWatch Registers
TimersInterrupt Control
Register File
Base ALU
Optional Functional Units
Register FilesProcessor State DeviceDevice
Bus
Brid
geAH
B-Li
te/A
XI
RAM
DMA
Device
Data Memory Management,
Protection & Error
DataRAMDataROM
GPIO32
Designer-Defined Queues, Ports & Lookups
Data
Optional Functional Units
Designer-Defined
Designer-Defined Functional Units
Register FilesProcessor StateRegister Files
Processor State
Recovery
XLMI Local Memory
Base ISA Feature
Designer-Defined Features (TIE)
Configurable Function
Optional FunctionRTL, FIFO, Memory
KEY
Load/Store Unit
Dual Load/Store Unit
DataCache
Copyright © 2009, Tensilica, Inc. 7
Local MemoryInterface
g ( )
External RTL & Peripherals
p
Optional & Configurable Function
Memory, Xtensa
The Essential AutomationXtensa Processor Generator
Complete Hardware DesignSource pre-verified RTL, EDA scripts, test suite
Processor Configuration
ProcessorProcessorExtensionsExtensions
Xtensa ProcessorGenerator
Use standard ASIC/COT design techniques and
libraries for any IC fabrication process
1. Select from menu2. Explicit instruction
description (TIE)
Customized Software ToolsC/C++ compiler Debuggers, Simulators, RTOSes
3. Automatic instruction discovery (XPRES)
Copyright © 2009, Tensilica, Inc.
Multiple Core Communication = Interconnect + Software
Software for Multi-core:• Modeling at every level:Modeling at every level:
1. Gate/RTL level2. FPGA netlist3. Pin exact 4. Cycle accurate
Fast Memory sRTL
RAM RAM RAM
5. Instruction accurate
• Multi-core debug and analysis• Communications APIs• Complete DSP Libraries
Xtensa Baseas small as 16K gates
Fast Memory Interfaces
RTL
inte
rface
s
omm
sin
terfa
cesRTL
RTLOther
Processors Complete DSP Libraries• Solution stacks for LTE
• L1 PHY• L2/L3 MAC, RLC
TIE Extensions
Bus InterfacesExt
ende
d R
Ext
ende
d co
RTL
RTLProcessors
On-chip bus
Copyright © 2009, Tensilica, Inc.
Scalable platform for all design approaches
Small, programmable
Deeply integrated Task Engines
(Xtensas performs specific tasks in HW flow)
Function-Specific Light-Weight DSP
(Xtensa Foundation + DSP
DPU/DSP(Xtensas control HW
accelerators)
)
Baseband Engines DSPs, Multi-Standard SDR
(Module Extensions)
Copyright © 2009, Tensilica, Inc. 10
Core 1: ConnX BBE16 Baseband EngineUltra-High Performance and Programming Ease
Architecture: • 16 simultaneous 18bx18b MAC per cycle• 8 way SIMD + 3 way VLIW• 8 way SIMD + 3 way VLIW• Dual load/store unit (128b wide)• Scalar CPU pipeline with general 32b
RISC instructions• Extended precision with guard bits• Extended precision with guard bits• 6 addressing modes
Performance: 16 multiply adds per cycle• 16 multiply-adds per cycle
• Three 8-way ops per cycle• 4 complex FIR taps / cycle• 1 Radix-4 FFT butterfly / cycle
17GB/ b d idth(@ 550MH )• 17GB/s memory bandwidth(@ 550MHz)• 40-bit accumulation on all MAC operations
without performance penalty
Copyright © 2009, Tensilica, Inc. 11
Core 2: ConnX SSP16 ProcessorProcessing soft bit streams– 3x more efficient than std DSPg
Data-types:• 10-bit and 8-bit Vectors• 8-bit,16-bit and 32-bit Scalars8 bit,16 bit and 32 bit Scalars
Performance: • 16 arithmetic operations per cycle• 2 issue VLIW architecture• ~10GB/s memory bandwidth
Software:• Vectorizing compiler• Multi-core debuggerMulti core debugger• Fast, cycle-accurate simulator• Energy models• Function libraries and 3rd party
cellular stacks
Copyright © 2009, Tensilica, Inc.
Core 3: ConnX BSP3 Processor“Bit Stream Processing at minimal size and power”
Data-types:• 8-bit,16-bit and 32-bit Scalars
Performance: • Dual Load/Store• 3 issue VLIW architecture• 600MHz in 45nm
Load Store Unit(32-bit)Local
Memory and / or Cache
Local Memory and / or Cache
AR General Registers(32 or 64 x 32 bits)Load Store Unit
(32-bit)
• >4GB/s memory bandwidth
Software:• C/C++ compiler• Multi-core debuggerMulti core debugger• Fast, cycle-accurate simulator• Energy models• Function libraries and 3rd party
cellular stacks Arith, Logical,Shift Ops
Q Q
Xtensa 32-bit Base
Operations
Arith, Logical,Shift Ops
Q Q
Xtensa 32-bit Base
Operations
Arith, Logical,Shift Ops
Q Q
Xtensa 32-bit Base
Operations
pp p
Copyright © 2009, Tensilica, Inc.
Core 4: ConnX Turbo16 Programmable Turbo Decoding Engine
• Customized Processor based solution for LTE Turbo decoding at 150 Mbps
• Matches throughput of hardwired RTL• Matches throughput of hardwired RTL solution with similar area, power
• MAX-Log-MAP based decoding• Uses: exp(log(a) + log(b)) ~ max(a,b)
• 8-parallel windows based decoding• Two bits from each window processed per update• Interleaving and deinterleaving operations integrated
with load/ store operations to the local memories (Odd and Even D-RAM)
• Confidence estimator for software-basedConfidence estimator for software based early-termination lower power
Copyright © 2009, Tensilica, Inc. 14
LTE Reference Architecture “Atlas”
Demonstration of best practices for low-cost/low-power multi-core-based digital PHY sub-system• 100% processor-based – no RTL blocks
• Highlights agile configuration of ConnX processors
Proves efficiency of digital-PHY solution with processorsProves efficiency of digital PHY solution with processors • Reference architecture implements all blocks, memories and interconnect
• Port of complete PHY software solution: mimoOn mi!MobilePHY™ 3GPP Fully Compliant LTE PHYCompliant LTE PHY
7 core solution for Cat 4 UE (150Mbps DL/50Mbps UL 20MHz):• 3 x BBE16
2 SSP16• 2 x SSP16
• 1 x BSP3
• 1 x Turbo16
Copyright © 2009, Tensilica, Inc. 15
ATLAS-LTE UE
Transmit
50Mbps
Low Bandwidth Bus
Receive
150Mbps
High Bandwidth
p
Copyright © 2009, Tensilica, Inc.
Typical utilization of cores: 60%
Wrap-up
1. Data-plane processors combine three characteristics
1
Relative SOC ComplexityM1 features/mm2
• Adaptable: interface, instruction set, memory system automatically fit need
• Programmable: proven instruction sets compilers libraries SW stacks 0 7
0.8
0.9
sets, compilers, libraries, SW stacks • Efficient: Rivals hardware area and
power dissipation on complex problems2 LTE handset challenge shows need
0.5
0.6
0.7
2. LTE handset challenge shows need … and possibility …for better solutions
3. Even bigger challenges ahead” - SOC complexity in 2010 = 0.0 0.2
0.3
0.4
0
0.1
1995 2000 2005 2010 2015 2020 2025
Copyright © 2009, Tensilica, Inc. 17