1 how to realize high-performance compute with multicore dsp
TRANSCRIPT
1
How to realize high-performance compute with Multicore DSP
TI Confidential – NDA Restrictions
C667x Target Applications (Non- Telecom)
Emerging Others
Test and AutomationMission Critical
Infrastructure Audio
HPC, Imaging and Medical
Video Infrastructure
Emerging Broadband
Innovations
TI Confidential – NDA Restrictions 3
3
RF and Communication Applications
Key Customer Careabouts •Long Term Partnership•Financial Stability•Strong Roadmap and R&D•Floating Point Performnce•Size, Weight, and Power (SWaP)•I/O Bandwidth •Longevity of supply (10+yrs)
Application ISR (Intelligence/Surveillance/Reconnaissance)
o SIGINT/COMINT/Signal GeneratorsMilitary Communications.
o SDR(JTRS)-Manpack/LMR/Fixedo Comm. Infra - VoIP/Video Gateways
Satellite\Avionics Communicationso Ground Receiver/Repeaterso Weather Radar
FAA – Civil Aviation/Govt Comm.Conventional PS – TETRA/APCO/E911
o Wireless Infrastructureo Comm. Infra - VoIP/Video Gateways
Emerging Broadband (OFDM/LTE/WiMAX)o Utilities/Transport/Smart Grid
Govt & Public SafetyAvionicsMilitary & Defense
TI Confidential – NDA Restrictions 4
RF and Comm. Product Requirements
Needs Raw Performance in terms of MIPS/GHz/MMACS
Floating Point Capable ISA to achieve “precision” and high GFLOPS.
Large On Chip RAM – Reduce accesses to slow
external memory. High Speed External Memory
Interface Large addressable memory Efficient DMA architecture Wireless specific accelerators
and TCP/IP Offload
Support Multiple Waveforms Common Platform for
TDMA/CDMA/OFDMA Multi-channel VoIP/Video
capability Support FEC and Modulation TCP/IP Networking support
End Product Need DSP Requirement
TI Confidential – NDA Restrictions 5
Reliability in Mission Critical Designs
Low Power Design
High BW Interface RF Front End and Telecom ports
Connect Multiple DSPs on a board e.g. in ATCA Card
High BW Backplane and Network Connectivity
Needs multiple high speed interfaces
– PCIe ,Serial RapidIO– OBSAI/CPRI Interface– Gigabit Ethernet etc
Memory Error Correction & Checking (ECC)
Efficient Low Power DSPs Support Extended Temp ranges from
-40oC to 105oC and others Temp
Ease of Use
Imaging Product Requirements
Dev and Debug Tools Multicore S/W Frameworks Signal/Image Processing functions. VoIP Library Audio/Video Codecs
End Product Need DSP Requirement
TI Confidential – NDA Restrictions
6
Introducing “Keystone Architecture” (C66x)The Best Combination of Performance (GHz) and Power Consumption in the Industry
16GFLOPs & 32GMACS per Core @ 1GHz
Fixed and Floating-point Core@ 1.25 GHz
4x C64x+ MAC (32)4xC67x Fl pt MAC(8)
16FLOP/cy compared to 6FLOP/cy
8 Core C6678 based on C66x core delivers 320 GMACs/160GFLOPS
@ 1.25GHz/Core (effectively a 10GHz DSP)
100% Code Compatible with allC64x (fixed) & C67x (floating)
Devices
Similar Power Profiles as C64x Core
Supported by Code Composer Studio IDE
Next-Generation Next-Generation C66x DSP CoreC66x DSP Core
FloatingPoint
FixedPoint
C64x+ Core (Fixed pt)
C64x+
Lowest Power Highest Performance DSP Core
C67x Core (Floating pt)
Industry’s Lowest Power FP DSP CoreHigh precision and wide dynamic range
C67xx
NEW MultiCore
DSP C66x
KEYSTONEArchitecture
TI Confidential – NDA Restrictions
0 2000 4000 6000 8000 10000 12000 14000
TMS320C66xx
TMS320C67x
Renesas SH77xx (SH-4)
Intell Pentium III
ADI TS202S/203S (TigerSHARC)
ADI TS201S (TigerSHARC)
ADI 213xx (SHARC)
ADI 2126x (SHARC)
ADI 2116x (SHARC)
Unmatched Performance
BDTI Score for Floating Point Processors
BDTImark2000 BDTImark2000 TMTM Score Score
0 5000 10000 15000 20000 25000
TMS320C66xx
TMS320C64x+
Freescale MSC815x (SC3850)
Freescale MSC814x (SC3400)
Freescale MSC81xx (SC140)
ADI TS202S/203S (TigerSHARC)
ADI TS201S(TigerSHARC)
ADI BF5xx (Blackfin)
NEC uPD77050
BDTI Score for Fixed Point Processors
AlgorithmC67x @ 300MHz
C64x+ @1.2GHz
C66x @1.25GHz Gain
Single Precision Floating Point FFT, 2048 pt, Radix 4
86.84 us 14.00 us* ~600%
Fixed Point FFT, 2048 pt, Radix 4 8.23 us 4.46 us* ~200%
FIR Filter, 40 samples, 40 taps 0.69 us 0.34 us* ~200%
Matrix Multiply 32 x 32 17.92 us 6.16 us* ~300%
Matrix Inverse 4 x 4 0.53 us 0.13 us* ~400%
TI Confidential – NDA Restrictions
8 8
The first network on chip infrastructure to unleash full multicore entitlement
Tera
Net
2
Shared MemoryShared Memory
High Speed I/OHigh Speed I/O
Multicore Shared Memory Controller Multicore Shared Memory Controller
C66x, ARMProcessing Cores
C66x, ARMProcessing Cores
Multicore Navigator Multicore Navigator
Application AcceleratorApplication Accelerator
Application AcceleratorApplication Accelerator
HyperLink50
System Management(Debug, Clocking, Power)System Management(Debug, Clocking, Power)
Network on Chip
TI Multicore KeyStone Architecture
• Highest Integration– Cost & Power
• Common Architecture– Portable Software
• Scalable Tailored Solutions
• Navigator– Innovative Multi-core
• Floating Point– Development Time
• Tools & Debugging– R&D Efficiency
• Quality Software– Solutions & Libraries
9TI Confidential – NDA Restrictions
Product Highlights: C6670 and C6678
TI Confidential – NDA Restrictions
Next Generation C66x Core - Up to 8 C66x Cores @ 1GHz -1.25GHz- Available Options: 1, 2, 4, and 8 Core Devices
Memory Architecture- 4MB Local L2/Core (512KB per Core)- 4MB Multicore Shared Memory
Power Optimized Core - <10W at 1Ghz nominal temp
C6678C6678Power Optimized Core
C6670C6670Performance Optimized Core
Next Generation C66x Core - 4 C66x Cores @ 1GHz - 1.2GHz
Memory Architecture- 4MB Local L2/Core (1MB per Core)- 2MB Multicore Shared Memory
Communication Accelerators- TCP3e (Turbo Encode) – Up to 550Mbps- TCP3d (Turbo Decode) – Up to 600Mbps- FFTC – 2048 FFT every 4.6µs- VCP2 for voice channel decoding
Multicore Navigator
Te
raN
et
C66X DSP
C66X DSP
L1L1 L2L2
C66X DSP
C66X DSP
L1L1 L2L2
C66X DSP
C66X DSP
L1L1 L2L2
C66X DSP
C66X DSP
L1L1 L2L2
C66X DSP
C66X DSP
L1L1 L2L2
C66X DSP
C66X DSP
L1L1 L2L2
C66X DSP
C66X DSP
L1L1 L2L2
C66X DSP
C66X DSP
L1L1 L2L2
8 x CorePac8 x CorePac
SRIOx4
SRIOx4
PCIex2
PCIex2
EMIF16
EMIF16
TSIPx2
TSIPx2
I2CSPI
I2CSPI UARTUART
Peripherals & IOPeripherals & IO
GbESwitch
GbESwitch
SGMIISGMIISGMIISGMII
IP InterfacesIP Interfaces
CryptoCrypto
Packet Accelerator
Packet Accelerator
NetworkCoProcessors
NetworkCoProcessors
Power ManagementPower Management
DebugDebug
Multicore Shared Memory Controller(MSMC)
Multicore Shared Memory Controller(MSMC)
Shared Memory 4MBShared Memory 4MB
DDR3-64b
DDR3-64b
EDMAEDMASysMonSysMon
System ElementsSystem Elements
Memory SubsystemMemory Subsystem
Hyp
erLi
nkH
yper
Link
Multicore Navigator
Ter
aNet
C66X DSP
C66X DSP
L1L1 L2L2
SRIOx4
SRIOx4
PCIex2
PCIex2
AIF2 x6
AIF2 x6
I2CSPI
I2CSPI UARTUART
Peripherals & IOPeripherals & IO
SGMII x2
SGMII x2
4x VCP24x VCP2 3x TCP3d3x TCP3d
CommunicationsCoProcessors
CommunicationsCoProcessors
Power ManagementPower Management
DebugDebug
Multicore Shared Memory Controller(MSMC)
Multicore Shared Memory Controller(MSMC)
Shared Memory 2MBShared Memory 2MB
DDR3-64b
DDR3-64b
EDMAEDMASysMonSysMon
System ElementsSystem Elements
Memory SubsystemMemory Subsystem
Hyp
erL
ink
Hyp
erL
ink
C66X DSP
C66X DSP
L1L1 L2L2
2x RAC2x RAC 1x TAC1x TAC
3x FFTC3x FFTC BCPBCP
CryptoCrypto
Packet Accelerator
Packet Accelerator
NetworkCoProcessors
NetworkCoProcessors
C66X DSP
C66X DSP
L1L1 L2L2
C66X DSP
C66X DSP
L1L1 L2L2
10TI Confidential – NDA Restrictions
Memory Architecture• 0.5 MB of local Memory per core;• 4 MB of Shared Memory. • Enhanced memory architecture through an enhanced Multicore Shared memory Controller• Bottleneck free fast on- and off-chip memory access including a DDR3-1333MHz (64-bit) interface• L1/L2/L3 ECC
Multicore Navigator
Ter
aN
et
C66X DSP
C66X DSP
L1L1 L2L2
C66X DSP
C66X DSP
L1L1 L2L2
C66X DSP
C66X DSP
L1L1 L2L2
C66X DSP
C66X DSP
L1L1 L2L2
C66X DSP
C66X DSP
L1L1 L2L2
C66X DSP
C66X DSP
L1L1 L2L2
C66X DSP
C66X DSP
L1L1 L2L2
C66X DSP
C66X DSP
L1L1 L2L2
8 x CorePac8 x CorePac
SRIOx4
SRIOx4
PCIex2
PCIex2
EMIF16
EMIF16
TSIPx2
TSIPx2
I2CSPII2CSPI UARTUART
Peripherals & IOPeripherals & IO
GbESwitch
GbESwitch
SGMIISGMIISGMIISGMII
IP InterfacesIP Interfaces
CryptoCrypto
Packet Accelerator
Packet Accelerator
NetworkCoProcessors
NetworkCoProcessors
Power ManagementPower Management
DebugDebug
Multicore Shared Memory Controller (MSMC)
Multicore Shared Memory Controller (MSMC)
Shared Memory 4MBShared Memory 4MB
DDR3-64b
DDR3-64b
EDMAEDMASysMonSysMon
System ElementsSystem Elements
Memory SubsystemMemory Subsystem
Hyp
erLi
nkH
yper
Link
Innovation & Integration via C6678 DSP Highlights
Peripherals and I/O InterfacesHigh bandwidth peripherals that operate independently (NOT Shared) allowing simultaneous data transfer to prevent bottle necks - featuring: RapidIO v2.1 – 4lanes @ 5Gbps with 1x, 2x and 4x support PCIe x2 – 2lanes, running independently of RapidIO
Improved DebugS/W Dev and Debug Support Leveraged by CCS
C66x Core Next generation Fixed / Floating-Point DSP core with clock speeds ranging from 1GHz– 1.25GHz and Up to 8 core options
Network Co- Processor and Accelerators A cost effective implementation to off-load the TCP/IP and secure networking functions from the DSP
Multicore NavigatorData transfer engine that is architected to move data between various system elements without using any CPU overhead so maximum system efficiency is achieved
TeraNet Switch fabric that has 2 Terabits of bandwidth which allows maximum data transfer between system components to realize full system entitlement
HyperLinkUltra high-speed ( up to 50 Gbaud), low latency serial interface that connects to other DSPs and FPGAs in the systems
11
Competitive Analysis
Value Prop against FPGA Value Prop against other DSPs
•C66x Performance– 320GMACS/160GFLOP– Baseband on a chip. Handles
multiple waveforms supporting OFDM,CDMA,TDM
– L1/L2/L3 Processing capability– Wireless Accelerators
(VCP/TCP/FFT)
•Software Programmability– Time To Market
•Smaller Package (more DSP/Board)
•Lower Power – smaller battery, simpler cooling
•Low Cost - MIPs/$
•C66x Fixed & Floating Point [email protected]– Industry’s Fastest DSP at 10GHz
•On-Chip RAM up to 8MB•DDR3
– 1600MHz, 64Bit, 8GB Address space•Multiple Independent High Speed IO
– 4xsRIOv2.1,2xPCIe Gen II, 2xSGMII, 2xTSIP•High BW FPGA connectivity
– Hyperlink @ 50Gbps•1/2/4/8 Core Option (Pin Compatible)•L1/L2/L3 Memory ECC – System Reliability•Low Power per GFLOPs and GMACS•Extended Temp support -40oC to 105oC•CCS Tools + S/W Collateral•3rd Party Network
TMDXEVM6678L EVMSinge wide AMC form factor
Code Composer Studio™ IDE*Design *Code and Build *Debug *Analyze *Tune
CCSv5 Allows designers of all experience levels to move quickly through application development (www.ti.com/ccstudio)•Time Limited FREE Evaluation Versions available for download. Includes C667x Simulator
EVM Kit includes•BIOS 6.x, •BIOS-MCSDK / LINUX-MCSDK 2.0 (NDK, PDK, LIB etc), •Sample Program and Out of box demo (OOB) e.g.
• I/O Benchmark, Imaging Processing Pipeline and High Performance DSP Utility Application (HUA)
•User Guide, Starter guide, Tech Ref Guide, App Notes etc
H/W Development Tools
• TMDXEVM6678L – EVM with XDS100 emulation - $399
• TMDXEVM6678LE – EVM with XDS560V2 emulation - $599
• TMDXEVM6678LXE – EVM with XDS560V2 emulation –Encryption Enabled - $599
• TMDSEMU560v2STM-UE - XDS560v2 System Trace Emulator with 128Mb System Trace buffer and Ethernet / USB support
• Optional PCIe adapter card to connect the C6678 EVM to a standard PCI header of a desktop.
C6678C6678
TI’s Multicore Hardware Ecosystem
CustomCustom
Chassis / SystemChassis / System
OthersOthers
PCIExpress (with Gen 2)PCIExpress (with Gen 2)
Advanced Mezzanine (AMC)Advanced Mezzanine (AMC)
ATCAATCA
Standardized BoardsStandardized Boards
Other Other
TI’s Multicore Software Ecosystem
Layer 1 UMTSLayer 1 UMTS Layer 1 LTELayer 1 LTE
Layer 2+Layer 2+
Customer ApplicationCustomer Application
TI Layer 1 LibrariesTI Layer 1 Libraries TI BIOS, Linux, OSE(ck)TI BIOS, Linux, OSE(ck)
Multicore EntitlementMulticore Entitlement
TI’s Device Entitlement LibrariesTI’s Device Entitlement Libraries
IP Network Stack
IP Network Stack
TI RuntimeTI Runtime
TI Confidential – NDA Restrictions
15
DSP
Multicore Tools and Software (MC-SDK)• Tools
– Codegen with OpenMP support
– Emulator/Debugger– Simulator– Profiler / DVT– 3rd party tools
• Software– BIOS/Linux SDK
• Multicore Demonstration• 6.x DSP BIOS
– Platform Abstraction– Basic Networking– Inter core communication
• Application Specific Libraries– Audio/Video CODECS– VoIP Components– WiMAX Toolkit, LTE Toolkit,– DSPLib
• others..
Host Computer Target Board
XDS 560 V2XDS 560 Trace
Eclipse
Code Composer StudioTM
ThirdParty
Plug-Ins
Editor/IDEEditor/IDE
CompilerLinker
(Codegen)
CompilerLinker
(Codegen)
ProfilerProfiler
DebuggerDebugger
RemoteDebug
RemoteDebug
SoC Analyzer
SoC Analyzer
PolycorePolycore
ENEAOptima
ENEAOptima
3L3L
Operating System w/ Boot Loader
BIOS
Full Silicon Entitlement
Multicore Entitlement
Linux
Platform Development Kit
Inter Core Communication
Customer Application
Speech Codec
NDK AudioCodec
Video Codec
Demo App Multicore
BIOS
Demo App Multicore
Linux
Demo App Multicore BIOS and
Linux
DSPLIBIMGLIB
Multicore Software Development Kit
Digital Signal Processing• FFT• Adaptive Filtering• Filtering and convolution• Others…..• Available free from TI
KeyStone Multicore Software – Libraries & Codecs
MATLAB• Image processing• Math operations
Vision Analytics
Image Processing• Edge Detection• Boundary• Morphology• Others…..• Available free from TI
Voice and Fax• Line Echo
Cancellation• Voice Activity
Detection• Others…• Available free from TI
Security/Cryptography• AES, SHA1, 3DES
Voice• G.711, G.722• G.723, G.729• CDMA, AMR(NB/WB),
EVRC-B• Others
Audio• MPEG1 Layer2• AAC LC/HE• AC3 2.0/5.1• Sample Rate
Conversion
Video• H.263• H.264• MPEG2• MPEG4• VC1/WMV9 Decode• Others
Fax• T.38• Fax Modem
Libraries
Codecs
Vision Lib (object only)• 50+ royalty-free kernels:
• Background modeling & subtraction• Object feature extraction• Tracking, recognition• Low-level pixel processing
High-Performance and Multicore Processor
High Value
Easy to Use
Quick to Market
Low-Cost EVM High-Performance at the Right Power & Price
Open & Affordable Tools
User CommunityDrivers &
Example Code
Product CollateralTraining
Enabler Software
Frameworks & Abstraction
Generic Libraries
Application Libraries
Benchmarks & Functional Understanding
Quick-Start Hardware
Keystone Architecture
TI Confidential – NDA Restrictions
Getting Started – More Information/Links• Product Folders:
– C66X Informational Wiki Page– All C6000 Multicore DSPs
• TMS320C6670 • TMS320C6678
• EVMs and Software Tools:– TMS320C6678 EVM– TMS320C6670 EVM– AMC to PCIe Adapter Card– Multicore Software Development Kit for BIOS & Linux
• MCSDK Wiki• CCS v5 Wiki• C66x Linux Wiki
– DSP Signal Processing Library(DSPLIB)– Image and Video Processing Library (IMGLIB)– LTE /WiMAX Toolkit – Discuss with BDM
• Technical Support– TI E2E Community (Online Support)– Product Training
TI Confidential – NDA RestrictionsTI Confidential – NDA Restrictions
TI Confidential – NDA Restrictions
Online Video Traininghttp://focus.ti.com/docs/training/catalog/events/event.jhtml?sku=OLT110027
TI Confidential – NDA Restrictions
Mission Critical DSP Market“What Customers Like about TI”
• Undisputed #1 DSP and SoC supplier– Strong Growth for 8 years in a row, even in 2009
– Higher R&D spending than DSP revenue of most competitors
• KeyStone SoC Architecture secures future success– Rich Product Portfolio & Strong Roadmap
– 2 Families with multiple devices and growing• Nyquist(6670), Shannon(6678/4/2)• 40nm -> 28nm• Tools/Software & Compilers• 3rd Party Eco-System
– Multiple Design Wins Pre-Announcement
• Secure Supply – No DSP product discontinuation (end of life)• History of delivery upon promises (Power, GHz, ..)• Field Experience - Completeness of system analysis, Architecture, Internal Switch, ….• Customer Support• Business Model - Long Term relationships with key customers
– Actively seek and incorporate customer feedback in roadmap devices.
TI SoCArchitecture
Layer 1
Laye
r 2
Layer 3+
PHY
MA
C
Laye
r 3, 4
Radio IP Network
MacroPico
FemtoSoftware
2002 2009
Reve
nue
21
Backup SlidesProduct Details
TI Confidential – NDA Restrictions
C6678 (Shannon) “Lightning” Half-Length PCIe Card Feature SetC6678 (Shannon) “Lightning” Half-Length PCIe Card Feature Set
TI TMS320C6678 (8-core) x 4― C66x Core Frequency: 1.25GHz― DDR3 Memory
― Data Frequency: 1600MHz― Data Bus Width: 64-bit
― Serial RapidIO Gen-2 Interface― PCIe Gen-2 Interface― 10/100/1000Mbps Ethernet w/ SGMII― Hyperlink50 Interface
1024 MB DDR3-1333 on board PLX PEX8624 PCIe Gen-2 Switch Serial RapidIO daisy-chain Ethernet daisy-chain Each DSP device is linked to PCIe
switch by x2 lanes Dual DSPs linked by Hyperlink50 Power: Max 54Watts
TI Confidential – NDA Restrictions
What is Hyperlink?“high-speed, low-latency, and low-pin-count communication interface”
23
•Low pin count (24 pins)•Point to Point Connection•Interconnect
•DSP-to-DSP•DSP-to-FPGA.
•SerDes for data transfer• x1 x4 modes for Tx and Rx•12.5GBaud/lane•Effectively 8b9b encoding
•LVCMOS sideband signals for flow control & power mgmt - errors/events/timeouts
* Simple packet-based transfer protocol for memory-mapped access* Read/Write to DSP/FPGA local memory - discrete memory access of any byte aligned width up to 64bits. - burst transfer modes• Write (Maximum Burst Size 256Bytes)
– Write Request --->– Data Packet --->
• Read (Maximum Burst Size 256Bytes)– Read Request --->– Read Response -
• Interrupt Request <-->
Up to 64 Memory mapped Regionseach region up to 256MB
TI Confidential – NDA Restrictions
Universal Parallel Port (uPP)
• What is it?– Parallel bus, two independent channels (separate data
buses)– I/O speeds up to 75 MHz with 8-16 bit data width per channel– 1 or 2 channel parallel interface operating in RX, TX or FD
mode– Supports Double data rate mode of operation (Bandwidth
does not change/increase)
• Application– Each channel can interface cleanly with high-speed ADCs and/or
DACs with up to 16-bit data width (per channel).
– Useful as low cost interface with FPGAs. Can run up to 120MByte/s per channel in single channel or bi-directional mode ( 240MByte for both channels in unidirectional mode)
– Can also be used to interface two C6655/57 devices or to connect C6655/57 with C674x or OMAP-L13x family of devices.
• Other benefits– Internal DMA – leaves CPU EDMA free– Simple protocol with few control pins (configurable: 2-4 per
channel)– Multiple data packing formats for 9-15 bit data widths– Interleave mode (single channel only)– Simple interface: IO Queued by software
Throughput Estimates:
Note: Max. clock of 50 MHz in (*) configuration
25
Thank You