taking configurability and programmability to the next ... · taking configurability and...
TRANSCRIPT
Dr. Yankin TanurhanVice President of Engineering, SGSynopsys Inc.July 2016
Taking Configurability and Programmability to the Next Level for Embedded Systems Processing
2© 2016 Synopsys, Inc.
Agenda
Machine vision Scalable EV Processors Convolution Neural Network Engine (CNN)Vision programming toolsHost IntegrationConclusions
© 2016 Synopsys, Inc. 3
Embedded Vision is Coming Fast
• Embedded Vision is the use of computer vision in embedded systems to interpret meaning from images or video
• In cars to improve safety
• Surveillance for detection and tracking
• In industrial automation to improve quality and control
• Estimated $300B+ market in 2020, 35% CAGR
0
50
100
150
200
250
300
350
2013 2014 2015 2016 2017 2018 2019 2020
Bill
ions
of D
olla
rs
Vision Systems Shipments
Sources: ABI Research, Insight Media, Transparency Market Research, Markets And Markets, Synopsys
4© 2016 Synopsys, Inc.
Video Surveillance Market
• Global IP Video Surveillance Market expected to grow at CAGR of 37.3% from 2012-20
• Demand driven by– Growing installations of IP cameras – Need for surveillance cameras
with better video quality
http://www.alliedmarketresearch.com/IP-video-surveillance-VSaaS-market
3X Growth Forecast 2013 - 2019
Home Surveillance, Retail, Healthcare, Security (Airports, Govt, Banks, Casinos)
5© 2016 Synopsys, Inc.
ADAS Vision Market
Source: IHS 2015
6© 2016 Synopsys, Inc.
Scalable Embedded Vision Processors
© 2016 Synopsys, Inc. 7
Scalable Embedded Vision Processors
• Highly integrated and configurable– Configurable scalar, vector DSP and
convolutional neural network (CNN) architecture
– Supports 1080p - 4K vision streams
• User scalable for optimum performance – 1 to 4 Vision CPU cores– Programmable CNN engine
• State-of-the-art performance-efficiency
• High productivity toolset– OpenCV, OpenVX, OpenCL C
OpenCV, OpenVX™ libraries and API
MetaWare, OpenCL C Development Tools
Vision CPU (1 to 4 cores)
AXI Interconnect
Core 4
Core 3
CNN Engine
Core 2
Core 1
32-bit scalar
512-bit vector DSP
Embedded Vision Processor
Convolution
Classification32-bit scalar
512-bit vector DSP
Shared MemoryShared MemorySync & DebugSync & Debug Streaming Transfer UnitStreaming Transfer Unit
8© 2016 Synopsys, Inc.
High-Performance Vision CPU
• Implemented with 1 to 4 Vision CPU cores to run full range of vision algorithms up to 4K resolutions
• 32-bit scalar for easy system integration
• Configurable vector DSP for high-end vision processing
• Cross-bar implements scatter/gather delivering higher performance with increased efficiency
• OpenCL C optimized instruction set for high programming productivity with reduced cycle count and lower power
• Parallel operation with the CNN Engine increasing efficiency and throughput
Fast, Real-time Vision Processing
Vision CPU
32-bit Scalar32-bit Scalar
D$I$
512-bit Vector DSP512-bit Vector DSP
LN1
LN1
LN2
LN2
LN3
LN3
LNn
LNn
…
VCCM
Cross-BARCross-BAR
9© 2016 Synopsys, Inc.
Multi-Core Scalability to Increase Performance
External Bus Interface (AXI)External Bus Interface (AXI)
Vision CPU 4AUXREG
VCCM
ICCMDCCM
ICACHE
Vision CPU 4Vision CPU 4AUXREG
AUXREG
VCCMVCCM
ICCMICCMDCCMDCCM
ICACHEICACHE
Vision CPU 2AUXREG
VCCM
DCCM
D$
DCCM
Vision CPU 2Vision CPU 2AUXREG
AUXREG
VCCMVCCM
DCCMDCCM
D$D$
DCCMDCCM
Interrupt I/F to Host
Inter-core interrupt
unit
Inter-core interrupt
unit Inter-core
semaphore unit
Inter-core semaphore
unit
Inter-core message
passing unit
Inter-core message
passing unit
64-bit global RTC
64-bit global RTC
ARC
AU
X in
terfa
ceARC
AUX interface
Inter-core debug
assist unit
Inter-core debug
assist unit
I$I$
Vision CPU1 AUXREG
VCCM
D$
Vision CPU1Vision CPU1 AUXREG
AUXREG
VCCMVCCM
D$D$
I$I$
Vision CPU 3AUXREG
VCCM
D$
I$
Vision CPU 3Vision CPU 3AUXREG
AUXREG
VCCMVCCM
D$D$
I$I$
L1 Scalar D$ Coherency Unit
STU(Streaming
Transfer Unit)
STU(Streaming
Transfer Unit)
VDSP
1VD
SP2
VDSP
3VD
SP4
CSM Cluster
Shared Mem.- Multi-bank- Arbitration
© 2016 Synopsys, Inc. 10
Efficient Frame Data Access
• Streaming Transfer Unit (DMA)• Tightly integrated with VDSP vector memory
(VCCM)– Capable of addressing the memory as
Private Banks or Local Memory– Low overhead:
– Transfers can be setup in VCCM memory with vector instructions
– Event-based low-overhead synchronization (no ITs required)– Physical channels (up to 4) can be shared between VDSP cores to
serve multiple private banks in parallel• Support for 1D and 2D (image) transfers for vision applications• I/O coherency: Configurable I/O-coherent regions
External Bus Interface (AXI)External Bus Interface (AXI)
STUSTU
HS VDSP 2HS VDSP 2
VCCMVCCM
D$D$
I$I$
HS VDSP 4HS VDSP 4
VCCMVCCM
D$D$
I$I$
HS VDSP 1HS VDSP 1
VCCMVCCM
D$D$
I$I$
HS VDSP 3HS VDSP 3
VCCMVCCM
D$D$
I$I$
ClusterShared Memory
11© 2016 Synopsys, Inc.
Convolution Neural Network Engine
© 2016 Synopsys, Inc. 12
CNN is Accurate and Efficient
• Convolutional Neural Networks (CNN)– Deep learning approach outperforms other vision algorithms– CNNs attempt to replicate how the brain sees– Efficiently recognizes objects directly from pixel images with
minimal pre-processing
• CNN usage for Vision– Image classification, search similar images– Object detection, classification & localization
– Any type of object(s), depending on training phase
– Face recognition, visual attention, facial expression recognition– Gesture recognition / hand tracking– Scene recognition and labelling, semantic segmentation
CNN
© 2016 Synopsys, Inc. 13
User Benefits of a Dedicated CNN Engine• Scalable high performance
– Over 800 MAC/cycle
• PPA competitive with H/W-based CNNs– Similar area density (GMAC/s/mm2) as H/W– Over 1000 GOP/s/W
• Low external memory bandwidth– Close to image pixel bandwidth <800 MB/s
• Highest productivity and flexibility– Automatic CNN graph mapping
• Closely coupled with leading edge EV6x processor– Unique combination of scalar + vector processing + CNN– Integrated into a single, standard programming model
800 MAC/cycle
14© 2016 Synopsys, Inc.
High-Performance CNN Engine
• Dedicated CNN Engine delivers higher performance than competitive solutions
• Programmable to support full range of fixed point CNN graphs
• State-of-the-art power-efficiency >1000 GOPS/W
• Supports resolutions to 4K
• Real-time, high quality image classification, object recognition, semantic segmentation
• Parallel operation with Vision CPU increasing efficiency and throughput
AXI Interconnect
Vision CPU32 bit RISC
512-bit Vector DSP
ClusterSharedMemory
DMA
AR
Connect
CNN Engine
Convolution
Classification
© 2016 Synopsys, Inc. 15
CNN Scene segmentation benchmark• Application is to perform semantic labeling
of full image– Define sky, forest, buildings– Define border of road (w/o lane markers)– Used typically in path finding applications for autonomous driving– Unlike object detection, there is no Region-of-Interest pre-calculation
– Need to analyze entire image
• CNN graph characteristics– Video at 1080p, 30 fps– 5 channels (3 channels for colors, 2 application-specific channels)– Image bandwidth requirements:
311 MB/s (for 8 bit pixels), or 622 MB/s (for 16 bit pixels)– Compute requirements: Over 750 GMAC/s– Weights: Over 100K values
Preliminary – Subject to Change
carcar
skybuilding
road
building
16© 2016 Synopsys, Inc.
Vision Programming Tools
17© 2016 Synopsys, Inc.
High Productivity Tools Increase Flexibility• OpenVX eases vision graph
development
• OpenCV open source library of 2500 vision algorithms helps build vision applications
• MetaWare C/C++ compiler delivers optimize program coding
• OpenCL C instructions with whole function vectorization simplifies DSP programming
• CNN graph mapping tools automate CNN programming
K1 Kn…
Kernel LibUi
UjUk
Kn
Cm1
UkUm
Runtime
Cm2
Cm3
Cm3
User kernels
Ui
Uk
C/C++
OpenCL C
Cmi
CNN Primitives
CNN Graph Mapping Tools
CNN Graph Training
C/C++ compiler with OpenCL C whole function
vectorizationLib
Embedded Vision Processor
Vision Graph
Synopsys User Open Source Third Party Partner
18© 2016 Synopsys, Inc.
FPGA-Based Development System
EV6x ProcessorVision CPU Shared
Data Mem
CNN Engine
DMADMA
AXI Subsystem InterconnectAXI Subsystem Interconnect
HS VDSP 4
MEMHS VDSP 1
AXI InterconnectAXI Interconnect
DDR3
Vision CPU- Scalar & vector processing- 1 – 4 HS VDSP Cores- Supports HD to 4K streams- Host offload or standalone
operation
AXIUMRBus
HAPS-80 (S26) Prototyping System
…
HAPS-80EV6x runs at up
to 100 MHz
32-bit RISC CPU
512-bit Vision DSP
Convolution
Classifier, SVM PCIe
Software Development- High productivity tool set- OpenVX runtime & kernels- OpenCL C vectorizing compiler- MetaWare C/C++ compiler
CNN Engine- Object detection,
classification, localization- Fully programmable- >1000 GMAC/s/W
19© 2016 Synopsys, Inc.
EV Processor SoC Virtual Development Kit
• Host integration demonstrator–VDK with ARC HS38 host running Linux
O/S
–Linked to EV processor model
–Low-level mechanisms provided to support customer-specific host O/S driver development
–Simple host-side demonstrator application: e.g., send frame for object detection processing, receive detection points
SoC InterconnectSoC Interconnect
SoCMem Host
Customer’s SoC
DRAMVDK
Linux
I/Operiph
Existing ARC HS VDK
DMA
VDK-wrappedEV processor model
nSIM ISS model CNN Model
I/F
20© 2016 Synopsys, Inc.
Host Integration
21© 2016 Synopsys, Inc.
Designed for SoC Integration
• Must be designed for easy integration with a host processor– Most designs will include host
HAPS-70 S12
Virtual Platform
Reference Designs
Sensor Subsystem
EV Processor
Host
AXI
IntelVision CPU (1 to 4 cores)
AXI Interconnect
Core 4Core 3
CNN Engine
Core 2Core 1
32-bit RISC
512-bit vector DSP
Embedded Vision Processor
Convolution
Classification
Cluster Shared MemorySync & DebugSync & Debug Streaming Transfer UnitStreaming Transfer Unit
32-bit RISC
512-bit vector DSP
22© 2016 Synopsys, Inc.
Host to EV6 CommunicationAs implemented in the EV VDK
CNN-based Application
AXI Interconnect
User Application
Custom host/EV system I/F to invoke CNN processing
Memory
Interrupts
Data transfer
Data transfer
runtimeon EV61 Processor
Sensor Subsystem
Host
23© 2016 Synopsys, Inc.
Conclusions
24© 2016 Synopsys, Inc.
Conclusions
• Strong growth in vision applications – Surveillance, ADAS, IoT, Industrial, gesture recognition, object detection– Embedded vision algorithm innovation, highly dynamic– Requirements for configurability and progammability
• Vision Processors integrate HW & SW for optimized vision apps– Flexible, integrated solution (scalar + vector DSP + CNN)– Dedicated CNN engine offers higher performance and best efficiency– Low-power, high-performance, flexible programming– Fast time-to-market with high-level OpenVX and C programming tools
• High productivity tool set– OpenVX, OpenCV, OpenCL C, Optimized C/C++ MetaWare compiler
Thank YouThank You