AI Chip Trends and Forecast
Joo-Young Kim
2019. 11. 6
ICT 산업전망컨퍼런스
Outline
• Introduction- Brief history & deep neural network models
- AI stack and new computing paradigm
• Trends in AI chips- ??
• Looking forward- ???
Motivation
Artificial Intelligence is pervasive in our everyday life.
Brief History of Neural Networks
F. Rosenblatt B. Widrow – M. Hoff M. Minsky – S. Papert D. Rumelhart – G. Hinton – R. Wiliams G. Hinton – R. Salakhutdinov
• Learnable weights and
Threshold• XOR problem • Nonlinear problem solved
• High computation
• Local optima and overfitting
• Hierarchical feature
learning
1943
• Adjustable
but not
learnable
weights
W. S. McCulloch - W. Pitts
1958 1960 1969 1986 2006
Deep
Deep Learning!
First WinterSecond Winter
- ImageNet
- AlphaGo
- Speech
translation
- Video synthesis
- Smart factory
- …
Deep Learning ≠ AI
AISearching
Planning
Knowledge
Representation
Fuzzy Logic
Natural Language
Processing
Genetic
Algorithm
Any technique that enables computers to mimic human behavior
AI techniques that have computers learn without being explicitly programmed
A subset of ML that makes the computation of multi-layer neural networks feasible
Deep Learning Revolution
Human: ~5%
ImageNet (ILSVRC) Top-5 Error
* F. Veen, The Asimov Institute, 2016
Deep learning starts to surpass human-level recognition on specific tasks
*
What Has Changed?
• Traditional pattern recognition
• Deep learning (model + data)Trainable Features & Classifiers
"Dog"
"Ship"
"Car"CNN
DNN
Hand-Crafted
Features
HoG
SIFT
Haar Like
Simple Trainable
ClassifiersSVM
K-Means
"Dog"
"Ship"
"Car"
Amount of Data
Perf
orm
ance
Traditional algorithms
Deep learning
Andrew Ng, Stanford CS 229 class
Popular Types of DNNs
MLP(Multi-Layer Perceptron)
CNN(Convolutional)
RNN(Recurrent)
Characteristic Fully Connected Convolutional LayerSequential DataFeedback Path
Major Application
Speech Recognition
Image RecognitionSpeech / Action
Recognition
Number of Layers
3~10 Layers Max ~100 Layers 3~5 Layers
Convolution
PoolingInput
Outp
ut
Fully Connected
Outp
ut
Input
Hidden Outp
ut
Input
Matrix-vector multiplication
3d convolutionMatrix-vector multiplication
Main Computation
And Many More Models…
1970s 1980s 1990s
MLPCognitron/
CNN
Attention only
Network
Tacotron
YOLO v3
BERT
FCN
DeepLab v3+
VoxelNet
PointNet++
WGAN
CycleGAN
StarGAN
DiscoGAN
DenseNet
DeepLab
Enet
YOLO v2
PointNet
WaveNet
CNN+RNN
ResNet
Fast R-CNN
Faster R-CNN
YOLO
GRU
R-CNNLSTM
LeNet
AlexNet
VGGNet
GoogleNet
SegNet
2012~2014 2015 2016 2017~
DNN Characteristics
• Requires big data & big computation
• Modern hardware enabled deep learning revolution (e.g. GPU)
# Operations: ~2Billion/Face# Mem. Access: ~1GB/Face
Local-feature-based Deep Learning-based
# Operations: ~0.1Billion/Face# Mem. Access: ~10MB/Face
AI Stack
Algorithm
Chip
Device
• Neuromorphic chip: brain-inspired computing, biological brain simulation, …• Programmable chip: GPU, ASIC, FPGA, DSP, …• System-on-Chip: multi-core, many-core, SIMD, systolic array, …• Development tool-chain: frameworks, compiler, simulator, optimizer, …
• High bandwidth off-chip memory: HBM, DRAM, GDDR, STT-MRAM, …• High speed interface: SerDes, Optical Communication• CMOS 3d stacking• Emerging computing device: analog computing, memristors, …• Emerging memory device: ReRAM, PCRAM, …
• Neural network topology: MLP, CNN, RNN, LSTM, SNN, …• Deep neural networks: AlexNet, ResNet, GoogLeNet, …• Neural network algorithms: reinforcement Learning, adversarial Learning, …• Machine learning algorithms: SVM, K-NN, decision tree, Markov chain, …
Application
• Video/Image: face recognition, image generation, video analysis, …• Sound and Speech: speech recognition, language synthesis, music generation, …• NLP: text analysis, language translation, human-machine communication, …• Robotics: autopilot, UAV, industrial automation, …
New Computational Paradigm
• Being able to handle big data- Huge storage capacity, high bandwidth, low latency memory access- “memory wall” problem
• Large amount of computation- Mainly linear algebraic operations while control is relatively simple- Parameters are large
• Training vs Inference- Training: accuracy, data capacity (~1018 bytes), weight synchronization- Inference: speed, energy, hardware cost, efficient reading of weights
• Data precision / Model compression / Pruning- Not always require a high precision
• High configurability- Tradeoff between energy efficiency and adaptability to new algorithms
AI Chip Landscape
https://basicmi.github.io/AI-Chip/
DNN Hardware
• Mobile Based- Specific AI - Real-time- Limited resources- Low-power
• Cloud Based- General AI- High computing- Huge memory- Fast & accurate
learning
Lo
w
Low Real-Time Operation
Glo
bal D
ata
Sh
ari
ng
Cloud Server
Mobile
Edge Terminal
Control &
Control Model
Control &
Control Model
Data &
Learned Model
Data &
Learned Model
High
Hig
h
Cloud based AI Computing
Pre-trained Network
LearningT
rain
ing
Da
ta (D
ata
se
t)Inferenceon
Cloud / Server
Question
Answer
Voice Assistant
Cloud / ServerDevice / Edge
DNN Chips for Cloud Server
• Nvidia (GPU)
• Goodle (TPU)
• Microsoft (BrainWave)
• Amazon (Inferentia)
• Alibaba, Baidu
Real-Time Operation
Glo
bal D
ata
Sh
ari
ng
Lo
wH
igh
HighLow
Cloud Server
Control based on overall conditions
Learning with data collected from edge devices
Stand-Alone AI
NVIDIA Volta Google Cloud TPU
Mobile/Edge based AI Inference
Self-driving vehicle, intelligent camera/speaker, IoT devices
Pretrained Network
Learning
Inferenceon
Cloud / Server
Tra
inin
g D
ata
(Da
tase
t)
InferenceUsing Pretrained Model
UserInterface
&
APPsplatform
Se
nso
rs
Camera
MIC
GPS
Gyro
Touch
Local Data
Load Pretrained
Model
Cloud / ServerDevice / Edge
Mobile/Edge DNN Applications
• Apple
• Huawei
• Qualcomm
• ARM
• CEVA
• Cambrion
• Horizon Robotics
• MobileEye
• Tesla
Pow
er
Con
sum
ptio
n
Inference Speed
Hig
hLo
w
Slow Fast
IoT
Wearable
Smart
Phone
Drone
Automoitive
Mobile
Robot
Cloud vs Edge Summary
High Performance
High Precision
High Flexibility
Distributed
Scalable
Diverse Requirements
(Car, Wearable, IoT)
Low-Moderate Throughput
Low Latency
Power Efficiency
Low Cost
High Throughput
Low Latency
Power Efficiency
Distributed
Scalable
?
Cloud / Datacenter Edge / MobileIn
fere
nce
Tra
inin
g
Functional Integration
Intel CPU
nVidia GPU
Xilinx FPGA
MIT Eyeriss
KAIST LNPU
Google TPU
Microsoft BrainWave
…
Wave DPU
Tsinghua Thinker
…
Hardware Classic Domain specific Reconfigurable
Domain Cloud Could/Edge Could/Edge
Target Workload Training oriented Inference Inference & Training
Early 1st Stage 2nd Stage
?
Courtesy of GTIC 2019
Two Different Directions
• Be more flexible
• Be more compact
DedicatedDiannao
2014
RS DataflowMIT Eyeriss
Systolic ArrayGoogle TPU
Sparse-awareNvidia SCNN
Flexible BitwidthKAIST UNPU …
2016 2017.6 20182017.1
CompressionPruning
EIE
2016.2
BWN TWN Low-bit TrainingDoReFa-Net
Low-bit QuantizationLQ-Nets …
2016.8 2018.2 2018.92016.11
Courtesy of GTIC 2019
Von Neumann Bottleneck for AI
• Von-Neumann architecture serially fetches data from the storage
• AI application needs to access tremendous amount of data
AI Processor
Memory
BUS
Bottleneck
Memory Wall
NVM DRAMSRAM
(Cache)Processor
Von Neumann Bottleneck
NVM DRAMSRAM
(Cache)Processor
Increasing Memory Bandwidth
How can we increase bandwidth between processor and memory?
Near Memory Processing
PCB
Processor
DRAM
DRAM
3D-Stacked Memory
High Bandwidth Memory
Advantage of HBM
ITEM GDDR5 HBM (High B/W Memory)
System
Configuration
DRAM 8Gb GDDR5 12ea 4GB HBM 4ea
Size 3120 ㎟ 792 ㎟
Density 12GB 16GB
Bandwidth 384GB/s 1024GB/s
Power 18.3W (1.5W X GDDR5 12ea) 9.1W (2.3W X HBM 4ea)
Pin
(Ball)
Speed 8 Gbps 2 Gbps
# I/O 32 per chip (Total 384) 1024 per cube (Total 4096)
2016GFX 예측 사양• HBM 4~6cube• 4~8GB, 512~1TB/s• 10TFLOPs
Processor
HBM
HBM
HBM
HBM
Processor
G5 G5
G5 G5
G5
G5
G5
G5
G5
G5
G5
G5
60mm
52mm
33mm
24mm
-75%
1.3x
3.6x
+18%
Emerging Non-Volatile Memories
White Paper on AI Chip Technologies (2018)
DRAM-like speed, Flash-like capacity and Non-Volatile
Towards into Memory
NVM DRAMSRAM
(Cache)Processor
Von Neumann Bottleneck
NVM DRAMSRAM
(Cache)Processor
NVM DRAM
P
SRAM
P P P P P P P P P P P
Traditional
Near-Memory/Emerging Mem
In-Memory/Memory-centric
Processing-In-Memory (PIM)
AI Processor
Memory
BUS
Bottleneck
Von Neuman
Mem
Logic
Mem
Logic
Mem
Logic
Mem
Logic
Mem
Logic
Mem
Logic
Mem
Logic
Mem
Logic
Mem
Logic
Non Von Neuman
Converged logic + memory (high BW)
Suitable for data-intensive workloads
Little data movement (energy efficient)
PIM Chip
Renesas’s ternary SRAM PIM for AI inference
S. Okumura, et al., “A Ternary Based Bit Scalable, 8.80 TOPS/W CNN accelerator with Many-core Processing-in-memory Architecture with 896K synapses/mm2”, Symposium on VLSI Technology 2019
AI Framework
Provides higher-level abstraction to developers/users
Convolution on volumes (1 line)
Max pooling (1 line)
Non-linear ReLu (1 line)
Hyper-Scale AI Accelerators
TPU v3 (2018)
Cerebras Wafer Scale Engine (2019)
Usually hundreds of processing units
in array structure..
How do we program this?
1.2T transistors
46,225 mm2
400,000 cores
18GB SRAM
100 Pb/s interconnect
Who Fills this Gap?
…
…
…
…
…
…
… …
… …
…
…
…
…
…
…
… …
… …
Cerebras WSE
AI Software Tool-Chain
• Xilinx AI Edge PlatformSW developers, users
A few hardware vendors
Problem: No De Facto SW Tool & Hardware!
C / Java Compiler toolchain CPU
Software Hardware
OpenGL / CUDA
Compiler toolchain GPU
Verilog / VHDL Synthesis toolchain FPGA
?
Neuromorphic Chip
• “Spiking neuron”• Closely model biological
neuron’s activity• Incorporates concept of
time: integrate and fire• Computationally expensive• Difficult to train
Not practical at moment
1st
Generation
• Perceptron based• No non-linear
functions• Binary output
2nd
Generation3rd
Generation
• Non-linear activation functions• Continuous output• Functional modeling of our
brain• Working real-life applications• We are here (FF, CNN, RNN, …)
IBM TrueNorth
• 5.4 billion transistors in 28nm CMOS process
• 64 x 64 neurosynaptic core, 256 neurons each
Paul A. Merolla, et al. "A million spiking-neuron integrated circuit with a scalable communication network and interface." Science2014
IBM TrueNorth
• Mimicking synapse with SRAM
• However, SRAM is not made for this (large area, cost).
Pre-Neuron (Tx)
Post-Neuron (Rx)
Synapse is a structure that permits a neuron to pass an electrical signal to another.
Input Spike
1 0
0 0
1 1
8T SRAM cell
as synapse
Output Spike (Voltage)
WL
BLT
BLT
BLBLWLT
Voltage Σ ΣΣ
1
0
1
SRAM Synapse Array
Neuromorphic Chip with Emerging Device
• New model requires device with new physics • FeFET: better storing/transferring analog signal
M. Jerry., et al., "Ferroelectric FET analog synapse for acceleration of deep neural network training.", IEEE IEDM 2017
Neuromorphic Chip with Emerging NV RAM
Z. Wang., et al, "Fully memristive neural networks for pattern classification with unsupervised learning", Nature Electronics 2018
• ReRAM (memristor)
1. Cloud and Edge Will be Closer
• Edge inference & learning will be more important due to privacy concern, real-time operation, and power constraint
• Federated learning: leverage cloud’s big data advantage on edge devices
Mobile Devices
Encryption & Compressed Data
LocalLearning
Custom Weight
Cloud ServersShared Model
Broadcasting shared model
Aggregating encrypted data
LocalLearning
Custom Weight
LocalLearning
Custom Weight
LocalLearning
Custom Weight
Updated Model
2. AI Chips will Support More Algorithms
• State-of-the-art algorithms are moving from traditional MLP, CNN, RNN to GAN, reinforcement learning, and unsupervised learning
Inference only(MLP/RNN or CNN)
Inference + Training(MLP/CNN/RNN)Inference only
(MLP/CNN/RNN)
Inference + Training(GAN/RL/
Unsupervised/MLP/CNN/RNN)
3. AI Security Will be Essential
• It is easy to break DNN based recognitionNew cyberattack: imperceivable noise injection
Breaking state-of-the-art face recognition Physical attack for autonomous vehicles
4. For Success of AI Chip, SW is the Key
• How did ARM dominate mobile processor market?- Low power consumption with reasonable performance
- ARM’s competent complier toolchain & licensing strategy
• Why did GPU have a big success in early DNN revolution?- That was because of CUDA which is a generic programming language for data-
intensive workloads like matrix-vector multiplication
- CUDA was baked for several years to have developers actually use it
AI Chip Researches at KAIST
Multi-core OR
Processor
Dual
Layered
3-stage
Pipeline
Simultaneous
Multi-threading
Multi-classifier
System
Multi-core
MIMD
2008 2009 2010 2012 2013
Visual
Attent
ionTomatoSauce
$2.60
Heterogeneous
Many-SIMD
20142011 2015 2016 2017
Multi-Modal UI/UX
Deep Learning Core
Tan
k
Rob
ot
Recogni
tion
Result
Sen
sing
Convolution
Cluster 0
FC LSTM
Processor
Ext. Gateway
Convolution
Cluster 3
Convolution
Cluster 1
Convolution
Cluster 2
CNN
Ctrlr.
Aggregation
Core
Top
Ctrlr.
Ex
t. G
ate
wayStereo Matching
Processor
Face
Recognition
& CNN–RNN
2018 2019
Core
#1
Core
#2
Core
#3
Ext.
IF
#0
Aggregation Core
1-D
SIM
D C
ore
To
p C
trlr
.
40
00
mm
WM
EM
Ext.
IF
#1
AFL
LB
PE
#0
LB
PE
#1
LB
PE
#2
LB
PE
#3
LB
PE
#4
LB
PE
#5
Matching
Core
Pipelined CNN PE
FMEM2
FMEM0
FWD/BWD Unit
CN
N
Co
re 1
Custom
RISC
WMEM
FMEM1
Lo
ca
l D
MA
Ext. I/F Ext. I/FTop Controller
ICP-PSO Engine
NN
PIM 0
NN
PIM 1
NN
PIM 2
NN
PIM 3
NN
PIM 4
NN
PIM 5
NN
PIM 6
NN
PIM 7
NN
PIM 8
NN
PIM 9
NN
PIM 10
NN
PIM 11
NN
PIM 12
NN
PIM 13
NN
PIM 14
NN
PIM 15
Variable Bit
DNN
& 3D HGR
Core Cluster 3Core Cluster 2
Core Cluster 1
Core1
Core3Core2
DMEM
PELPELPELPEL
ILB
Central CoreI/F1
fp-u
nit S
IMD
Co
reT
op
Ctrlr. R
ISC
I/F0
Process 65nm 1P8M Logic CMOS
Area 4mm × 4mm
SRAM 448 KB
Supply 0.67V – 1.1V
Power196 mW @ 200MHz, 1.1V
2.4 mW @ 10MHz, 0.67V
PrecisionFeature – bfloat16
Weight – 16/8/4'b FXP
Peak
Performance204 GFLOPS @ 16b Weight
Ext.
IF 0
Core 1
Core 2 Core 3
Top Ctrlr.Ext.
IF 1
UMEM
UMEMBMEM
BMEM
PE Arrays
Exp. Compressor
1-D SIMD
Supervised &
Reinforcement
Learning
Input Image
Hand Depth
Tracking
Results
-1.5cm
10cm
0cm
5cm
-5cm
7.5cm
0cm
5cm
40cm
20cm25cm
30cm35cm
-5cm
10cm
0cm5cm
-5cm
10cm
0cm
5cm
40cm
20cm25cm
30cm35cm
X
Y
-5cm
10cm
0cm5cm
-5cm
10cm
0cm
5cm
40cm
20cm25cm
30cm35cm
X
Y
X
Y
Hand
Tracking
Accuracy
2.6mm@20cm
4.6mm@30cm
3.4mm@40cm
5cm
Seperated
VGA
Cameras
22.5cm
40.5cm