drisa: a dram-based reconfigurable in-situ acceleratorshuangchenli/tr/drisa v1.0.pdf · scalable...
TRANSCRIPT
Scalable and Energy-efficient Architecture Lab (SEAL)
http://seal.ece.ucsb.edu/ SEAL@UCSB
Scalable and Energy-efficient Architecture Lab (SEAL)
DRISA: A DRAM-based
Reconfigurable In-Situ Accelerator
Shuangchen Li, Dimin Niu, Krishna T. Malladi,
Hongzhong Zheng, Bob Brennan, Yuan Xie
University of California, Santa Barbara
Memory Solutions Lab, Samsung Semiconductor Inc.
Scalable and Energy-efficient Architecture Lab (SEAL)
Motivation and Observation
• Merging the computing resources
and memory fabrics
2
1.E+00
1.E+01
1.E+02
1.E+03
1E+00 1E+01 1E+02 1E+03 1E+04
Norm
aliz
ed O
n-c
hip
M
em
.Capacity p
er A
rea
Normalized Peak Perf. per Area
Scalable and Energy-efficient Architecture Lab (SEAL)
Motivation and Observation
• Merging the computing resources
and memory fabrics
– Memory-rich processor: low memory
capacity
2
Shidiannao (ASICs)
Dadiannao
TITAN X (GPU)
1.E+00
1.E+01
1.E+02
1.E+03
1E+00 1E+01 1E+02 1E+03 1E+04
Norm
aliz
ed O
n-c
hip
M
em
.Capacity p
er A
rea
Normalized Peak Perf. per Area
Memory-rich
Processor
Scalable and Energy-efficient Architecture Lab (SEAL)
Motivation and Observation
• Merging the computing resources
and memory fabrics
– Memory-rich processor: low memory
capacity
– Compute-capable memory: low
performance
2
Shidiannao (ASICs)
BufferedComp
NeuroCube
Dadiannao
TITAN X (GPU)
1.E+00
1.E+01
1.E+02
1.E+03
1E+00 1E+01 1E+02 1E+03 1E+04
Norm
aliz
ed O
n-c
hip
M
em
.Capacity p
er A
rea
Normalized Peak Perf. per Area
Compute-capable
Memory (PIM)
Memory-rich
Processor
Scalable and Energy-efficient Architecture Lab (SEAL)
Motivation and Observation
• Merging the computing resources
and memory fabrics
– Memory-rich processor: low memory
capacity
– Compute-capable memory: low
performance
2
Shidiannao (ASICs)
BufferedComp
NeuroCube
Dadiannao
This Work
TITAN X (GPU)
1.E+00
1.E+01
1.E+02
1.E+03
1E+00 1E+01 1E+02 1E+03 1E+04
Norm
aliz
ed O
n-c
hip
M
em
.Capacity p
er A
rea
Normalized Peak Perf. per Area
Compute-capable
Memory (PIM)
Memory-rich
Processor
Scalable and Energy-efficient Architecture Lab (SEAL)
Motivation and Observation
• Merging the computing resources
and memory fabrics
– Memory-rich processor: low memory
capacity
– Compute-capable memory: low
performance
2
To have BOTH:
(1) Use DRAM technology
(2) Remove “sys-memory” constraints
Building an accelerator with DRAM
technology
Shidiannao (ASICs)
BufferedComp
NeuroCube
Dadiannao
This Work
TITAN X (GPU)
1.E+00
1.E+01
1.E+02
1.E+03
1E+00 1E+01 1E+02 1E+03 1E+04
Norm
aliz
ed O
n-c
hip
M
em
.Capacity p
er A
rea
Normalized Peak Perf. per Area
Compute-capable
Memory (PIM)
Memory-rich
Processor
Scalable and Energy-efficient Architecture Lab (SEAL)
Key Ideas and Approaches
3
To have BOTH:
(1) Use DRAM technology
(2) Remove “sys-memory” constraints
Building an accelerator with DRAM
technology
Scalable and Energy-efficient Architecture Lab (SEAL)
Key Ideas and Approaches
3
To have BOTH:
(1) Use DRAM technology
(2) Remove “sys-memory” constraints
Building an accelerator with DRAM
technology
DRAM technology
Scalable and Energy-efficient Architecture Lab (SEAL)
Key Ideas and Approaches
3
To have BOTH:
(1) Use DRAM technology
(2) Remove “sys-memory” constraints
Building an accelerator with DRAM
technology
DRAM technology
Logic Incompatible
Scalable and Energy-efficient Architecture Lab (SEAL)
Key Ideas and Approaches
3
To have BOTH:
(1) Use DRAM technology
(2) Remove “sys-memory” constraints
Building an accelerator with DRAM
technology
DRAM technology
Logic Incompatible
Simple Boolean logic
Operation
Bitline
SA
Cells
NOR
Scalable and Energy-efficient Architecture Lab (SEAL)
Key Ideas and Approaches
3
To have BOTH:
(1) Use DRAM technology
(2) Remove “sys-memory” constraints
Building an accelerator with DRAM
technology
DRAM technology
Logic Incompatible
Simple Boolean logic
Operation
General Purpose
Reconfigurable
Bitline
SA
Cells
NOR
SHIFT
Scalable and Energy-efficient Architecture Lab (SEAL)
Key Ideas and Approaches
3
To have BOTH:
(1) Use DRAM technology
(2) Remove “sys-memory” constraints
Building an accelerator with DRAM
technology
DRAM technology
Logic Incompatible
Simple Boolean logic
operations
General Purpose
Reconfigurable
High Pref. Improve Parallelism
Unblock Data Mov.
Optimize Activation
Multi-subarray
active
Multi-bank active
Bitline
SA
Cells
NOR
SHIFT
Scalable and Energy-efficient Architecture Lab (SEAL)
Architecture Overview
• DRAM modifications:
4
(a) Chip
Group
Bank
Group
BankBank
Bank
Group
Scalable and Energy-efficient Architecture Lab (SEAL)
Architecture Overview
• DRAM modifications:
4
(a) Chip (b) Bank
Group
Bank
Group
bC
trl
Mat
Subarry
Mat
BankBank
Bank
Group
Scalable and Energy-efficient Architecture Lab (SEAL)
Architecture Overview
• DRAM modifications:
4
(a) Chip (b) Bank
Group
Bank
Group
bC
trl
Mat
(c) Subarray and mat
sCtrl
DRAM Cells
SA supports Boolean logic operations
Shifter Subarry
Mat
BankBank
Bank
Group
Scalable and Energy-efficient Architecture Lab (SEAL)
Architecture Overview
• DRAM modifications:
– Change decoders to controllers
4
(a) Chip (b) Bank
Group
Bank
Group
bC
trl
Mat
(c) Subarray and mat
sCtrl
DRAM Cells
SA supports Boolean logic operations
Shifter Subarry
Mat
BankBank
Bank
Group
Scalable and Energy-efficient Architecture Lab (SEAL)
Architecture Overview
• DRAM modifications:
– Change decoders to controllers
– Change SA to support logic operations
4
(a) Chip (b) Bank
Group
Bank
Group
bC
trl
Mat
(c) Subarray and mat
sCtrl
DRAM Cells
SA supports Boolean logic operations
Shifter Subarry
Mat
BankBank
Bank
Group
Scalable and Energy-efficient Architecture Lab (SEAL)
Architecture Overview
• DRAM modifications:
– Change decoders to controllers
– Change SA to support logic operations
– Add shifters
4
(a) Chip (b) Bank
Group
Bank
Group
bC
trl
Mat
(c) Subarray and mat
sCtrl
DRAM Cells
SA supports Boolean logic operations
Shifter Subarry
Mat
BankBank
Bank
Group
Scalable and Energy-efficient Architecture Lab (SEAL)
Architecture Overview
• DRAM modifications:
– Change decoders to controllers
– Change SA to support logic operations
– Add shifters
– Others: Group/Bank buffers helps internal data transfer, Bank/Subarray reorganization,
Spitted cell array regions 4
(a) Chip (b) Bank
Group
Bank
Group
bC
trl
Mat
(c) Subarray and mat
sCtrl
DRAM Cells
SA supports Boolean logic operations
Shifter Subarry
Mat
BankBank
Bank
Group
(a) Chip (b) Bank
Group
Bank
Group
bC
trl
Mat
(c) Subarray and mat
sCtrl
DRAM Cells
SA supports Boolean logic operations
Shifter Subarry
Mat
BankBank
Bank
Group
Scalable and Energy-efficient Architecture Lab (SEAL)
Make BL Be Able To Compute (1/2)
• Three solutions:
5
Bitline
SA
Cells
NOR
SHIFT
Scalable and Energy-efficient Architecture Lab (SEAL)
Make BL Be Able To Compute (1/2)
• Three solutions:
– 3T1C: natural NOR on BL
5
Rs
Rt
Rr
rWL
wBL
rBL
SA
wWL
3T1C-NOR
Bitline
SA
Cells
NOR
SHIFT
Scalable and Energy-efficient Architecture Lab (SEAL)
Make BL Be Able To Compute (1/2)
• Three solutions:
– 3T1C: natural NOR on BL
– 1T1C: adds gates or adopting AMBIT’s methods
5
Rs
Rt
Rr
rWL
wBL
rBL
SA
wWL
3T1C-NOR
10
0 1
01
0.3 0.6
0 1
<0.5 >0.5SA
and
Pre-load
orRs
Rt
Rr latch
logic gate
Rs
Rt
Rr
SAOr
1T1C-NOR/MIX
Bitline
SA
Cells
NOR
SHIFT
Scalable and Energy-efficient Architecture Lab (SEAL)
Make BL Be Able To Compute (1/2)
• Three solutions:
– 3T1C: natural NOR on BL
– 1T1C: adds gates or adopting AMBIT’s methods
– 1T1C-adder: adds full-adders to BL
5
Rs
Rt
Rr
rWL
wBL
rBL
SA
wWL
3T1C-NOR
10
0 1
01
0.3 0.6
0 1
<0.5 >0.5SA
and
Pre-load
orRs
Rt
Rr latch
logic gate
Rs
Rt
Rr
SAOr
1T1C-NOR/MIX
...
...
...
...latches
n-bit adder
Rs
Rt
Rr
SA
1T1C-ADDER
Bitline
SA
Cells
NOR
SHIFT
Scalable and Energy-efficient Architecture Lab (SEAL)
Make BL Be Able To Compute (2/2)
• Example: selector
6
𝑅 = (𝑆 == 1)? 𝑋: 𝑌
Bitline
SA
Cells
NOR
SHIFT
Scalable and Energy-efficient Architecture Lab (SEAL)
Make BL Be Able To Compute (2/2)
• Example: selector
6
𝑅 = 𝑆 ⋅ 𝑋 + ሚ𝑆 ⋅ 𝑌
𝑅 = (𝑆 == 1)? 𝑋: 𝑌
Bitline
SA
Cells
NOR
SHIFT
Scalable and Energy-efficient Architecture Lab (SEAL)
Make BL Be Able To Compute (2/2)
• Example: selector
6
𝑅 = 𝑆 ⋅ 𝑋 + ሚ𝑆 ⋅ 𝑌
෨𝑅 = NOR( NOR( ሚ𝑆, ෨𝑋), NOR(𝑆, ෨𝑌) )
NOR-only logic
𝑅 = (𝑆 == 1)? 𝑋: 𝑌
Bitline
SA
Cells
NOR
SHIFT
Scalable and Energy-efficient Architecture Lab (SEAL)
Make BL Be Able To Compute (2/2)
• Example: selector
6
𝑅 = 𝑆 ⋅ 𝑋 + ሚ𝑆 ⋅ 𝑌
෨𝑅 = NOR( NOR( ሚ𝑆, ෨𝑋), NOR(𝑆, ෨𝑌) )
NOR-only logic
𝑅 = (𝑆 == 1)? 𝑋: 𝑌
X
Y
S
Bitline
SA
Cells
NOR
SHIFT
Scalable and Energy-efficient Architecture Lab (SEAL)
Make BL Be Able To Compute (2/2)
• Example: selector
6
𝑅 = 𝑆 ⋅ 𝑋 + ሚ𝑆 ⋅ 𝑌
෨𝑅 = NOR( NOR( ሚ𝑆, ෨𝑋), NOR(𝑆, ෨𝑌) )
NOR-only logic
Step-1: ෨𝑋 = NOR(0, 𝑋)
𝑅 = (𝑆 == 1)? 𝑋: 𝑌
X
Y
S
!X
Bitline
SA
Cells
NOR
SHIFT
Scalable and Energy-efficient Architecture Lab (SEAL)
Make BL Be Able To Compute (2/2)
• Example: selector
6
𝑅 = 𝑆 ⋅ 𝑋 + ሚ𝑆 ⋅ 𝑌
෨𝑅 = NOR( NOR( ሚ𝑆, ෨𝑋), NOR(𝑆, ෨𝑌) )
NOR-only logic
Step-1: ෨𝑋 = NOR(0, 𝑋)
𝑅 = (𝑆 == 1)? 𝑋: 𝑌
X
Y
S
!X
!Y
Step-2: ෨𝑌 = NOR(0, 𝑌)
Bitline
SA
Cells
NOR
SHIFT
Scalable and Energy-efficient Architecture Lab (SEAL)
Make BL Be Able To Compute (2/2)
• Example: selector
6
𝑅 = 𝑆 ⋅ 𝑋 + ሚ𝑆 ⋅ 𝑌
෨𝑅 = NOR( NOR( ሚ𝑆, ෨𝑋), NOR(𝑆, ෨𝑌) )
NOR-only logic
Step-1: ෨𝑋 = NOR(0, 𝑋)
𝑅 = (𝑆 == 1)? 𝑋: 𝑌
X
Y
S
!X
!Y
!S
Step-2: ෨𝑌 = NOR(0, 𝑌)
Step-3: ሚ𝑆 = NOR(0, 𝑆)
Bitline
SA
Cells
NOR
SHIFT
Scalable and Energy-efficient Architecture Lab (SEAL)
Make BL Be Able To Compute (2/2)
• Example: selector
6
𝑅 = 𝑆 ⋅ 𝑋 + ሚ𝑆 ⋅ 𝑌
෨𝑅 = NOR( NOR( ሚ𝑆, ෨𝑋), NOR(𝑆, ෨𝑌) )
NOR-only logic
𝑅 = (𝑆 == 1)? 𝑋: 𝑌
X
Y
S
!X
!Y
!S
!(!X+!S)Step-4: tmp1 = NOR( ሚ𝑆, ෨𝑋)
Bitline
SA
Cells
NOR
SHIFT
Scalable and Energy-efficient Architecture Lab (SEAL)
Make BL Be Able To Compute (2/2)
• Example: selector
6
𝑅 = 𝑆 ⋅ 𝑋 + ሚ𝑆 ⋅ 𝑌
෨𝑅 = NOR( NOR( ሚ𝑆, ෨𝑋), NOR(𝑆, ෨𝑌) )
NOR-only logic
𝑅 = (𝑆 == 1)? 𝑋: 𝑌
X
Y
S
!X
!Y
!S
!(!X+!S)
!(!Y+S)
Step-4: tmp1 = NOR( ሚ𝑆, ෨𝑋)
Step-5: tmp2 = NOR(𝑆, ෨𝑌)
Bitline
SA
Cells
NOR
SHIFT
Scalable and Energy-efficient Architecture Lab (SEAL)
Make BL Be Able To Compute (2/2)
• Example: selector
6
𝑅 = 𝑆 ⋅ 𝑋 + ሚ𝑆 ⋅ 𝑌
෨𝑅 = NOR( NOR( ሚ𝑆, ෨𝑋), NOR(𝑆, ෨𝑌) )
NOR-only logic
𝑅 = (𝑆 == 1)? 𝑋: 𝑌
X
Y
S
!X
!Y
!S
!(!X+!S)
!(!Y+S)
!R
Step-4: tmp1 = NOR( ሚ𝑆, ෨𝑋)
Step-5: tmp2 = NOR(𝑆, ෨𝑌)
Step-6: ෨𝑅 = NOR(tmp1,tmp2)
Bitline
SA
Cells
NOR
SHIFT
Scalable and Energy-efficient Architecture Lab (SEAL)
Make BL Be Able To Compute (2/2)
• Example: selector
6
X
Y
S
!X
!Y
!S
!(!X+!S)
!(!Y+S)
!R
R
𝑅 = 𝑆 ⋅ 𝑋 + ሚ𝑆 ⋅ 𝑌
෨𝑅 = NOR( NOR( ሚ𝑆, ෨𝑋), NOR(𝑆, ෨𝑌) )
NOR-only logic
𝑅 = (𝑆 == 1)? 𝑋: 𝑌
Step-7: 𝑅 = NOR(0, ෨𝑅)
Bitline
SA
Cells
NOR
SHIFT
Scalable and Energy-efficient Architecture Lab (SEAL)
Shifters (1/2)
• Why include shifters:
– E.g., carry-in propagation
7
Bitline
SA
Cells
NOR
SHIFT
Scalable and Energy-efficient Architecture Lab (SEAL)
Shifters (1/2)
• Why include shifters:
– E.g., carry-in propagation
7
X0
Y0
Cin0
X1
Y1
Bitline
SA
Cells
NOR
SHIFT
Scalable and Energy-efficient Architecture Lab (SEAL)
Shifters (1/2)
• Why include shifters:
– E.g., carry-in propagation
7
X0
Y0
Cin0
S0
X1
Y1
Bitline
SA
Cells
NOR
SHIFT
Scalable and Energy-efficient Architecture Lab (SEAL)
Shifters (1/2)
• Why include shifters:
– E.g., carry-in propagation
7
X0
Y0
Cin0
S0
Cout0
X1
Y1
Bitline
SA
Cells
NOR
SHIFT
Scalable and Energy-efficient Architecture Lab (SEAL)
Shifters (1/2)
• Why include shifters:
– E.g., carry-in propagation
7
X1
Y1
X1
Y1
X0
Y0
Cin0
S0
Cout0
Cin1
Bitline
SA
Cells
NOR
SHIFT
Scalable and Energy-efficient Architecture Lab (SEAL)
Shifters (2/2)
• Multiple hierarchies:
8
Bitline
SA
Cells
NOR
SHIFT
Scalable and Energy-efficient Architecture Lab (SEAL)
Shifters (2/2)
• Multiple hierarchies:
– Intra-lane: bit shift inside 8 bit lane
8
Virtual lane (INT8) Virtual lane (INT8)
Bitline
SA
Cells
NOR
SHIFT
Scalable and Energy-efficient Architecture Lab (SEAL)
Shifters (2/2)
• Multiple hierarchies:
– Intra-lane: bit shift inside 8 bit lane
– Inter-lane: array element shift
8
Virtual lane (INT8) Virtual lane (INT8)
Bitline
SA
Cells
NOR
SHIFT
Scalable and Energy-efficient Architecture Lab (SEAL)
Shifters (2/2)
• Multiple hierarchies:
– Intra-lane: bit shift inside 8 bit lane
– Inter-lane: array element shift
– Forwarding: access any element in the array
8
Virtual lane (INT8) Virtual lane (INT8)
Bitline
SA
Cells
NOR
SHIFT
Scalable and Energy-efficient Architecture Lab (SEAL)
Putting Compute-capable BLs and Shifters Together
• Observations:
– CSA is preferred: reduction works fine
9
0
10
20
30
40
2 4 8 16
Cycle
s
Operand bit length
CSA FA
Scalable and Energy-efficient Architecture Lab (SEAL)
Putting Compute-capable BLs and Shifters Together
• Observations:
– CSA is preferred: reduction works fine
– Affordable MUL: need to have one operand within 2-bit
9
0
10
20
30
40
2 4 8 16
Cycle
s
Operand bit length
CSA FA
1
10
100
1000
1 2 4 8 16
Cycle
s
Operand-1 bit length
Operand-2 bit length = 2 bit 4 8 16
Scalable and Energy-efficient Architecture Lab (SEAL)
Optimizations for high performance
10
Scalable and Energy-efficient Architecture Lab (SEAL)
Optimizations for high performance
10
DRAM technology
Logic Incompatible
Simple Boolean logic+ Serially run
General Purpose
Reconfigurable
High Pref.
Scalable and Energy-efficient Architecture Lab (SEAL)
Optimizations for high performance
10
DRAM technology
Logic Incompatible
Simple Boolean logic+ Serially run
General Purpose
Reconfigurable
High Pref.
Scalable and Energy-efficient Architecture Lab (SEAL)
Optimizations for high performance
• Adopting commodity DRAM:
– 13-cycles for 8-bit CSA
– tRC (46ns) 10
DRAM technology
Logic Incompatible
Simple Boolean logic+ Serially run
General Purpose
Reconfigurable
High Pref.
1.E+00
1.E+01
1.E+02
1.E+03
1E+00 1E+01 1E+02 1E+03 1E+04
No
rma
lize
d O
n-c
hip
M
em
.Ca
pa
city p
er A
rea
Normalized Peak Perf. per Area
Compute-capable
Memory (PIM)
Memory-rich
Processor
Scalable and Energy-efficient Architecture Lab (SEAL)
Optimizations for high performance
• Adopting commodity DRAM:
– 13-cycles for 8-bit CSA
– tRC (46ns) 10
DRAM technology
Logic Incompatible
Simple Boolean logic+ Serially run
General Purpose
Reconfigurable
High Pref.
1.E+00
1.E+01
1.E+02
1.E+03
1E+00 1E+01 1E+02 1E+03 1E+04
No
rma
lize
d O
n-c
hip
M
em
.Ca
pa
city p
er A
rea
Normalized Peak Perf. per Area
Compute-capable
Memory (PIM)
Memory-rich
Processor
un-optimized
Scalable and Energy-efficient Architecture Lab (SEAL)
Optimizations for high performance
• Adopting commodity DRAM:
– 13-cycles for 8-bit CSA
– tRC (46ns) 10
DRAM technology
Logic Incompatible
Simple Boolean logic+ Serially run
General Purpose
Reconfigurable
High Pref. Improve Parallelism
Unblock Data Mov.
Optimize Activation
Target
1.E+00
1.E+01
1.E+02
1.E+03
1E+00 1E+01 1E+02 1E+03 1E+04
No
rma
lize
d O
n-c
hip
M
em
.Ca
pa
city p
er A
rea
Normalized Peak Perf. per Area
Compute-capable
Memory (PIM)
Memory-rich
Processor
un-optimized
Scalable and Energy-efficient Architecture Lab (SEAL)
Experiment Setup
• DRISA circuit simulator:
– Heavily modified CACTI
– Digital circuit (controller, logic gates)
• From Design Compiler synthesis
• Scaled to DRAM process with 20% perf.
Overhead and 80% area overhead (ISCAS’99)
• DRISA performance simulator:
– A behavior-level simulator
– Including a mapping optimization
framework
11
Performance
Simulator
[In-house]
Mapping
scheme
Design
options
# mat/
subarr
y/bank
Speed
Power
Circuit Simulator
[DesignCompiler+
CACTI-3DD]
Devise
parameter
Design
options
Circuits
Latency/
cyclesPower/ops
Area
Leakage
NN
topology
Scalable and Energy-efficient Architecture Lab (SEAL)
Binary weight, 8-bit activation CNN inference
case study
12
1E-02
1E-01
1E+00
1E+01
1E+02
1 8 64 1 8 64 1 8 64 1 8 64
AlexNet vgg-16 vgg-19 resnet-152 GM
Perf
/Are
a (
fr./
s/m
m2)
3T1C 1T1C-nor
1T1C-mixed 1T1C-adder
GPU-INT
Scalable and Energy-efficient Architecture Lab (SEAL)
Binary weight, 8-bit activation CNN inference
case study
12
1E-02
1E-01
1E+00
1E+01
1E+02
1 8 64 1 8 64 1 8 64 1 8 64
AlexNet vgg-16 vgg-19 resnet-152 GM
Perf
/Are
a (
fr./
s/m
m2)
3T1C 1T1C-nor
1T1C-mixed 1T1C-adder
GPU-INT
Scalable and Energy-efficient Architecture Lab (SEAL)
Binary weight, 8-bit activation CNN inference
case study
12
1E-02
1E-01
1E+00
1E+01
1E+02
1 8 64 1 8 64 1 8 64 1 8 64
AlexNet vgg-16 vgg-19 resnet-152 GM
Perf
/Are
a (
fr./
s/m
m2)
3T1C 1T1C-nor
1T1C-mixed 1T1C-adder
GPU-INT
Scalable and Energy-efficient Architecture Lab (SEAL)
Binary weight, 8-bit activation CNN inference
case study
12
1E-02
1E-01
1E+00
1E+01
1E+02
1 8 64 1 8 64 1 8 64 1 8 64
AlexNet vgg-16 vgg-19 resnet-152 GM
Perf
/Are
a (
fr./
s/m
m2)
3T1C 1T1C-nor
1T1C-mixed 1T1C-adder
GPU-INT
Scalable and Energy-efficient Architecture Lab (SEAL)
Binary weight, 8-bit activation CNN inference
case study
12
1E-02
1E-01
1E+00
1E+01
1E+02
1 8 64 1 8 64 1 8 64 1 8 64
AlexNet vgg-16 vgg-19 resnet-152 GM
Perf
/Are
a (
fr./
s/m
m2)
3T1C 1T1C-nor
1T1C-mixed 1T1C-adder
GPU-INT
Scalable and Energy-efficient Architecture Lab (SEAL)
Binary weight, 8-bit activation CNN inference
case study
12
1E-02
1E-01
1E+00
1E+01
1E+02
1 8 64 1 8 64 1 8 64 1 8 64
AlexNet vgg-16 vgg-19 resnet-152 GM
Perf
/Are
a (
fr./
s/m
m2)
3T1C 1T1C-nor
1T1C-mixed 1T1C-adder
GPU-INT
Scalable and Energy-efficient Architecture Lab (SEAL)
Binary weight, 8-bit activation CNN inference
case study
• 3T1C is not good
– The lowest area overhead
– Large memory cells
12
1E-02
1E-01
1E+00
1E+01
1E+02
1 8 64 1 8 64 1 8 64 1 8 64
AlexNet vgg-16 vgg-19 resnet-152 GM
Perf
/Are
a (
fr./
s/m
m2)
3T1C 1T1C-nor
1T1C-mixed 1T1C-adder
GPU-INT
Scalable and Energy-efficient Architecture Lab (SEAL)
Binary weight, 8-bit activation CNN inference
case study
• 3T1C is not good
– The lowest area overhead
– Large memory cells
• 1T1C-adder is not the
best
– The best peak performance
– Low effective performance
• 1T1C-mixed is the best
solution
12
1E-02
1E-01
1E+00
1E+01
1E+02
1 8 64 1 8 64 1 8 64 1 8 64
AlexNet vgg-16 vgg-19 resnet-152 GM
Perf
/Are
a (
fr./
s/m
m2)
3T1C 1T1C-nor
1T1C-mixed 1T1C-adder
GPU-INT
Scalable and Energy-efficient Architecture Lab (SEAL)
More in the paper
• Microarchitectures of BL-logic operations and shifter
• Interface design
• Optimizations for high performance
• Impact of variation
• CNN mapping and optimizations
• Detail experiment setup and more results
13
Scalable and Energy-efficient Architecture Lab (SEAL)
Summary
• In-situ computing: building an accelerator with DRAM
technology
– DRAM for large memory capacity
– BL-computing logic design + Shifter for general purpose instructions
– Optimized for high computing performance
14
• Experiments on binary CNN
acceleration:
– perf. per area 8.8x than
ASIC,7.7x than GPU
– energy efficiency per area:
1.2x than ASIC, 15x than GPU
Multi-subarray
active
Multi-bank active
Bitline
SA
Cells
NOR
Bitline
SA
Cells
NOR
SHIFT
Scalable and Energy-efficient Architecture Lab (SEAL)
http://seal.ece.ucsb.edu/ SEAL@UCSB
Scalable and Energy-efficient Architecture Lab (SEAL)
DRISA: A DRAM-based
Reconfigurable In-Situ AcceleratorShuangchen Li, Dimin Niu, Krishna T. Malladi,
Hongzhong Zheng, Bob Brennan, Yuan Xie
University of California, Santa Barbara
Memory Solutions Lab, Samsung Semiconductor Inc.
Questions?