a 3d nand flash ready 8-bit convolutional neural network

A 3D NAND Flash Ready 8-Bit Convolutional Neural Network Core Demonstrated in a

Standard Logic Process

M. Kim1, M. Liu1, L. Everson1, G. Park1, Y. Jeon2,

S. Kim2, S. Lee2, S. Song2 and C. H. Kim1

1Dept. of ECE, University of Minnesota2Anaflash Inc.

Outline

• 3D NAND based Neural Network

• Prototyping in a Standard Logic Process

‒ Architecture, synapse cell, memory array design

• 65nm Test Chip Results

‒ Programming results, MNIST demonstration, retention

• Conclusions2

Deep Neural Network Complexity

3

• State-of-the-art deep neural networks: ~150 layers, ~150M

parameters, ~100MB memory per image

An Analysis of Deep Neural Network Models for Practical Applications, Arxiv, 2017

In-Memory Computing and Flash-based Design

4

• Ultra-high density, reduced data traffic, low cost, mature technology

• Low program/erase speed (but fast read speed), limited endurance cycles (but fine for neural network applications)

Flash SSD

(e.g. 2TB)Data Data

DRAM

(e.g. 64GB)

SRAM

(e.g. 22MB)

3D NAND Cell and Array

Source: Toshiba

W0

Wi

X0

Xi

ISum

W1

X1

Analog multiply and accumulate (MAC)

In-memory Computing (e.g. SRAM, RRAM)

5

H. Maejima, Toshiba, ISSCC 2018

WL<0>

SGS

WL<95>

CSL

SGD<0> SGD<1> SGD<2> SGD<3>

Block Y

XZ

X

Y

16KB BLs

SGD<3>

SGD<0>

State-of-the-Art 3D NAND Flash Memory

• (x,y,z) = (BL, SGD, WL)‒ BL shared across multiple blocks ‒ WL is a shared plane (not line)‒ SGD can be individually controlled

Capacity

Technology 96-WL-Layer

512Gb (3bit/cell)

Bit Density 5.95Gb/mm2

Organization(1822 + EXT)

Blocks / Plane

ThroughputRead(tR) : 58µs

(ABL : 16KB)

Cost 4 Cent/Gb

6

Analog MAC in 3D NAND Array

• Σ Xi × Wi‒Xi : Binary input applied to individual

SGD lines (no analog voltages)‒Wi: Multi-level weight (MLC, TLC, QLC)‒Σ (Accumulate) : Bitline currents of

different blocks summed up‒High resolution MAC can be realized

using multiple weight cells, bit serial operation, and partial product post-processing

WL<0>

SGS

WL<95>

CSL

X0 X1 X2 X3

Block

7






Vpass

SGS

Vread

CSL

X0 X1 X2 X3

Block

Vpass

w00 w01 w02 w03

8






Vpass

SGS

Vread

CSL

X0 X1 X2 X3

Block

Vpass

w00 w01 w02 w03

9






Vread

SGS

Vpass

CSL

X0 X1 X2 X3

Block

Vpass

w00 w01 w02 w03

Y0 Y1 Y2 Y3

Z0 Z1 Z2 Z3

Bit Serial Operation for 8bit x 8bit MAC

10

8bit Input

Σ ×

×

LSB

8bit Weight (Pos. 7bit + Neg. 7bit) Analog MAC Operation

MSB

11000101

LSB MSB

1 0 10 01 1

10111000

0 1 11 11 0

1

3 4

Cycle 1

Σ ×

×

8bit Weight Analog MAC

Operation LSB MSB

1 0 10 01 1

0 1 11 11 0

0

3

Cycle 2

3

8bit Input

Σ ×

×


Operation LSB MSB

1 0 10 01 1

0 1 11 11 0

3

0

Cycle 31 LSB

3

8bit Input

Σ ×

×


Operation LSB MSB

1 0 10 01 1

0 1 11 11 0

1

0

Cycle 32 LSB

1

MSB MSB

8bit Input LSB MSB

11000101

10111000

11000101

10111000

11000101

10111000

+ 4 × 20

Pos. weights

+4 × 20

3 × 22 +

+

+4 × 20

3 × 22 +

+3 × 2(7+4)

× 20 × 22

× 2(7+4)

+

× 2(7+6)

P. Judd, ICAL 2017

Logic Compatible 3T eFlash Based NAND String

11

• Cell current proportional to X∙W (=0µA, 3µA, 6µA, or 9µA) • BL voltage pinned at 0.8V during read and verify operation

VPASS

X

SGS : 2.5VCSL : 0V

VBL=0.8V

VFG: Floating Gate Voltage

* Selected WL : VREAD(1.1V)* Unselected WL : VPASS (2.6V)

X W X∙W ICELL

0 0 0 0µA

0 1 0 0µA

0 2 0 0µA

0 3 0 0µA

1 0 0 0µA

1 1 1 3µA

1 2 2 6µA

1 3 3 9µA

ICELL

VPASS

VPASS

VREAD

VREAD

VREAD

0.0 0.2 0.4 0.6 0.8 1.0

0

3

6

9

12

VFG -0.6V

VFG -0.290V

VFG -0.066V

VFG 0.150V

BL Voltage (V)

65nm, VDD 1.2V, 25°C, X(SGD) =2.5V

Cell C

urren

t (µA

)X=1, W=0 or X=0, W=0,1,2,3

X=1, W=1

X=1, W=2

X=1, W=3

12

Prototype Design in a Standard Logic Process

• Flatten to 2D while preserving 3D NAND array architecture• Unit block size: 4 SGD x 16 WL x 40 BL

WL<0>

SGS

WL<15>

CSL

40BLs

SGD<0> SGD<1> SGD<2> SGD<3>

Block

WL<0>

SGS

WL<15>

CSL

Block

BL0

BL39

SGD<0>

BL0 BL39

SGD<1>SGD<2>

SGD<3>

Positive and Negative Weight Storage

13

X W0 W1 X∙W0 X∙W1 ICELL0 ICELL1 ∆I

1 0 3 0 3 0µA 9µA -9µA

1 0 2 0 2 0µA 6µA -6µA

1 0 1 0 1 0µA 3µA -3µA

1 0 0 0 0 0µA 0µA 0µA

1 1 0 1 0 3µA 0µA 3µA

1 2 0 2 0 6µA 0µA 6µA

1 3 0 3 0 9µA 0µA 9µA

Negative2bit

Weights

Positive2bit

Weights

Pos. W

X

SGS : 2.5VCSL : 0V

Even BL=0.8V

* Selected WL : VREAD(1.1V)* Unselected WL : VPASS (2.6V)

Odd BL=0.8V

W0

Neg. W

W1

VPASS

ICELL0

VPASS

VPASS

VREAD

ICELL1

VPASS

2bit

2bit

2bit

1bit

7bit 7bit

8bitpair

Erase and Program Operations in NAND String

14

• FN tunneling utilized for erase and program• Program inhibition of unselected cells via self-boosting

PWL<15>(0V)

WWL<15>(HV)

SGD<3:0>(0V)

WL<0>(0V)

SGS(0V)CSL(0V)

Erase Operation (WL15)

WM1=

8xWM2,3

M1

M2

e-

OFF

BL(0V)

~0V

OFF

PWL<15>(HV)

WWL<15>(HV)

SGD<0>(VDD)

SGS(0V)CSL(VDD)

PGM Operation (WL15)

e-

ON

BLA(0V)

Ce

ll A

<0

>

e-

OFF

Ce

ll A

<3

:1>

SGD<3:1>(0V)

e-

OFF

Ce

ll B

<0

>

BLB(VDD)

Programmed Program-Inhibited

via self-boosting

WL<1>(0V)

OFF OFF OFF

WL<0>(VPASS)

WL<1>(VPASS)

64 Row x 320 Column Core Architecture

15

• 16 stack eNAND array, high voltage switches, BL sensing circuits• Input data loaded on to 25 SGD lines, 3 SGD lines for bias

BL0

Analog MUX

BL39

BL Current Verify/Sensing Circuit

(Core Devices)

SGD0<3:0>

Block0(16 StackeNANDArray)

SGD4<3:0>

Block4

Pix

el-

IN,W

L a

nd

Blo

ck

SC

AN

Block3Block7(RepairBlock)

VCO

OutputFrequency Divider

HVS WWL015PWL015

HVS WWL00PWL00

SGS0

SGD3<3:0>

SGS3

HVS

HVS

HVS

HVS

WWL315PWL315

WWL30PWL30

WWL415PWL415

WWL40PWL40

WWL715PWL715

WWL70PWL70

HVS

HVS

Pix

el-

IN,W

L a

nd

Blo

ck

SC

AN

VP

P,V

PA

SS

SGS4

SGD7<3:0>

SGS7

BL SCAN & DRIVER C

on

tro

l1

Co

ntr

ol2

High Voltage Switch

BL

WRITE BL SCAN

VREG –

+

Frequency Divider

BLSGD

PWL

WWL

S1

M1

M2

M3

FG

CSL

PWL

WWL

SGS

M1

M2

M3

S2

FG

16Stack eNANDCell String

S2

PWL<15>

WL<0>

SGD<0>

SGD<3>

SGSCSL

Output

BL

voltage

regulator

VCO

M1

M2

M3

S1

FG

WL<1>

WL<15>

WL<14>

WWL<15>

65nm Die Photo and Feature Summary

16

16 Stack eNANDBased Synapse Array

Technology

Core Size 1100 X 600 μm2

65nm Logic

VDD (Core / IO)

1.2V / 2.5V

# of 8bit Weights

2,560

# of Synapses

20K(=64x320)

Throughputw/o VCO

4.95 µW (per bitline)

Power

0.5G pixels/s per core

(tREAD : 50ns)

Program Characteristics vs. NAND String Location

17

• Different Vgs and Vdsdepending on the stack location

• Requires different Vth for the same cell current

• Longer programming time for cells closer to the bottom

• Cell current variation less than 0.6µA after program-verify operations

10 20 30 40

3

6

9

12WL0-3

WL12-15

Ce

ll C

urr

en

t (µ

A) Weight2

Weight2

Weight1

Weight1

25°C, VPASS 2.6V, VREAD 1.1V

Weight3

Weight3

Fine Program Pulse Count

SGS(2.5V)

CSL(0V)

SGD (2.5V)

WL0

WL1

WL2

WL3

WL15

WL14

WL13

WL12

BL (0.8V)

Cell Current

3

6

9

12

*Selected WL : VREAD, Unselected WL : VPASS

0

LeNet-5 CNN Demonstration

•5 x 5 8bit convolution performed on chip•ADC, pooling, bit serial operation performed off chip

18

INPUT 28X28 (8 bit)

55

5×5 Convolutions

22

Max Pooling

55

5×5 Convolutions2

2

Max Pooling

Feature maps6@24X24

OUTPUT10

Layer120

Full connectionFull connection

Gaussian connections

Feature maps6@12X12

Feature maps16@8X8 Feature maps

16@4X4 Layer84

eNAND 5×5 6filters (12BLs) operation

eNAND 5×5 96(6×16)filters (192BLs) operation

* eNAND 1 filter : 2BLs (Pos. BL + Neg. BL)

Final results

0 1 2 3 4 5 6 7 8 9

On-chip Off-chip On-chip Off-chip

MNIST Digit Recognition Accuracy Results

•Recognition accuracy is 98.5% which is close to the software model’s 99.0% accuracy

19

Output Recognized Index

Digit 0

Digit 1

Digit 2

Digit 3

Digit 4

Digit 5

Digit 6

Digit 7

Digit 8

Digit 9

Coun

t #

0

1

1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9

10

100

1

10

100

1

10

100

1

10

100

1

10

100

Retention Test Results

20

• 1K erase and program cycles before baking test

• Excellent retention characteristics additional weight levels possible (e.g. 3 bit per cell)

1K Pre-Cycles , Tbake=150°C

100101 102 103

Bake Time (hrs)

0

3

6

Weight0

9

Weight1

Weight2

Weight3

Before baking

Ce

ll C

urr

en

t (µ

A)

Comparison Table

[1] X. Guo, et al., IEDM, 2017.[2] W. Chen, et al., ISSCC, 2018.

[3] M. Kim, et al., IEDM, 2018.[4] C. Xue, et al., ISSCC, 2019.

[5] X. Si, et al., ISSCC, 2019.

21

ISSCC 19 [4]

Technology 55nm

IEDM 18 [3]

65nm

ISSCC 19 [5]

55nm

Input

Resolution

Non volatile? YES (ReRAM) NO (SRAM) YES (eFlash)

Voltage 1.0V 1.0V1.0V

Logic

Compatible?

ISSCC 18 [2]

Weight

Resolution

YES (ReRAM)

YES YES NONO

3 Bits 3 Bits 5 Bits 2.3 Bits

Program-verify? NO NO YES NO

65nm

1.0V

YES (eFlash)

YES

65nm

This work

YES (eFlash)

1.2V

YES

8 Bits

YES

IEDM 17 [1]

Cell Type NAND NOR NOR NOR NOR NOR

1 Bit

2 Bits

NO

2.7V

180nm

# of Currents

Summed Up8 Cells 68 Cells 14 Cells28 Cells 32 Cells

3 Bits 2 Bits 2 Bits 1 Bit8 Bits

Neural Net

ArchitectureCNN MLP CNNCNN MLPCNN

4 Cells

Conclusions

• 3D NAND Flash ready 8bit x 8bit convolutional neural network core demonstrated in a 65nm standard logic process‒16 stack NAND string‒2 bit per cell weight storage‒Bit serial operation‒Back-pattern tolerant program verify (details in paper)

• LeNet5 2-layer CNN test results show 98.5% digit recognition accuracy

22

a 3d nand flash ready 8-bit convolutional neural network

Documents