parallel stateful logic in rram: theoretical analysis and

24
Parallel Stateful Logic in RRAM: Theoretical Analysis and Arithmetic Design Feng Wang, Guojie Luo , Guangyu Sun, Jiaxi Zhang, Peng Huang, Jinfeng Kang [email protected] 2019/07/16

Upload: others

Post on 15-Oct-2021

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Parallel Stateful Logic in RRAM: Theoretical Analysis and

Parallel Stateful Logic in RRAM: Theoretical Analysis and Arithmetic Design

Feng Wang, Guojie Luo, Guangyu Sun, Jiaxi Zhang, Peng Huang, Jinfeng Kang

[email protected]

2019/07/16

Page 2: Parallel Stateful Logic in RRAM: Theoretical Analysis and

OUTLINE

Background

Theoretical Analysis

Integer Addition

Extension

Experimental Evaluation

Conclusion

Page 3: Parallel Stateful Logic in RRAM: Theoretical Analysis and

RRAM Resistance for Representing Logic Values

[1] Akinaga, H., & Shima, H. (2010). Resistive random access memory (ReRAM) based on metal oxides. Proceedings of the IEEE.

3

HRS (logical 0)

LRS(logical 1)

SET (π‘ˆπ‘ŠπΏ βˆ’ π‘ˆπ΅πΏ > 𝑉𝑆𝐸𝑇 )

RESET (π‘ˆπ΅πΏ βˆ’ π‘ˆπ‘ŠπΏ > 𝑉𝑅𝐸𝑆𝐸𝑇 )

metal

metal

insulator

WL

BL

Page 4: Parallel Stateful Logic in RRAM: Theoretical Analysis and

RRAM State Switching as Primitive Logic Operations

[2] Kvatinsky, S., Member, S., Belousov, D., Liman, S., Satat, G., Member, S., … Weiser, U. C. (2014). MAGIC β€” Memristor-Aided Logic. TCAS-II.

4

MAGIC NOR implementation 𝑍 = NOR(𝑋, π‘Œ)

– 𝑉𝐺 > 2 β‹… 𝑉𝑅𝐸𝑆𝐸𝑇

Parallel execution over rows and columns

– WL parallelism: π‘…π‘–π‘š = NOR 𝑅𝑖1, 𝑅𝑖2 𝑖 ∈ [1,π‘š]

– BL parallelism: π‘…π‘šπ‘– = NOR(𝑅1𝑖 , 𝑅2𝑖) 𝑖 ∈ [1, π‘š]

VG VG GND R1

R1

R1

R1 R1

R1

R1

R1

R1

BL voltage controller

WL1

WL2

WLm

BL1 BL2 BLm

R11 R12 R1m

R21 R22 R2m

Rm1 Rm2 Rmm

RRAM cell

Page 5: Parallel Stateful Logic in RRAM: Theoretical Analysis and

RRAM-based Stateful Logic Families

[3] Borghetti, J., Snider, G. S., Kuekes, P. J., Yang, J. J., Stewart, D. R., & Williams, R. S. (2010). β€œMemristive” switches enable β€œstateful” logic operations via

material implication. Nature.

[4] Huang, P., Kang, J., Zhao, Y., Chen, S., Han, R., Zhou, Z., … Liu, X. (2016). Reconfigurable Nonvolatile Logic Operations in Resistance Switching Crossbar

Array for Large-Scale Circuits. Advanced Materials.

[5] Avati, V., Eggert, K., & Taylor, C. (2018). FELIX: Fast and Energy-Efficient Logic in Memory. In ICCAD.

[6] Xu, L., Bao, L., Zhang, T., Yang, K., Cai, Y., Yang, Y., & Huang, R. (2018). Nonvolatile memristor as a new platform for non-von Neumann computing. In

ICSICT.

5

Support parallel execution

Functionally complete

Work Stateful logic operations

[3] IMP

[2] NOR, NOT

[4] NAND, NIMP

[5] NOR, NAND, Min, OR

[6] NOR, NOT, NAND, NIMP, XOR

Page 6: Parallel Stateful Logic in RRAM: Theoretical Analysis and

OUTLINE

Background

Theoretical Analysis

Integer Addition

Extension

Experimental Evaluation

Conclusion

Page 7: Parallel Stateful Logic in RRAM: Theoretical Analysis and

SIML Computation Model7

All of the stateful logic families satisfy the four assumptions

– The latency of a single stateful logic operation is identical.

– The input number of a single operation cannot exceed a constant.

– The latency of WL and BL operations is identical.

– The degree of parallelism can reach the crossbar size. The crossbar size scales with the problem size.

Page 8: Parallel Stateful Logic in RRAM: Theoretical Analysis and

Lower Bounds of the Time Complexity (1)8

(a) Theorem 1 (b) Theorem 2

Condition bitwise functions most arithmetic functions

Parallelismupper bound

max(𝑀, β„Ž) 𝑂𝑇

𝑀 + β„Ž

Time complexity lower bound

trivial bound: 𝑂𝑇

max(𝑀,β„Ž)shape bound: 𝑂(𝑀 + β„Ž)

Example function π‘Œπ‘– = 𝑋𝑖1 NOR 𝑋𝑖2 (𝑖 = 1, 2,… , 𝑛) π‘Œπ‘– = NORπ‘˜=1𝑖 π‘‹π‘˜1 NOR 𝑋𝑖2 (𝑖 = 1, 2, … , 𝑛)

Example algorithm (netlist)

Example layout in RRAM

𝑋11 𝑋21 … 𝑋𝑛1

𝑋12 𝑋22 … 𝑋𝑛2

π‘Œ1 π‘Œ2 … π‘Œπ‘›

𝑋11 𝑋21 … 𝑋𝑛1

𝑋12 𝑋22 … 𝑋𝑛2

π‘Œ1 π‘Œ2 … π‘Œπ‘›

𝑋11

𝑋12

π‘Œ1 … 𝑋𝑛1

𝑋𝑛2

π‘Œπ‘›

#1

#1 #1 #1

#n-1

#n #n #n

𝑋11

𝑋21 𝑋𝑛1

…

𝑋𝑛2

π‘Œπ‘›

𝑂(𝑇): total cycles in the series implementation, 𝑀: width of layout (# of BLs), β„Ž: height of layout (# of WLs), π‘™π‘’π‘›π‘šπ‘Žπ‘₯: length of the critical path.

Page 9: Parallel Stateful Logic in RRAM: Theoretical Analysis and

Lower Bounds of the Time Complexity (2)9

(c) Corollary 1 (d) Theorem 3

Condition square layout a given algorithm

Parallelismupper bound

𝑂(𝑇

π‘€β„Ž) 𝑂

𝑇

π‘™π‘’π‘›π‘šπ‘Žπ‘₯

Time complexity lower bound

function bound: 𝑂( π‘€β„Ž) algorithm bound: π‘™π‘’π‘›π‘šπ‘Žπ‘₯

Example function β€” π‘Œ = NORπ‘˜=1𝑛 π‘‹π‘˜1

Example algorithm (netlist)

β€”

Example layout in RRAM

β€”

𝑂(𝑇): total cycles in the series implementation, 𝑀: width of layout (# of BLs), β„Ž: height of layout (# of WLs), π‘™π‘’π‘›π‘šπ‘Žπ‘₯: length of the critical path.

𝑋1 𝑋2 … 𝑋𝑛 π‘Œπ‘›

𝑋1

𝑋2 𝑋𝑛

π‘Œβ€¦

#1 #n-1

Page 10: Parallel Stateful Logic in RRAM: Theoretical Analysis and

OUTLINE

Background

Theoretical Analysis

Integer Addition

Extension

Experimental Evaluation

Conclusion

Page 11: Parallel Stateful Logic in RRAM: Theoretical Analysis and

Ripple Carry Adder11

One-bit full adder

Pulse Logic operation

1 𝑇1 = NOR(𝐴, 𝐡)

2 𝑇2 = NOR(𝐴, 𝑇1)

3 𝑇3 = NOR 𝐡, 𝑇1

4 𝑇4 = NOR(𝑇1, 𝑇2)

5 𝑇5 = NOR(𝑇4, 𝐢𝑖)

6 πΆπ‘œ = NOR(𝑇1, 𝑇5)

7 𝑇6 = NOR(𝑇4, 𝑇5)

8 𝑇7 = NOR(𝑇5, 𝐢𝑖)

9 𝑆 = NOR(𝑇5, 𝑇7)

π‘¨πŸπŸ π‘¨πŸπŸ … π‘¨πŸπ’

π‘¨πŸπŸ π‘¨πŸπŸ … π‘¨πŸπ’

𝑇11 𝑇12 … 𝑇1𝑛

… … … …

𝑇41 𝑇42 … 𝑇4𝑛

𝑇51 𝑇52 … 𝑇5𝑛

𝐢𝑖1 πΆπ‘œ2 … πΆπ‘œπ‘›

πΆπ‘œ1 πΆπ‘œ1(𝐢𝑖2) … πΆπ‘œ π‘›βˆ’1 (𝐢𝑖𝑛)

𝑇61 𝑇62 … 𝑇6𝑛

… … … …

π‘ΊπŸ π‘ΊπŸ … 𝑺𝒏

β‘‘

β‘’

β‘ 

β‘ β‘’ add 𝑛 BLs in parallel β‘‘ generate and propagate carries in series

time complexity: 𝑂(𝑛)shape / algorithm lower bound

Page 12: Parallel Stateful Logic in RRAM: Theoretical Analysis and

Carry Select Adder12

π‘¨πŸπŸ … π‘¨πŸ 𝒏 π‘¨πŸπŸ … π‘¨πŸ 𝒏 0 𝑆1 … 𝑆 𝑛

π‘¨πŸ( 𝒏+𝟏) … π‘¨πŸ(𝟐 𝒏) π‘¨πŸ( 𝒏+𝟏) … π‘¨πŸ(𝟐 𝒏) 0 𝑆 𝑛+1 … 𝑆2 𝑛

… … … … … … … … … …

π‘¨πŸ(π’βˆ’ 𝒏+𝟏) … π‘¨πŸπ’ π‘¨πŸ(π’βˆ’ 𝒏+𝟏) … π‘¨πŸπ’ 0 π‘†π‘›βˆ’ 𝑛+1 … 𝑆𝑛′

𝐴1( 𝑛+1)β€² … 𝐴1(2 𝑛)β€² 𝐴2( 𝑛+1)β€² … 𝐴2(2 𝑛)β€² 1 𝑆 𝑛+1β€² … 𝑆2 𝑛′

… … … … … … … … … …

𝐴1(π‘›βˆ’ 𝑛+1)β€² … 𝐴1𝑛′ 𝐴2(π‘›βˆ’ 𝑛+1)β€² … 𝐴2𝑛′ 1 π‘†π‘›βˆ’ 𝑛+1β€² … 𝑆𝑛′

… … … … … … π‘ΊπŸ … 𝑺 𝒏

… … … … … … 𝑺 𝒏+𝟏 … π‘ΊπŸ 𝒏

… … … … … … … … …

… … … … … … π‘Ίπ’βˆ’ 𝒏+𝟏 … 𝑺𝒏

β‘ 

β‘’

β‘‘

β‘  copy the input β‘‘ add 2 𝑛 βˆ’ 1 WLs in parallel β‘’ select the correct result

time complexity: 𝑂( 𝑛)function / algorithm lower bound

Page 13: Parallel Stateful Logic in RRAM: Theoretical Analysis and

Carry Save Adder13

π‘¨πŸπŸ π‘¨πŸπŸ … π‘¨πŸπ’

π‘¨πŸπŸ π‘¨πŸπŸ … π‘¨πŸπ’

π‘¨πŸ‘πŸ π‘¨πŸ‘πŸ … π‘¨πŸ‘π’

… … … …

πΆπ‘œ1 πΆπ‘œ2 … πΆπ‘œπ‘›

𝑃𝑆1 𝑃𝑆2 … 𝑃𝑆𝑛

0 πΆπ‘œ1 … πΆπ‘œ(π‘›βˆ’1)

𝑃𝑆1 𝑃𝑆2 … 𝑃𝑆𝑛

… … … …

π‘ΊπŸ π‘ΊπŸ … 𝑺𝒏

shift πΆπ‘œ1, …,πΆπ‘œπ‘› right

β‘ 

β‘’

β‘  add 𝑛 BLs in parallel β‘’ add the last two addends πΆπ‘œ, 𝑃𝑆

β‘‘

time complexity: 𝑂( 𝑀)function lower bound

π‘Ž1 … π‘Ž 𝑀 𝐴1

π‘Ž 𝑀+1 … π‘Ž2 𝑀 𝐴2

… … … …

π‘Žπ‘€βˆ’ 𝑀+1 … π‘Žπ‘€ 𝐴 𝑀

… … … 𝒂

β‘  accumulate each WL in parallelβ‘‘ accumulate the sums in the same BLs

β‘ β‘‘

Page 14: Parallel Stateful Logic in RRAM: Theoretical Analysis and

Adder Summary14

Function two 𝒏-bit integer addition 𝑴 integer addition

Algorithm ripple carry adder carry select adder carry select adder

Layout 𝑂 𝑛 Γ— 𝑂(1) 𝑂 𝑛 Γ— 𝑂( 𝑛) 𝑂 𝑀 Γ— 𝑂( 𝑀)

Time complexity 𝑂(𝑛) 𝑂( 𝑛) 𝑂( 𝑀)

Lower bound type shape / algorithm bound function / algorithm bound function bound

Page 15: Parallel Stateful Logic in RRAM: Theoretical Analysis and

OUTLINE

Background

Theoretical Analysis

Integer Addition

Extension

Experimental Evaluation

Conclusion

Page 16: Parallel Stateful Logic in RRAM: Theoretical Analysis and

Dot Product16

π’‚πŸπŸ … π’‚πŸ 𝑴 π’‚πŸπŸ … π’‚πŸ 𝑴 𝑃1 … 𝑃 𝑀 𝑆1

π’‚πŸ( 𝑴+𝟏) … π’‚πŸ(𝟐 𝑴) π’‚πŸ( 𝑴+𝟏) … π’‚πŸ(𝟐 𝑴) 𝑃 𝑀+1 … 𝑃2 𝑀 𝑆2

… … … … … … … … … …

π’‚πŸ(π‘΄βˆ’ 𝑴+𝟏) … π’‚πŸπ‘΄ π’‚πŸ(π‘΄βˆ’ 𝑴+𝟏) … π’‚πŸπ‘΄ π‘ƒπ‘€βˆ’ 𝑀+1 … 𝑃𝑀 𝑆 𝑀

… … … … … … … … … …

… … … … … … … … … …

… … … … … … … … … πΆπ‘œ

… … … … … … … … … 𝑃𝑆

… … … … … … … … … …

… … … … … … … … … …

… … … … … … … … … π‘¨πŸ β‹… π‘¨πŸ

β‘ 

β‘’

β‘  element-wise multiplication β‘‘ accumulate 𝑀 WLs in parallel β‘’ accumulate π‘†π‘˜s using the carry save adder

β‘‘

time complexity: 𝑂( 𝑀)function lower bound

Page 17: Parallel Stateful Logic in RRAM: Theoretical Analysis and

Real Data Type Comparison

[7] KΓΆster, U., Webb, T. J., Wang, X., Nassar, M., Bansal, A. K., Constable, W. H., … Rao, N. (2017). Flexpoint: An Adaptive Numerical Format for Efficient Training of

Deep Neural Networks. arXiv.

17

Β·Β·Β·

Β·Β·Β·

Β·Β·Β·Β·Β·Β·

Β·Β·Β·

Β·Β·Β·

Β·Β·Β· Β·Β·Β·

Β·Β·Β·

fixed-point flex-point

float-point

exponent mantissa

Page 18: Parallel Stateful Logic in RRAM: Theoretical Analysis and

Flex-point Support18

π’‚πŸπŸ π’‚πŸπŸ … π‘Ž31

π’‚πŸπŸ π’‚πŸπŸ … π‘Ž32

… … … …

π’‚πŸπ’ π’‚πŸπ’ … π‘Ž3𝑛

Co

ntr

olle

rπ’†π’™π’‘πŸ π’†π’™π’‘πŸ … 𝑒π‘₯𝑝3

β‘  input exponent

β‘‘ arithmetic instruction

β‘’ output status

β‘£ shift instruction

β‘€ output exponent

Crossbar

Page 19: Parallel Stateful Logic in RRAM: Theoretical Analysis and

OUTLINE

Background

Theoretical Analysis

Integer Addition

Extension

Experimental Evaluation

Conclusion

Page 20: Parallel Stateful Logic in RRAM: Theoretical Analysis and

Integer Algorithm Evaluation20

0

200

400

600

800

8 16 24 32 40 48 56 64

cycl

e

bit width n

MAGIC-adder [8]

ripple carry

carry select

100

1000

10000

100000

1000000

16 32 64 128 256 512 10242048

cycl

e

integer number M

MAGIC-adder [8]

APIM [9]

carry save

0

20000

40000

60000

80000

16 32 64 128 256 512 1024 2048

max

imal

M

crossbar size

APIM [9]

carry save

0

1000

2000

3000

4000

64 128 256 512

cycl

e

vector dimension M

IMAGING [10] Proposed

[8] Talati, N., Gupta, S., Mane, P., & Kvatinsky, S. (2016). Logic design within memristive memories using memristor-aided loGIC (MAGIC). TNANO.

[9] Imani, M., Gupta, S., & Rosing, T. (2017). Ultra-Efficient Processing In-Memory for Data Intensive Applications. In DAC.

Page 21: Parallel Stateful Logic in RRAM: Theoretical Analysis and

Flex-support Evaluation21

0

2

4

6

8

10

100 400 700 1000

Late

ncy

(s)

Matrix size M

RRAM ARM

0

1

2

3

4

100 400 700 1000

Ener

gy /

MΒ³

(nJ)

Matrix size M

RRAM ARM

Page 22: Parallel Stateful Logic in RRAM: Theoretical Analysis and

OUTLINE

Background

Theoretical Analysis

Integer Addition

Extension

Experimental Evaluation

Conclusion

Page 23: Parallel Stateful Logic in RRAM: Theoretical Analysis and

Conclusion23

Theoretical analysis

– SIML model

– time complexity lower bound

Integer addition

– ripple carry adder

– carry select adder

– carry save adder

Extension

– multiplication support

– flex-point support

Page 24: Parallel Stateful Logic in RRAM: Theoretical Analysis and

&A