fpt 2006 bangkok a novel memory architecture for elliptic curve cryptography with parallel modular...

FPT 2006 Bangkok

A Novel Memory Architecture for Elliptic Curve Cryptography with Parallel Modular

Multipliers

Ralf Laue, Sorin A. HussIntegrated Circuits and Systems Lab, Computer Science Dept.

Technische Universität Darmstadt, Germany{laue|huss}@iss.tu-darmstadt.de

December 14th, 2006FPT 2006, BangkokFPT 2006, Bangkok

FPT 2006 Bangkok

Introduction

• Speed-up of todays hardware stems increasingly from parallelization.

• Cryptographical implementations should take ad-vantage of this by using parallel algorithm versions.

• We begin with an survey about parallelization on dif-ferent abstraction levels of public key cryptography.

• Then, we present a novel parallel memory architecture for elliptic curve cryptography in GF(P).– Allows the execution time to scale with the number of

parallel modular multipliers.– Direct memory connection leads to low resource usage.

FPT 2006 Bangkok

Overview

• Parallelization on Different Abstraction Levels

• Novel Memory Architecture– Design Considerations– Proposed Memory Architecture

• Experimental Results– Number of Parallel Multipliers– Prototype Implementation– Application to Another EC Arithmetic Algorithm

FPT 2006 Bangkok

Parallelization on Different Abstraction Levels

• In general, parallelization yields greater benefit on lower levels (as less control logic needs to be duplicated)

• Parallelization on higher levels allows further speed-up and offers advantages not available on lower levels.

• Parallelization methods on different levels do not exclude each other.Finite FieldFinite Field

Modular Arithmetic

Elliptic Curve GroupElliptic Curve GroupPoint Addition and Doubling

Discrete Logarithm/Discrete Logarithm/Integer FactorizationInteger Factorization

Point Multiplication/Exponentiation

Cryptographic SchemeCryptographic Scheme

SystemSystem

RSARSA

ECC/HECCECC/HECC

FPT 2006 Bangkok

Parallelization on Finite Field Level

• Modular multi-word multiplication is the most critical operation. Thus, paralleliza-tion on this level is a popular strategy.

• The approaches on this level do not exclude each other.

Data-paths of full bit-width:– Allow for linear time complexity at cost

of proportional increase of resources (e.g. systolic array).– Usual bit-widths: ECC: >100 bit, RSA: >1000 bit– Problem: Design for maximum bit-width. For smaller word

counts resources stay unused, higher may be infeasible.

Finite FieldFinite FieldModular Arithmetic

Elliptic Curve GroupPoint Addition and Doubling

Discrete Logarithm/Integer Factorization


Cryptographic Scheme

System

RSAECC/HECC

FPT 2006 Bangkok

Parallelization on Finite Field Level (cont.)

• Pipelining– Allows for linear time complexity, too.– More flexible as buses of full bit-width, because

number of pipeline stages may be chosen freely.– Problem: calculated bit-width always corresponds

to a multiple of the number of stages in words.• Resources may still stay unused.

• ECC/RSA-combination allows only for pipeline lengths designed for ECC, as those designed for RSA would waste resources and execution time, if used with ECC.

FPT 2006 Bangkok

Parallelization on Finite Field Level (cont.)

• Karatsuba multiplication:– Multiplying two numbers with two words each can be done with

three word multiplications.

– Recursion leads to approx. O(n1,585).– As recursion is difficult in hardware, this is usually used for

multiplications in full bit-width (requires less resources).

• Residue Number Systems:– Long numbers are represented relative to a base consisting of

multiple smaller moduli, relatively prime to each other. The Chinese Remainder Theorem ensures a unique mapping.

– Multiplication, addition and subtraction may be executed in parallel.– Can be interpreted as special case of buses of full bit-width.

)(2)]()()()[(2)(

)(2)]()[(2)()2()2(

00001101012

11

0010012

110101

yxyxyxyyxxyx

yxyxyxyxyyxxbb

bbbb

)(2)]()()()[(2)(

)(2)]()[(2)()2()2(

00001101012

11

0010012

110101

yxyxyxyyxxyx

yxyxyxyxyyxxbb

bbbb

FPT 2006 Bangkok

Parallelization on Elliptic Group Level

• EC doubling and addition may be sped upby using multiple modular units in paral-lel.

• Literature suggests a maximum of two orthree modular multipliers (data depen-dencies limit further improvements).

• One instance of the remaining modulararithmetic is sufficient, because it is veryfast in comparison.

• This abstraction level is well-suited for parallelization in SIMD implementations.

• Note that this level does not exist for RSA.

Finite FieldModular Arithmetic

Elliptic Curve GroupElliptic Curve GroupPoint Addition and Doubling




System

ECC/HECCRSA

FPT 2006 Bangkok

Parallelization on Discrete Logarithm/ Integer Factorization Level

• Both point multiplication and expo-nentiation allows parallel use of twoinstances of group operations.– E.g. with Montgomery Ladder (paral-

lel point doubling/addition for ECC;parallel square/multiply for RSA).

• Parallelization on this abstractionlevel is (in addition to further speed-ups) often used as countermeassure against side channel attacks.



Discrete Logarithm/Discrete Logarithm/Integer FactorizationInteger FactorizationPoint Multiplication/Exponentiation


System

ECC/HECCRSA

FPT 2006 Bangkok

Parallelization on Cryptographic Primitive/ System Level

• Cryptographic Schermes usually only useone point multiplication/exponentiation.– We know of no proposal for parallelization

on this level.

• Possible scenario: Flexible coprocessor for RSA/ECC– Parallelization on lower abstraction levels

is only possible to a certain degree, as long as unused resources should be avoided.

– Further parallelization may be done on the level of the cryptographic primitive to increase throughput.





Cryptographic SchemeCryptographic Scheme

SystemSystem

ECC/HECCRSA

FPT 2006 Bangkok

Overview




FPT 2006 Bangkok

Design Goals

• ECC implementation for GF(P) on FPGAs.• Ability to support different key lengths.• Resource requirements should be relatively low, thus

allowing integration of further functions on the FPGA.– E.g. other cryptographic modules, something unrelated to

cryptography.

• Thus, minimum execution time was less important than a high utilization of the allocated resources.

FPT 2006 Bangkok

Design Decisions

• No parallelization on finite field level– Would lead to unused resources, at least for some key

lengths.

• Instead, parallelization on elliptic group level– Depends on data dependencies, independent from key

length.

• Modular multiplication is more complex and time consuming than remaining modular operations.– Chosen architecture consists of multiple modular multipliers

parallel to each other and the module for the remaining modular arithmetic parallel to the multipliers.

FPT 2006 Bangkok

Conventional Memory Architecure

• Memory architecture must allow all operations to be continuously supplied with data.

• Conventional memory architecure consists of one memory and modules with input and output registers.

• Registers take up FPGA resources, but contain only redundant data copied from memory.

Mult 1RAM ... Mult n ALU ... Square

FPT 2006 Bangkok

Novel Memory Architecture

• Each modular multiplier is assigned its own memory block via a direct connection.– Supports continuous data supply.– Low general resource usage, slightly increased memory usage.

• Remaining modular arithmetic may access memory blocks via the second port.

• Execution time scales with the number of modular multpliers.• Modular arithmetic copies data between local memory blocks,

as multipliers only can access “their“ memory block.– Does not hinder scalability, as remaining modular arithmetic can

access all memory blocks simultaneously in parallel.

FPT 2006 Bangkok

Novel Memory Architecture (cont.)

• Usual memory blocks lack third port.

• Cryptographic primitive and modular arithemtic share second memory port.– Access from cryptographic

primitive only while no computation is executed.

– Else: access from the modular arithmetic.

• Elliptic curve arithmetic does not directly access the data, but only indirectly via the modular arithmetic.

ModMult

BRAM

MUX

...ModMult

BRAM

ModMult

BRAM

Modular ArithmeticModular ArithmeticModular ArithmeticModular Arithmetic

Elliptic Curve ArithmeticElliptic Curve ArithmeticElliptic Curve ArithmeticElliptic Curve Arithmetic

Cryptographic PrimitiveCryptographic PrimitiveCryptographic PrimitiveCryptographic Primitive

data

statuscommands

commands

busy

commands

data

FPT 2006 Bangkok

Overview




FPT 2006 Bangkok

Number of Parallel Multipliers

• Determine number of multipliers to be used (IEEE 1363):– ECDbl can utilize only two parallel modular multipliers because of

data dependecies.

– Utilization of modular multipliers for ECAdd (16 multiplications).

• Table highlights scalability.– (#multipliers * #consecutive multiplications) is smallest multiple of

the number of multipliers larger or equal than overall number of multiplications.

#multipliersmultiplier utilization

#consecutive multiplications

2 approx. 98% 8

3 approx. 82% 6

4 approx. 74% 5

FPT 2006 Bangkok

Data Flow Graph ECAdd, IEEE

• Consecutive multiplications are always executed on same multiplier.– No copying between memory

blocks.– Dark and light grey multiplications

are executed on different modular multipliers.

• Longest path contains 5 modular multiplications.– No speed-up by using more than 4

multipliers possible.

FPT 2006 Bangkok

Schedule ECAdd, IEEE

• Schedule for two modular multipliers.

• Mapping to multipliers as shown in data flow graph on last slide.

Quad1 Mult1 Mult2 Mult3 Quad3 Mult12 Mult11 Mult15

Quad2 Mult5 Mult4 Mult6 Mult9 Quad4 Mult10 Mul14

Sub1

Mult8_Add

Mult7_Add

Sub3

Sub2

Sub4

Sub5

Sub6

Mult13_Add

Sub7

Div1

ModM

ult

BM

odM

ult

AM

odA

rith

FPT 2006 Bangkok

Prototype Implementation - Results

• Taking its smaller resource usage into account, the execution time of our solution is comparable to previous work.

• However, because of the high resource usage, none of the previous designs fulfills the given requirements.

• Reference [5] uses GF(2m) as finite field, thus execution time is not comparable. But its memory architecture is similar, but not easily applicable to GF(P) and it does not scale as well.

FlipFlops LUTs Slices BRAMs Cycle Period Point Multiplication

this workthis work 11281128 30153015 18061806 33 9.898ns9.898ns 12.716ms (160 Bit)12.716ms (160 Bit)

[16] 6959 11227 n/a n/a 10.952ns 14.414ms (160 Bit)

[30] 5735 11416 n/a 35 25ns estimated 3ms (192 Bit)

[5] n/a n/a 18314 24 100.1ns 114.71µs (191 Bit GF(2m))

FPT 2006 Bangkok

Application to Alternative EC Arithmetic

• Application of our memory architecture to an algorithm for atomic point doubling and addition.

• Algorithms consists of more modular multiplications, thus, allowing the better utilization for more modular multipliers.

• Our architecture allows the parallel execution of modular additions.• With three multipliers atomic algorithm is faster as IEEE point addition

with only two parallel multipliers.

#multipliersmultiplier utilization

#consecutive multiplications

#consecutive additions

[21] 2 approx. 90% 10 8

this

work

2 approx. 94% 10 1

3 approx. 90% 7 1

4 approx. 89% 5 5

5 approx. 75% 5 1

FPT 2006 Bangkok

Schedule for Atomic ECAdd&Dbl

• Schedule for three modular multipliers.

Mult6

Add26

Add33

Sub18Add25

Sub4

Add3

Add12

Add16

Sub32

Add15

ModM

ult

CM

odM

ult

BM

odA

rith

Mult1 Mult22 Mult27 Mult29 Mult20 Mult28 Mult30

Mult21 Mult9 Mult2 Mult23 Mult5 Mult14 Mult31

Mult7 Mult8 Mult13 Mult10

Add19 Add17

Add11

Sub24

ModM

ult

A

FPT 2006 Bangkok

Conclusions

• Novel memory architecture for ECC implementations over GF(P) on FPGAs features the following advantages:

– Low register usage, because of direct memory access.

– Execution time scales with the number of modular multipliers, as long as data dependencies allow this.

– Remaining modular arithmetic is executed in parallel to all the modular multiplications.

FPT 2006 Bangkok

• Thank you for the attention.

• Any questions?

FPT 2006 Bangkok

References

[5] N. A. Saqib, F. Rodríguez-Henríquez, A. Díaz-Pérez, „A Parallel Architecture for Computing Scalar Multiplication on Hessian Elliptic Curves.“ in ITCC, vol. 2, 2004, pp.493-497.

[16] A. B. Örs, L. Batina, B. Preneel, J. Vandewalle, „Hardware Implementation of an Elliptic Curve Processor over GF(p).“ in ASAP. IEEE Computer Society, 2003, pp. 433-443.

[21] W. Fischer, C. Giraud, E. W. Knudsen, „Parallel scalar multiplication on general elliptic curves over Fp hedged against Non-Differential Side-Channel Attacks.“, Jan 2002.

[30] G. Orlando, C. Paar, „A Scalable GF(p) Ellitpic Curve Processor Architecture for Programmable Hardware.“ in CHES, ser. LNCS, vol 2162, 2001, pp. 348-363.

fpt 2006 bangkok a novel memory architecture for elliptic curve cryptography with parallel modular...

Documents