fpt 2006 bangkok a novel memory architecture for elliptic curve cryptography with parallel modular...
TRANSCRIPT
FPT 2006 Bangkok
A Novel Memory Architecture for Elliptic Curve Cryptography with Parallel Modular
Multipliers
Ralf Laue, Sorin A. HussIntegrated Circuits and Systems Lab, Computer Science Dept.
Technische Universität Darmstadt, Germany{laue|huss}@iss.tu-darmstadt.de
December 14th, 2006FPT 2006, BangkokFPT 2006, Bangkok
Page 2FPT 2006 Bangkok
Introduction
• Speed-up of todays hardware stems increasingly from parallelization.
• Cryptographical implementations should take ad-vantage of this by using parallel algorithm versions.
• We begin with an survey about parallelization on dif-ferent abstraction levels of public key cryptography.
• Then, we present a novel parallel memory architecture for elliptic curve cryptography in GF(P).– Allows the execution time to scale with the number of
parallel modular multipliers.– Direct memory connection leads to low resource usage.
Page 3FPT 2006 Bangkok
Overview
• Parallelization on Different Abstraction Levels
• Novel Memory Architecture– Design Considerations– Proposed Memory Architecture
• Experimental Results– Number of Parallel Multipliers– Prototype Implementation– Application to Another EC Arithmetic Algorithm
Page 4FPT 2006 Bangkok
Parallelization on Different Abstraction Levels
• In general, parallelization yields greater benefit on lower levels (as less control logic needs to be duplicated)
• Parallelization on higher levels allows further speed-up and offers advantages not available on lower levels.
• Parallelization methods on different levels do not exclude each other.Finite FieldFinite Field
Modular Arithmetic
Elliptic Curve GroupElliptic Curve GroupPoint Addition and Doubling
Discrete Logarithm/Discrete Logarithm/Integer FactorizationInteger Factorization
Point Multiplication/Exponentiation
Cryptographic SchemeCryptographic Scheme
SystemSystem
RSARSA
ECC/HECCECC/HECC
Page 5FPT 2006 Bangkok
Parallelization on Finite Field Level
• Modular multi-word multiplication is the most critical operation. Thus, paralleliza-tion on this level is a popular strategy.
• The approaches on this level do not exclude each other.
Data-paths of full bit-width:– Allow for linear time complexity at cost
of proportional increase of resources (e.g. systolic array).– Usual bit-widths: ECC: >100 bit, RSA: >1000 bit– Problem: Design for maximum bit-width. For smaller word
counts resources stay unused, higher may be infeasible.
Finite FieldFinite FieldModular Arithmetic
Elliptic Curve GroupPoint Addition and Doubling
Discrete Logarithm/Integer Factorization
Point Multiplication/Exponentiation
Cryptographic Scheme
System
RSAECC/HECC
Page 6FPT 2006 Bangkok
Parallelization on Finite Field Level (cont.)
• Pipelining– Allows for linear time complexity, too.– More flexible as buses of full bit-width, because
number of pipeline stages may be chosen freely.– Problem: calculated bit-width always corresponds
to a multiple of the number of stages in words.• Resources may still stay unused.
• ECC/RSA-combination allows only for pipeline lengths designed for ECC, as those designed for RSA would waste resources and execution time, if used with ECC.
Page 7FPT 2006 Bangkok
Parallelization on Finite Field Level (cont.)
• Karatsuba multiplication:– Multiplying two numbers with two words each can be done with
three word multiplications.
– Recursion leads to approx. O(n1,585).– As recursion is difficult in hardware, this is usually used for
multiplications in full bit-width (requires less resources).
• Residue Number Systems:– Long numbers are represented relative to a base consisting of
multiple smaller moduli, relatively prime to each other. The Chinese Remainder Theorem ensures a unique mapping.
– Multiplication, addition and subtraction may be executed in parallel.– Can be interpreted as special case of buses of full bit-width.
)(2)]()()()[(2)(
)(2)]()[(2)()2()2(
00001101012
11
0010012
110101
yxyxyxyyxxyx
yxyxyxyxyyxxbb
bbbb
)(2)]()()()[(2)(
)(2)]()[(2)()2()2(
00001101012
11
0010012
110101
yxyxyxyyxxyx
yxyxyxyxyyxxbb
bbbb
Page 8FPT 2006 Bangkok
Parallelization on Elliptic Group Level
• EC doubling and addition may be sped upby using multiple modular units in paral-lel.
• Literature suggests a maximum of two orthree modular multipliers (data depen-dencies limit further improvements).
• One instance of the remaining modulararithmetic is sufficient, because it is veryfast in comparison.
• This abstraction level is well-suited for parallelization in SIMD implementations.
• Note that this level does not exist for RSA.
Finite FieldModular Arithmetic
Elliptic Curve GroupElliptic Curve GroupPoint Addition and Doubling
Discrete Logarithm/Integer Factorization
Point Multiplication/Exponentiation
Cryptographic Scheme
System
ECC/HECCRSA
Page 9FPT 2006 Bangkok
Parallelization on Discrete Logarithm/ Integer Factorization Level
• Both point multiplication and expo-nentiation allows parallel use of twoinstances of group operations.– E.g. with Montgomery Ladder (paral-
lel point doubling/addition for ECC;parallel square/multiply for RSA).
• Parallelization on this abstractionlevel is (in addition to further speed-ups) often used as countermeassure against side channel attacks.
Finite FieldModular Arithmetic
Elliptic Curve GroupPoint Addition and Doubling
Discrete Logarithm/Discrete Logarithm/Integer FactorizationInteger FactorizationPoint Multiplication/Exponentiation
Cryptographic Scheme
System
ECC/HECCRSA
Page 10FPT 2006 Bangkok
Parallelization on Cryptographic Primitive/ System Level
• Cryptographic Schermes usually only useone point multiplication/exponentiation.– We know of no proposal for parallelization
on this level.
• Possible scenario: Flexible coprocessor for RSA/ECC– Parallelization on lower abstraction levels
is only possible to a certain degree, as long as unused resources should be avoided.
– Further parallelization may be done on the level of the cryptographic primitive to increase throughput.
Finite FieldModular Arithmetic
Elliptic Curve GroupPoint Addition and Doubling
Discrete Logarithm/Integer Factorization
Point Multiplication/Exponentiation
Cryptographic SchemeCryptographic Scheme
SystemSystem
ECC/HECCRSA
Page 11FPT 2006 Bangkok
Overview
• Parallelization on Different Abstraction Levels
• Novel Memory Architecture– Design Considerations– Proposed Memory Architecture
• Experimental Results– Number of Parallel Multipliers– Prototype Implementation– Application to Another EC Arithmetic Algorithm
Page 12FPT 2006 Bangkok
Design Goals
• ECC implementation for GF(P) on FPGAs.• Ability to support different key lengths.• Resource requirements should be relatively low, thus
allowing integration of further functions on the FPGA.– E.g. other cryptographic modules, something unrelated to
cryptography.
• Thus, minimum execution time was less important than a high utilization of the allocated resources.
Page 13FPT 2006 Bangkok
Design Decisions
• No parallelization on finite field level– Would lead to unused resources, at least for some key
lengths.
• Instead, parallelization on elliptic group level– Depends on data dependencies, independent from key
length.
• Modular multiplication is more complex and time consuming than remaining modular operations.– Chosen architecture consists of multiple modular multipliers
parallel to each other and the module for the remaining modular arithmetic parallel to the multipliers.
Page 14FPT 2006 Bangkok
Conventional Memory Architecure
• Memory architecture must allow all operations to be continuously supplied with data.
• Conventional memory architecure consists of one memory and modules with input and output registers.
• Registers take up FPGA resources, but contain only redundant data copied from memory.
Mult 1RAM ... Mult n ALU ... Square
Page 15FPT 2006 Bangkok
Novel Memory Architecture
• Each modular multiplier is assigned its own memory block via a direct connection.– Supports continuous data supply.– Low general resource usage, slightly increased memory usage.
• Remaining modular arithmetic may access memory blocks via the second port.
• Execution time scales with the number of modular multpliers.• Modular arithmetic copies data between local memory blocks,
as multipliers only can access “their“ memory block.– Does not hinder scalability, as remaining modular arithmetic can
access all memory blocks simultaneously in parallel.
Page 16FPT 2006 Bangkok
Novel Memory Architecture (cont.)
• Usual memory blocks lack third port.
• Cryptographic primitive and modular arithemtic share second memory port.– Access from cryptographic
primitive only while no computation is executed.
– Else: access from the modular arithmetic.
• Elliptic curve arithmetic does not directly access the data, but only indirectly via the modular arithmetic.
ModMult
BRAM
MUX
...ModMult
BRAM
ModMult
BRAM
Modular ArithmeticModular ArithmeticModular ArithmeticModular Arithmetic
Elliptic Curve ArithmeticElliptic Curve ArithmeticElliptic Curve ArithmeticElliptic Curve Arithmetic
Cryptographic PrimitiveCryptographic PrimitiveCryptographic PrimitiveCryptographic Primitive
data
statuscommands
commands
busy
commands
data
Page 17FPT 2006 Bangkok
Overview
• Parallelization on Different Abstraction Levels
• Novel Memory Architecture– Design Considerations– Proposed Memory Architecture
• Experimental Results– Number of Parallel Multipliers– Prototype Implementation– Application to Another EC Arithmetic Algorithm
Page 18FPT 2006 Bangkok
Number of Parallel Multipliers
• Determine number of multipliers to be used (IEEE 1363):– ECDbl can utilize only two parallel modular multipliers because of
data dependecies.
– Utilization of modular multipliers for ECAdd (16 multiplications).
• Table highlights scalability.– (#multipliers * #consecutive multiplications) is smallest multiple of
the number of multipliers larger or equal than overall number of multiplications.
#multipliersmultiplier utilization
#consecutive multiplications
2 approx. 98% 8
3 approx. 82% 6
4 approx. 74% 5
Page 19FPT 2006 Bangkok
Data Flow Graph ECAdd, IEEE
• Consecutive multiplications are always executed on same multiplier.– No copying between memory
blocks.– Dark and light grey multiplications
are executed on different modular multipliers.
• Longest path contains 5 modular multiplications.– No speed-up by using more than 4
multipliers possible.
Page 20FPT 2006 Bangkok
Schedule ECAdd, IEEE
• Schedule for two modular multipliers.
• Mapping to multipliers as shown in data flow graph on last slide.
Quad1 Mult1 Mult2 Mult3 Quad3 Mult12 Mult11 Mult15
Quad2 Mult5 Mult4 Mult6 Mult9 Quad4 Mult10 Mul14
Sub1
Mult8_Add
Mult7_Add
Sub3
Sub2
Sub4
Sub5
Sub6
Mult13_Add
Sub7
Div1
ModM
ult
BM
odM
ult
AM
odA
rith
Page 21FPT 2006 Bangkok
Prototype Implementation - Results
• Taking its smaller resource usage into account, the execution time of our solution is comparable to previous work.
• However, because of the high resource usage, none of the previous designs fulfills the given requirements.
• Reference [5] uses GF(2m) as finite field, thus execution time is not comparable. But its memory architecture is similar, but not easily applicable to GF(P) and it does not scale as well.
FlipFlops LUTs Slices BRAMs Cycle Period Point Multiplication
this workthis work 11281128 30153015 18061806 33 9.898ns9.898ns 12.716ms (160 Bit)12.716ms (160 Bit)
[16] 6959 11227 n/a n/a 10.952ns 14.414ms (160 Bit)
[30] 5735 11416 n/a 35 25ns estimated 3ms (192 Bit)
[5] n/a n/a 18314 24 100.1ns 114.71µs (191 Bit GF(2m))
Page 22FPT 2006 Bangkok
Application to Alternative EC Arithmetic
• Application of our memory architecture to an algorithm for atomic point doubling and addition.
• Algorithms consists of more modular multiplications, thus, allowing the better utilization for more modular multipliers.
• Our architecture allows the parallel execution of modular additions.• With three multipliers atomic algorithm is faster as IEEE point addition
with only two parallel multipliers.
#multipliersmultiplier utilization
#consecutive multiplications
#consecutive additions
[21] 2 approx. 90% 10 8
this
work
2 approx. 94% 10 1
3 approx. 90% 7 1
4 approx. 89% 5 5
5 approx. 75% 5 1
Page 23FPT 2006 Bangkok
Schedule for Atomic ECAdd&Dbl
• Schedule for three modular multipliers.
Mult6
Add26
Add33
Sub18Add25
Sub4
Add3
Add12
Add16
Sub32
Add15
ModM
ult
CM
odM
ult
BM
odA
rith
Mult1 Mult22 Mult27 Mult29 Mult20 Mult28 Mult30
Mult21 Mult9 Mult2 Mult23 Mult5 Mult14 Mult31
Mult7 Mult8 Mult13 Mult10
Add19 Add17
Add11
Sub24
ModM
ult
A
Page 24FPT 2006 Bangkok
Conclusions
• Novel memory architecture for ECC implementations over GF(P) on FPGAs features the following advantages:
– Low register usage, because of direct memory access.
– Execution time scales with the number of modular multipliers, as long as data dependencies allow this.
– Remaining modular arithmetic is executed in parallel to all the modular multiplications.
Page 26FPT 2006 Bangkok
References
[5] N. A. Saqib, F. Rodríguez-Henríquez, A. Díaz-Pérez, „A Parallel Architecture for Computing Scalar Multiplication on Hessian Elliptic Curves.“ in ITCC, vol. 2, 2004, pp.493-497.
[16] A. B. Örs, L. Batina, B. Preneel, J. Vandewalle, „Hardware Implementation of an Elliptic Curve Processor over GF(p).“ in ASAP. IEEE Computer Society, 2003, pp. 433-443.
[21] W. Fischer, C. Giraud, E. W. Knudsen, „Parallel scalar multiplication on general elliptic curves over Fp hedged against Non-Differential Side-Channel Attacks.“, Jan 2002.
[30] G. Orlando, C. Paar, „A Scalable GF(p) Ellitpic Curve Processor Architecture for Programmable Hardware.“ in CHES, ser. LNCS, vol 2162, 2001, pp. 348-363.