workshop on cryptographic hardware and embedded systems (ches 2006) 13/10/2006 1/26 superscalar...
Post on 26-Dec-2015
218 Views
Preview:
TRANSCRIPT
13/10/2006 1/26Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006)
Superscalar Coprocessor forHigh-speed Curve-based
Cryptography
K. Sakiyama, L. Batina, B. Preneel, I. Verbauwhede
Katholieke Universiteit Leuven / IBBTDepartment Electrical Engineering - ESAT/COSIC
13/10/2006 2/26Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006)
Introduction
Curve-based Cryptography
HW/SW Partitioning
Superscalar Coprocessor
Results
Conclusions
Overview
13/10/2006 3/26Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006)
IntroductionMotivation
High-speed curve-based cryptography in HW/SW co-design How much instruction-level parallelism can we obtain from coprocessor instructi
ons?
Performance improvement for different operation forms in datapath AB+C mod P vs A(B+D)+C mod P ,A,B,C,D,P: polynomials
Performance comparison three different curve-based cryptosystems Which one is faster between ECC, HECC, ECC over a composite field?
Programmability and scalability Programmable in order to support different cryptosystems? Scalable in field sizes?
13/10/2006 4/26Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006)
IntroductionTarget Architecture
Curve-based cryptography over binary fields Hardware can be smaller and faster than prime field ECC over a binary field, e.g. GF(2163) HECC of genus 2 Field length can be shorter with a factor of 2, e.g. GF(283) ECC over a composite field Field length can be shorter with a factor of 2, e.g. GF ((283)2)
The datapath can be sharedProgrammable coprocessor supporting three curve-based crypt
ography by defining coprocessor instruction(s)(Coprocessor) instruction-level parallelism by superscalar
13/10/2006 5/26Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006)
Introduction
Curve-based Cryptography
HW/SW Partitioning
Superscalar Coprocessor
Results
Conclusions
Overview
13/10/2006 6/26Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006)
Curve-based Cryptography
HW/SW partitioning (1) General hierarchy in coprocessor for curve-
based cryptography
Point/DivisorMultiplication
Point/DivisorAddition
Point/DivisorDoubling
Finite FieldAddition
Finite FieldMultiplication
Finite FieldInversion HW Datapath
SW or HW controller
SW or HW controller
13/10/2006 7/26Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006)
Single instruction for all finite field operations Fixed-cycle execution enables efficient
implementation
Point/DivisorMultiplication
Point/DivisorAddition
Point/DivisorDoubling
Finite FieldAddition
Finite FieldMultiplication
Finite FieldInversion
Point/DivisorMultiplication
Point/DivisorAddition
Point/DivisorDoubling
Finite Field OperationE.g. AB+C mod P
Finite FieldInversion
Curve-based Cryptography
Proposed Hierarchy (1)
Sing
le In
stru
ctio
n
(Dat
apat
h)
Conv
ention
al
13/10/2006 8/26Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006)
(a) Building block: Regular XOR chains (b) Scalable in digit size (d) and field size (k) by
interconnecting several building blocks We use MALU83 (n=83, d=12) as building block
2xMALU83 can be configured as 1xMALU163
Curve-based Cryptography
Modular Arithmetic Logic Unit (MALU)
aiB(x)
miP(x)
T(x)
c i
ak
mk
ck+1
Tnext(x)
aiB(x)
miP(x)
T(x)
c i
ak
mk
ck+1
Tnext(x)
Inte
rco
nnec
tion
Inte
rco
nnec
tion
…
… …
(b)(a)
d
n
13/10/2006 9/26Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006)
Introduction
Curve-based Cryptography
HW/SW Partitioning
Superscalar Coprocessor
Results
Conclusions
Overview
13/10/2006 10/26Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006)
HW/SW PartitioningTYPE I: Smallest implementation
(baseline)
32-bitinstructions32-bit data
Instruction Bus
ProgramROM
Main CPU
Memory Mapped I/O
SRAM
MALU83
Data Bus
DBC
Coprocessor
IBC
13/10/2006 11/26Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006)
HW/SW Partitioning TYPE II: TYPE I + -code RAM
IBC
32-bitinstructions32-bit data
Instruction Bus
ProgramROM
Main CPU
Memory Mapped I/O
SRAM
-codeRAM
Data Bus
DBC
FSM
Coprocessor
MALU83
13/10/2006 12/26Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006)
HW/SW Partitioning TYPE III: TYPE I + Coprocessor
Memory
32-bitinstructions32-bit data
Instruction Bus
ProgramROM
Main CPU
Memory Mapped I/O
Coprocessor Memory
SRAM
MALU83
Data Bus
DBC
Coprocessor
IBC
13/10/2006 13/26Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006)
HW/SW Partitioning TYPE IV: TYPE I + Copro. Mem.& -code RAM
32-bitinstructions32-bit data
Instruction Bus
ProgramROM
Main CPU
Memory Mapped I/O
Coprocessor Memory
SRAM
MALU83
Data Bus
DBC
IBC-codeRAM
FSM
Coprocessor
13/10/2006 14/26Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006)
HW/SW Partitioning Co-design flow with GEZEL
Partitioning of functions
C/C++ codes for PKCs
C/C++ codes & H/W behavior blocks w/interface
GEZELFDL codes
Cross compile Synthesis
C/C++ codes w/physicalmemory map
ARM (SW) Co-processor (HW)
Cycle-true sim.( GEZEL)
VHDL codesProgram codes
13/10/2006 15/26Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006)
HW/SW Partitioning Result: Vertical Exploration of
System HECC Performance for different HW/SW
partitioning (Performance: Point/Divisor multiplication)
38 38
676767670 0
187
2,859
0
2,672
0
100
200
300
400
500
TYPE I TYPE II TYPE III TYPE IV
System Configuration
Req
uire
d C
lock
Cyc
les
[K]
I/O Transfer Overhead + OthersCoprocessor Data MemoryDatapath
Coprocessor Configuration
-code RAM Data Mem.
TYPE ITYPE II XTEPE III XTYPE IV X X
13/10/2006 16/26Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006)
Introduction
Curve-based Cryptography
HW/SW Partitioning
Superscalar Coprocessor
Results
Conclusions
Overview
13/10/2006 17/26Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006)
Multiple Modular Arithmetic Logic Units (MALUs) in coprocessor
Finite FieldOperation
E.g. AB+C mod P
Point/DivisorMultiplication
Point/DivisorAddition
Point/DivisorDoubling
Finite FieldInversion
Finite FieldOperation
E.g. AB+C mod P
Finite FieldOperation
E.g. AB+C mod P
Finite FieldOperation
E.g. AB+C mod P
…
Multipl
e MAL
Us
Point/DivisorMultiplication
Point/DivisorAddition
Point/DivisorDoubling
Finite Field OperationE.g. AB+C mod P
Finite FieldInversion
Sing
le M
ALU
Superscalar Coprocessor Proposed Hierarchy (2)
13/10/2006 18/26Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006)
IBC
32-bitinstructions32-bit data
Instruction Bus
ProgramROM
Main CPU
Memory Mapped I/O
MALU83
Coprocessor Memory
SRAM
MALU83 MALU83 MALU83
IQB
-codeRAM
Data Bus
BufferFull
DBC
FSM
Coprocessor
Superscalar Coprocessor Parallel Processing Architecture (TYPE IV-
based)
13/10/2006 19/26Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006)
Superscalar Coprocessor
Horizontal Exploration of System Performance of ECC and HECC
67 58
30 3622 20 20
3841
25 1322 22
8
0
20
40
60
80
100
Coprocessor Configuration
Req
uire
d C
lock
Cyc
les
[K]
Coprocessor Data Memory
Datapath
1xMALU83 2xMALU831xMALU83
HECC HECCHECC
Operation: A(B+D)+COperation: AB+C1xMALU163 2xMALU1633xMALU83 4xMALU83
HECCECC HECC ECC
13/10/2006 20/26Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006)
Introduction
Curve-based Cryptography
HW/SW Partitioning
Superscalar Coprocessor
Results
Conclusions
Overview
13/10/2006 21/26Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006)
ResultsPerformance for ECC over GF(283)
Fastest of three
x1.8 speed-up by 2-way superscaling (ILPD
P=6) with A(B+D)+C
Still more improvement is possible by adding MALUs
AB+C A(B+D)+C
13/10/2006 22/26Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006)
ResultsPerformance of HECC over GF(283)
Faster than ECC over a composite field
x2.7 speed-up by 4-way superscaling (ILPDP=5) with A(B+D)+C
Less improvement as increasing # of MALU
AB+C A(B+D)+C
13/10/2006 23/26Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006)
ResultsPerformance for ECC over GF((283)2 )
Slowest of three
x2.5 speed-up by 4-way superscaling (ILPD
P=6) with A(B+D)+C
Less improvement as increasing # of MALU
AB+C A(B+D)+C
13/10/2006 24/26Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006)
ResultsComparison of ECC/HECC implementations on
FPGAs
[11] T. Wollinger, PhD thesis, 2004.[13] G. Orlando and C. Paar, CHES 00.[14] N. Gura et al., CHES02.[29] Nazar A. Saqib et al., International Journal of Embedded Systems 2005
13/10/2006 25/26Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006)
Performance improvement / Comparison ECC was improved by a factor of 1.8 (2-way) HECC (genus 2) was improved by a factor of 2.7 (4-way) ECC over a composite field was improved by a factor of 2.5 (4-way) A(B+D)+C offers better performance than AB+C ECC is the fastest in this case study
Programmability & flexibility Support three different curve-based cryptosystems over a binary field Arbitrary irreducible polynomial Field size up to 332 bits by using 4xMALU83
Conclusions
13/10/2006 26/26Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006)
Thank you!
13/10/2006 27/26Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006)
EXIF/DMALU#0
1 4(3*) 4 Clock cycle
EXIF/D
EXIF/D
EXIF/D
R0 W0 IF/D
IF/D
MALU#3
MALU#1
MALU#2
k/d
R1
R2
R3
R0
R1
R2
R3
R0
R1
R2
R3
IF/D
W3IF/D
R0
R1
R2
R3
R0
R1
R2
R3
R0
R1
R2
R3
W1
W2
R0
R1
R2
R3
R0
R1
R2
R3
…
Parallel issue of instructionsCase of using 4 MALUs
IF/D : Instruction Fetch & Decode R_ : Read operands (dependent on the type of
operation) EX : Execution (dependent on MALU configuration, k &
d) W_ : Write (dependent on # of instructions issued in
parallel)
13/10/2006 28/26Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006)
Parallel issue of instructions
Out-of-order Execution Check RAW (Read After Write Dependency) for in-/out-of-order execution
top related