cryptographic algorithms and their implementations discussion of how to map different algorithms to...

Cryptographic Algorithms and their Implementations

Discussion of how to map different algorithms to our architecture

Public-Key Algorithms (Modular Exponentiation)

Rijndael

Serpent

Others (Mars, RC6, Twofish, etc.)

Modular Exponentiation

Square and Multiply Algorithm for Modular Exponentiation


Montgomery Modular Multiplication


Several Approaches to implementing Modular Multiplication:

Redundant Representation based (e.g. Carry-save)

Residue Number System based. Systolic Array Based.

Word-based implementations preferable, due to similarity with Symmetric-key Rules out systolic arrays

Modular Exponentiation Most popular and fastest were Carry-Save

representation based implementations.

Carry-save based were also word-oriented.

We selected fastest, simplest implementation: Extremely beneficial to have simplicity and

homogeneity in algorithms when designing a custom reconfigurable fabric.

Performance when implemented on Xilinx Virtex FPGAs: almost 5 Mb/s !!! (highest reported that we could find)


Five-to-two Multiplier Modular Exponentiation (P, E, M)

K = 22k mod M … computed externally1. P10 , P20 = 5to2_MontMult(K , 0 , 1 , 0 , M),

Z10 , Z20 = 5to2_MontMult(K , 0 , P , 0 , M);2. FOR i = 0 to n-1 DO3. Z1i+1 , Z2i+1 = 5to2_MontMult(Z1i , Z2i , Z1i , Z2i ,

M)4. IF ei = 1 THEN

P1i+1 , P2i+1 = 5to2_MontMult(P1i , P2i , Z1i , Z2i , M)ELSE

P1i+1 , P2i+1 = P1i , P2i

5. ENDFOR6. P1n , P2n = 5to2_MontMult(1 , 0 , P1n-1 , P2n-1 , M)7. P = P1n + P2n

8. RETURN P


Five-to-two CSA Montgomery Multiplication (A1 , A2 , B1 , B2 , M)

1. S10 , S20 = 0 , 02. FOR i = 0 to m-1 DO3. qi = [(S1i + S2i) + Ai*(B1+B2)] mod 24. S1i+1 , S2i+1 = CSR [(S1i + S2i) + Ai*(B1+B2) + qi*M]

div 25. ENDFOR


1024 Bits CSA

1024 Bits CSA

1024 Bits CSA

1024 Bits Registers

S1[i] S2[i]

MEMAi.B1

MEMAi.B2

MEMqi.n

FA

1024 bits shift register


Ai

FF

Their Implementation of MM


Implementing MM on our design

64 Bits CSA

64 Bits Registers

S1[i]<63:0>S2[i]<63:0>

MEMAi.B1<63:0>

MEMAi.B2<63:0>

MEMqi.n<63:0>

MEMAi.B1<959:896>

MEMAi.B2<959:896>

MEMqi.n<959:896>

MEMAi.B1<1023:960>

MEMAi.B2<1023:960>

MEMqi.n<1023:960>

64 Bits CSA

64 Bits CSA

64 Bits CSA

64 Bits CSA

64 Bits CSA

64 Bits CSA

64 Bits CSA

64 Bits CSA

64 Bits Registers64 Bits Registers

S1[i]<959:896>S2[i]<959:896>

S1[i]<1023:960>S2[i]<1023:960>

Modular Exponentiation Each of the 64-CSA blocks maps to a single

basic block Outputs of the last basic block are

registered. qi is generated by random-logic block at the

second basic-block Broadcast to all groups

Ai is generated in a similar manner, utilizing two more basic-blocks: Also broadcast to all groups FA



Ai

FF

Modular Exponentiation Efficient and scalable mapping to our

design1024-bit RSA will need to use 16 groups, while2048-bit will use 32, and 4096-bits will use 64

groups

Primary concern : clock rate may be limited by bit-broadcasts of qi and AiPotential impediment to scalabilityWe are exploring methods for pipelining these

broadcasts as well, to increase cycle-time and scalability.

Rijndael

Primary operations:

Sub-Bytes

Shift-Rows

Mix-Columns

Add-Round-Key

Rijndael

Representation of Data: 128-bit state.

32-bits

128-bits

8-bits each 32-bits 32-bits

128-bits of state

Rijndael

Add-Round-Key Simple 128-bit XOR operation: uses 1 basic-block

Sub-Bytes: Simple operation: byte-wise table lookup from S-Box Each S-box is 2kbits. 16 parallel S-boxes required ! No basic-blocks required, ALL memory-blocks

required !

Shift-Rows Simple operation: 4 x 32-bit permutations Uses only 1 basic-block

Rijndael Mix-Columns

Somewhat complicated: can be implemented using table lookups, but we’re out of Memory !

Alternative implementation:


Operation may be expressed in terms of “xtime()” function

Mix-columns implementation requires “xtime()” operation on each byte, followed by 4 XOR operations


In order to efficiently implement “xtime()”, we modified it this way In this form, only 2 basic-blocks are needed to apply “xtime()” to all

16 bytes A single basic-block will take the 128-bit data as input, and generate

the “xtime()” mask (0000x7x70x7) for each of the 16 bytes at the permute unit.

Another basic-block will now first perform the XOR operation, followed by a left shift (and substitute LSB with x7) at the permute unit.


After generating output from the “xtime()” function, 4 x 128-bit XOR operations need to be performed

4 basic-blocks will be used Note that the mix-column operation is carried out in parallel on all

4 columns.

Permute Unit

4-BitRandom

Logic

64 Carry SaveAdders

A B C D

O1 O2

64 Bits Registers

64

32 32

64

32 32

64

32 32

64

32 32

64

32 32

64

32 32

Xtime masks for all bytes

XOR operation

Rijndael Implementation summary

8 basic-blocks required only 2 (1 each) for Add-Round-Key and Shift-Rows 6 for Mix-Columns (2 for xtime(), 4 for XOR

operations)

16 Memory-blocks required !! All memory blocks used up in a single round!

In-efficient implementation due to memory intensive implementation of Rijndael

Only 10% logic used, versus 100% memory usage.

Rijndael Potential Solutions

Add lots of memory !! At least 10 times more Issues with memory placement

Consider memory-less implementations of Sub-Byte

Requires GF() constant multiplication and Inverse Affine Transforms

Currently under study as the more efficient and practical option.

Serpent

Substitution-permutation cipher comprised of Key Mixing, S-Box Substitution, and Linear Transformation.

S-boxes: 4 x 4 bit 32 copies required each

round 16 x 4 x 32 = 2048 bits

per round.

Serpent

The Linear Transformation step consists of: 8 fixed permute

operations, and 8 XOR operations

All operands are 32-bits wide

Serpent Serpent is an ideal match for our architecture:

8 x 32-bit fixed shifts and rotates can be easily implemented by the permute units of 2 basic-blocks.

Additional 2 basic-blocks required to implement the 8 x 32-bit XOR operations.

128-bit key mixing stage per round would require 1 more basic-block

Total of 5 basic-blocks and 2kbits of memory required per round.

Each round perfectly fits in a single group of our architecture!

16 rounds of Serpent’s total of 32 may be unrolled in our architecture

Other Algorithms DES

Implementation of a single round is trivial: a single group may implement multiple rounds !

Twofish Complex structure, requires more time to

define implementation on our architecture. However, all its basic operations are directly

supported.

RC6 and MARS Involve complicated operations requiring

special purpose logic: Data-dependent rotations Multiplication Modulo 232

Other Algorithms

RC6 and MARSThis special-purpose logic was not

incorporated because: Algorithms are more suitable for software

implementations than in hardware Lack of support and popularity of these

algorithms Addition of special-purpose logic would occur

overhead beyond its area, as additional supporting interconnect must be provided.

Comparison with Related Work

Although we cannot provide results based on empirical evaluation, we can present a logical framework for comparison of individual features

Through deductive reasoning, we identify what possible advantages one approach may have over the other, assuming all other factors normalized.

Comparison with Related Work Comparison with FPGA based implementations

Area Efficiency Use of basic gates instead of LUTs Basic-blocks with limited flexibility, thus fewer configuration

bits Basic units (full adders) combined into clusters of 64, and

programmed as a single entity – further savings in configuration memory elements

Performance Use of basic gates instead of LUTs Simpler Interconnect, with fewer routing-switches Hierarchical organization – no long wires (except for bit-

broadcast) Far smaller configuration data required – faster

reconfiguration time

Comparison with Related Work Comparison with FPGA based

implementationsPotential pitfalls

Design dedicates considerable amount of area to inter-block interconnect.

Until actual area can be quantified, we are unsure of area efficiency estimates.

Need to identify most suitable Performance/Area tradeoff.

Comparison with Related Work Comparison with COBRA Architecture

Uses multiple copies of special purpose logic blocks, couples with extremely simple interconnect.

Comparison with Related Work Comparison with COBRA

Architecture

Low logic-utilization – we have more generic blocks,

Fixed latency operationsIntermediate values registered only at

RCE boundary.

Programming Methodology Reconfigurable Computing devices suffer

from following two critical issues: Lack of a comprehensive programming model Lack of hardware virtualization

First issue implies the difficulty of programming RC architectures such as FPGAs

Second issue deals with exposition of hardware resource limitations to programmer.

Programming Methodology

How COBRA deals with these issues

Essentially a special-purpose programmable architecture than a configurable one

VLIW like instructions alleviate some of the programming model related issues

Also resolve the virtualization aspect.


The programming methodology and the impact of the issues mentioned can be seen in terms of a spectrum:

COBRA [3]Microprocessor Our Approach FPGAs


Programming model issue less severe for us because:Simple, highly specialized architecture

Hardware Virtualization is still a concern.

Programming Methodology Programming model:

Provide basic primitives that are supported by our architecture.

Programming is to be accomplished by expressing an algorithm using these primitives and interconnecting these primitives together using 32-bit interconnect.

Mapping such a description onto our design should be a trivial software challenge.

Due to special purpose nature, primitives are limited in number and thus programming should be an easy task.


32-bit Carry Save Adder 32-bit XOR 32-bit AND 32-bit OR 32, 64, and 128-bit

Ripple Carry Adder 32, 64, and 128-bit Fixed

Shifts 32 bit Rotates and

random permutes. 64-bit, 128-bit limited

permutes (TBD).

ANDing 32-bit value with a single bit

128-bit shift-register Random bit-logic

implementation, since each block is also capable of implementing:

single 4-input function two 3-input functions four 2-input functions 4 global bit-broadcast

lines 32-bit interconnect, point

to point.

Programming Primitives:

Conclusion: Work in Progress Following areas of design still under

consideration and not completely defined yet:

Configurable Memory-block Architecture VLSI Design to evaluate performance metrics and

fine-tuning of logical design i.e. if found to be too slow, reduce no of switches, use

longer wires, minimize the amount of interconnect to that which is necessary, etc.

Furthermore, the iterative process of evaluating more symmetric-key algorithms and refining the architecture is still in progress.

cryptographic algorithms and their implementations discussion of how to map different algorithms to...

Documents