ece 545 lecture 8b hardware architectures of secret-key ... · - outer-round pipelining -...

George Mason University

Hardware Architectures of Secret-Key Block Ciphers

and Hash Functions

ECE 545 Lecture 8b

2

Recommended reading

•  K. Gaj and P. Chodowiec, FPGA and ASIC Implementations of AES, Chapter 10 in C.K. Koc (Ed.), Cryptographic Engineering Section 10.4 Parameters of Hardware Implementations Section 10.5 Hardware Architectures of Symmetric Block Ciphers

3

Recommended reading E. Homsirikamol, M. Rogawski, and K. Gaj, "Throughput vs. Area Trade-offs in High-Speed Architectures of Five Round 3 SHA-3 Candidates Implemented Using Xilinx and Altera FPGAs," in LNCS 6917, Cryptographic Hardware and Embedded Systems - CHES 2011, Nara, Japan, Sep. 28-Oct. 1, pp. 491-506. Sections 1-4.

Secret-key Ciphers

Cipher

message

ciphertext

cryptographic key

N bits

N bits

K bits

Current American Standards AES vs. Triple DES

Triple DES AES

3 DES

64 bits

input

64 bits

output

key

168 bits AES

128 bits

input

128 bits output

key

128, 192, and 256 bits

Initial transformation

Final transformation

#rounds times

Round Key[i] i:=i+1

Round Key[0]

i:=1

i<#rounds?

Cipher Round

Round Key[#rounds+1]

Typical Flow Diagram of a Secret-Key Block Cipher

key scheduling

encryption/decryption

memory of internal keys

output

input/key

input interface

output interface

Control unit

control

Top level block diagram

key expansion encryption/

decryption

memory of internal keys

output

input

input interface

output interface

Control unit

control

key setup

key scheduling

key

Primary parameters of hardware implementations for secret-key block ciphers

Latency Throughput

Encryption/ decryption

Time to encrypt/decrypt a single block

of data

Mi

Ci Number of bits

encrypted/decrypted in a unit of time


Mi Mi+1 Mi+2

Ci Ci+1 Ci+2

Throughput = Block_size · Number_of_blocks_processed_simultaneously Latency

Encryption time

Latency (Message_size –Block_size)

Time

Message size

Dependence of the encryption time on latency and throughput

Throughput

register

combinational logic

one round

multiplexer

Basic iterative architecture

round key

register

combinational logic

one round

multiplexer

Basic iterative architecture

IN

OUT

P1

C1

P2

C2

P3

Basic architecture: Timing

#rounds · clock_period

CLK

Block vs. stream ciphers

Stream cipher

Internal state - IS Block cipher

K K

M1, M2, …, Mn m1, m2, …, mn

C1, C2, …, Cn c1, c2, …, cn

Ci=fK(Mi) ci = fK(mi, ISi) ISi+1=gK(mi, ISi)

Every block of ciphertext is a function of only one

corresponding block of plaintext

Every block of ciphertext is a function of the current block

of plaintext and the current internal state of the cipher

Typical stream cipher Sender Receiver

Pseudorandom Key Generator

mi

plaintext

ci

ciphertext

ki keystream

key initialization vector (seed)

Pseudorandom Key Generator

mi plaintext

ci

ciphertext

ki keystream

key initialization vector (seed)

ECB (Electronic CodeBook) mode

Electronic CodeBook Mode – ECB Encryption

M1 M2 M3

E

Ci = EK(Mi) for i=1..N

MN-1 MN

E E E E . . .

C1 C2 C3 CN-1 CN

K K K K K

Electronic CodeBook Mode – ECB Decryption

C1 C2 C3

D

Ci = EK(Mi) for i=1..N

CN-1 CN

D D D D . . .

M1 M2 M3 MN-1 MN

K K K K K

Counter Mode

Counter Mode - CTR Encryption

m1 m2 m3

E

ci = mi ⊕ ki ki = EK(IV+i-1) for i=1..N

mN-1 mN

. . .

E E E E . . .

c1 c2 c3 cN-1 cN

IV IV+1 IV+2 IV+N-2 IV+N-1

k1 k2 k3 kN-1 kN

K K K K K

Counter Mode - CTR Decryption

c1 c2 c3

E

mi = ci ⊕ ki ki = EK(IV+i-1) for i=1..N

cN-1 cN

. . .

E E E E . . .

m1 m2 m3 mN-1 mN

IV IV+1 IV+2 IV+N-2 IV+N-1

k1 k2 k3 kN-1 kN

K K K K K

Counter Mode - CTR

E K

IN

OUT

counter

IV

1 L

ci

mi

E K

IN

OUT

counter

IV

1 L

ci

mi

1 L 1 L

IS1 = IV ci = EK(ISi) ⊕ mi ISi+1 = ISi+1

CBC (Cipher Block Chaining) Mode

Cipher Block Chaining Mode - CBC Encryption

m1 m2 m3

E

IV

ci = EK(mi ⊕ ci-1) for i=1..N c0=IV

mN-1 mN . . .

E E E E . . .

c1 c2 c3 cN-1 cN

Cipher Block Chaining Mode - CBC Decryption

mi = DK(ci) ⊕ ci-1 for i=1..N c0=IV

m1 m2 m3 mN-1 mN

IV . . .

D D D D D . . .

c1 c2 c3 cN-1 cN

Primary factor in choosing the encryption/decryption unit architecture

Symmetric-key cipher mode of operation:

1. Non-feedback cipher modes

ECB, counter mode

2. Feedback cipher modes

CBC, CFB, OFB

Non-feedback Counter Mode - CTR

M0 M1 M2

E

Ci = Mi ⊕ AES(IV+i) for i=0..N

MN-1 MN

. . .

E E E E . . .

C1 C2 C3 CN-1 CN

IV IV+1 IV+2 IV+N-1 IV+N

Feedback cipher modes - CBC M1 M2 M3

E

IV

C1 = AES(Mi ⊕ IV) Ci = AES(Mi ⊕ Ci-1) for i=2..N

MN-1 MN . . .

E E E E . . .

C1 C2 C3 CN-1 CN

Feedback cipher modes CBC, CFB, OFB

combinational logic

k rounds

register

multiplexer

round 1 round 2

round k . . . . .

k-rounds Loop Unrolling

Loop Unrolling: Timing

IN

OUT

P1

C1

P2

C2

P3

#rounds/k · extended_clock_period

CLK

speed

area

k=2 k=3 k=4 k=5

loop-unrolling basic architecture

Loop Unrolling: Speed vs. Area

speed = speed basic 1 + τ

1 + τ / k

τ << 1

combinational logic

MUX register

one round

Architectures suitable for feedback modes

round K . . . .

round 1

round 2

MUX

round #rounds

. . . .

round 1

round 2

Decreasing area by resource sharing

F F

D0 D1

D0’ D1’

F

D0 D1

D0’ D1’

multiplexer

Before After

register

Throughput

Area

basic architecture

Resource Sharing: Speed vs. Area

- basic architecture

- resource sharing

resource sharing

Non-Feedback Cipher Modes ECB, counter, OCB

Comparison for non-feedback cipher modes, e.g. Counter Mode - CTR

M0 M1 M2

E

Ci = Mi ⊕ AES(IV+i) for i=0..N

MN-1 MN

. . .

E E E E . . .

C1 C2 C3 CN-1 CN

IV IV+1 IV+2 IV+N-1 IV+N

E

IV

E

C1

M1

Z1

Z1

E

C2

M2

Z2

Z2

E

CN-1

MN-1

ZN-1

ZN-1

E

CN

MN

ZN

MN

. . . L

R

length

g(L)

Zi=f(L, R)

E

0

E

T

ZN

τ bits

Control sum

OCB

Increasing speed by parallel processing


unit


unit


unit


unit


unit


unit

Increasing speed using pipelining

Cipher 1 Cipher 2

round 1 round 1

round 2

round 10

. . .

round 16

. . .

Speed = target_clock_period

block size

target clock period, e.g., 20 ns

Pipelined operation of the encryption unit

B1

clock cycle 1

B2

2

B1 B3

3

B2 B1

B4

4

B3 B2 B1

B5

5

B4 B3 B2

B6

6

B5 B4 B3

B7

7

B6 B5 B4

B8 B7 B6 B5

8

B13 B12 B11 B10

B14 B13 B12 B11

B15 B14 B13 B12

B16 B15 B14 B13

B9 B8 B7 B6

B10 B9 B8 B7

B11 B10 B9 B8

B12 B11 B10 B9

clock cycle 9 10 11 12 13 14 15 16

. . . .

#rounds registers

round #rounds = one pipeline stage

round 1 = one pipeline stage


Full outer-round pipelining

Total # of pipeline stages = #rounds

Full mixed inner- and outer-round pipelining

round #rounds =k pipeline stages

. . . .

round 1 = k pipeline stages

round 2 =k pipeline stages

. . . .

. . . .

. . . .

k registers

Total # of pipeline stages = #rounds·k

k rounds

register1

register2

register k . . . .

pipeline stage 1 = round 1

pipeline stage 2 = round 2

pipeline stage k = round k

multiplexer

k-stage Outer-Round Pipelining

IN

OUT

P1

C1

P2

C3

P3

Outer-Round Pipelining: Timing

#rounds · clock_period

CLK

P4 P5 P6

C2 C4

speed

area

outer-round pipelining non-feedback modes

basic architecture

k=2

k=3

k=4

k=5

Outer-Round Pipelining: Speed vs. Area

outer-round pipelining feedback modes

round #rounds = one pipeline stage

. . . .



K registers

round K = one pipeline stage

. . . .



MUX K registers

combinational logic

MUX register

one round, no pipelining

Outer-round Pipelining

one round

register1

register2

register k . . . .

pipeline stage 1

pipeline stage 2

pipeline stage k

multiplexer

k-stage Inner-Round Pipelining

IN

OUT

P1

C1

P2

C3

P3

Inner-Round Pipelining: Timing

#rounds · (k · reduced_clock_period)

CLK

P4 P5 P6

C2 C4

speed

area

inner-round pipelining non-feedback modes

basic architecture k=2

k=3

k=4

k=5

inner-round pipelining feedback modes

Inner-Round Pipelining: Speed vs. Area

Mixed Inner- and Outer-round Pipelining

round #rounds =k pipeline stages

. . . .


round 2 =k pipeline stages

. . . .

. . . .

. . . .

d) k registers

round K = k pipeline stages

. . . .



MUX

. . . .

. . . .

. . . .

c)

k registers

one round = k pipeline stages

MUX

. . . .

b) k registers

one round, no pipelining

MUX

a) register

combinational logic

- basic architecture - outer-round pipelining

- inner-round pipelining - mixed inner and outer-round pipelining

Area

Throughput

basic architecture

inner-round pipelining

mixed inner and outer-round pipelining

outer-round pipelining

K=2 K=3

K=4

K=2

K=3

k=2

kopt

Comparison of the traditional and new design methodologies

Area [CLB slices]

Speed [Mbit/s]

Choosing optimum architecture for non-feedback cipher modes

basic architecture



Area

Latency

basic architecture




K=2 K=4 K=3 K=5

K=2 K=3

k=2

kopt

Latency vs. area dependence for the new design methodology

- basic architecture - outer-round pipelining

- inner-round pipelining - mixed inner and outer-round pipelining

r1 r2 r1

r1 r2 r1 r3 r4

op1 op2 op3 op4 op5

op1 op2 op3 op4 op5

TCLKmin

TCLKmin

Limits on the minimum clock period after pipelining (1)

1. Delay of a single round divided by k = number of internal pipeline stages

k=2

2. Delay of the longest indivisible operation

k=4

r1 r2 r1

cntr1 cntr3 cntr1

rc Control Unit

op1 op3 op4 op5

TCLKmin

op2

Limits on the minimum clock period after pipelining (2)

3. Delays within the control unit

4. Maximum latency

5. Maximum input/output bandwidth

cntr2 cntr4

DES"encryption/decryption"

core"

clock reset

encrypt/decrypt

data input data available data read

64

key input

key available key read

64

IV input IV available IV read

64

Key memory"

Key schedule"

data output

write full

64

56 key

4 4

read bank"write bank"

round number" round key"4 48

DES/3DES

key choice"2

AES"encryption/decryption"

core"

clock reset

encrypt/decrypt

data input data available data read

128

key input key available

key read

64

IV input IV available IV read

128!

Key schedule!

Key memory"

data output

write full

128!

64! Key material!

4"4

read bank"write bank"

round number" round key!4 128!

DES/3DES

Cycle number! 6!

2!key size!

speed

area

k=2 k=3 k=4 k=5



loop-unrolling

resource sharing

- basic architecture - loop unrolling

- inner-round pipelining - outer-round pipelining

- resource sharing

basic architecture

k=2

k=3

k=4

k=2

k=3

k=4

k=5

k=5

Performance of alternative architectures: in non-feedback cipher modes (ECB, counter)

speed

area

k=2 k=3 k=4 k=5

outer-round pipelining inner-round pipelining

loop-unrolling

resource sharing

basic architecture

Performance of alternative architectures: in feedback cipher modes (CBC, CFB, OFB)

- basic architecture - loop unrolling

- inner-round pipelining - outer-round pipelining

- resource sharing

Hash Functions

Hash Function

arbitrary length

message

hash function

hash value h(m)

h

m

fixed length

It is computationally infeasible to find such

m and m’ that h(m)=h(m’)

Collision Resistance:

Message

Hash function

Public key cipher

Alice Signature

Alice’s private key

Bob

Hash function

Alice’s public key

Hash Functions in Digital Signature Schemes

Hash value 1

Hash value 2

Hash value

Public key cipher

yes no

Message Signature

General scheme for constructing a secure hash function: Merkle–Damgård scheme

Message m

Padding, appending bit length, M

M1

IV H0 H1 H2 f f . . .

Ht

compression function

output transformation

h(m) g

M2 Mt . . .

Sponge Scheme

R

H

+

IV

CLR

Wt Kt Step t

Basic iterative architecture of SHA-1 and SHA-2

OUT

Data Stream 1 . . . . . . . . Data Stream k

. . . . .

R

H

+

IV

CLR

Wt KtStep t

R

H

+

IV

CLR

Wt KtStep t

Architecture with Multiple Processing Units

R

H

+

IV

CLR

Wt KtStep t

R

H

+

IV

CLR

Wt KtStep t

Features of architecture with multiple processing units

" Pros —  Throughput increases by a factor of k

" Cons —  Latency the same as for the basic architecture

—  Area increases by a factor of k

—  Requires k independent data streams (messages)

Pipelined architecture

H

+

R 1

IV

K t W t

H

IV

R 2 step t, stage 1

step t, stage 2

Features of the pipelined architecture

" Pros —  Throughput increases by a factor close to k

—  Area increases by a factor smaller than k

" Cons —  Latency the same as for the basic architecture

—  Requires k independent data streams (messages)

R

H

+

IV

CLR

. . . .

Wt Wt+1

Wt+k-1

Kt Kt+1

Kt+k-1

Step t Step t+1

Step t+k-1

. . . .

Unrolled architecture

Features of the unrolled architecture

" Pros —  Reduces both latency and throughput

—  Requires only one data stream

" Cons —  Area may increase substantially compared

to the basic architecture

74

•  datapath width = state size •  one clock cycle per one round

Starting Point: Basic Iterative Architecture

Currently, most common architecture used to implement SHA-1, SHA-2, and many other hash functions.

Throughput

Area A

Th x1

75

•  datapath width = state size •  two clock cycles per one round

Horizontal Folding - /2(h)

Typically Throughput/Area ratio increases

Throughput

Area A

Th x1 /2(h)

76

•  datapath width = state size/2 •  two clock cycles per one round/step

Vertical Folding - /2(v)

Throughput

Area A

Th x1

/2(v)

Typically Throughput/Area ratio decreases

77

Vertical Folding with the State Kept in Memory BLAKE /4(h)/4(v)-m

•  datapath width = state size/4 •  16 clock cycles per one round

Throughput/Area ratio increases

Throughput

Area A

Th x1

/4(h)/4(v)-m

78

•  datapath width = state size •  one clock cycle per two rounds

Unrolling - x2

Typically Throughput/Area ratio decreases

Throughput

Area A

Th

2A

x1

x2

79

•  datapath width = state size •  one clock cycle per two rounds

Efficient Unrolling - x2

Throughput

Area A

Th

2A

x1 x2

x2-efficient

Sometimes Throughput/Area ratio increases

,

,,

80

Multiple Packets Available for Parallel Processing

PACKETS

1500B576B576B 64B

64B

1500B1500B64B 64B

64B 40B576B 40B

576B 576B40B N1

N2

N3

NX

PACKETS

PACKETS

PACKETS

Typical sizes of packets: 40B – 1500B 1500 B = Maximum Transmission Unit (MTU) for Ethernet v2

81

Parallel Processing Using Multi-Unit Architecture – MU2

Throughput

Area A

Th

2A

2Th

Typically Throughput/Area ratio stays the same

x1

MU2

82

Unrolled Architecture with Pipelining - x2-PPL2

Throughput

Area A

Th

2A

2Th

x1

x2-PPL2

Typically Throughput/Area ratio stays almost the same

83

Basic Architecture with Pipelining - x1-PPL2

Throughput

Area A

Th

2A

2Th

x1

x1-PPL2

Typically Throughput/Area ratio increases

84

BLAKE-256 in Virtex 5

85

Why Interface Matters?

•  Pin limit

Total number of i/o ports ≤ Total number of an FPGA i/o pins

•  Support for the maximum throughput

Time to load the next message block ≤ Time to process previous block

86

Interface: Two possible solutions

Length of the message communicated at the beginning

+ easy to implement passive source circuit − area overhead for the counter of message bits

Dedicated end of message port

− more intelligent source circuit required + no need for internal message bit counter

msg_bitlen

zero_word

message end_of_msg SHA core

87

SHA Core: Interface & Typical Configuration

•  SHA core is an active component; surrounding FIFOs are passive and widely available •  Input interface is separate from an output interface •  Processing a current block, reading the next block, and storing a result for the previous message can be all done in parallel

fifoin_empty

fifoin_read

idata w w

odata

fifoout_full

fifoout_write

fifoin_full

fifoin_write

fifoout_empty

fifoout_read

Input FIFO SHA core

clk rst

ext_idata

w

ext_odata din dout

src_ready

src_read

dst_ready

dst_write

din dout

full empty

write read

Output FIFO

din dout

full empty

write read

w

clk rst

clk rst clk rst

clk rst

clk rst

88

SHA Core Interface

w

SHA core

din dout

src_ready

src_read

dst_ready

dst_write

clk rst

clk rst

w

89

Communication Protocol for Unpadded Messages

msg_bitlen

zero_word −−−−−

message

w bits

.

.

.

seg_0_bitlen

zero_word

seg_0

w bits

seg_1_bitlen

seg_1

� � �

seg_n-1_bitlen

seg_n-1

a) b)

−−−−−

90

Communication Protocol for Unpadded Messages Without Message Splitting

last = 1 | msg_len_bp

message

msg_len_bp – message length before padding [bits]

w bits

91

Communication Protocol for Unpadded Messages With Message Splitting

.

.

.

last=0 | seg_0_len_bp

seg_0

w bits

last = 0 | seg_1_len_bp

seg_1

� � �

last = 1 | seg_n-1_len_bp

seg_n-1

seg_i_len_bp – segment i length before padding [bits]

* For all i < n-1 segment i length is assumed to be a multiple of the message block size, b [characteristic to each function], and thus also the word size, w. The last segment cannot consist of only padding bits. It must include at least one message bit.

w

SHA core

din dout

src_ready

src_read

dst_ready

dst_write

clk rst

clk rst

w

io_clk

io_clk

fifoin_empty

fifoin_read

idata w w

odata

fifoout_full

fifoout_write

fifoin_full

fifoin_write

fifoout_empty

fifoout_read

Input FIFO SHA core

clk rst

ext_idata

w

ext_odata din dout

src_ready

src_read

dst_ready

dst_write

din dout

full empty

write read

Output FIFO

din dout

full empty

write read

w

clk rst

io_clk rst io_clk rst

clk rst

clk rst

io_clk

io_clk

fifoin_empty

fifoin_read

idata w w

odata

fifoout_full

fifoout_write

fifoin_full

fifoin_write

fifoout_empty

fifoout_read

Input FIFO SHA core

clk rst

ext_idata

w

ext_odata din dout

src_ready

src_read

dst_ready

dst_write

din dout

full empty

write read

Output FIFO

din dout

full empty

write read

w

clk rst

clk or io_clk rst clk or io_clk rst

clk rst

clk rst

io_clk

io_clk

ece 545 lecture 8b hardware architectures of secret-key ... · - outer-round pipelining -...

Documents