aes microcode implementation in ixp2400 and a study of reconfigurable crypto unit piyush ranjan...

24
AES Microcode Implementation In IXP2400 And A study of Reconfigurable Crypto Unit Piyush Ranjan Satapathy CS203B Class Project Presentation

Post on 20-Dec-2015

225 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: AES Microcode Implementation In IXP2400 And A study of Reconfigurable Crypto Unit Piyush Ranjan Satapathy CS203B Class Project Presentation

AES Microcode Implementation In IXP2400 And

A study ofReconfigurable Crypto Unit

Piyush Ranjan SatapathyCS203B Class Project

Presentation

Road Map

AES Algorithm Overview IXP2400 Platform A Quick Look Microcode Overview Implementation of AES Experimental Results

Reconfigurable Crypto unit of Intel IXP2850

Algorithm Overview

Designed by Daemen and Rijmen for the NIST

Originally called Rijndael Symmetric key block

substitution cipher Replacement for DES Successful field testing since

inception Three bit-modes State defined as a 4x4 array

of 16 bytes Key size is either 1624 or

32 bytes A byte is represented by

Galois polynomials

Bit Mode

Key Lengt

h (Nk

words)

State Size(Nb

words)

Number

of Round

s(Nr)

128 4 4 10

192 6 4 12

256 8 4 14

Stages of AES Algorithm

Detailed view of round n Each round performs the following operationsEach round performs the following operations

Non-linear Layer No linear relationship between the input and output of a Non-linear Layer No linear relationship between the input and output of a roundround

Linear Mixing Layer Guarantees high diffusion over multiple roundsLinear Mixing Layer Guarantees high diffusion over multiple rounds Very small correlation between bytes of the round input and the bytes of Very small correlation between bytes of the round input and the bytes of

the outputthe output Key Addition Layer Bytes of the input are simply EXORrsquoed with the Key Addition Layer Bytes of the input are simply EXORrsquoed with the

expanded round keyexpanded round key

ByteSub Shift Row MixColumn AddRoundKey

Kn

Result from round n-1

Pass toround n+1

1 SubBytes Function Affine Transformation in GF (28) Direct implementation is

complex Easily performed by a 16 x 16

LUT ROM Simple byte substitution Combinational logic

Each byte at the input of a round undergoes a

non-linear byte substitution according to the following transform

Substitution (ldquoSrdquo)-box

2 Shift Row Shifting done only on the

bottom three rows of the State Left rotate for encryption Right rotate for decryption

Depending on the block length each ldquorowrdquo of the block is cyclically shifted according to the above table

3 MixColumns Functionbull Matrix multiplication in GF (28)bull MixColumns functionality

resides primarily in the controller and instruction memory

bull A series of conditional XOR and left shift operations

Each column is multiplied by a fixed polynomialC(x) = rsquo03rsquoX3 + rsquo01rsquoX2 + rsquo01rsquoX + rsquo02rsquo

This corresponds to matrix multiplication b(x) = c(x) a(x)

4 Key Expansion and Addition Performed before both the encrypt and decrypt process Byte values from the Key are read and manipulated into the RoundKey A series of SubBytes and XOR operations with RCON ROM values and the

Key Performs XOR operation between the State and the Roundkey This is the only function without an inverse

Each word is simply EXORrsquoed with the expanded round key

IXP2400 Platform A Quick LookName SizeBytes Transfer

Size(Bytes)Reference

latency in cycles

GPRME 2564 4 1

TRME 5124 4 1

NNRME 1284 4 1

LMME 6404 4 3

Scratch 16K 4 60

SRAM 64M 4 90

DRAM 1G 16 120

bull achieve high processing performance

bull programming flexibilitybull Cheaper than ASIC

Microcode Overview

alu [ dest1 a + b] ALU addition of a and b and storing in dest1 alu [ dest2 dest1 - c] ALU subtraction Move(reg1 reg2) Moving from one reg1 to reg2 both are gprs Immed[reg ox0020] Immediate value assignment to register local_csr_wr[ACTIVE_LM_ADDR_0 0x0] Local memory indexing with index0 begin hellip endm Macro begin and end if hellip endif If loop xbuf_alloc ($$state 4 read) buffer allocation in DRAM transfer register reg gen_regiater $sram_reg $$dram_reg Register declaration sig sram_sig dram_sig signal declaration while hellip endw While looping for round[12345678910] hellip endloop For looping alu_shf[index -- B s0 gtgt24] Alu shift function of B scratch[read $T index 0 1] ctx_swap[sram_sig] scratch read instruction ld_field_w_clr[t1 1000 $T] Performs a write to t1 register dram[write $$out[0] dst_addr 0 2] sig_done[dram_sig] Dram write ctx_arb[dram_sig] ctx_arb[kill] signaling

Implementation Setup

Environmental Setup Intel IXP 41 600MHz ME configurations 200-MHz SRAMs 150-MHz RDRAMs Executed in Multi threads Executed in Different Micro Engines

Experimental Results(1)

Command Bus Arbiter Statistics

0

20

40

60

80

100

None-SRAM SRAM

Per

cent

age

idle due to memoryqueue fullness

Idle due to No request

Used

SRAM Utilization

MicroEngine Utilisation Percentage

0

20

40

60

80

100

8 Threads 4Threads 2Threads 1Thread

No of Threads in Execution

Per

cent

age

Idle

Stalled

Aborted

Executing

ME utilization

Experimental Results(2)

Throughput Improvement for 1 MicroEngine with different threads

0

100

200

300

400

500

8 Threads 4Threads 2Threads 1Thread

No of threads

Thro

ughp

ut(M

IPS

)

Series1

AES Throughput Across MicroEngines

0200400600800

10001200140016001800

1 2 4 8

No of MicroEngines

Thro

ughp

ut(M

IPS

)

Throughput PerformanceAcross Threads in 1 ME

Throughput PerformanceAcross Threads in 1 ME

Crypto Unit of IXP2850

Intel IXP2850 Encryption Data Flow

Crypto Unit Overview

Simple Encrypt Example

Simple Encrypt and Hash Example

3DES Core10486972 Cores per crypto unit

1048697Takes 192-bit key ndash(56-bit + 8-bit parity) x 3Keys

1048697Operates on 8-byte blocks 1048697Result is written to ME transfer registers or TBUF

element 1048697Result can be passed to the SHA-1 unit for hashing

Security Processing pipelining and interleaving using three wires and one core Multiple keys and IVs

AES Core

1048697All AES key sizes are supported

ndash(128 192 or 256) Both Encryption and

Decryption supported 1048697Operates on 16 byte

blocks

AES Key Scheduler

SHA1 Core

2 SHA-1 cores per crypto unit Operates on 64-byte blocks

Data is loaded from Input RAM or Crypto cores into the SHA-1 buffer

Can perform on unmodified packet data or on the ciphered packet data

Operates on 512 bit block size and has a data buffer to accumulate the ciphered data

This gives flexibility to run SHA and AES 3DES at different rates

SHA1 Critical Path Analysis

Some of The Crypto Commands

crypto_write_ram($$orig_plain_text[0]DATA_RAM_ADDR8ENCRYPT_UNIT ram_sig) Perform and wait for the write

crypto_load_iv($$iv[0] 1ENCRYPT_UNITCRYPTO_BANK ENCRYPT_STATE iv_sig) Loading IV Data

crypto_load_key($$key[0]3ENCRYPT_UNITCRYPTO_BANKENCRYPT_STATEkey_sig) Loading Key

crypto_cipher($$encrypt_data[0]DATA_RAM_ADDR8CRYPTO_CIPHER_ENCRYPTCRYPTO_CIPHER_NO_CBC CRYPTO_CIPHER_3DES ENCRYPT_UNITCRYPTO_BANK ENCRYPT_STATE cipher_sig)

Acknowledgement

Yan Luo Chris Baron httpcnscenterfuturecokrresourcersc-cent

erpresentationintelspring2003S03USCPTS92_OSpdf ( For some slides)

Mel Tsai UC Berkeley (For some slides) Thomas Sodon et al EE College of

NewJersey Zhangxi Tan et al Tsinghua University

Qhelliphelliphelliphelliphellip

Page 2: AES Microcode Implementation In IXP2400 And A study of Reconfigurable Crypto Unit Piyush Ranjan Satapathy CS203B Class Project Presentation

Road Map

AES Algorithm Overview IXP2400 Platform A Quick Look Microcode Overview Implementation of AES Experimental Results

Reconfigurable Crypto unit of Intel IXP2850

Algorithm Overview

Designed by Daemen and Rijmen for the NIST

Originally called Rijndael Symmetric key block

substitution cipher Replacement for DES Successful field testing since

inception Three bit-modes State defined as a 4x4 array

of 16 bytes Key size is either 1624 or

32 bytes A byte is represented by

Galois polynomials

Bit Mode

Key Lengt

h (Nk

words)

State Size(Nb

words)

Number

of Round

s(Nr)

128 4 4 10

192 6 4 12

256 8 4 14

Stages of AES Algorithm

Detailed view of round n Each round performs the following operationsEach round performs the following operations

Non-linear Layer No linear relationship between the input and output of a Non-linear Layer No linear relationship between the input and output of a roundround

Linear Mixing Layer Guarantees high diffusion over multiple roundsLinear Mixing Layer Guarantees high diffusion over multiple rounds Very small correlation between bytes of the round input and the bytes of Very small correlation between bytes of the round input and the bytes of

the outputthe output Key Addition Layer Bytes of the input are simply EXORrsquoed with the Key Addition Layer Bytes of the input are simply EXORrsquoed with the

expanded round keyexpanded round key

ByteSub Shift Row MixColumn AddRoundKey

Kn

Result from round n-1

Pass toround n+1

1 SubBytes Function Affine Transformation in GF (28) Direct implementation is

complex Easily performed by a 16 x 16

LUT ROM Simple byte substitution Combinational logic

Each byte at the input of a round undergoes a

non-linear byte substitution according to the following transform

Substitution (ldquoSrdquo)-box

2 Shift Row Shifting done only on the

bottom three rows of the State Left rotate for encryption Right rotate for decryption

Depending on the block length each ldquorowrdquo of the block is cyclically shifted according to the above table

3 MixColumns Functionbull Matrix multiplication in GF (28)bull MixColumns functionality

resides primarily in the controller and instruction memory

bull A series of conditional XOR and left shift operations

Each column is multiplied by a fixed polynomialC(x) = rsquo03rsquoX3 + rsquo01rsquoX2 + rsquo01rsquoX + rsquo02rsquo

This corresponds to matrix multiplication b(x) = c(x) a(x)

4 Key Expansion and Addition Performed before both the encrypt and decrypt process Byte values from the Key are read and manipulated into the RoundKey A series of SubBytes and XOR operations with RCON ROM values and the

Key Performs XOR operation between the State and the Roundkey This is the only function without an inverse

Each word is simply EXORrsquoed with the expanded round key

IXP2400 Platform A Quick LookName SizeBytes Transfer

Size(Bytes)Reference

latency in cycles

GPRME 2564 4 1

TRME 5124 4 1

NNRME 1284 4 1

LMME 6404 4 3

Scratch 16K 4 60

SRAM 64M 4 90

DRAM 1G 16 120

bull achieve high processing performance

bull programming flexibilitybull Cheaper than ASIC

Microcode Overview

alu [ dest1 a + b] ALU addition of a and b and storing in dest1 alu [ dest2 dest1 - c] ALU subtraction Move(reg1 reg2) Moving from one reg1 to reg2 both are gprs Immed[reg ox0020] Immediate value assignment to register local_csr_wr[ACTIVE_LM_ADDR_0 0x0] Local memory indexing with index0 begin hellip endm Macro begin and end if hellip endif If loop xbuf_alloc ($$state 4 read) buffer allocation in DRAM transfer register reg gen_regiater $sram_reg $$dram_reg Register declaration sig sram_sig dram_sig signal declaration while hellip endw While looping for round[12345678910] hellip endloop For looping alu_shf[index -- B s0 gtgt24] Alu shift function of B scratch[read $T index 0 1] ctx_swap[sram_sig] scratch read instruction ld_field_w_clr[t1 1000 $T] Performs a write to t1 register dram[write $$out[0] dst_addr 0 2] sig_done[dram_sig] Dram write ctx_arb[dram_sig] ctx_arb[kill] signaling

Implementation Setup

Environmental Setup Intel IXP 41 600MHz ME configurations 200-MHz SRAMs 150-MHz RDRAMs Executed in Multi threads Executed in Different Micro Engines

Experimental Results(1)

Command Bus Arbiter Statistics

0

20

40

60

80

100

None-SRAM SRAM

Per

cent

age

idle due to memoryqueue fullness

Idle due to No request

Used

SRAM Utilization

MicroEngine Utilisation Percentage

0

20

40

60

80

100

8 Threads 4Threads 2Threads 1Thread

No of Threads in Execution

Per

cent

age

Idle

Stalled

Aborted

Executing

ME utilization

Experimental Results(2)

Throughput Improvement for 1 MicroEngine with different threads

0

100

200

300

400

500

8 Threads 4Threads 2Threads 1Thread

No of threads

Thro

ughp

ut(M

IPS

)

Series1

AES Throughput Across MicroEngines

0200400600800

10001200140016001800

1 2 4 8

No of MicroEngines

Thro

ughp

ut(M

IPS

)

Throughput PerformanceAcross Threads in 1 ME

Throughput PerformanceAcross Threads in 1 ME

Crypto Unit of IXP2850

Intel IXP2850 Encryption Data Flow

Crypto Unit Overview

Simple Encrypt Example

Simple Encrypt and Hash Example

3DES Core10486972 Cores per crypto unit

1048697Takes 192-bit key ndash(56-bit + 8-bit parity) x 3Keys

1048697Operates on 8-byte blocks 1048697Result is written to ME transfer registers or TBUF

element 1048697Result can be passed to the SHA-1 unit for hashing

Security Processing pipelining and interleaving using three wires and one core Multiple keys and IVs

AES Core

1048697All AES key sizes are supported

ndash(128 192 or 256) Both Encryption and

Decryption supported 1048697Operates on 16 byte

blocks

AES Key Scheduler

SHA1 Core

2 SHA-1 cores per crypto unit Operates on 64-byte blocks

Data is loaded from Input RAM or Crypto cores into the SHA-1 buffer

Can perform on unmodified packet data or on the ciphered packet data

Operates on 512 bit block size and has a data buffer to accumulate the ciphered data

This gives flexibility to run SHA and AES 3DES at different rates

SHA1 Critical Path Analysis

Some of The Crypto Commands

crypto_write_ram($$orig_plain_text[0]DATA_RAM_ADDR8ENCRYPT_UNIT ram_sig) Perform and wait for the write

crypto_load_iv($$iv[0] 1ENCRYPT_UNITCRYPTO_BANK ENCRYPT_STATE iv_sig) Loading IV Data

crypto_load_key($$key[0]3ENCRYPT_UNITCRYPTO_BANKENCRYPT_STATEkey_sig) Loading Key

crypto_cipher($$encrypt_data[0]DATA_RAM_ADDR8CRYPTO_CIPHER_ENCRYPTCRYPTO_CIPHER_NO_CBC CRYPTO_CIPHER_3DES ENCRYPT_UNITCRYPTO_BANK ENCRYPT_STATE cipher_sig)

Acknowledgement

Yan Luo Chris Baron httpcnscenterfuturecokrresourcersc-cent

erpresentationintelspring2003S03USCPTS92_OSpdf ( For some slides)

Mel Tsai UC Berkeley (For some slides) Thomas Sodon et al EE College of

NewJersey Zhangxi Tan et al Tsinghua University

Qhelliphelliphelliphelliphellip

Page 3: AES Microcode Implementation In IXP2400 And A study of Reconfigurable Crypto Unit Piyush Ranjan Satapathy CS203B Class Project Presentation

Algorithm Overview

Designed by Daemen and Rijmen for the NIST

Originally called Rijndael Symmetric key block

substitution cipher Replacement for DES Successful field testing since

inception Three bit-modes State defined as a 4x4 array

of 16 bytes Key size is either 1624 or

32 bytes A byte is represented by

Galois polynomials

Bit Mode

Key Lengt

h (Nk

words)

State Size(Nb

words)

Number

of Round

s(Nr)

128 4 4 10

192 6 4 12

256 8 4 14

Stages of AES Algorithm

Detailed view of round n Each round performs the following operationsEach round performs the following operations

Non-linear Layer No linear relationship between the input and output of a Non-linear Layer No linear relationship between the input and output of a roundround

Linear Mixing Layer Guarantees high diffusion over multiple roundsLinear Mixing Layer Guarantees high diffusion over multiple rounds Very small correlation between bytes of the round input and the bytes of Very small correlation between bytes of the round input and the bytes of

the outputthe output Key Addition Layer Bytes of the input are simply EXORrsquoed with the Key Addition Layer Bytes of the input are simply EXORrsquoed with the

expanded round keyexpanded round key

ByteSub Shift Row MixColumn AddRoundKey

Kn

Result from round n-1

Pass toround n+1

1 SubBytes Function Affine Transformation in GF (28) Direct implementation is

complex Easily performed by a 16 x 16

LUT ROM Simple byte substitution Combinational logic

Each byte at the input of a round undergoes a

non-linear byte substitution according to the following transform

Substitution (ldquoSrdquo)-box

2 Shift Row Shifting done only on the

bottom three rows of the State Left rotate for encryption Right rotate for decryption

Depending on the block length each ldquorowrdquo of the block is cyclically shifted according to the above table

3 MixColumns Functionbull Matrix multiplication in GF (28)bull MixColumns functionality

resides primarily in the controller and instruction memory

bull A series of conditional XOR and left shift operations

Each column is multiplied by a fixed polynomialC(x) = rsquo03rsquoX3 + rsquo01rsquoX2 + rsquo01rsquoX + rsquo02rsquo

This corresponds to matrix multiplication b(x) = c(x) a(x)

4 Key Expansion and Addition Performed before both the encrypt and decrypt process Byte values from the Key are read and manipulated into the RoundKey A series of SubBytes and XOR operations with RCON ROM values and the

Key Performs XOR operation between the State and the Roundkey This is the only function without an inverse

Each word is simply EXORrsquoed with the expanded round key

IXP2400 Platform A Quick LookName SizeBytes Transfer

Size(Bytes)Reference

latency in cycles

GPRME 2564 4 1

TRME 5124 4 1

NNRME 1284 4 1

LMME 6404 4 3

Scratch 16K 4 60

SRAM 64M 4 90

DRAM 1G 16 120

bull achieve high processing performance

bull programming flexibilitybull Cheaper than ASIC

Microcode Overview

alu [ dest1 a + b] ALU addition of a and b and storing in dest1 alu [ dest2 dest1 - c] ALU subtraction Move(reg1 reg2) Moving from one reg1 to reg2 both are gprs Immed[reg ox0020] Immediate value assignment to register local_csr_wr[ACTIVE_LM_ADDR_0 0x0] Local memory indexing with index0 begin hellip endm Macro begin and end if hellip endif If loop xbuf_alloc ($$state 4 read) buffer allocation in DRAM transfer register reg gen_regiater $sram_reg $$dram_reg Register declaration sig sram_sig dram_sig signal declaration while hellip endw While looping for round[12345678910] hellip endloop For looping alu_shf[index -- B s0 gtgt24] Alu shift function of B scratch[read $T index 0 1] ctx_swap[sram_sig] scratch read instruction ld_field_w_clr[t1 1000 $T] Performs a write to t1 register dram[write $$out[0] dst_addr 0 2] sig_done[dram_sig] Dram write ctx_arb[dram_sig] ctx_arb[kill] signaling

Implementation Setup

Environmental Setup Intel IXP 41 600MHz ME configurations 200-MHz SRAMs 150-MHz RDRAMs Executed in Multi threads Executed in Different Micro Engines

Experimental Results(1)

Command Bus Arbiter Statistics

0

20

40

60

80

100

None-SRAM SRAM

Per

cent

age

idle due to memoryqueue fullness

Idle due to No request

Used

SRAM Utilization

MicroEngine Utilisation Percentage

0

20

40

60

80

100

8 Threads 4Threads 2Threads 1Thread

No of Threads in Execution

Per

cent

age

Idle

Stalled

Aborted

Executing

ME utilization

Experimental Results(2)

Throughput Improvement for 1 MicroEngine with different threads

0

100

200

300

400

500

8 Threads 4Threads 2Threads 1Thread

No of threads

Thro

ughp

ut(M

IPS

)

Series1

AES Throughput Across MicroEngines

0200400600800

10001200140016001800

1 2 4 8

No of MicroEngines

Thro

ughp

ut(M

IPS

)

Throughput PerformanceAcross Threads in 1 ME

Throughput PerformanceAcross Threads in 1 ME

Crypto Unit of IXP2850

Intel IXP2850 Encryption Data Flow

Crypto Unit Overview

Simple Encrypt Example

Simple Encrypt and Hash Example

3DES Core10486972 Cores per crypto unit

1048697Takes 192-bit key ndash(56-bit + 8-bit parity) x 3Keys

1048697Operates on 8-byte blocks 1048697Result is written to ME transfer registers or TBUF

element 1048697Result can be passed to the SHA-1 unit for hashing

Security Processing pipelining and interleaving using three wires and one core Multiple keys and IVs

AES Core

1048697All AES key sizes are supported

ndash(128 192 or 256) Both Encryption and

Decryption supported 1048697Operates on 16 byte

blocks

AES Key Scheduler

SHA1 Core

2 SHA-1 cores per crypto unit Operates on 64-byte blocks

Data is loaded from Input RAM or Crypto cores into the SHA-1 buffer

Can perform on unmodified packet data or on the ciphered packet data

Operates on 512 bit block size and has a data buffer to accumulate the ciphered data

This gives flexibility to run SHA and AES 3DES at different rates

SHA1 Critical Path Analysis

Some of The Crypto Commands

crypto_write_ram($$orig_plain_text[0]DATA_RAM_ADDR8ENCRYPT_UNIT ram_sig) Perform and wait for the write

crypto_load_iv($$iv[0] 1ENCRYPT_UNITCRYPTO_BANK ENCRYPT_STATE iv_sig) Loading IV Data

crypto_load_key($$key[0]3ENCRYPT_UNITCRYPTO_BANKENCRYPT_STATEkey_sig) Loading Key

crypto_cipher($$encrypt_data[0]DATA_RAM_ADDR8CRYPTO_CIPHER_ENCRYPTCRYPTO_CIPHER_NO_CBC CRYPTO_CIPHER_3DES ENCRYPT_UNITCRYPTO_BANK ENCRYPT_STATE cipher_sig)

Acknowledgement

Yan Luo Chris Baron httpcnscenterfuturecokrresourcersc-cent

erpresentationintelspring2003S03USCPTS92_OSpdf ( For some slides)

Mel Tsai UC Berkeley (For some slides) Thomas Sodon et al EE College of

NewJersey Zhangxi Tan et al Tsinghua University

Qhelliphelliphelliphelliphellip

Page 4: AES Microcode Implementation In IXP2400 And A study of Reconfigurable Crypto Unit Piyush Ranjan Satapathy CS203B Class Project Presentation

Stages of AES Algorithm

Detailed view of round n Each round performs the following operationsEach round performs the following operations

Non-linear Layer No linear relationship between the input and output of a Non-linear Layer No linear relationship between the input and output of a roundround

Linear Mixing Layer Guarantees high diffusion over multiple roundsLinear Mixing Layer Guarantees high diffusion over multiple rounds Very small correlation between bytes of the round input and the bytes of Very small correlation between bytes of the round input and the bytes of

the outputthe output Key Addition Layer Bytes of the input are simply EXORrsquoed with the Key Addition Layer Bytes of the input are simply EXORrsquoed with the

expanded round keyexpanded round key

ByteSub Shift Row MixColumn AddRoundKey

Kn

Result from round n-1

Pass toround n+1

1 SubBytes Function Affine Transformation in GF (28) Direct implementation is

complex Easily performed by a 16 x 16

LUT ROM Simple byte substitution Combinational logic

Each byte at the input of a round undergoes a

non-linear byte substitution according to the following transform

Substitution (ldquoSrdquo)-box

2 Shift Row Shifting done only on the

bottom three rows of the State Left rotate for encryption Right rotate for decryption

Depending on the block length each ldquorowrdquo of the block is cyclically shifted according to the above table

3 MixColumns Functionbull Matrix multiplication in GF (28)bull MixColumns functionality

resides primarily in the controller and instruction memory

bull A series of conditional XOR and left shift operations

Each column is multiplied by a fixed polynomialC(x) = rsquo03rsquoX3 + rsquo01rsquoX2 + rsquo01rsquoX + rsquo02rsquo

This corresponds to matrix multiplication b(x) = c(x) a(x)

4 Key Expansion and Addition Performed before both the encrypt and decrypt process Byte values from the Key are read and manipulated into the RoundKey A series of SubBytes and XOR operations with RCON ROM values and the

Key Performs XOR operation between the State and the Roundkey This is the only function without an inverse

Each word is simply EXORrsquoed with the expanded round key

IXP2400 Platform A Quick LookName SizeBytes Transfer

Size(Bytes)Reference

latency in cycles

GPRME 2564 4 1

TRME 5124 4 1

NNRME 1284 4 1

LMME 6404 4 3

Scratch 16K 4 60

SRAM 64M 4 90

DRAM 1G 16 120

bull achieve high processing performance

bull programming flexibilitybull Cheaper than ASIC

Microcode Overview

alu [ dest1 a + b] ALU addition of a and b and storing in dest1 alu [ dest2 dest1 - c] ALU subtraction Move(reg1 reg2) Moving from one reg1 to reg2 both are gprs Immed[reg ox0020] Immediate value assignment to register local_csr_wr[ACTIVE_LM_ADDR_0 0x0] Local memory indexing with index0 begin hellip endm Macro begin and end if hellip endif If loop xbuf_alloc ($$state 4 read) buffer allocation in DRAM transfer register reg gen_regiater $sram_reg $$dram_reg Register declaration sig sram_sig dram_sig signal declaration while hellip endw While looping for round[12345678910] hellip endloop For looping alu_shf[index -- B s0 gtgt24] Alu shift function of B scratch[read $T index 0 1] ctx_swap[sram_sig] scratch read instruction ld_field_w_clr[t1 1000 $T] Performs a write to t1 register dram[write $$out[0] dst_addr 0 2] sig_done[dram_sig] Dram write ctx_arb[dram_sig] ctx_arb[kill] signaling

Implementation Setup

Environmental Setup Intel IXP 41 600MHz ME configurations 200-MHz SRAMs 150-MHz RDRAMs Executed in Multi threads Executed in Different Micro Engines

Experimental Results(1)

Command Bus Arbiter Statistics

0

20

40

60

80

100

None-SRAM SRAM

Per

cent

age

idle due to memoryqueue fullness

Idle due to No request

Used

SRAM Utilization

MicroEngine Utilisation Percentage

0

20

40

60

80

100

8 Threads 4Threads 2Threads 1Thread

No of Threads in Execution

Per

cent

age

Idle

Stalled

Aborted

Executing

ME utilization

Experimental Results(2)

Throughput Improvement for 1 MicroEngine with different threads

0

100

200

300

400

500

8 Threads 4Threads 2Threads 1Thread

No of threads

Thro

ughp

ut(M

IPS

)

Series1

AES Throughput Across MicroEngines

0200400600800

10001200140016001800

1 2 4 8

No of MicroEngines

Thro

ughp

ut(M

IPS

)

Throughput PerformanceAcross Threads in 1 ME

Throughput PerformanceAcross Threads in 1 ME

Crypto Unit of IXP2850

Intel IXP2850 Encryption Data Flow

Crypto Unit Overview

Simple Encrypt Example

Simple Encrypt and Hash Example

3DES Core10486972 Cores per crypto unit

1048697Takes 192-bit key ndash(56-bit + 8-bit parity) x 3Keys

1048697Operates on 8-byte blocks 1048697Result is written to ME transfer registers or TBUF

element 1048697Result can be passed to the SHA-1 unit for hashing

Security Processing pipelining and interleaving using three wires and one core Multiple keys and IVs

AES Core

1048697All AES key sizes are supported

ndash(128 192 or 256) Both Encryption and

Decryption supported 1048697Operates on 16 byte

blocks

AES Key Scheduler

SHA1 Core

2 SHA-1 cores per crypto unit Operates on 64-byte blocks

Data is loaded from Input RAM or Crypto cores into the SHA-1 buffer

Can perform on unmodified packet data or on the ciphered packet data

Operates on 512 bit block size and has a data buffer to accumulate the ciphered data

This gives flexibility to run SHA and AES 3DES at different rates

SHA1 Critical Path Analysis

Some of The Crypto Commands

crypto_write_ram($$orig_plain_text[0]DATA_RAM_ADDR8ENCRYPT_UNIT ram_sig) Perform and wait for the write

crypto_load_iv($$iv[0] 1ENCRYPT_UNITCRYPTO_BANK ENCRYPT_STATE iv_sig) Loading IV Data

crypto_load_key($$key[0]3ENCRYPT_UNITCRYPTO_BANKENCRYPT_STATEkey_sig) Loading Key

crypto_cipher($$encrypt_data[0]DATA_RAM_ADDR8CRYPTO_CIPHER_ENCRYPTCRYPTO_CIPHER_NO_CBC CRYPTO_CIPHER_3DES ENCRYPT_UNITCRYPTO_BANK ENCRYPT_STATE cipher_sig)

Acknowledgement

Yan Luo Chris Baron httpcnscenterfuturecokrresourcersc-cent

erpresentationintelspring2003S03USCPTS92_OSpdf ( For some slides)

Mel Tsai UC Berkeley (For some slides) Thomas Sodon et al EE College of

NewJersey Zhangxi Tan et al Tsinghua University

Qhelliphelliphelliphelliphellip

Page 5: AES Microcode Implementation In IXP2400 And A study of Reconfigurable Crypto Unit Piyush Ranjan Satapathy CS203B Class Project Presentation

1 SubBytes Function Affine Transformation in GF (28) Direct implementation is

complex Easily performed by a 16 x 16

LUT ROM Simple byte substitution Combinational logic

Each byte at the input of a round undergoes a

non-linear byte substitution according to the following transform

Substitution (ldquoSrdquo)-box

2 Shift Row Shifting done only on the

bottom three rows of the State Left rotate for encryption Right rotate for decryption

Depending on the block length each ldquorowrdquo of the block is cyclically shifted according to the above table

3 MixColumns Functionbull Matrix multiplication in GF (28)bull MixColumns functionality

resides primarily in the controller and instruction memory

bull A series of conditional XOR and left shift operations

Each column is multiplied by a fixed polynomialC(x) = rsquo03rsquoX3 + rsquo01rsquoX2 + rsquo01rsquoX + rsquo02rsquo

This corresponds to matrix multiplication b(x) = c(x) a(x)

4 Key Expansion and Addition Performed before both the encrypt and decrypt process Byte values from the Key are read and manipulated into the RoundKey A series of SubBytes and XOR operations with RCON ROM values and the

Key Performs XOR operation between the State and the Roundkey This is the only function without an inverse

Each word is simply EXORrsquoed with the expanded round key

IXP2400 Platform A Quick LookName SizeBytes Transfer

Size(Bytes)Reference

latency in cycles

GPRME 2564 4 1

TRME 5124 4 1

NNRME 1284 4 1

LMME 6404 4 3

Scratch 16K 4 60

SRAM 64M 4 90

DRAM 1G 16 120

bull achieve high processing performance

bull programming flexibilitybull Cheaper than ASIC

Microcode Overview

alu [ dest1 a + b] ALU addition of a and b and storing in dest1 alu [ dest2 dest1 - c] ALU subtraction Move(reg1 reg2) Moving from one reg1 to reg2 both are gprs Immed[reg ox0020] Immediate value assignment to register local_csr_wr[ACTIVE_LM_ADDR_0 0x0] Local memory indexing with index0 begin hellip endm Macro begin and end if hellip endif If loop xbuf_alloc ($$state 4 read) buffer allocation in DRAM transfer register reg gen_regiater $sram_reg $$dram_reg Register declaration sig sram_sig dram_sig signal declaration while hellip endw While looping for round[12345678910] hellip endloop For looping alu_shf[index -- B s0 gtgt24] Alu shift function of B scratch[read $T index 0 1] ctx_swap[sram_sig] scratch read instruction ld_field_w_clr[t1 1000 $T] Performs a write to t1 register dram[write $$out[0] dst_addr 0 2] sig_done[dram_sig] Dram write ctx_arb[dram_sig] ctx_arb[kill] signaling

Implementation Setup

Environmental Setup Intel IXP 41 600MHz ME configurations 200-MHz SRAMs 150-MHz RDRAMs Executed in Multi threads Executed in Different Micro Engines

Experimental Results(1)

Command Bus Arbiter Statistics

0

20

40

60

80

100

None-SRAM SRAM

Per

cent

age

idle due to memoryqueue fullness

Idle due to No request

Used

SRAM Utilization

MicroEngine Utilisation Percentage

0

20

40

60

80

100

8 Threads 4Threads 2Threads 1Thread

No of Threads in Execution

Per

cent

age

Idle

Stalled

Aborted

Executing

ME utilization

Experimental Results(2)

Throughput Improvement for 1 MicroEngine with different threads

0

100

200

300

400

500

8 Threads 4Threads 2Threads 1Thread

No of threads

Thro

ughp

ut(M

IPS

)

Series1

AES Throughput Across MicroEngines

0200400600800

10001200140016001800

1 2 4 8

No of MicroEngines

Thro

ughp

ut(M

IPS

)

Throughput PerformanceAcross Threads in 1 ME

Throughput PerformanceAcross Threads in 1 ME

Crypto Unit of IXP2850

Intel IXP2850 Encryption Data Flow

Crypto Unit Overview

Simple Encrypt Example

Simple Encrypt and Hash Example

3DES Core10486972 Cores per crypto unit

1048697Takes 192-bit key ndash(56-bit + 8-bit parity) x 3Keys

1048697Operates on 8-byte blocks 1048697Result is written to ME transfer registers or TBUF

element 1048697Result can be passed to the SHA-1 unit for hashing

Security Processing pipelining and interleaving using three wires and one core Multiple keys and IVs

AES Core

1048697All AES key sizes are supported

ndash(128 192 or 256) Both Encryption and

Decryption supported 1048697Operates on 16 byte

blocks

AES Key Scheduler

SHA1 Core

2 SHA-1 cores per crypto unit Operates on 64-byte blocks

Data is loaded from Input RAM or Crypto cores into the SHA-1 buffer

Can perform on unmodified packet data or on the ciphered packet data

Operates on 512 bit block size and has a data buffer to accumulate the ciphered data

This gives flexibility to run SHA and AES 3DES at different rates

SHA1 Critical Path Analysis

Some of The Crypto Commands

crypto_write_ram($$orig_plain_text[0]DATA_RAM_ADDR8ENCRYPT_UNIT ram_sig) Perform and wait for the write

crypto_load_iv($$iv[0] 1ENCRYPT_UNITCRYPTO_BANK ENCRYPT_STATE iv_sig) Loading IV Data

crypto_load_key($$key[0]3ENCRYPT_UNITCRYPTO_BANKENCRYPT_STATEkey_sig) Loading Key

crypto_cipher($$encrypt_data[0]DATA_RAM_ADDR8CRYPTO_CIPHER_ENCRYPTCRYPTO_CIPHER_NO_CBC CRYPTO_CIPHER_3DES ENCRYPT_UNITCRYPTO_BANK ENCRYPT_STATE cipher_sig)

Acknowledgement

Yan Luo Chris Baron httpcnscenterfuturecokrresourcersc-cent

erpresentationintelspring2003S03USCPTS92_OSpdf ( For some slides)

Mel Tsai UC Berkeley (For some slides) Thomas Sodon et al EE College of

NewJersey Zhangxi Tan et al Tsinghua University

Qhelliphelliphelliphelliphellip

Page 6: AES Microcode Implementation In IXP2400 And A study of Reconfigurable Crypto Unit Piyush Ranjan Satapathy CS203B Class Project Presentation

2 Shift Row Shifting done only on the

bottom three rows of the State Left rotate for encryption Right rotate for decryption

Depending on the block length each ldquorowrdquo of the block is cyclically shifted according to the above table

3 MixColumns Functionbull Matrix multiplication in GF (28)bull MixColumns functionality

resides primarily in the controller and instruction memory

bull A series of conditional XOR and left shift operations

Each column is multiplied by a fixed polynomialC(x) = rsquo03rsquoX3 + rsquo01rsquoX2 + rsquo01rsquoX + rsquo02rsquo

This corresponds to matrix multiplication b(x) = c(x) a(x)

4 Key Expansion and Addition Performed before both the encrypt and decrypt process Byte values from the Key are read and manipulated into the RoundKey A series of SubBytes and XOR operations with RCON ROM values and the

Key Performs XOR operation between the State and the Roundkey This is the only function without an inverse

Each word is simply EXORrsquoed with the expanded round key

IXP2400 Platform A Quick LookName SizeBytes Transfer

Size(Bytes)Reference

latency in cycles

GPRME 2564 4 1

TRME 5124 4 1

NNRME 1284 4 1

LMME 6404 4 3

Scratch 16K 4 60

SRAM 64M 4 90

DRAM 1G 16 120

bull achieve high processing performance

bull programming flexibilitybull Cheaper than ASIC

Microcode Overview

alu [ dest1 a + b] ALU addition of a and b and storing in dest1 alu [ dest2 dest1 - c] ALU subtraction Move(reg1 reg2) Moving from one reg1 to reg2 both are gprs Immed[reg ox0020] Immediate value assignment to register local_csr_wr[ACTIVE_LM_ADDR_0 0x0] Local memory indexing with index0 begin hellip endm Macro begin and end if hellip endif If loop xbuf_alloc ($$state 4 read) buffer allocation in DRAM transfer register reg gen_regiater $sram_reg $$dram_reg Register declaration sig sram_sig dram_sig signal declaration while hellip endw While looping for round[12345678910] hellip endloop For looping alu_shf[index -- B s0 gtgt24] Alu shift function of B scratch[read $T index 0 1] ctx_swap[sram_sig] scratch read instruction ld_field_w_clr[t1 1000 $T] Performs a write to t1 register dram[write $$out[0] dst_addr 0 2] sig_done[dram_sig] Dram write ctx_arb[dram_sig] ctx_arb[kill] signaling

Implementation Setup

Environmental Setup Intel IXP 41 600MHz ME configurations 200-MHz SRAMs 150-MHz RDRAMs Executed in Multi threads Executed in Different Micro Engines

Experimental Results(1)

Command Bus Arbiter Statistics

0

20

40

60

80

100

None-SRAM SRAM

Per

cent

age

idle due to memoryqueue fullness

Idle due to No request

Used

SRAM Utilization

MicroEngine Utilisation Percentage

0

20

40

60

80

100

8 Threads 4Threads 2Threads 1Thread

No of Threads in Execution

Per

cent

age

Idle

Stalled

Aborted

Executing

ME utilization

Experimental Results(2)

Throughput Improvement for 1 MicroEngine with different threads

0

100

200

300

400

500

8 Threads 4Threads 2Threads 1Thread

No of threads

Thro

ughp

ut(M

IPS

)

Series1

AES Throughput Across MicroEngines

0200400600800

10001200140016001800

1 2 4 8

No of MicroEngines

Thro

ughp

ut(M

IPS

)

Throughput PerformanceAcross Threads in 1 ME

Throughput PerformanceAcross Threads in 1 ME

Crypto Unit of IXP2850

Intel IXP2850 Encryption Data Flow

Crypto Unit Overview

Simple Encrypt Example

Simple Encrypt and Hash Example

3DES Core10486972 Cores per crypto unit

1048697Takes 192-bit key ndash(56-bit + 8-bit parity) x 3Keys

1048697Operates on 8-byte blocks 1048697Result is written to ME transfer registers or TBUF

element 1048697Result can be passed to the SHA-1 unit for hashing

Security Processing pipelining and interleaving using three wires and one core Multiple keys and IVs

AES Core

1048697All AES key sizes are supported

ndash(128 192 or 256) Both Encryption and

Decryption supported 1048697Operates on 16 byte

blocks

AES Key Scheduler

SHA1 Core

2 SHA-1 cores per crypto unit Operates on 64-byte blocks

Data is loaded from Input RAM or Crypto cores into the SHA-1 buffer

Can perform on unmodified packet data or on the ciphered packet data

Operates on 512 bit block size and has a data buffer to accumulate the ciphered data

This gives flexibility to run SHA and AES 3DES at different rates

SHA1 Critical Path Analysis

Some of The Crypto Commands

crypto_write_ram($$orig_plain_text[0]DATA_RAM_ADDR8ENCRYPT_UNIT ram_sig) Perform and wait for the write

crypto_load_iv($$iv[0] 1ENCRYPT_UNITCRYPTO_BANK ENCRYPT_STATE iv_sig) Loading IV Data

crypto_load_key($$key[0]3ENCRYPT_UNITCRYPTO_BANKENCRYPT_STATEkey_sig) Loading Key

crypto_cipher($$encrypt_data[0]DATA_RAM_ADDR8CRYPTO_CIPHER_ENCRYPTCRYPTO_CIPHER_NO_CBC CRYPTO_CIPHER_3DES ENCRYPT_UNITCRYPTO_BANK ENCRYPT_STATE cipher_sig)

Acknowledgement

Yan Luo Chris Baron httpcnscenterfuturecokrresourcersc-cent

erpresentationintelspring2003S03USCPTS92_OSpdf ( For some slides)

Mel Tsai UC Berkeley (For some slides) Thomas Sodon et al EE College of

NewJersey Zhangxi Tan et al Tsinghua University

Qhelliphelliphelliphelliphellip

Page 7: AES Microcode Implementation In IXP2400 And A study of Reconfigurable Crypto Unit Piyush Ranjan Satapathy CS203B Class Project Presentation

3 MixColumns Functionbull Matrix multiplication in GF (28)bull MixColumns functionality

resides primarily in the controller and instruction memory

bull A series of conditional XOR and left shift operations

Each column is multiplied by a fixed polynomialC(x) = rsquo03rsquoX3 + rsquo01rsquoX2 + rsquo01rsquoX + rsquo02rsquo

This corresponds to matrix multiplication b(x) = c(x) a(x)

4 Key Expansion and Addition Performed before both the encrypt and decrypt process Byte values from the Key are read and manipulated into the RoundKey A series of SubBytes and XOR operations with RCON ROM values and the

Key Performs XOR operation between the State and the Roundkey This is the only function without an inverse

Each word is simply EXORrsquoed with the expanded round key

IXP2400 Platform A Quick LookName SizeBytes Transfer

Size(Bytes)Reference

latency in cycles

GPRME 2564 4 1

TRME 5124 4 1

NNRME 1284 4 1

LMME 6404 4 3

Scratch 16K 4 60

SRAM 64M 4 90

DRAM 1G 16 120

bull achieve high processing performance

bull programming flexibilitybull Cheaper than ASIC

Microcode Overview

alu [ dest1 a + b] ALU addition of a and b and storing in dest1 alu [ dest2 dest1 - c] ALU subtraction Move(reg1 reg2) Moving from one reg1 to reg2 both are gprs Immed[reg ox0020] Immediate value assignment to register local_csr_wr[ACTIVE_LM_ADDR_0 0x0] Local memory indexing with index0 begin hellip endm Macro begin and end if hellip endif If loop xbuf_alloc ($$state 4 read) buffer allocation in DRAM transfer register reg gen_regiater $sram_reg $$dram_reg Register declaration sig sram_sig dram_sig signal declaration while hellip endw While looping for round[12345678910] hellip endloop For looping alu_shf[index -- B s0 gtgt24] Alu shift function of B scratch[read $T index 0 1] ctx_swap[sram_sig] scratch read instruction ld_field_w_clr[t1 1000 $T] Performs a write to t1 register dram[write $$out[0] dst_addr 0 2] sig_done[dram_sig] Dram write ctx_arb[dram_sig] ctx_arb[kill] signaling

Implementation Setup

Environmental Setup Intel IXP 41 600MHz ME configurations 200-MHz SRAMs 150-MHz RDRAMs Executed in Multi threads Executed in Different Micro Engines

Experimental Results(1)

Command Bus Arbiter Statistics

0

20

40

60

80

100

None-SRAM SRAM

Per

cent

age

idle due to memoryqueue fullness

Idle due to No request

Used

SRAM Utilization

MicroEngine Utilisation Percentage

0

20

40

60

80

100

8 Threads 4Threads 2Threads 1Thread

No of Threads in Execution

Per

cent

age

Idle

Stalled

Aborted

Executing

ME utilization

Experimental Results(2)

Throughput Improvement for 1 MicroEngine with different threads

0

100

200

300

400

500

8 Threads 4Threads 2Threads 1Thread

No of threads

Thro

ughp

ut(M

IPS

)

Series1

AES Throughput Across MicroEngines

0200400600800

10001200140016001800

1 2 4 8

No of MicroEngines

Thro

ughp

ut(M

IPS

)

Throughput PerformanceAcross Threads in 1 ME

Throughput PerformanceAcross Threads in 1 ME

Crypto Unit of IXP2850

Intel IXP2850 Encryption Data Flow

Crypto Unit Overview

Simple Encrypt Example

Simple Encrypt and Hash Example

3DES Core10486972 Cores per crypto unit

1048697Takes 192-bit key ndash(56-bit + 8-bit parity) x 3Keys

1048697Operates on 8-byte blocks 1048697Result is written to ME transfer registers or TBUF

element 1048697Result can be passed to the SHA-1 unit for hashing

Security Processing pipelining and interleaving using three wires and one core Multiple keys and IVs

AES Core

1048697All AES key sizes are supported

ndash(128 192 or 256) Both Encryption and

Decryption supported 1048697Operates on 16 byte

blocks

AES Key Scheduler

SHA1 Core

2 SHA-1 cores per crypto unit Operates on 64-byte blocks

Data is loaded from Input RAM or Crypto cores into the SHA-1 buffer

Can perform on unmodified packet data or on the ciphered packet data

Operates on 512 bit block size and has a data buffer to accumulate the ciphered data

This gives flexibility to run SHA and AES 3DES at different rates

SHA1 Critical Path Analysis

Some of The Crypto Commands

crypto_write_ram($$orig_plain_text[0]DATA_RAM_ADDR8ENCRYPT_UNIT ram_sig) Perform and wait for the write

crypto_load_iv($$iv[0] 1ENCRYPT_UNITCRYPTO_BANK ENCRYPT_STATE iv_sig) Loading IV Data

crypto_load_key($$key[0]3ENCRYPT_UNITCRYPTO_BANKENCRYPT_STATEkey_sig) Loading Key

crypto_cipher($$encrypt_data[0]DATA_RAM_ADDR8CRYPTO_CIPHER_ENCRYPTCRYPTO_CIPHER_NO_CBC CRYPTO_CIPHER_3DES ENCRYPT_UNITCRYPTO_BANK ENCRYPT_STATE cipher_sig)

Acknowledgement

Yan Luo Chris Baron httpcnscenterfuturecokrresourcersc-cent

erpresentationintelspring2003S03USCPTS92_OSpdf ( For some slides)

Mel Tsai UC Berkeley (For some slides) Thomas Sodon et al EE College of

NewJersey Zhangxi Tan et al Tsinghua University

Qhelliphelliphelliphelliphellip

Page 8: AES Microcode Implementation In IXP2400 And A study of Reconfigurable Crypto Unit Piyush Ranjan Satapathy CS203B Class Project Presentation

4 Key Expansion and Addition Performed before both the encrypt and decrypt process Byte values from the Key are read and manipulated into the RoundKey A series of SubBytes and XOR operations with RCON ROM values and the

Key Performs XOR operation between the State and the Roundkey This is the only function without an inverse

Each word is simply EXORrsquoed with the expanded round key

IXP2400 Platform A Quick LookName SizeBytes Transfer

Size(Bytes)Reference

latency in cycles

GPRME 2564 4 1

TRME 5124 4 1

NNRME 1284 4 1

LMME 6404 4 3

Scratch 16K 4 60

SRAM 64M 4 90

DRAM 1G 16 120

bull achieve high processing performance

bull programming flexibilitybull Cheaper than ASIC

Microcode Overview

alu [ dest1 a + b] ALU addition of a and b and storing in dest1 alu [ dest2 dest1 - c] ALU subtraction Move(reg1 reg2) Moving from one reg1 to reg2 both are gprs Immed[reg ox0020] Immediate value assignment to register local_csr_wr[ACTIVE_LM_ADDR_0 0x0] Local memory indexing with index0 begin hellip endm Macro begin and end if hellip endif If loop xbuf_alloc ($$state 4 read) buffer allocation in DRAM transfer register reg gen_regiater $sram_reg $$dram_reg Register declaration sig sram_sig dram_sig signal declaration while hellip endw While looping for round[12345678910] hellip endloop For looping alu_shf[index -- B s0 gtgt24] Alu shift function of B scratch[read $T index 0 1] ctx_swap[sram_sig] scratch read instruction ld_field_w_clr[t1 1000 $T] Performs a write to t1 register dram[write $$out[0] dst_addr 0 2] sig_done[dram_sig] Dram write ctx_arb[dram_sig] ctx_arb[kill] signaling

Implementation Setup

Environmental Setup Intel IXP 41 600MHz ME configurations 200-MHz SRAMs 150-MHz RDRAMs Executed in Multi threads Executed in Different Micro Engines

Experimental Results(1)

Command Bus Arbiter Statistics

0

20

40

60

80

100

None-SRAM SRAM

Per

cent

age

idle due to memoryqueue fullness

Idle due to No request

Used

SRAM Utilization

MicroEngine Utilisation Percentage

0

20

40

60

80

100

8 Threads 4Threads 2Threads 1Thread

No of Threads in Execution

Per

cent

age

Idle

Stalled

Aborted

Executing

ME utilization

Experimental Results(2)

Throughput Improvement for 1 MicroEngine with different threads

0

100

200

300

400

500

8 Threads 4Threads 2Threads 1Thread

No of threads

Thro

ughp

ut(M

IPS

)

Series1

AES Throughput Across MicroEngines

0200400600800

10001200140016001800

1 2 4 8

No of MicroEngines

Thro

ughp

ut(M

IPS

)

Throughput PerformanceAcross Threads in 1 ME

Throughput PerformanceAcross Threads in 1 ME

Crypto Unit of IXP2850

Intel IXP2850 Encryption Data Flow

Crypto Unit Overview

Simple Encrypt Example

Simple Encrypt and Hash Example

3DES Core10486972 Cores per crypto unit

1048697Takes 192-bit key ndash(56-bit + 8-bit parity) x 3Keys

1048697Operates on 8-byte blocks 1048697Result is written to ME transfer registers or TBUF

element 1048697Result can be passed to the SHA-1 unit for hashing

Security Processing pipelining and interleaving using three wires and one core Multiple keys and IVs

AES Core

1048697All AES key sizes are supported

ndash(128 192 or 256) Both Encryption and

Decryption supported 1048697Operates on 16 byte

blocks

AES Key Scheduler

SHA1 Core

2 SHA-1 cores per crypto unit Operates on 64-byte blocks

Data is loaded from Input RAM or Crypto cores into the SHA-1 buffer

Can perform on unmodified packet data or on the ciphered packet data

Operates on 512 bit block size and has a data buffer to accumulate the ciphered data

This gives flexibility to run SHA and AES 3DES at different rates

SHA1 Critical Path Analysis

Some of The Crypto Commands

crypto_write_ram($$orig_plain_text[0]DATA_RAM_ADDR8ENCRYPT_UNIT ram_sig) Perform and wait for the write

crypto_load_iv($$iv[0] 1ENCRYPT_UNITCRYPTO_BANK ENCRYPT_STATE iv_sig) Loading IV Data

crypto_load_key($$key[0]3ENCRYPT_UNITCRYPTO_BANKENCRYPT_STATEkey_sig) Loading Key

crypto_cipher($$encrypt_data[0]DATA_RAM_ADDR8CRYPTO_CIPHER_ENCRYPTCRYPTO_CIPHER_NO_CBC CRYPTO_CIPHER_3DES ENCRYPT_UNITCRYPTO_BANK ENCRYPT_STATE cipher_sig)

Acknowledgement

Yan Luo Chris Baron httpcnscenterfuturecokrresourcersc-cent

erpresentationintelspring2003S03USCPTS92_OSpdf ( For some slides)

Mel Tsai UC Berkeley (For some slides) Thomas Sodon et al EE College of

NewJersey Zhangxi Tan et al Tsinghua University

Qhelliphelliphelliphelliphellip

Page 9: AES Microcode Implementation In IXP2400 And A study of Reconfigurable Crypto Unit Piyush Ranjan Satapathy CS203B Class Project Presentation

IXP2400 Platform A Quick LookName SizeBytes Transfer

Size(Bytes)Reference

latency in cycles

GPRME 2564 4 1

TRME 5124 4 1

NNRME 1284 4 1

LMME 6404 4 3

Scratch 16K 4 60

SRAM 64M 4 90

DRAM 1G 16 120

bull achieve high processing performance

bull programming flexibilitybull Cheaper than ASIC

Microcode Overview

alu [ dest1 a + b] ALU addition of a and b and storing in dest1 alu [ dest2 dest1 - c] ALU subtraction Move(reg1 reg2) Moving from one reg1 to reg2 both are gprs Immed[reg ox0020] Immediate value assignment to register local_csr_wr[ACTIVE_LM_ADDR_0 0x0] Local memory indexing with index0 begin hellip endm Macro begin and end if hellip endif If loop xbuf_alloc ($$state 4 read) buffer allocation in DRAM transfer register reg gen_regiater $sram_reg $$dram_reg Register declaration sig sram_sig dram_sig signal declaration while hellip endw While looping for round[12345678910] hellip endloop For looping alu_shf[index -- B s0 gtgt24] Alu shift function of B scratch[read $T index 0 1] ctx_swap[sram_sig] scratch read instruction ld_field_w_clr[t1 1000 $T] Performs a write to t1 register dram[write $$out[0] dst_addr 0 2] sig_done[dram_sig] Dram write ctx_arb[dram_sig] ctx_arb[kill] signaling

Implementation Setup

Environmental Setup Intel IXP 41 600MHz ME configurations 200-MHz SRAMs 150-MHz RDRAMs Executed in Multi threads Executed in Different Micro Engines

Experimental Results(1)

Command Bus Arbiter Statistics

0

20

40

60

80

100

None-SRAM SRAM

Per

cent

age

idle due to memoryqueue fullness

Idle due to No request

Used

SRAM Utilization

MicroEngine Utilisation Percentage

0

20

40

60

80

100

8 Threads 4Threads 2Threads 1Thread

No of Threads in Execution

Per

cent

age

Idle

Stalled

Aborted

Executing

ME utilization

Experimental Results(2)

Throughput Improvement for 1 MicroEngine with different threads

0

100

200

300

400

500

8 Threads 4Threads 2Threads 1Thread

No of threads

Thro

ughp

ut(M

IPS

)

Series1

AES Throughput Across MicroEngines

0200400600800

10001200140016001800

1 2 4 8

No of MicroEngines

Thro

ughp

ut(M

IPS

)

Throughput PerformanceAcross Threads in 1 ME

Throughput PerformanceAcross Threads in 1 ME

Crypto Unit of IXP2850

Intel IXP2850 Encryption Data Flow

Crypto Unit Overview

Simple Encrypt Example

Simple Encrypt and Hash Example

3DES Core10486972 Cores per crypto unit

1048697Takes 192-bit key ndash(56-bit + 8-bit parity) x 3Keys

1048697Operates on 8-byte blocks 1048697Result is written to ME transfer registers or TBUF

element 1048697Result can be passed to the SHA-1 unit for hashing

Security Processing pipelining and interleaving using three wires and one core Multiple keys and IVs

AES Core

1048697All AES key sizes are supported

ndash(128 192 or 256) Both Encryption and

Decryption supported 1048697Operates on 16 byte

blocks

AES Key Scheduler

SHA1 Core

2 SHA-1 cores per crypto unit Operates on 64-byte blocks

Data is loaded from Input RAM or Crypto cores into the SHA-1 buffer

Can perform on unmodified packet data or on the ciphered packet data

Operates on 512 bit block size and has a data buffer to accumulate the ciphered data

This gives flexibility to run SHA and AES 3DES at different rates

SHA1 Critical Path Analysis

Some of The Crypto Commands

crypto_write_ram($$orig_plain_text[0]DATA_RAM_ADDR8ENCRYPT_UNIT ram_sig) Perform and wait for the write

crypto_load_iv($$iv[0] 1ENCRYPT_UNITCRYPTO_BANK ENCRYPT_STATE iv_sig) Loading IV Data

crypto_load_key($$key[0]3ENCRYPT_UNITCRYPTO_BANKENCRYPT_STATEkey_sig) Loading Key

crypto_cipher($$encrypt_data[0]DATA_RAM_ADDR8CRYPTO_CIPHER_ENCRYPTCRYPTO_CIPHER_NO_CBC CRYPTO_CIPHER_3DES ENCRYPT_UNITCRYPTO_BANK ENCRYPT_STATE cipher_sig)

Acknowledgement

Yan Luo Chris Baron httpcnscenterfuturecokrresourcersc-cent

erpresentationintelspring2003S03USCPTS92_OSpdf ( For some slides)

Mel Tsai UC Berkeley (For some slides) Thomas Sodon et al EE College of

NewJersey Zhangxi Tan et al Tsinghua University

Qhelliphelliphelliphelliphellip

Page 10: AES Microcode Implementation In IXP2400 And A study of Reconfigurable Crypto Unit Piyush Ranjan Satapathy CS203B Class Project Presentation

Microcode Overview

alu [ dest1 a + b] ALU addition of a and b and storing in dest1 alu [ dest2 dest1 - c] ALU subtraction Move(reg1 reg2) Moving from one reg1 to reg2 both are gprs Immed[reg ox0020] Immediate value assignment to register local_csr_wr[ACTIVE_LM_ADDR_0 0x0] Local memory indexing with index0 begin hellip endm Macro begin and end if hellip endif If loop xbuf_alloc ($$state 4 read) buffer allocation in DRAM transfer register reg gen_regiater $sram_reg $$dram_reg Register declaration sig sram_sig dram_sig signal declaration while hellip endw While looping for round[12345678910] hellip endloop For looping alu_shf[index -- B s0 gtgt24] Alu shift function of B scratch[read $T index 0 1] ctx_swap[sram_sig] scratch read instruction ld_field_w_clr[t1 1000 $T] Performs a write to t1 register dram[write $$out[0] dst_addr 0 2] sig_done[dram_sig] Dram write ctx_arb[dram_sig] ctx_arb[kill] signaling

Implementation Setup

Environmental Setup Intel IXP 41 600MHz ME configurations 200-MHz SRAMs 150-MHz RDRAMs Executed in Multi threads Executed in Different Micro Engines

Experimental Results(1)

Command Bus Arbiter Statistics

0

20

40

60

80

100

None-SRAM SRAM

Per

cent

age

idle due to memoryqueue fullness

Idle due to No request

Used

SRAM Utilization

MicroEngine Utilisation Percentage

0

20

40

60

80

100

8 Threads 4Threads 2Threads 1Thread

No of Threads in Execution

Per

cent

age

Idle

Stalled

Aborted

Executing

ME utilization

Experimental Results(2)

Throughput Improvement for 1 MicroEngine with different threads

0

100

200

300

400

500

8 Threads 4Threads 2Threads 1Thread

No of threads

Thro

ughp

ut(M

IPS

)

Series1

AES Throughput Across MicroEngines

0200400600800

10001200140016001800

1 2 4 8

No of MicroEngines

Thro

ughp

ut(M

IPS

)

Throughput PerformanceAcross Threads in 1 ME

Throughput PerformanceAcross Threads in 1 ME

Crypto Unit of IXP2850

Intel IXP2850 Encryption Data Flow

Crypto Unit Overview

Simple Encrypt Example

Simple Encrypt and Hash Example

3DES Core10486972 Cores per crypto unit

1048697Takes 192-bit key ndash(56-bit + 8-bit parity) x 3Keys

1048697Operates on 8-byte blocks 1048697Result is written to ME transfer registers or TBUF

element 1048697Result can be passed to the SHA-1 unit for hashing

Security Processing pipelining and interleaving using three wires and one core Multiple keys and IVs

AES Core

1048697All AES key sizes are supported

ndash(128 192 or 256) Both Encryption and

Decryption supported 1048697Operates on 16 byte

blocks

AES Key Scheduler

SHA1 Core

2 SHA-1 cores per crypto unit Operates on 64-byte blocks

Data is loaded from Input RAM or Crypto cores into the SHA-1 buffer

Can perform on unmodified packet data or on the ciphered packet data

Operates on 512 bit block size and has a data buffer to accumulate the ciphered data

This gives flexibility to run SHA and AES 3DES at different rates

SHA1 Critical Path Analysis

Some of The Crypto Commands

crypto_write_ram($$orig_plain_text[0]DATA_RAM_ADDR8ENCRYPT_UNIT ram_sig) Perform and wait for the write

crypto_load_iv($$iv[0] 1ENCRYPT_UNITCRYPTO_BANK ENCRYPT_STATE iv_sig) Loading IV Data

crypto_load_key($$key[0]3ENCRYPT_UNITCRYPTO_BANKENCRYPT_STATEkey_sig) Loading Key

crypto_cipher($$encrypt_data[0]DATA_RAM_ADDR8CRYPTO_CIPHER_ENCRYPTCRYPTO_CIPHER_NO_CBC CRYPTO_CIPHER_3DES ENCRYPT_UNITCRYPTO_BANK ENCRYPT_STATE cipher_sig)

Acknowledgement

Yan Luo Chris Baron httpcnscenterfuturecokrresourcersc-cent

erpresentationintelspring2003S03USCPTS92_OSpdf ( For some slides)

Mel Tsai UC Berkeley (For some slides) Thomas Sodon et al EE College of

NewJersey Zhangxi Tan et al Tsinghua University

Qhelliphelliphelliphelliphellip

Page 11: AES Microcode Implementation In IXP2400 And A study of Reconfigurable Crypto Unit Piyush Ranjan Satapathy CS203B Class Project Presentation

Implementation Setup

Environmental Setup Intel IXP 41 600MHz ME configurations 200-MHz SRAMs 150-MHz RDRAMs Executed in Multi threads Executed in Different Micro Engines

Experimental Results(1)

Command Bus Arbiter Statistics

0

20

40

60

80

100

None-SRAM SRAM

Per

cent

age

idle due to memoryqueue fullness

Idle due to No request

Used

SRAM Utilization

MicroEngine Utilisation Percentage

0

20

40

60

80

100

8 Threads 4Threads 2Threads 1Thread

No of Threads in Execution

Per

cent

age

Idle

Stalled

Aborted

Executing

ME utilization

Experimental Results(2)

Throughput Improvement for 1 MicroEngine with different threads

0

100

200

300

400

500

8 Threads 4Threads 2Threads 1Thread

No of threads

Thro

ughp

ut(M

IPS

)

Series1

AES Throughput Across MicroEngines

0200400600800

10001200140016001800

1 2 4 8

No of MicroEngines

Thro

ughp

ut(M

IPS

)

Throughput PerformanceAcross Threads in 1 ME

Throughput PerformanceAcross Threads in 1 ME

Crypto Unit of IXP2850

Intel IXP2850 Encryption Data Flow

Crypto Unit Overview

Simple Encrypt Example

Simple Encrypt and Hash Example

3DES Core10486972 Cores per crypto unit

1048697Takes 192-bit key ndash(56-bit + 8-bit parity) x 3Keys

1048697Operates on 8-byte blocks 1048697Result is written to ME transfer registers or TBUF

element 1048697Result can be passed to the SHA-1 unit for hashing

Security Processing pipelining and interleaving using three wires and one core Multiple keys and IVs

AES Core

1048697All AES key sizes are supported

ndash(128 192 or 256) Both Encryption and

Decryption supported 1048697Operates on 16 byte

blocks

AES Key Scheduler

SHA1 Core

2 SHA-1 cores per crypto unit Operates on 64-byte blocks

Data is loaded from Input RAM or Crypto cores into the SHA-1 buffer

Can perform on unmodified packet data or on the ciphered packet data

Operates on 512 bit block size and has a data buffer to accumulate the ciphered data

This gives flexibility to run SHA and AES 3DES at different rates

SHA1 Critical Path Analysis

Some of The Crypto Commands

crypto_write_ram($$orig_plain_text[0]DATA_RAM_ADDR8ENCRYPT_UNIT ram_sig) Perform and wait for the write

crypto_load_iv($$iv[0] 1ENCRYPT_UNITCRYPTO_BANK ENCRYPT_STATE iv_sig) Loading IV Data

crypto_load_key($$key[0]3ENCRYPT_UNITCRYPTO_BANKENCRYPT_STATEkey_sig) Loading Key

crypto_cipher($$encrypt_data[0]DATA_RAM_ADDR8CRYPTO_CIPHER_ENCRYPTCRYPTO_CIPHER_NO_CBC CRYPTO_CIPHER_3DES ENCRYPT_UNITCRYPTO_BANK ENCRYPT_STATE cipher_sig)

Acknowledgement

Yan Luo Chris Baron httpcnscenterfuturecokrresourcersc-cent

erpresentationintelspring2003S03USCPTS92_OSpdf ( For some slides)

Mel Tsai UC Berkeley (For some slides) Thomas Sodon et al EE College of

NewJersey Zhangxi Tan et al Tsinghua University

Qhelliphelliphelliphelliphellip

Page 12: AES Microcode Implementation In IXP2400 And A study of Reconfigurable Crypto Unit Piyush Ranjan Satapathy CS203B Class Project Presentation

Experimental Results(1)

Command Bus Arbiter Statistics

0

20

40

60

80

100

None-SRAM SRAM

Per

cent

age

idle due to memoryqueue fullness

Idle due to No request

Used

SRAM Utilization

MicroEngine Utilisation Percentage

0

20

40

60

80

100

8 Threads 4Threads 2Threads 1Thread

No of Threads in Execution

Per

cent

age

Idle

Stalled

Aborted

Executing

ME utilization

Experimental Results(2)

Throughput Improvement for 1 MicroEngine with different threads

0

100

200

300

400

500

8 Threads 4Threads 2Threads 1Thread

No of threads

Thro

ughp

ut(M

IPS

)

Series1

AES Throughput Across MicroEngines

0200400600800

10001200140016001800

1 2 4 8

No of MicroEngines

Thro

ughp

ut(M

IPS

)

Throughput PerformanceAcross Threads in 1 ME

Throughput PerformanceAcross Threads in 1 ME

Crypto Unit of IXP2850

Intel IXP2850 Encryption Data Flow

Crypto Unit Overview

Simple Encrypt Example

Simple Encrypt and Hash Example

3DES Core10486972 Cores per crypto unit

1048697Takes 192-bit key ndash(56-bit + 8-bit parity) x 3Keys

1048697Operates on 8-byte blocks 1048697Result is written to ME transfer registers or TBUF

element 1048697Result can be passed to the SHA-1 unit for hashing

Security Processing pipelining and interleaving using three wires and one core Multiple keys and IVs

AES Core

1048697All AES key sizes are supported

ndash(128 192 or 256) Both Encryption and

Decryption supported 1048697Operates on 16 byte

blocks

AES Key Scheduler

SHA1 Core

2 SHA-1 cores per crypto unit Operates on 64-byte blocks

Data is loaded from Input RAM or Crypto cores into the SHA-1 buffer

Can perform on unmodified packet data or on the ciphered packet data

Operates on 512 bit block size and has a data buffer to accumulate the ciphered data

This gives flexibility to run SHA and AES 3DES at different rates

SHA1 Critical Path Analysis

Some of The Crypto Commands

crypto_write_ram($$orig_plain_text[0]DATA_RAM_ADDR8ENCRYPT_UNIT ram_sig) Perform and wait for the write

crypto_load_iv($$iv[0] 1ENCRYPT_UNITCRYPTO_BANK ENCRYPT_STATE iv_sig) Loading IV Data

crypto_load_key($$key[0]3ENCRYPT_UNITCRYPTO_BANKENCRYPT_STATEkey_sig) Loading Key

crypto_cipher($$encrypt_data[0]DATA_RAM_ADDR8CRYPTO_CIPHER_ENCRYPTCRYPTO_CIPHER_NO_CBC CRYPTO_CIPHER_3DES ENCRYPT_UNITCRYPTO_BANK ENCRYPT_STATE cipher_sig)

Acknowledgement

Yan Luo Chris Baron httpcnscenterfuturecokrresourcersc-cent

erpresentationintelspring2003S03USCPTS92_OSpdf ( For some slides)

Mel Tsai UC Berkeley (For some slides) Thomas Sodon et al EE College of

NewJersey Zhangxi Tan et al Tsinghua University

Qhelliphelliphelliphelliphellip

Page 13: AES Microcode Implementation In IXP2400 And A study of Reconfigurable Crypto Unit Piyush Ranjan Satapathy CS203B Class Project Presentation

Experimental Results(2)

Throughput Improvement for 1 MicroEngine with different threads

0

100

200

300

400

500

8 Threads 4Threads 2Threads 1Thread

No of threads

Thro

ughp

ut(M

IPS

)

Series1

AES Throughput Across MicroEngines

0200400600800

10001200140016001800

1 2 4 8

No of MicroEngines

Thro

ughp

ut(M

IPS

)

Throughput PerformanceAcross Threads in 1 ME

Throughput PerformanceAcross Threads in 1 ME

Crypto Unit of IXP2850

Intel IXP2850 Encryption Data Flow

Crypto Unit Overview

Simple Encrypt Example

Simple Encrypt and Hash Example

3DES Core10486972 Cores per crypto unit

1048697Takes 192-bit key ndash(56-bit + 8-bit parity) x 3Keys

1048697Operates on 8-byte blocks 1048697Result is written to ME transfer registers or TBUF

element 1048697Result can be passed to the SHA-1 unit for hashing

Security Processing pipelining and interleaving using three wires and one core Multiple keys and IVs

AES Core

1048697All AES key sizes are supported

ndash(128 192 or 256) Both Encryption and

Decryption supported 1048697Operates on 16 byte

blocks

AES Key Scheduler

SHA1 Core

2 SHA-1 cores per crypto unit Operates on 64-byte blocks

Data is loaded from Input RAM or Crypto cores into the SHA-1 buffer

Can perform on unmodified packet data or on the ciphered packet data

Operates on 512 bit block size and has a data buffer to accumulate the ciphered data

This gives flexibility to run SHA and AES 3DES at different rates

SHA1 Critical Path Analysis

Some of The Crypto Commands

crypto_write_ram($$orig_plain_text[0]DATA_RAM_ADDR8ENCRYPT_UNIT ram_sig) Perform and wait for the write

crypto_load_iv($$iv[0] 1ENCRYPT_UNITCRYPTO_BANK ENCRYPT_STATE iv_sig) Loading IV Data

crypto_load_key($$key[0]3ENCRYPT_UNITCRYPTO_BANKENCRYPT_STATEkey_sig) Loading Key

crypto_cipher($$encrypt_data[0]DATA_RAM_ADDR8CRYPTO_CIPHER_ENCRYPTCRYPTO_CIPHER_NO_CBC CRYPTO_CIPHER_3DES ENCRYPT_UNITCRYPTO_BANK ENCRYPT_STATE cipher_sig)

Acknowledgement

Yan Luo Chris Baron httpcnscenterfuturecokrresourcersc-cent

erpresentationintelspring2003S03USCPTS92_OSpdf ( For some slides)

Mel Tsai UC Berkeley (For some slides) Thomas Sodon et al EE College of

NewJersey Zhangxi Tan et al Tsinghua University

Qhelliphelliphelliphelliphellip

Page 14: AES Microcode Implementation In IXP2400 And A study of Reconfigurable Crypto Unit Piyush Ranjan Satapathy CS203B Class Project Presentation

Crypto Unit of IXP2850

Intel IXP2850 Encryption Data Flow

Crypto Unit Overview

Simple Encrypt Example

Simple Encrypt and Hash Example

3DES Core10486972 Cores per crypto unit

1048697Takes 192-bit key ndash(56-bit + 8-bit parity) x 3Keys

1048697Operates on 8-byte blocks 1048697Result is written to ME transfer registers or TBUF

element 1048697Result can be passed to the SHA-1 unit for hashing

Security Processing pipelining and interleaving using three wires and one core Multiple keys and IVs

AES Core

1048697All AES key sizes are supported

ndash(128 192 or 256) Both Encryption and

Decryption supported 1048697Operates on 16 byte

blocks

AES Key Scheduler

SHA1 Core

2 SHA-1 cores per crypto unit Operates on 64-byte blocks

Data is loaded from Input RAM or Crypto cores into the SHA-1 buffer

Can perform on unmodified packet data or on the ciphered packet data

Operates on 512 bit block size and has a data buffer to accumulate the ciphered data

This gives flexibility to run SHA and AES 3DES at different rates

SHA1 Critical Path Analysis

Some of The Crypto Commands

crypto_write_ram($$orig_plain_text[0]DATA_RAM_ADDR8ENCRYPT_UNIT ram_sig) Perform and wait for the write

crypto_load_iv($$iv[0] 1ENCRYPT_UNITCRYPTO_BANK ENCRYPT_STATE iv_sig) Loading IV Data

crypto_load_key($$key[0]3ENCRYPT_UNITCRYPTO_BANKENCRYPT_STATEkey_sig) Loading Key

crypto_cipher($$encrypt_data[0]DATA_RAM_ADDR8CRYPTO_CIPHER_ENCRYPTCRYPTO_CIPHER_NO_CBC CRYPTO_CIPHER_3DES ENCRYPT_UNITCRYPTO_BANK ENCRYPT_STATE cipher_sig)

Acknowledgement

Yan Luo Chris Baron httpcnscenterfuturecokrresourcersc-cent

erpresentationintelspring2003S03USCPTS92_OSpdf ( For some slides)

Mel Tsai UC Berkeley (For some slides) Thomas Sodon et al EE College of

NewJersey Zhangxi Tan et al Tsinghua University

Qhelliphelliphelliphelliphellip

Page 15: AES Microcode Implementation In IXP2400 And A study of Reconfigurable Crypto Unit Piyush Ranjan Satapathy CS203B Class Project Presentation

Intel IXP2850 Encryption Data Flow

Crypto Unit Overview

Simple Encrypt Example

Simple Encrypt and Hash Example

3DES Core10486972 Cores per crypto unit

1048697Takes 192-bit key ndash(56-bit + 8-bit parity) x 3Keys

1048697Operates on 8-byte blocks 1048697Result is written to ME transfer registers or TBUF

element 1048697Result can be passed to the SHA-1 unit for hashing

Security Processing pipelining and interleaving using three wires and one core Multiple keys and IVs

AES Core

1048697All AES key sizes are supported

ndash(128 192 or 256) Both Encryption and

Decryption supported 1048697Operates on 16 byte

blocks

AES Key Scheduler

SHA1 Core

2 SHA-1 cores per crypto unit Operates on 64-byte blocks

Data is loaded from Input RAM or Crypto cores into the SHA-1 buffer

Can perform on unmodified packet data or on the ciphered packet data

Operates on 512 bit block size and has a data buffer to accumulate the ciphered data

This gives flexibility to run SHA and AES 3DES at different rates

SHA1 Critical Path Analysis

Some of The Crypto Commands

crypto_write_ram($$orig_plain_text[0]DATA_RAM_ADDR8ENCRYPT_UNIT ram_sig) Perform and wait for the write

crypto_load_iv($$iv[0] 1ENCRYPT_UNITCRYPTO_BANK ENCRYPT_STATE iv_sig) Loading IV Data

crypto_load_key($$key[0]3ENCRYPT_UNITCRYPTO_BANKENCRYPT_STATEkey_sig) Loading Key

crypto_cipher($$encrypt_data[0]DATA_RAM_ADDR8CRYPTO_CIPHER_ENCRYPTCRYPTO_CIPHER_NO_CBC CRYPTO_CIPHER_3DES ENCRYPT_UNITCRYPTO_BANK ENCRYPT_STATE cipher_sig)

Acknowledgement

Yan Luo Chris Baron httpcnscenterfuturecokrresourcersc-cent

erpresentationintelspring2003S03USCPTS92_OSpdf ( For some slides)

Mel Tsai UC Berkeley (For some slides) Thomas Sodon et al EE College of

NewJersey Zhangxi Tan et al Tsinghua University

Qhelliphelliphelliphelliphellip

Page 16: AES Microcode Implementation In IXP2400 And A study of Reconfigurable Crypto Unit Piyush Ranjan Satapathy CS203B Class Project Presentation

Crypto Unit Overview

Simple Encrypt Example

Simple Encrypt and Hash Example

3DES Core10486972 Cores per crypto unit

1048697Takes 192-bit key ndash(56-bit + 8-bit parity) x 3Keys

1048697Operates on 8-byte blocks 1048697Result is written to ME transfer registers or TBUF

element 1048697Result can be passed to the SHA-1 unit for hashing

Security Processing pipelining and interleaving using three wires and one core Multiple keys and IVs

AES Core

1048697All AES key sizes are supported

ndash(128 192 or 256) Both Encryption and

Decryption supported 1048697Operates on 16 byte

blocks

AES Key Scheduler

SHA1 Core

2 SHA-1 cores per crypto unit Operates on 64-byte blocks

Data is loaded from Input RAM or Crypto cores into the SHA-1 buffer

Can perform on unmodified packet data or on the ciphered packet data

Operates on 512 bit block size and has a data buffer to accumulate the ciphered data

This gives flexibility to run SHA and AES 3DES at different rates

SHA1 Critical Path Analysis

Some of The Crypto Commands

crypto_write_ram($$orig_plain_text[0]DATA_RAM_ADDR8ENCRYPT_UNIT ram_sig) Perform and wait for the write

crypto_load_iv($$iv[0] 1ENCRYPT_UNITCRYPTO_BANK ENCRYPT_STATE iv_sig) Loading IV Data

crypto_load_key($$key[0]3ENCRYPT_UNITCRYPTO_BANKENCRYPT_STATEkey_sig) Loading Key

crypto_cipher($$encrypt_data[0]DATA_RAM_ADDR8CRYPTO_CIPHER_ENCRYPTCRYPTO_CIPHER_NO_CBC CRYPTO_CIPHER_3DES ENCRYPT_UNITCRYPTO_BANK ENCRYPT_STATE cipher_sig)

Acknowledgement

Yan Luo Chris Baron httpcnscenterfuturecokrresourcersc-cent

erpresentationintelspring2003S03USCPTS92_OSpdf ( For some slides)

Mel Tsai UC Berkeley (For some slides) Thomas Sodon et al EE College of

NewJersey Zhangxi Tan et al Tsinghua University

Qhelliphelliphelliphelliphellip

Page 17: AES Microcode Implementation In IXP2400 And A study of Reconfigurable Crypto Unit Piyush Ranjan Satapathy CS203B Class Project Presentation

Simple Encrypt Example

Simple Encrypt and Hash Example

3DES Core10486972 Cores per crypto unit

1048697Takes 192-bit key ndash(56-bit + 8-bit parity) x 3Keys

1048697Operates on 8-byte blocks 1048697Result is written to ME transfer registers or TBUF

element 1048697Result can be passed to the SHA-1 unit for hashing

Security Processing pipelining and interleaving using three wires and one core Multiple keys and IVs

AES Core

1048697All AES key sizes are supported

ndash(128 192 or 256) Both Encryption and

Decryption supported 1048697Operates on 16 byte

blocks

AES Key Scheduler

SHA1 Core

2 SHA-1 cores per crypto unit Operates on 64-byte blocks

Data is loaded from Input RAM or Crypto cores into the SHA-1 buffer

Can perform on unmodified packet data or on the ciphered packet data

Operates on 512 bit block size and has a data buffer to accumulate the ciphered data

This gives flexibility to run SHA and AES 3DES at different rates

SHA1 Critical Path Analysis

Some of The Crypto Commands

crypto_write_ram($$orig_plain_text[0]DATA_RAM_ADDR8ENCRYPT_UNIT ram_sig) Perform and wait for the write

crypto_load_iv($$iv[0] 1ENCRYPT_UNITCRYPTO_BANK ENCRYPT_STATE iv_sig) Loading IV Data

crypto_load_key($$key[0]3ENCRYPT_UNITCRYPTO_BANKENCRYPT_STATEkey_sig) Loading Key

crypto_cipher($$encrypt_data[0]DATA_RAM_ADDR8CRYPTO_CIPHER_ENCRYPTCRYPTO_CIPHER_NO_CBC CRYPTO_CIPHER_3DES ENCRYPT_UNITCRYPTO_BANK ENCRYPT_STATE cipher_sig)

Acknowledgement

Yan Luo Chris Baron httpcnscenterfuturecokrresourcersc-cent

erpresentationintelspring2003S03USCPTS92_OSpdf ( For some slides)

Mel Tsai UC Berkeley (For some slides) Thomas Sodon et al EE College of

NewJersey Zhangxi Tan et al Tsinghua University

Qhelliphelliphelliphelliphellip

Page 18: AES Microcode Implementation In IXP2400 And A study of Reconfigurable Crypto Unit Piyush Ranjan Satapathy CS203B Class Project Presentation

Simple Encrypt and Hash Example

3DES Core10486972 Cores per crypto unit

1048697Takes 192-bit key ndash(56-bit + 8-bit parity) x 3Keys

1048697Operates on 8-byte blocks 1048697Result is written to ME transfer registers or TBUF

element 1048697Result can be passed to the SHA-1 unit for hashing

Security Processing pipelining and interleaving using three wires and one core Multiple keys and IVs

AES Core

1048697All AES key sizes are supported

ndash(128 192 or 256) Both Encryption and

Decryption supported 1048697Operates on 16 byte

blocks

AES Key Scheduler

SHA1 Core

2 SHA-1 cores per crypto unit Operates on 64-byte blocks

Data is loaded from Input RAM or Crypto cores into the SHA-1 buffer

Can perform on unmodified packet data or on the ciphered packet data

Operates on 512 bit block size and has a data buffer to accumulate the ciphered data

This gives flexibility to run SHA and AES 3DES at different rates

SHA1 Critical Path Analysis

Some of The Crypto Commands

crypto_write_ram($$orig_plain_text[0]DATA_RAM_ADDR8ENCRYPT_UNIT ram_sig) Perform and wait for the write

crypto_load_iv($$iv[0] 1ENCRYPT_UNITCRYPTO_BANK ENCRYPT_STATE iv_sig) Loading IV Data

crypto_load_key($$key[0]3ENCRYPT_UNITCRYPTO_BANKENCRYPT_STATEkey_sig) Loading Key

crypto_cipher($$encrypt_data[0]DATA_RAM_ADDR8CRYPTO_CIPHER_ENCRYPTCRYPTO_CIPHER_NO_CBC CRYPTO_CIPHER_3DES ENCRYPT_UNITCRYPTO_BANK ENCRYPT_STATE cipher_sig)

Acknowledgement

Yan Luo Chris Baron httpcnscenterfuturecokrresourcersc-cent

erpresentationintelspring2003S03USCPTS92_OSpdf ( For some slides)

Mel Tsai UC Berkeley (For some slides) Thomas Sodon et al EE College of

NewJersey Zhangxi Tan et al Tsinghua University

Qhelliphelliphelliphelliphellip

Page 19: AES Microcode Implementation In IXP2400 And A study of Reconfigurable Crypto Unit Piyush Ranjan Satapathy CS203B Class Project Presentation

3DES Core10486972 Cores per crypto unit

1048697Takes 192-bit key ndash(56-bit + 8-bit parity) x 3Keys

1048697Operates on 8-byte blocks 1048697Result is written to ME transfer registers or TBUF

element 1048697Result can be passed to the SHA-1 unit for hashing

Security Processing pipelining and interleaving using three wires and one core Multiple keys and IVs

AES Core

1048697All AES key sizes are supported

ndash(128 192 or 256) Both Encryption and

Decryption supported 1048697Operates on 16 byte

blocks

AES Key Scheduler

SHA1 Core

2 SHA-1 cores per crypto unit Operates on 64-byte blocks

Data is loaded from Input RAM or Crypto cores into the SHA-1 buffer

Can perform on unmodified packet data or on the ciphered packet data

Operates on 512 bit block size and has a data buffer to accumulate the ciphered data

This gives flexibility to run SHA and AES 3DES at different rates

SHA1 Critical Path Analysis

Some of The Crypto Commands

crypto_write_ram($$orig_plain_text[0]DATA_RAM_ADDR8ENCRYPT_UNIT ram_sig) Perform and wait for the write

crypto_load_iv($$iv[0] 1ENCRYPT_UNITCRYPTO_BANK ENCRYPT_STATE iv_sig) Loading IV Data

crypto_load_key($$key[0]3ENCRYPT_UNITCRYPTO_BANKENCRYPT_STATEkey_sig) Loading Key

crypto_cipher($$encrypt_data[0]DATA_RAM_ADDR8CRYPTO_CIPHER_ENCRYPTCRYPTO_CIPHER_NO_CBC CRYPTO_CIPHER_3DES ENCRYPT_UNITCRYPTO_BANK ENCRYPT_STATE cipher_sig)

Acknowledgement

Yan Luo Chris Baron httpcnscenterfuturecokrresourcersc-cent

erpresentationintelspring2003S03USCPTS92_OSpdf ( For some slides)

Mel Tsai UC Berkeley (For some slides) Thomas Sodon et al EE College of

NewJersey Zhangxi Tan et al Tsinghua University

Qhelliphelliphelliphelliphellip

Page 20: AES Microcode Implementation In IXP2400 And A study of Reconfigurable Crypto Unit Piyush Ranjan Satapathy CS203B Class Project Presentation

AES Core

1048697All AES key sizes are supported

ndash(128 192 or 256) Both Encryption and

Decryption supported 1048697Operates on 16 byte

blocks

AES Key Scheduler

SHA1 Core

2 SHA-1 cores per crypto unit Operates on 64-byte blocks

Data is loaded from Input RAM or Crypto cores into the SHA-1 buffer

Can perform on unmodified packet data or on the ciphered packet data

Operates on 512 bit block size and has a data buffer to accumulate the ciphered data

This gives flexibility to run SHA and AES 3DES at different rates

SHA1 Critical Path Analysis

Some of The Crypto Commands

crypto_write_ram($$orig_plain_text[0]DATA_RAM_ADDR8ENCRYPT_UNIT ram_sig) Perform and wait for the write

crypto_load_iv($$iv[0] 1ENCRYPT_UNITCRYPTO_BANK ENCRYPT_STATE iv_sig) Loading IV Data

crypto_load_key($$key[0]3ENCRYPT_UNITCRYPTO_BANKENCRYPT_STATEkey_sig) Loading Key

crypto_cipher($$encrypt_data[0]DATA_RAM_ADDR8CRYPTO_CIPHER_ENCRYPTCRYPTO_CIPHER_NO_CBC CRYPTO_CIPHER_3DES ENCRYPT_UNITCRYPTO_BANK ENCRYPT_STATE cipher_sig)

Acknowledgement

Yan Luo Chris Baron httpcnscenterfuturecokrresourcersc-cent

erpresentationintelspring2003S03USCPTS92_OSpdf ( For some slides)

Mel Tsai UC Berkeley (For some slides) Thomas Sodon et al EE College of

NewJersey Zhangxi Tan et al Tsinghua University

Qhelliphelliphelliphelliphellip

Page 21: AES Microcode Implementation In IXP2400 And A study of Reconfigurable Crypto Unit Piyush Ranjan Satapathy CS203B Class Project Presentation

SHA1 Core

2 SHA-1 cores per crypto unit Operates on 64-byte blocks

Data is loaded from Input RAM or Crypto cores into the SHA-1 buffer

Can perform on unmodified packet data or on the ciphered packet data

Operates on 512 bit block size and has a data buffer to accumulate the ciphered data

This gives flexibility to run SHA and AES 3DES at different rates

SHA1 Critical Path Analysis

Some of The Crypto Commands

crypto_write_ram($$orig_plain_text[0]DATA_RAM_ADDR8ENCRYPT_UNIT ram_sig) Perform and wait for the write

crypto_load_iv($$iv[0] 1ENCRYPT_UNITCRYPTO_BANK ENCRYPT_STATE iv_sig) Loading IV Data

crypto_load_key($$key[0]3ENCRYPT_UNITCRYPTO_BANKENCRYPT_STATEkey_sig) Loading Key

crypto_cipher($$encrypt_data[0]DATA_RAM_ADDR8CRYPTO_CIPHER_ENCRYPTCRYPTO_CIPHER_NO_CBC CRYPTO_CIPHER_3DES ENCRYPT_UNITCRYPTO_BANK ENCRYPT_STATE cipher_sig)

Acknowledgement

Yan Luo Chris Baron httpcnscenterfuturecokrresourcersc-cent

erpresentationintelspring2003S03USCPTS92_OSpdf ( For some slides)

Mel Tsai UC Berkeley (For some slides) Thomas Sodon et al EE College of

NewJersey Zhangxi Tan et al Tsinghua University

Qhelliphelliphelliphelliphellip

Page 22: AES Microcode Implementation In IXP2400 And A study of Reconfigurable Crypto Unit Piyush Ranjan Satapathy CS203B Class Project Presentation

Some of The Crypto Commands

crypto_write_ram($$orig_plain_text[0]DATA_RAM_ADDR8ENCRYPT_UNIT ram_sig) Perform and wait for the write

crypto_load_iv($$iv[0] 1ENCRYPT_UNITCRYPTO_BANK ENCRYPT_STATE iv_sig) Loading IV Data

crypto_load_key($$key[0]3ENCRYPT_UNITCRYPTO_BANKENCRYPT_STATEkey_sig) Loading Key

crypto_cipher($$encrypt_data[0]DATA_RAM_ADDR8CRYPTO_CIPHER_ENCRYPTCRYPTO_CIPHER_NO_CBC CRYPTO_CIPHER_3DES ENCRYPT_UNITCRYPTO_BANK ENCRYPT_STATE cipher_sig)

Acknowledgement

Yan Luo Chris Baron httpcnscenterfuturecokrresourcersc-cent

erpresentationintelspring2003S03USCPTS92_OSpdf ( For some slides)

Mel Tsai UC Berkeley (For some slides) Thomas Sodon et al EE College of

NewJersey Zhangxi Tan et al Tsinghua University

Qhelliphelliphelliphelliphellip

Page 23: AES Microcode Implementation In IXP2400 And A study of Reconfigurable Crypto Unit Piyush Ranjan Satapathy CS203B Class Project Presentation

Acknowledgement

Yan Luo Chris Baron httpcnscenterfuturecokrresourcersc-cent

erpresentationintelspring2003S03USCPTS92_OSpdf ( For some slides)

Mel Tsai UC Berkeley (For some slides) Thomas Sodon et al EE College of

NewJersey Zhangxi Tan et al Tsinghua University

Qhelliphelliphelliphelliphellip

Page 24: AES Microcode Implementation In IXP2400 And A study of Reconfigurable Crypto Unit Piyush Ranjan Satapathy CS203B Class Project Presentation

Qhelliphelliphelliphelliphellip