university of california, irvinenewport.eecs.uci.edu/~ytang/academic/thesis.pdfuniversity of...

UNIVERSITY OF CALIFORNIA, IRVINE

The Advanced Encryption Standard Mapping into MorphoSys Architecture

THESIS

submitted in partial satisfaction of the requirements for the degree of

MASTER OF SCIENCE

in Electrical and Computer Engineering

by

Ye Tang

Thesis Committee: Professor Nader Bagherzadeh, Chair

Professor Fadi J. Kurdahi Professor Stephen F. Jenks

2001

ii

The thesis of Ye Tang is approved:

_____________________________

_____________________________

_____________________________ Committee Chair

University of California, Irvine

2001

iii

DEDICATION

To

my dear wife Yang Zhao,

mother Xiuyun Zhou,

father Jiyin Tang,

sister Jun Tang,

for their love, support, understanding, and patience

iv

TABLE OF CONTENTS

LIST OF FIGURES.....................................................................................................vii

LIST OF TABLES .......................................................................................................ix

ACKNOWLEDGEMENTS ..........................................................................................x

ABSTRACT OF THE THESIS....................................................................................xi

CHAPTER 1 MorphoSys Architecture Introduction .................................................1

1.1 Reconfigurable Computing Systems.....................................................................1

1.2 MorphoSys Architecture.......................................................................................2

1.2.1 Reconfigurable Cell (RC)...............................................................................4

1.2.2 RC Array .......................................................................................................6

1.2.3 Frame Buffer and DMA Controller ..............................................................10

1.2.4 Context Memory..........................................................................................11

1.2.5 TinyRISC.....................................................................................................13

1.3 Modifications to MorphoSys...............................................................................15

1.3.1 Size Expansion of Register File and Context Memory..................................15

1.3.2 Embedded Lookup Table in Every RC.........................................................16

1.3.3 New RC Array Instructions..........................................................................16

CHAPTER 2 The Advanced Encryption Standard (AES) .......................................17

2.1 Introduction of the AES......................................................................................17

2.1.1 History of the AES Development .................................................................17

2.1.2 Overview of Rijndael ...................................................................................18

2.1.3 Definition of Terms, Parameters and Functions............................................19

2.2 Mathematical Background of Rijndael ................................................................20

2.2.1 Polynomial Representation of A Finite Field Element ..................................21

2.2.2 Addition in GF(28) .......................................................................................21

2.2.3 Multiplication in GF(28)...............................................................................22

2.2.4 Multiplication by x.......................................................................................23

2.2.5 Polynomials with Coefficients in GF(28) ......................................................24

2.3 Rijndael Specification.........................................................................................26

2.3.1 The Cipher ...................................................................................................26

2.3.1.1 SubBytes( ) Function.............................................................................27

v

2.3.1.2 ShiftRows( ) Function ...........................................................................28 2.3.1.3 MixColumns( ) Function .......................................................................29 2.3.1.4 AddRoundKey( ) Function ....................................................................31 2.3.1.5 Key Expansion......................................................................................31

2.3.2 The Inverse Cipher.......................................................................................33

CHAPTER 3 Mapping AES into MorphoSys ...........................................................36

3.1 Parallel Computing Exploration..........................................................................36

3.1.1 Multi-block Processing ................................................................................37

3.1.2 Parallel Table-lookup...................................................................................38

3.1.3 Dedicated Data Movement for Rijndael........................................................38

3.2 Algorithm Flowchart and Illustration..................................................................40

3.2.1 Key Expansion by TinyRISC .......................................................................42

3.2.2 Table Loading..............................................................................................43

3.2.3 Data and Round Key Loading ......................................................................44

3.2.4 Data Processing in RC Array........................................................................44

3.2.5 Data Storing.................................................................................................52

3.3 Simulation Environment .....................................................................................52

3.4 Performance Analysis.........................................................................................53

3.5 Conclusions........................................................................................................58

BIBLIOGRAPHY .......................................................................................................60

APPENDIX A Constant Tables Used in AES............................................................62

A.1 Lookup Table “S-box” .......................................................................................62

A.2 Lookup Table “Inv S-box” .................................................................................63

A.3 Lookup Table “xtime” .......................................................................................64

A.4 Lookup Table “Log” ..........................................................................................65

A.4 Lookup Table “Alog” .........................................................................................66

A.5 Table “Rcon” .....................................................................................................66

APPENDIX B MorphoSys TinyRISC ISA ................................................................67

B.1 Instruction Format..............................................................................................67

B.2 Instruction Codes...............................................................................................68

B.2.1 Arithmetic Instructions.................................................................................68

B.2.2 Logical Instructions......................................................................................69

B.2.3 Shift Instructions..........................................................................................71

B.2.4 Comparison Instructions...............................................................................73

vi

B.2.5 Load-Immediate Instructions........................................................................76

B.2.6 Memory Access Instructions ........................................................................77

B.2.7 Control Transfer Instructions........................................................................77

B.2.8 MorphoSys Instruction.................................................................................80

APPENDIX C RC Array Instruction Set ..................................................................88

APPENDIX D The Programs for AES Implementation in MorphoSys...................90

D.1 Key Expansion...................................................................................................90

D.2 Data Processing .................................................................................................94

D.3 Contexts for Data Processing ...........................................................................106

vii

LIST OF FIGURES

Figure 1.1: MorphoSys integrated architectural model .....................................................3

Figure 1.2: RC Architecture.............................................................................................4

Figure 1.3: 8 x 8 RC Array ..............................................................................................6

Figure 1.4: Level 1 & 2 of RC Array Interconnection Network........................................7

Figure 1.5: Level 3 of interconnection network................................................................8

Figure 1.6: L, M, R, T, C, B Port of MUXA ....................................................................9

Figure 1.7: L, U, D Port of MUXB ................................................................................10

Figure 1.8: Frame Buffer Block Diagram ......................................................................11

Figure 1.9: Structure of Context Memory ......................................................................12

Figure 1.10: TinyRISC block diagram...........................................................................13

Figure 2.1: Pseudo Code for the Cipher of Rijndael Algorithm......................................26

Figure 2.2: Transformation of ShiftRows( ) ...................................................................28

Figure 2.3: Doing MixColumns( ) by xtime Approach...................................................30

Figure 2.4: Pseudo Code for Key Expansion..................................................................31

Figure 2.5: Key Expansion and Round Key Partition for Nk = 6....................................32

Figure 2.6: Basic Pseudo Code for the Cipher of Rijndael Algorithm ............................33

Figure 2.7: Transformation of InvShiftRows( )..............................................................33

Figure 3.1: Intuitive Partitioning of RC Array................................................................37

Figure 3.2: Actual Partitioning of RC Array ..................................................................37

Figure 3.3: Transformation of ShiftRows( ) in 4x4 Matrix.............................................38

Figure 3.4: Transformation of ShiftRows( ) in 8x2 Matrix.............................................39

Figure 3.5: Data Movement for ShiftRows( ).................................................................40

Figure 3.6: Flowchart of Rijndael Implementation in MorphoSys..................................41

Figure 3.7: Concatenations of Round Keys....................................................................42

Figure 3.8: ShiftRows( ) Step 1 .....................................................................................45

Figure 3.9: ShiftRows( ) Step 2 .....................................................................................46

Figure 3.10: ShiftRows( ) Step 3, 4................................................................................47

Figure 3.11: ShiftRows( ) Step 5 ...................................................................................48

Figure 3.12: ShiftRows( ) Step 6, 7, 8............................................................................49

Figure 3.13: InvShiftRows( ) Step 1, 2, 3, 4...................................................................50

viii

Figure 3.14: InvShiftRows( ) Step 5, 6, 7, 8...................................................................51

Figure 3.15: Software Tools for MorphoSys..................................................................53

Figure 3.16: Throughputs of Different Implementations................................................58

ix

LIST OF TABLES

Table 1.1: RC Functions..................................................................................................5

Table 1.2: MorphoSys Instructions................................................................................14

Table 2.1: Terms and Acronyms Used in AES...............................................................19

Table 2.2: Parameter and Functions Used in AES..........................................................20

Table 2.3: Key-Block-Round Combinations..................................................................26

Table 3.1: # of Cycles for Key Expansion in Several Implementations..........................54

Table 3.2: # of Cycles for AES Initialization in MorphoSys Implementation.................55

Table 3.3: # of Cycles and Throughputs per Block in Other Implementations................55

Table 3.4: # of Cycles and Throughputs per Block in MorphoSys Implementation ........56

Table 3.5: AES by Amphion ASIC Cores using TSMC 0.18µm Technology.................57

Table 3.6: AES by Amphion Programmable Logic Cores using Altera APEX20KE-1...57

Table 3.7: AES by Amphion Programmable Logic Cores using Xilinx VirtexE-8..........57

x

ACKNOWLEDGEMENTS

I would like to thank my advisors, Professor Fadi J. Kurdahi and Nader

Bagherzadeh, for their guidance and support in my graduate studies and research towards

the M.S. degree. And thank my thesis committee member Professor Stephen F. Jenks.

This thesis would be impossible without their work.

I would also like to thank my group members in the VLSI Design Automation

Laboratory, Afshin Niktash, Chengzhi Pan, and Hooman T. Parizi, and former students in

the same group, Guangming Lu, Hartej Singh, Ming-Hau Lee. Their contributions on the

MorphoSys project are very important to my work.

Special thanks will go to Broadcom Corporation and Conexant Systems Inc.,

which provided me with a one-year fellowship for my graduate studies at UCI, and the

Defense and Advanced Projects Agency (DARPA), who is supporting the MorphoSys

project.

xi

ABSTRACT OF THE THESIS

The Advanced Encryption Standard Mapping into MorphoSys Architecture

By

Ye Tang

Master of Science in Electrical and Computer Engineering

University of California, Irvine, 2001

Professor Nader Bagherzadeh, Chair

The Advanced Encryption Standard (AES) specifies a cryptographic algorithm

that can be used to protect electronic data. The algorithm is called Rijndael, a high-

performance symmetric block cipher with very good security-level. AES is expected to

be used by the U.S. Government and, on a voluntary basis, by the private sector.

Hopefully, AES will gradually replace the current encryption standard, Data Encryption

Standard (DES).

MorphoSys is an SIMD based reconfigurable parallel computing system. It

includes a general-purpose RISC processor for the sequential and control part of an

algorithm, and 64 reconfigurable computing components for the parallel part of the

algorithm. The intrinsic data parallelism in AES algorithm, and the efficient data

communication and powerful data computing in MorphoSys, make MorphoSys very

suitable for AES implementation.

xii

The performance of MorphoSys implementation is quite good. The throughput is

more than 100Mb/s, adequate for applications on mobile phones and PDAs. It is one or

two orders of magnitude faster than software implementation by Assembly language,

C/C++, and Java. And up to now, one of the fastest hardware implementations by ASIC

or FPGA is only 240% ~ 270% or 30% ~ 60% faster than MorphoSys implementation,

respectively. Besides the high speed, another advantage of AES implementation by

MorphoSys is that MorphoSys is also capable of doing many other applications

efficiently with the same architecture. This feature is extremely critical when AES is only

part of the whole application.

1

Chapter 1

MorphoSys Architecture Introduction

MorphoSys is a reconfigurable computing system developed to investigate the

effectiveness of combining reconfigurable hardware with a general-purpose processor for

word-level, computation-intensive applications. It consists of a RISC processor,

embedded memory and high-speed memory interface, and an array of reconfigurable

computing cells. The dynamic reconfigurability, considerable depth of programmability,

and the large number of computing cells, make MorphoSys suitable for data-parallel and

high-throughput applications [1][2].

In this chapter, the features and advantages of reconfigurable systems, the

MorphoSys architecture and instructions, and modifications to the first generation

MorphoSys architecture are introduced.

1.1 Reconfigurable Computing Systems

General-purpose processors and Application-Specific Integrated Circuits (ASICs)

are two extremely different types of hardware. The former, such as Intel Pentium,

Motorola PowerPC, and Sun SPARC, provide the ability to run a great diversity of

applications, such as an operating system, a word processing application, or some

scientific calculation. As a consequence, the performance may be inferior to that achieved

by a system possessing architecture more suitable for the application. The latter, on the

other hand, implement exactly the functionality needed by a particular application. The

architecture of an ASIC exploits the intrinsic characteristics of an application’s algorithm

2

that lead to high performance. However, the direct architecture-algorithm mapping

restricts the range of applicability and reusability.

In order to combine the flexibility of general-purpose processors and the high

speed of ASICs, the concept of reconfigurable computing system is proposed. A

reconfigurable computing system is a hybrid approach between a general-purpose

processor and an ASIC. Ideally, a reconfigurable system delivers high performance

typical of ASICs and still provides the flexibility of general-purpose processors (i.e. it

can execute a wide range of applications).

Conventionally, field programmable gate arrays (FPGAs) are the most common

devices used for implementing reconfigurable components. This is because FPGAs allow

designers to manipulate gate-level devices such as flip-flops, memory and other logic

gates. However, FPGAs have certain disadvantages such as low logic density and

inefficient performance for word-level datapath operations [3]. Hence, many researchers

have proposed prototypes of coarse-grain reconfigurable systems that employ non-FPGA

reconfigurable components. MorphoSys is one among them.

1.2 MorphoSys Architecture

MorphoSys M1 (M1 is the first version of its physical implementation) consists of

five main components: the Reconfigurable Cell Array (RC Array), the RISC control

processor (TinyRISC), the Context Memory, the Frame Buffer and the DMA Controller.

Figure 1.1 shows the organization of the integrated MorphoSys reconfigurable computing

system.

3

TinyRISC Core Processor

Context Memory

(2 x 8 x 16 x 32 bits)

RC Array

(8 X 8 RCs)

DMA Controller

In s t . Cache

Tin

yRIS

CInstru

ction

TinyR

iscD

ata

Mem

Controller

Main Memory

Frame Buffer(2 x 128 x 64 bits)

Context

Seg

ment

Data

Segm

ent

Mem

Controller

M 1 Chip

Figure 1.1: MorphoSys integrated architectural model

The RC Array contains 64 reconfigurable computing elements. The Context

Memory is the local memory to store the configuration contexts, or instructions, for RC

Array. So RC Array and Context Memory correspond to the reconfigurable processor

array (SIMD co-processor), which is responsible for the parallel computing of the

application. The main processor is TinyRISC, a general-purpose 32-bit RISC processor.

TinyRISC is responsible for sequential tasks and control functions of the application. The

high-bandwidth memory interface is implemented through Frame Buffer and DMA

controller. The data to be processed is transferred from external memory to Frame Buffer,

then from Frame Buffer to RC Array, and in the reverse order for the result data.

In the following sections, all the components of MorphoSys architecture are

described in detail. For more information related to MorphoSys architecture, please refer

to [4][5].

4

1.2.1 Reconfigurable Cell (RC)

RC is the basic element of RC Array. Each RC incorporates an ALU-multiplier, a

shift unit, input multiplexors and a register file, as shown in Figure 1.2. The multiplier is

included since many target applications require integer multiplication. In addition, there

is a context register that is used to store the current context and provide

control/configuration signals to the RC components (namely the ALU-multiplier, shift

unit and the input muxes).

Context Memory

Data(31.....0)

MUXA

XQ

RM

ALU+MULT

REG

Output

ALU_CTRL

Context R

egister

Constant

Address From TinyRISC

T C B

MUXB

SHIFTALU_SFT

Register File

R0

R3

I U D L

RF

0R

F1

RF

2R

F3

16(X2)Entries

R1

R2

R3

R2

R1

R0

L VE

I

FLAG

ALU

_OP

MU

XA

MU

XB

Con

stant

RE

G_F

ILE

Write_E

XP

R

RS

_LS

11...031 15...1218...1622...1926...23

AL

U_S

FT

29...2830

Write_R

F_E

n

27

HE

16

28

16

1616

8

VE HE

To_FB

WE & Row_col

12

Figure 1.2: RC Architecture

The data to the multiplier/ALU is provided through two 16-bit input muxes.

MUXA selects an input from: (1) the outputs of other RCs (L, M, R, T, C, B ports) in the

same row/column within the same RC Array quadrant; (2) the nearest neighbors in the

adjacent quadrant (XQ port); (3) the data from Frame Buffer (I port); (4) the internal

5

register file (R0 through R3 port); or (5) the vertical/horizontal express lane (VE, HE

port). MUXB selects one input from: (1) three nearest neighbors (L, U, D port); (2) the

data from Frame Buffer (I port); or (3) the register file (R0 through R3 port). Please refer

to Section 1.2.2 for details of these connection ports.

The 32-bit context register stores current configuration for each RC. For example,

the field ALU_OP specifies the ALU function, and the field MUXA/MUXB indicates the

input from MUXA/MUXB.

Table 1.1 shows all the RC functions implemented in M1. The special functions

such as absolute value, count one's, and rounding are implemented as separate units from

the ALU to simplify the logic complexity of the ALU and improve the overall

performance.

Table 1.1: RC Functions

Instruction Description

A OR B, A AND B, A XOR B,

A OR C, A AND C, A XOR C

Two-operand logic functions

A + B, A− B, B − A, A + C, A − C Two-operand arithmetic functions

A * C Multiplication with constant

A*C + B, A*C + Out(t),

A*C − Out(t)

Multiply-accumulate functions

| A - B | + Out(t) Absolute difference accumulate

A AND B : Count One's AND with count # of one's in result

A+B if A>0, A-B if A<0 Conditional add/subtract based on sign bit of A

Rounding, RESET, BYPASS A, LOAD Constant, No-op

Miscellaneous functions

A: MUXA operand, B: MUXB operand, C: constant

Out(t) = previous output, Out(t+1) = new output

6

1.2.2 RC Array

The whole reconfigurable component is an array of RCs, or RC Array.

Considering that target applications (video compression, etc.) tend to be processed in

clusters of 8 x 8 data elements, the RC Array has 64 cells in a two-dimensional matrix, as

illustrated in Figure 1.3. This configuration is chosen to maximally utilize the parallelism

inherent in an application, which in turn enhances throughput.

RC RC

RC RC

RC RC

RC RC

RC RC

RC RC

RC RC

RC RC

RC RC

RC RC

RC RC

RC RC

RC RC

RC RC

RC RC

RC RC

RC RC

RC RC

RC RC

RC RC

RC RC

RC RC

RC RC

RC RC

RC RC

RC RC

RC RC

RC RC

RC RC

RC RC

RC RC

RC RC

Figure 1.3: 8 x 8 RC Array

The RC Array follows the SIMD model of computation. All RCs in the same

row/column share the same configuration data (context). However, each RC operates on

different data. Sharing the context across a row/column is useful for data-parallel

applications.

The RC Array has an extensive interconnection network, designed to enable fast

data exchange between the RCs. This results in enhanced performance for application

7

kernels that involve a lot of data movement, such as the discrete cosine transform (DCT)

used in video compression, and the AES algorithm described in this thesis.

There are three levels of RC Array interconnection network. The first level of the

RC Array interconnection network is the nearest neighbor layer that connects the RCs in

a 2-D mesh (see Figure 1.4). The second layer of connectivity is at the quadrant level (a

quadrant is a 4x4 RC group, see Figure 1.4), which provides complete row and column

connectivity within a quadrant. Therefore, each RC can access data from any other RC in

the same row/column within the quadrant.

RC RC

RC RC

RC RC

RC RC

RC RC

RC RC

RC RC

RC RC

RC RC

RC RC

RC RC

RC RC

RC RC

RC RC

RC RC

RC RC

RC RC

RC RC

RC RC

RC RC

RC RC

RC RC

RC RC

RC RC

RC RC

RC RC

RC RC

RC RC

RC RC

RC RC

RC RC

RC RC

Quad0 Quad1

Quad2 Quad3

Figure 1.4: Level 1 & 2 of RC Array Interconnection Network

At the third or global level, there are buses that support inter-quadrant

connectivity (see Figure 1.5). These buses are also called express lanes and they run

across rows as well as columns. These lanes can supply data from any RC of a quadrant

8

to other four RCs in the same row/column but different quadrant. For example, the value

of RC(0,1)* can be put on the horizontal express lane (HE) and then got by RC(0,4),

RC(0,5), RC(0,6) and RC(0,7); or it can be put on the vertical express lane (VE) and then

got by RC(4,1), RC(5,1), RC(6,1) and RC(7,1). Thus, up to four cells in a row/column

may access the output value of any one of four cells in the same row/column of the

adjacent quadrant. Express lanes greatly enhance the global connectivity. Some irregular

communication patterns, that otherwise require extensive interconnections, can be

handled quite efficiently. For example, an eight-point butterfly in FFT is accomplished in

only three clock cycles, and the data movement in the AES algorithm implementation

largely depends on the express lanes.

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

RC

Figure 1.5: Level 3 of interconnection network

* RC(x,y) means the RC located at row x, column y.

9

The L, M, R, T, C, B port in MUXA: L (Left), M (Middle), R (Right), T (Top),

C (Center), and B (Bottom) port of MUXA are all connected to other RCs within the

same quadrant. For example, these ports for RC X and Y are marked in Figure 1.6.

Notice that they do not always match their literal meanings.

L M

T

X R

C

B

T

C

B

Y L M R

Figure 1.6: L, M, R, T, C, B Port of MUXA

The L, U, D port of MUXB: L (Left), U (Up), and D (Down) port of MUXB are

defined by absolute location. They are not necessarily limited within a quadrant. For

example, these ports for RC X, Y, and Z are marked in Figure 1.7. Notice that they are

wrapped.

10

L

U

X

D

D

U

L Z

D

U

L Y

Figure 1.7: L, U, D Port of MUXB

1.2.3 Frame Buffer and DMA Controller

The high parallelism of the RC Array would be ineffective if the memory

interface is unable to transfer data at an adequate rate. Therefore, a high-speed memory

interface consisting of a streaming buffer (Frame Buffer) and a DMA controller is

incorporated in the system. The Frame Buffer has two sets as illustrated in Figure 1.8.

The communication between Frame Buffer and main memory is controlled by DMA

controller. By using the two sets of Frame Buffer alternatively, the computation of RC

Array and the data load and store of Frame Buffer are overlapped. Therefore, the memory

accesses are virtually transparent to RC Array.

11

BANK A

(64 x 8 bytes)

SET 0

SET 1

MSB

LSB

AA

AA

AA

AA

AA

AA

AA

AA

BB

BB

BB

BB

BB

BB

BB

BB

BANK B

(64 x 8 bytes)

.

.

.

.

.

.

.

.

.

.

.

.

Figure 1.8: Frame Buffer Block Diagram

1.2.4 Context Memory

The context memory stores configuration data, or contexts, for RC Array.

Contexts resemble the instructions for a microprocessor. But here, every context can

serve eight RCs in the same row or column simultaneously*.

As shown in Figure 1.9, Context Memory is logically organized into two blocks,

column context block (on the top) and row context block (on the left). Each block

consists of eight context sets, and each set consists of 16 context words.

A context word in the row context block (called row context word) is broadcast

on a row. And a context word in the column context block (called column context word)

is broadcast on a column. By picking up one corresponding word from each set in the

* That also indicates the coarse-grain nature (word-level operations) of MorphoSys architecture.

12

row/column context block, those 8 words (a plane) can cover the whole 8 rows/columns,

or the 64 RCs.

RC RC

RC RC

RC RC

RC RC

RC RC

RC RC

RC RC

RC RC

RC RC

RC RC

RC RC

RC RC

RC RC

RC RC

RC RC

RC RC

RC RC

RC RC

RC RC

RC RC

RC RC

RC RC

RC RC

RC RC

RC RC

RC RC

RC RC

RC RC

RC RC

RC RC

RC RC

RC RC16

16

16

16

16

16

16

16

16

16 16 16 16 16 16 16

Figure 1.9: Structure of Context Memory

The total number of row/column contexts is referred as the depth of

programmability. Because there are 16 words in a set, there are 16 row contexts and 16

column contexts in total. This means the depth of programmability is 32. In other words,

RC Array can perform up to 32 different operations without reloading new contexts.

This depth is enough for a lot of DSP and image processing applications.

However, it is not enough for some complicated algorithms. Because the penalty to

reload new contexts during application is large, a reasonable way is to increase the

context memory size. In M2, the next version MorphoSys, the depth will be increased to

256.

13

1.2.5 TinyRISC

Figure 1.10 shows the block diagram of TinyRISC. Since most target applications

involve some sequential processing, a RISC processor, TinyRISC [6], is included in the

system.

Fetch Stage

ProgramCounter

BranchUnit

ALU

ShiftUnit

MemoryUnit

MorphoSysUnit

Execute StageDecode Stage Write-Back Stage

ClockDriver

RegisterFi le Data Cache Core

Figure 1.10: TinyRISC block diagram

This is a MIPS-like processor with a 4-stage scalar pipeline. It has a 32-bit ALU,

register file and an on-chip data cache memory. This processor also coordinates system

operation and controls its interface with the external world. This is made possible by

some specific instructions (besides the standard RISC instructions) to the TinyRISC

Instruction Set Architecture (ISA). These instructions are called MorphoSys instructions.

They can initiate data transfers between main memory and MorphoSys components, and

control the execution of the RC Array.

14

These MorphoSys instructions are listed in Table 1.2. There are two major

categories of these instructions: DMA instructions and RC Array instructions.

Table 1.2: MorphoSys Instructions

Mnemonic Description of Operation

LDCTXT Load Context from Main Memory into Context Memory.

LDFB Load data from Main Memory into Frame Buffer.

STFB Store data into Main Memory from Frame Buffer.

CBCAST Context broadcast, no data from Frame Buffer.

DBCBC Column context broadcast, get data from both banks of Frame Buffer.

DBCBR Row context broadcast, get data from both banks of Frame Buffer.

DBCB Context broadcast, get data from both banks of Frame Buffer.

SBCB Context broadcast, transfer 128 bit data from Frame Buffer.

WFB Write the processed data back into Frame Buffer with indirect Address.

WFBI Write the processed data back into Frame Buffer with immediate address.

RCRISC Write one 16-bit data from RC Array into TinyRISC.

The DMA instructions contain fields that provide the DMA Controller with

adequate information, such as starting address in main memory, starting address in Frame

Buffer or Context Memory, number of bytes to load, load or store control, etc. This

enables the transfer of data between main memory and Frame Buffer or Context Memory

through the DMA Controller.

15

The RC Array instructions have fields that provide control signals to the RC

Array and Context Memory. This is essential to enable the execution of computations in

the RC Array. This information includes the contexts to be executed, the mode of context

broadcast (row or column), location of data to be loaded in from Frame Buffer, etc.

1.3 Modifications to MorphoSys

In the implementation of M2, some modifications to MorphoSys architecture are

proposed, including memory size expansion and architectural revamping of the RC. The

modifications that have impact on the implementation of AES are briefly mentioned

below.

1.3.1 Size Expansion of Register File and Context Memory

To make RC capable of more complicated algorithms, 8 registers (instead of 4)

will be included in the register file. The size of context memory will be increased to be

able to store 256 context planes instead of 32. These upgrades are critical to the

implementation effectiveness of some complex algorithms, such as AES, FFT, Reed

Solomon Codes, and so on. Specifically, AES uses 7 registers and 27 contexts for

encryption, and 8 registers and 28 contexts for decryption. Notice that the numbers of

contexts mentioned here are only for AES’s data processing part. Besides, its

initialization part needs more than 500 contexts for loading two tables, 256 bytes each.

Since these tables are only loaded once in a session, it is acceptable to repeatedly load

them into a small-size context memory. So a context memory with the capability of

storing 32 contexts is enough for AES. However, the increase of the number of registers

is necessary to achieve high-speed implementation of AES.

16

1.3.2 Embedded Lookup Table in Every RC

Lookup operation is common in quite a few algorithms. For AES, it is the most

important operation (see Chapter 2). To achieve high computing parallelism, M2 will

embed a 512-byte lookup table in each RC. This table will be implemented by SRAM.

1.3.3 New RC Array Instructions

To access the lookup table in every RC, two new RC Array instructions,

“LDMM” and “STMM”, are added to the instruction set. For example, “LDMM r1 > 5”

means loading the value of table element (memory) at address r1 into register r5; “STMM

r5 > 1” means storing the value of register r5 into the table element (memory) at address

r1.

17

Chapter 2

The Advanced Encryption Standard (AES)

Advanced Encryption Standard (AES) is the new encryption standard that is

expected to replace the current standard, Data Encryption Standard (DES) and Triple

DES. The National Institute of Standards and Technology (NIST) worked with industry

and public cryptographic community to develop the AES [7]. A comprehensive overview

of AES and its algorithm is described in this chapter.

2.1 Introduction of the AES

After more than three years’ work, NIST recently announced Rijndael as the AES

algorithm. The development of AES and the nature of Rijndael algorithm are briefly

introduced in this section.

2.1.1 History of the AES Development

The AES development was launched by NIST on Jan 2, 1997. On August 20,

1998, NIST selected fifteen algorithms as candidates for tests. After the comprehensive

analysis and public comments by the global cryptographic community, five algorithms

were selected from them as the AES finalist in April 1999. They were MARS, RC6,

Rijndael, Serpent, and Twofish. Then, after two rounds of further public analysis, NIST

announced on October 2, 2000 that Rijndael has been selected for the AES. Four months

after the announcement, NIST finished a draft Federal Information Processing Standard

(FIPS) for the AES and asked for public review and comment [8]. The comment period

18

ended on May 29, 2001. According to NIST’s schedule, the formal standard is to be

published by the summer of 2001.

2.1.2 Overview of Rijndael

Rijndael is a symmetric block cipher developed by two Belgium cryptology

experts, Joan Daemen and Vincent Rijmen. The pronunciation of Rijndael could be like

"Reign Dahl", "Rain Doll", or "Rhine Dahl", according to its authors’ suggestion.

Rijndael can apply to data blocks of 128 bits, using cipher keys with lengths of

128, 192, and 256 bits*. Rijndael's combination of security, performance, efficiency, ease

of implementation and flexibility make it an appropriate selection for the AES.

Specifically, Rijndael has very good performance in both hardware and software

across a wide range of computing. Its initialization time is short, and its key agility is

good. Rijndael's very low memory requirements make it very well suited for restricted-

space environments, in which it also demonstrates excellent performance. Rijndael's

operations are among the easiest to defend against power and timing attacks [9][10].

Additionally, Rijndael's internal round structure appears to have good potential to benefit

from instruction-level parallelism (ILP). It is the ILP characteristic of Rijndael that

stimulates the research of its implementation into MorphoSys architecture.

For all kinds of information about Rijndael, you may want to begin from the

website maintained by its authors: http://www.esat.kuleuven.ac.be/~rijmen/rijndael/.

* In Fact, Rijndael can handle any combination of Key size and block size from 128, 192, and 256 bits. But in the AES, the block size is fixed at 128 bits to be more easily accommodated by many types of block cipher design.

19

2.1.3 Definition of Terms, Parameters and Functions

The terms, parameters, and functions used by AES are defined in the following

two tables. They conform to the convention used by the draft FIPS.

Table 2.1: Terms and Acronyms Used in AES

Term Explanation

Block Sequence of binary bits that comprise the input, output, State, and Round Key. The length of a block is the number of bits it contains. For AES, the block length is 128 bits.

Byte A group of eight bits that is treated either as a single entity or as an array of 8 individual bits.

Cipher Series of transformations that converts plaintext to ciphertext using the Cipher Key.

Cipher Key Secret, cryptographic key that is used by the Key Expansion routine to generate a set of Round Keys; can be pictured as a rectangular array of bytes, having four rows and Nk columns.

Ciphertext Data output from the Cipher or input to the Inverse Cipher.

Inverse Cipher Series of transformations that converts ciphertext to plaintext using the Cipher Key.

Key Expansion Routine used to generate a series of Round Keys from the Cipher Key.

Plaintext Data input to the Cipher or output from the Inverse Cipher.

Round Key Round Keys are values derived from the Cipher Key using the Key Expansion routine; they are applied to the State in the Cipher and Inverse Cipher.

State Intermediate Cipher result that can be pictured as a rectangular array of bytes, having four rows and Nb columns.

S-box Non-linear substitution table used in several byte substitution of a byte value.

Word A group of 32 bits that is treated either as a single entity or as an array of 4 bytes.

20

Table 2.2: Parameter and Functions Used in AES

AddRoundKey( ) Transformation in the Cipher and Inverse Cipher in which a Round Key is added to the State using an XOR operation. The length of a Round Key equals the size of the State (128 bits, or 16 bytes).

SubBytes( ) Transformation in the Cipher that processes the State using a non-linear byte substitution table (S-box) that operates on each of the State bytes independently.

ShiftRows( ) Transformation in the Cipher that processes the State by cyclically shifting the last three rows of the State by different offsets.

MixColumns( ) Transformation in the Cipher that takes all of the columns of the State and mixes their data (independently of one another) to produce new columns.

InvSubBytes( ) Transformation in the Inverse Cipher that is the inverse of SubBytes( ).

InvShiftRows( ) Transformation in the Inverse Cipher that is the inverse of ShiftRows( ).

InvMixColumns( ) Transformation in the Inverse Cipher that is the inverse of MixColumns( ).

RotWord( ) Function used in the Key Expansion routine that takes a 4-byte word and performs a cyclic permutation.

SubWord( ) Function used in the Key Expansion routine that takes a 4-byte input word and applies an S-box to each of the 4 bytes to produce an output word.

Nb Number of columns (32-bit words) comprising the State. For AES, Nb = 4.

Nk Number of 32-bit words comprising the Cipher Key. For AES, Nk = 4, 6, or 8.

Nr Number of rounds, which is a function of Nk and Nb (which is fixed). For AES, Nr = 10, 12, or 14.

2.2 Mathematical Background of Rijndael

Before looking into the algorithm of Rijndael, it is helpful to understand the

mathematical basis used by it. In this section, the necessary mathematical concepts are

introduced, and some simple examples are given.

21

2.2.1 Polynomial Representation of A Finite Field Element

The basic processing unit in Rijndael is a byte, which can be represented as a

group of eight contiguous bits:

{ }01234567 ,,,,,,, bbbbbbbb where 1or 0=ib

Furthermore, it can be interpreted as finite field elements using a polynomial

representation [11]:

0

01

12

23

34

45

56

67

7 xbxbxbxbxbxbxbxb +++++++

For example, { 10011100} identifies the following specific finite field element:

2347 xxxx +++

To simplify the representation, hexadecimal notation is introduced. For example,

the above element { 10011100} can be represented as { 9C} , or simpler, ‘9C’.

Since the unit in Rijndael is a byte, all elements can be represented by two

hexadecimal digits. This kind of finite field is called GF(28). (GF stands for Galois Field.)

2.2.2 Addition in GF(28)

The addition of two elements is a polynomial with coefficients that are given by

the sum modulo 2 of the corresponding coefficients of the two operands. For example,

‘9C’ + ‘26’ = ‘BA’

Or, with the polynomial representation:

)()()( 134571252347 xxxxxxxxxxxx ++++=++++++

Not surprisingly, the addition in GF(28) is actually a simple and fast bitwise XOR

operation. To verify it with the previous example,

‘9C’ ⊕ ‘26’ = ‘10011100’ ⊕ ‘00100110’ = ‘10111010’ = ‘BA’

22

So from now on, the symbol for addition might be either + or ⊕ .

The neutral element is ‘00’ , and the inverse (or, more accurately, additive inverse)

of any element is itself. So subtraction and addition are the same here*.

2.2.3 Multiplication in GF(28)

The multiplication in GF(28) corresponds with multiplication of polynomials

modulo an irreducible binary polynomial of degree 8. A polynomial is irreducible if it has

no divisors other than 1 and itself. For Rijndael, this irreducible polynomial is fixed and

given by

1)( 1348 ++++= xxxxxm

Or, it can be represented as ‘11B’ in hexadecimal notation. Notice it is out of the

range of ‘00’ ~ ‘FF’.

Here is an example of multiplication.

‘9C’ • ‘26’ = ‘63’ , or:

1)1( mod )( then,

XOR) is(addition

)()()(

)()( first,

156134836712

36712

3458456978912

1252347

+++=+++++++

+++=+++++++++++=

++•+++

xxxxxxxxxxx

xxxx

xxxxxxxxxxxx

xxxxxxx

The modular reduction by m(x) ensures that the result will be a binary polynomial

of degree less than 8, and thus can be represented by a byte.

The natural element is ‘01’ , and b(x) is a(x)’s multiplicative inverse if

1)(mod)()( =• xmxaxb

* For more information, please refer to mathematics about Abelian group.

23

Unlike addition, there is no simple operation at the byte level that corresponds to

the multiplication. In software implementation of Rijndael, the multiplication is usually

done by two table-lookup operations:

if (a && b) return Alogtable[(Logtable[a] + Logtable[b])%255];

else return 0;

It is just like the normal mathematical equation: )log(loglog 1 baba +=• − . More

information about these tables as well as the whole software implementation of Rijndael

can be found at [12].

2.2.4 Multiplication by x

When b(x) is multiplied by x, the result before modulo m(x) is:

10

21

32

43

54

65

76

87 xbxbxbxbxbxbxbxb +++++++

If b7 = 0, no modular reduction is needed since the degree is already less than 8;

If b7 = 1, the subsequent modular reduction, however, is necessary. And the

reduction can be implemented by a bitwise XOR with ‘1B’. Notice that m(x) is actually

‘11B’, and the MSB will be XORed with b7 , thus generates a zero which can be omitted.

To summarize, a multiplication by x can be implemented at byte level as a 1-bit

left shift followed by a conditional bitwise XOR with ‘1B’, denoted by b(x) =

xtime(a(x)), or simpler, b = xtime(a). xtime operation is much faster than a normal

multiplication, which as shown before is implemented by two table-lookup operations.

However, xtime is not the ultimate goal. An important feature that makes xtime

useful is that ANY multiplication can be implemented by a sum of a series of xtime

operations. Here is the proof and an example:

24

Proof:

17a ifexist 12a ifexist 11a ifexist 10a ifexist

7710

)))((())(()(

)()()()()(

====

+++=•++•+•=•

�

��

�

bxtimextimextimebxtimextimebxtimeb

xaxbxaxbaxbxaxb

Example: ‘9C’ • ‘26’ = ‘63’ :

‘26’ = ‘00100110’, so a1 = a2 = a5 = ‘1’

xtime(‘9C’) = ‘00111000’ ⊕ ‘1B’ = ‘23’ # has conditional XOR

xtime(‘23’) = ‘46’ # no conditional XOR

xtime(‘46’) = ‘8C’ # no conditional XOR

xtime(‘8C’) = ‘00011000’ ⊕ ‘1B’ = ‘03’ # has conditional XOR

xtime(‘03’) = ‘06’ # no conditional XOR

‘9C’ • ‘26’ = ‘9C’ • (‘02’ ⊕ ‘04’ ⊕ ‘20’) = ‘23’ ⊕ ‘46’ ⊕ ‘06’ = ‘63’ .

Notice that multiple xtime operations may be needed to perform just one

multiplication.

2.2.5 Polynomials with Coefficients in GF(28)

A polynomial can be defined with coefficients in GF(28). In Rijndael, this kind of

polynomial has a degree of 4. For example, 01

12

23

3)( axaxaxaxa +++= is such a

polynomial. Notice that 0123 ,,, aaaa are bytes defined in GF(28) rather than simple ‘0’ or

‘1’ .

Addition can be defined similarly:

)()()()()()( 001

112

223

33 baxbaxbaxbaxbxa ⊕+⊕+⊕+⊕=+

25

Multiplication is a little different. The first step is

01

12

23

34

45

56

6)()()( cxcxcxcxcxcxcxbxaxc ++++++=•=

where

336

32235

3122134

302112033

2011022

10011

000

bac

babac

bababac

babababac

bababac

babac

bac

•=•⊕•=

•⊕•⊕•=•⊕•⊕•⊕•=

•⊕•⊕•=•⊕•=

•=

The second step is to reduce the previous result to a polynomial of degree less

than 4. In Rijndael, it is accomplished by modulo 1)( 4 += xxM . Let d(x) be the modular

product of a(x) and b(x), then

01

12

23

3)( dxdxdxdxd +++=

where

)()()()(

)()()()(

)()()()(

)()()()(

302112033

332011022

322310011

312213000

babababad

babababad

babababad

babababad

•⊕•⊕•⊕•=•⊕•⊕•⊕•=•⊕•⊕•⊕•=•⊕•⊕•⊕•=

Using matrix form, it can be written as the following circulant format:

��

�

�

��

�

�

��

�

�

��

�

�

=��

�

�

��

�

�

3

2

1

0

0123

3012

2301

1230

3

2

1

0

b

b

b

b

aaaa

aaaa

aaaa

aaaa

d

d

d

d

26

2.3 Rijndael Specification

Rijndael is an iterated block cipher. The number of rounds depends on the values

of Nb and Nk. The Key-Block-Round relation is given in Table 2.3. In the following

sections, the algorithms for the cipher and inverse cipher are described separately. Since

the inverse cipher’s algorithm is very similar to the cipher’s, the discussion will be

mainly focused on the cipher.

Table 2.3: Key-Block-Round Combinations

Key Length (Nk words) Block Size (Nb words) Number of Rounds (Nr)

AES-128 4 4 10

AES-192 6 4 12

AES-256 8 4 14

2.3.1 The Cipher

The pseudo code for the cipher is listed in Figure 2.1.

KeyExpansion(CipherKey, RoundKey); // see sec 2.3.1.5

state = in;

AddRoundKey(state);

for ( round = 1; round < Nr; round ++)

{

SubBytes(state); // see sec 2.3.1.1

ShiftRows(state); // see sec 2.3.1.2

MixColumns(state); // see sec 2.3.1.3

AddRoundKey(state); // see sec 2.3.1.4

}

SubBytes(state);

ShiftRows(state);

AddRoundKey(state);

out = state;

Figure 2.1: Pseudo Code for the Cipher of Rijndael Algorithm

27

The cipher consists of three parts:

• an initial Key Expansion and Round Key addition.

• Nr-1 intermediate rounds

• a final round

Every intermediate round consists of four steps:

• Substitute Bytes

• Shift Rows

• Mix Columns

• Add Round Key

And the final round can be regarded as an incomplete intermediate round, lacking

the MixColumns step.

KeyExpansion, SubBytes, ShiftRows, MixColumns, and AddRoundKey are all

the distinct functions in the cipher. Their algorithms are described below. Because

KeyExpansion calls function SubBytes, it will be interpreted at last.

2.3.1.1 SubBytes( ) Function

SubBytes is a non-linear byte substitution. It substitutes each byte of the State

with a corresponding element in a table called S-box. This table is constructed by two

steps:

28

1. Take the multiplicative inverse in GF(28), while ‘00’ mapped onto itself.

2. Apply an affine transformation over GF(28) defined by:

��

�

�

��

�

�

+

��

�

�

��

�

�

��

�

�

��

�

�

=

��

�

�

��

�

�

0

1

1

0

0

0

1

1

11111000

01111100

00111110

00011111

10001111

11000111

11100011

11110001

7

6

5

4

3

2

1

0

7

6

5

4

3

2

1

0

x

x

x

x

x

x

x

x

y

y

y

y

y

y

y

y

Since the table is fixed, it is loaded during the initialization and accessed by table-

lookup operation. For example, SubByte(‘00’) = ‘63’ . (’63’ is the first element in S-box.)

The whole S-box table is listed in Appendix A.

2.3.1.2 ShiftRows( ) Function

In ShiftRows, the four rows of the State are cyclically shifted over different

offsets: Row 0 is not shifted; Row 1 is shifted over 1 byte; Row 2 is shifted over 2 bytes;

Row 3 is shifted over 3 bytes. So the positions (denoted by numbers from 1 to 16) of the

bytes in a State are changed like:

1 5 9 13 1 5 9 13

2 6 10 14 6 10 14 2

3 7 11 15 11 15 3 7

4 8 12 16 16 4 8 12

Figure 2.2: Transformation of ShiftRows( )

29

2.3.1.3 MixColumns( ) Function

In MixColumns, the columns are considered as polynomials with coefficients in

GF(28). They are multiplied modulo x4+1 with a fixed polynomial c(x), given by

'02''01''01''03')( 23 +++= xxxxc

Recall Section 2.2.5, multiplication )()()( xcxaxd •= can be denoted by

��

�

�

��

�

�

��

�

�

��

�

�

=��

�

�

��

�

�

3

2

1

0

3

2

1

0

02010103

03020101

01030201

01010302

a

a

a

a

d

d

d

d

As a result, the four bytes in a column a(x) are transformed into the following

d(x):

)'02(')'03('

)'03(')'02('

)'03(')'02('

)'03(')'02('

32102

32102

32101

32100

aaaad

aaaad

aaaad

aaaad

•⊕⊕⊕•=•⊕•⊕⊕=

⊕•⊕•⊕=⊕⊕•⊕•=

There are two ways to do the multiplications in above expressions. One way is to

use two tables (Logtable and Alogtable) and three table-lookup operations (see Section

2.2.3); another way is to use multiple xtime operations (see Section 2.2.4), each of which

can be implemented either by dedicated hardware or by one table-lookup operation.

As shown before, a disadvantage of xtime approach is that multiple xtime

operations may be needed to finish one multiplication, especially when the degrees of

a(x)’s coefficients are high. Fortunately, a(x) is fixed in Rijndael and the degrees of its

coefficients are not very high: in the encryption part of Rijndael, the coefficients are ‘03’ ,

‘01’ , ‘01’ , and ‘02’ . So xtime approach works perfectly in that case. In the decryption

30

part of Rijndael, the coefficients are ‘0B’, ‘0D’, ‘09’ , and ‘0E’. That will introduce more

xtime operations and reduce the speed a little bit.

Another concern about xtime approach comes from the structure of MorphoSys.

Because MorphoSys is a reconfigurable computing system rather than an ASIC, one

should not expect to implement xtime operation at byte level as a 1-bit left shift followed

by a conditional bitwise XOR with ‘1B’. Instead, xtime operation will be implemented by

a table-lookup operation. It seems that xtime is not attractive any more because it still

needs a table-lookup, and one multiplication needs multiple xtime operations. But, in

fact, M2 of MorphoSys can do the table-lookup operation quite efficiently. And more

importantly, xtime approach needs only one table, while a normal multiplication needs

two. Considering the tradeoff between speed (not much difference) and memory usuage

(ratio of 1 to 2), the xtime approach is preferable.

In MixColumns, the four elements in a column are transformed by the following

code.

tmp = a[0]â[1]â[2]â[3]; // ^ means XOR

tm = a[0]â[1]; tm = xtime(tm); a[0] = a[0] ^ tm ^ tmp;




Figure 2.3: Doing MixColumns( ) by xtime Approach

It is easy to prove that the new a[i] equals to the di shown in the previous page.

Notice that a[0]^(a[0]â[1]â[2]â[3]) = a[1]â[2]â[3], etc., and tmp is shared

among four expressions to save some registers.

31

2.3.1.4 AddRoundKey( ) Function

Round Key addition is very simple and straightforward. In this operation, current

Round Key is applied to the current State by a bitwise XOR.

2.3.1.5 Key Expansion

The purpose of Key Expansion is to derive all Round Keys from the Cipher Key.

It should be done during the initialization. And it only needs to be done once if the Cipher

Key is not changed during the whole session*.

The pseudo code for Key Expansion is shown in Figure 2.4.

KeyExpansion (Key[4*Nk], W[4*(Nr+1)], Nk)

{

for ( i = 0; i < Nk; i++)

W[i] = (Key[4*i], Key[4*i+1], Key[4*i+2], Key[4*i+3]);

for ( i = Nk; i < 4*(Nr+1); i++)

{

temp = W[i-1];

if ( i % Nk == 0)

temp = SubWord(RotWord(temp)) ^ Rcon[i/Nk];

else if ( Nk = 8 and i % Nk == 4)

temp = SubWord(temp);

W[i] = W[i-Nk] ^ temp;

}

}

SubWord (W(a, b, c, d))

{ return W(S-box(a), S-box(b), S-box(c), S-box(d)); }

RotWord (W(a, b, c, d))

{ return W(b, c, d, a); }

Figure 2.4: Pseudo Code for Key Expansion

* Usually the Cipher Key is not changed in one session of encryption/decryption. But theoretically, one can use several Cipher Keys within one session to achieve better security. In that case, each change of Cipher Key will introduce one Key Expansion.

32

Recall there are an initial Round Key addition, several intermediate rounds, and a

final round in total, the number of Round Keys should be equal to the number of rounds

plus 1. Because Nr = 10, 12, 14 for Nk = 4, 6, 8, respectively, the numbers of Round

Keys are 11, 13, 15, respectively.

The expansion processes the data at word level. The ith word, or W[i], includes

the (4* i)th, (4* i+1)th, (4* i+2)th, (4* i+3)th byte, or the ith column. For example, if Nk =

4, there are 4 words in the Cipher Key. And it would be expanded to 11*4 = 44 words, or

44*4*8 = 1408 bits.

The Rcon[ ] array in the code is a constant array listed in Appendix A.

As shown in the code, the first Nk words of the whole expanded Round Keys are

exactly the original Cipher Key. After that, the optimized expansion implemented in

hardware should be done by a number of loops because by this means the expanded

Round Keys can be calculated in place to save a lot of memory. Please refer to Section

3.2.1 for detailed information.

The result of Key Expansion is a bunch of words that should be partitioned into

(Nr+1) Round Keys. The partition is very simple: from the beginning, every 4 words

form a Round Key. Figure 2.5 shows the Round Key expansion and partition for Nk = 6.

As shown below, W0 to W5 form the original Cipher Key, but every Round Key contains

only 4 words.

W0 W1 W2 W3 W4 W5 W6 W7 W8 W9 W10 W11 W12 W13 …

Round Key 0 Round Key 1 Round Key 2 …

Figure 2.5: Key Expansion and Round Key Partition for Nk = 6

33

2.3.2 The Inverse Cipher

In the Inverse Cipher, each function is substituted by its inverse function, and the

order is reversed. The basic pseudo code is listed below.

KeyExpansion(CipherKey, RoundKey);

state = in;

// inverse of last round

AddRoundKey(state); // use RoundKey[Nr]

InvShiftRows(state);

InvSubBytes(state);

// inverse of intermediate rounds

for ( round = Nr-1; round > 0; round --)

{

AddRoundKey(state); // inv addition = addition

InvMixColumns(state);

InvShiftRows(state);

InvSubBytes(state);

}

// inverse initial Round Key Addition

AddRoundKey(state); // use RoundKey[0]

out = state;

Figure 2.6: Basic Pseudo Code for the Cipher of Rijndael Algorithm

InvShiftRows( ) is defined as: Row 0 is not shifted; Row 1 is shifted over 3 byte;

Row 2 is shifted over 2 bytes; Row 3 is shifted over 1 bytes. So the positions of the bytes

in a State are changed like:

1 5 9 13 1 5 9 13

2 6 10 14 14 2 6 10

3 7 11 15 11 15 3 7

4 8 12 16 8 12 16 4

Figure 2.7: Transformation of InvShiftRows( )

34

InvSubBytes( ) is the byte substitution where the inverse table, inv S-box, is

applied. The inv S-box table is listed in Appendix A.

InvMixColumns( ) is similar to MixColumns( ). But it uses a different c(x), given

by

'0''09''0''0')( 23 ExxDxBxc +++=

The coefficients of this polynomial is larger than those of the polynomial used by

MixColumns( ), '02''01''01''03' 23 +++ xxx . So the speed of InvMixColumns( ) is slower

due to more xtime and XOR operations (see Section 2.3.1.3.)

There are some properties of these inverse functions that can be exploited to

derive a Cipher-like structure for the Inverse Cipher.

First, the order of InvShiftRows( ) and InvSubBytes( ) is indifferent. This is

because InvShiftRows( ) simply transposes the bytes and has no effect on the values, and

InvSubBytes( ) works on individual bytes, independent of their positions.

Second, the sequence

AddRoundKey(State, RoundKey);

InvMixColumn(State);

can be replaced by

InvMixColumn(State);

AddRoundKey(State, InvRoundKey);

where InvRoundKey is obtained by:

1. Apply the Key Expansion.

2. Apply InvMixColumn to all Round Keys except the first one and last one.

35

Notice that the basic pseudo code in Figure 2.6 can be represented by the

following sequence:

ASB AMSB AMSB … AMSB A

where A means AddRoundKey( ), S means InvShiftRows( ), B means

InvSubBytes( ), and M means InvMixColumns( ).

Using the two properties to change the order SB to BS, AM to MA, the sequence

becomes

ABS MABS MABS … MABS A

or equivalently

A BSMA BSMA … BSMA BSA

The last sequence is exactly the Cipher’s sequence. So, with the use of

InvRoundKey, the Inverse Cipher’s structure is the same as the Cipher’s. When AES is

mapped into MorphoSys, the Inverse Cipher uses right the same architecture as the

Cipher’s. Of course, the function InvShiftRows( ) and InvMixColumns( ) are slightly

different than ShiftRows( ) and MixColumns( ), and InvRoundKey replaces the

RoundKey.

36

Chapter 3

Mapping AES into MorphoSys

AES has already been widely implemented in different formats, such as

C/C++[13][14], Java[15], Visual Basic[16], Perl[17], Assembly[18], Ada[19], etc. It can

also be implemented by hardware, such as ASIC. MorphoSys is designed for applications

with inherent data-parallelism, high regularity, and high throughput requirement. Due to

the high data-parallelism in the AES algorithm, MorphoSys is able to implement it much

faster than those software implementations. Besides, because of the reconfigurability of

MorphoSys, the mapped AES algorithm can be part of a larger system.

In this chapter, several key features of MorphoSys that help the mapping of AES

are pointed out. Then, the complete mapping progress, including the Key Expansion by

TinyRISC processor, the data processing by RC Array, the Context/data loading and

storing, are discussed. At last, the simulation and results are introduced and analyzed.

3.1 Parallel Computing Exploration

Rijndael is a block cipher that includes a large amount of table lookup operations

and data movement, the actual ALU operation is just a very small part in terms of

running time or number of instructions. So how to input/output the blocks between Frame

Buffer and RC Array, to do the table-lookup operations, and to move the data among RCs

with the help of three layers of RC Array interconnection network are main concerns.

37

3.1.1 Multi-block Processing

Every data block in Rijndael has 16 bytes, while the number of RCs in the RC

Array is 64. Because there is no data dependency between any two data blocks,

MorphoSys has the capability to process 4 data blocks at the same time.

Because each block is a 4x4 matrix, it is very natural to partition the 4 blocks as

shown in Figure 3.1. However, because the data is column-wise stored in main memory

and Frame Buffer, this partitioning will introduce data reshuffle, which is very difficult to

realize in the Frame Buffer.

Block 0

(4x4)

Block 1

(4x4)

Block 2

(4x4)

Block 3

(4x4)

Figure 3.1: Intuitive Partitioning of RC Array

The actual partitioning used in the implementation is shown in Figure 3.2.

Block 0 (8x2)

Block 1 (8x2)

Block 2 (8x2)

Block 3 (8x2)

Figure 3.2: Actual Partitioning of RC Array

38

Under this partitioning, the data loading/storing process is straightforward. But

the data movement for ShiftRows( ) is not the same as in a 4x4 matrix. Please refer to

Section 3.1.3 for details about the data movement.

3.1.2 Parallel Table-lookup

In M2’s architecture, there is an embedded memory for each RC. This memory

behaves as a local lookup table. When a context commands a row/column to perform a

table-lookup operation, eight table-lookups are done in parallel. Furthermore, if the eight

contexts in a whole context plane all indicate table-lookup operations, 64 table-lookups

are done in parallel. On the other hand, in a software implementation of Rijndael, the

table-lookup operation can only be done one by one. That is significantly slower than the

implementation in MorphoSys.

3.1.3 Dedicated Data Movement for Rijndael

Recall the data movement for ShiftRows( ). The new position of every byte is

shown in Figure 3.3.

1 5 9 13 1 5 9 13

2 6 10 14 6 10 14 2

3 7 11 15 11 15 3 7

4 8 12 16 16 4 8 12

Figure 3.3: Transformation of ShiftRows( ) in 4x4 Matrix

Before moving the data according to ShiftRows( ), one needs to be aware what

data is needed in the subsequent function MixColumns( ). MixColumn( ) is a “column”

function, which means a byte will only need the value of all the four bytes (including

39

itself) in the same column for the transformation. For example, the highlighted byte at

position 10 will need the values of the bytes at the same column marked by 5, 10, 15, and

4 to do MixColumn( ).

In MorphoSys, a block is partitioned into 8x2 matrix, and every RC stores a byte.

So the ShiftRows( ) will move the data as following.

1 9 1 9

2 10 6 14

3 11 11 3

4 12 16 8

5 13 5 13

6 14 10 2

7 15 15 7

8 16 4 12

Figure 3.4: Transformation of ShiftRows( ) in 8x2 Matrix

To make every RC do MixColumns( ) independently and simultaneously, it is

desirable to have every RC store four relevant values used by MixColumns( ) into its

local registers. For example, because RC(5,0)* will use the input value in RC(4,0),

RC(5,0), RC(6,0), and RC(7,0) for MixColumns( ), it should store them into its local

registers.

Figure 3.5 shows the data movement result for ShiftRows( ). After the move, each

RC will contain the shifted data as well as the relevant data for MixColumns( ). Notice

that only two columns of RCs are shown here. The other six columns of RCs (other three

* Assume we only consider block 0 here. The corresponding RCs in other three blocks are RC(5,2), RC(5,4), and RC(5,6).

40

blocks) apply the same move. And the order of the bytes saved in four registers are not

important. As shown later, the order is not exactly the same as Figure 3.5. It merely

depends on the ease of implementation.

Column 0 Column 1 Column 0 Column 1

r0 r1 r2 r3 r0 r1 r2 r3 r0 r1 r2 r3 r0 r1 r2 r3

Row 0 1 – – – 9 – – – 1 6 11 16 9 14 3 8

Row 1 2 – – – 10 – – – 1 6 11 16 9 14 3 8

Row 2 3 – – – 11 – – – 1 6 11 16 9 14 3 8

Row 3 4 – – – 12 – – – 1 6 11 16 9 14 3 8

Row 4 5 – – – 13 – – – 5 10 15 4 13 2 7 12

Row 5 6 – – – 14 – – – 5 10 15 4 13 2 7 12

Row 6 7 – – – 15 – – – 5 10 15 4 13 2 7 12

Row 7 8 – – – 16 – – – 5 10 15 4 13 2 7 12

Figure 3.5: Data Movement for ShiftRows( )

The detailed data movement illustration and algorithm for encryption/decryption

are discussed in Section 3.2.4.

3.2 Algorithm Flowchart and Illustration

The whole algorithm can be divided into two parts: sequential part and parallel

part. The sequential part includes Key Expansion, and is done by TinyRISC. The parallel

part includes loading lookup tables, loading Round Keys, loading data, processing data,

and storing data. It is done by RC Array.

41

The complete flowchart is shown in Figure 3.6. And the implementation of each

block is discussed in the following sections.

Key Expansion by TinyRISCStore the result - Round Keys

into main memory

Table LoadingLoad xtime and S-box (or inv S-box)

table into every RC

Data and Round Key LoadingLoad four data blocks and currently-

needed Round Key from main memoryto Frame Buffer, then to RC Array

Data Encryption/DecryptionPerform the multiple-round

cipher or inverse cipher in RC Array

Data StoringStore four data blocks from RC Array

to Frame Buffer, then to main memory

End of Data?No

Yes

End

Figure 3.6: Flowchart of Rijndael Implementation in MorphoSys

42

3.2.1 Key Expansion by TinyRISC

The pseudo code for Key Expansion has been discussed in Section 2.3.1.5. In

order to reduce the number of registers used in TinyRISC, the assembly code uses loop

structure: Nk words are generated in each loop, until the total number reaches the desired

number (of words). For example, if Nk = 4, the total number of words in all Round Keys

is 4*11 = 44, so the total number of loops is � � 114/44 = ; if Nk = 6, the total number is

4*13 = 52, so the number of loops is � � 96/52 = ; if Nk = 8, the total number is 4*15 =

60, so the number of loops is � � 88/60 = . The indivisibility when Nk ≠ 4 means more

than necessary words would be generated during the expansion. The extra words can

simply be discarded.

In the Inverse Cipher, an additional InvMixColumns( ) function is applied to

every Round Key except the first and last one.

Because the main memory and TinyRISC are 32-bit, the expanded Round Keys

are also 32-bit. But this format cannot be used by Frame Buffer, which expects 16-bit

inputs. For example, when Frame Buffer reads a 32-bit word 0x00000064 from main

memory, it will treat it as two numbers: 0x0000 and 0x0064. So the result needs 2-to-1

concatenations: after all Round Keys are generated and stored back into main memory as

32-bit format, they will be loaded into TinyRISC again, with two 32-bit words each time,

and concatenated to one 32-bit word, then stored back into main memory.

0x000000eb 0x00eb003d

0x0000003d ( next concat enat i on)

Figure 3.7: Concatenations of Round Keys

43

3.2.2 Table Loading

Three types of contexts are need for loading each table element. They are:

set 0, 0 LDI M! 5 def def > 0; # l oad val ue 5 i nt o RC’ s r 0

set 0, 15 STMM r 0 def > 1; # st or e r 0 i nt o t abl e addr ess r 1

set 8, 15 ADD r 1 r 2 > 1; # i ncr ease r 1 by 1 ( r 2)

Notice that once STMM and ADD* are loaded into Context Memory, they can be

used for every table element. So theoretically, the total number of contexts to load two

256-byte table is 256*2 (two tables’ LDIM) + 1 (STMM) + 1 (ADD) + 2 (set r0, r1’s

initial value) = 516. But the size of Context Memory is not big enough to save all 516

contexts. In M2, the Context Memory can save up to 256 contexts. Since STMM, ADD,

and initialization contexts are needed once in every 256 contexts, the pattern of contexts

should be:

1st l oadi ng: 252 LDI Ms + 1 STMM + 1 ADD + 2 I ni t i al i zat i on

2nd l oadi ng: 252 LDI Ms + 1 STMM + 1 ADD + 2 I ni t i al i zat i on

3r d l oadi ng: 8 LDI Ms + 1 STMM + 1 ADD + 2 I ni t i al i zat i on

The total number of contexts is 256*2 + 12 = 524.

At the time the author simulated the implementation of AES, the simulator was

only able to handle up to 32 contexts (i.e., M1’s structure). So there are 18 times of table

loading instead of 3. But in any case it is not a big issue – the table loading is done only

once during the initialization.

There are two tables to be loaded: xtime and S-box (or inv S-box). One of them is

from address 0x00 to 0xFF, and another is from address 0x100 to 0x1FF. As shown later,

to access the second table, an extra context to add the offset 0x100 is needed for every

* All RC Array instructions are listed in Appendix C.

44

table lookup operation. Because xtime table is used more frequently (see next section), it

is reasonable to load it first.

3.2.3 Data and Round Key Loading

Four blocks, or 64 bytes of data, and the currently needed Round Key (16 bytes)

are loaded from main memory into Frame Buffer, then into RC Array. Because the four

blocks use the same Round Key, the Round Key will be repeatedly loaded from Frame

Buffer to RC Array for four times. The involved instructions are LDFB and SBCB.

3.2.4 Data Processing in RC Array

After the data and Round Key have been loaded into RC Array, the next thing is

to process data in RC Array. As stated in Chapter 2, the process includes four functions:

SubBytes( ), ShiftRows( ), MixColumns( ), and AddRoundKey( ).

The contexts for SubBytes( ) are very simple:

set 0, 3 ADD r 0 r 1 > 0; # r 1 i s const ant 0x0100

set 0, 4 LDMM r 0 def > 0; # l oad i nt o r 0

The first context is to add offset 0x100 to index register r0. The second context is

to load table element at address [r0+0x100] into r0. So the result is r0 = S-box(r0) (or inv

S-box(r0)).

The context for AddRoundKey( ) is also very simple:

set 8, 0 XOR r 0 r 7 > 0; # RoundKey i s saved i n r 7

However, the contexts for ShiftRows( ) and MixColumns are more complicated.

ShiftRows( ) includes eight steps of data movement, and MixColumns( ) mainly consists

of xtime and XOR operations.

45

The data movement and contexts for ShiftRows( ) are illustrated in several

figures.



Row 0 1 – – – 9 – – – 1 – – – 9 – – –

Row 1 2 – – – 10 – – – 2 – – – 10 – – –

Row 2 3 – – – 11 – – – 3 – – – 11 – – –

Row 3 4 – – – 12 – – – 4 5 – – 12 13 – –

Row 4 5 – – – 13 – – – 5 – – – 13 – – –

Row 5 6 – – – 14 – – – 6 1 – – 14 9 – –

Row 6 7 – – – 15 – – – 7 – – – 15 – – –

Row 7 8 – – – 16 – – – 8 – – – 16 – – –

31

40

51

00 , rrrr →→ Expr ess Lane, Row Mode

set 8 , 1 BYPASS r0 def > 0 WE ;

set 9 , 1 BYPASS r0 def > 0 ;


set 11 , 1 BYPASS VE def > 1 ;





Figure 3.8: ShiftRows( ) Step 1

Figure 3.8 shows the first step of ShifRows( ). The contexts are in Row Mode,

which means one context for one row. Row 0/4 will put the data in r0 onto Express Lane,

and Row 3/5 will get the data from corresponding vertical Express Lane and save into

ikr means rk in

Row i

46

register r1. By this means, the value at position 1 and 5 is transferred to desired positions.

In this step, Row 1, 2, 6, and 7 are doing NOP operations.



Row 0 1 – – – 9 – – – 1 – – – 9 – – –

Row 1 2 – – – 10 – – – 2 7 – – 10 15 – –

Row 2 3 – – – 11 – – – 3 – – – 11 – – –

Row 3 4 5 – – 12 13 – – 4 5 – – 12 13 – –

Row 4 5 – – – 13 – – – 5 – – – 13 – – –

Row 5 6 1 – – 14 9 – – 6 1 – – 14 9 – –

Row 6 7 – – – 15 – – – 7 – – – 15 – – –

Row 7 8 – – – 16 – – – 8 3 – – 16 11 – –

11

60

71











Figure 3.9 shows the second step. It is similar to the first step, but moves different

data into desired positions.

ikr means rk in

Row i

47



Row 0 1 – – – 9 – – – 1 – – – 9 – – –

Row 1 2 7 – – 10 15 – – 2 7 10 15 10 15 2 7

Row 2 3 – – – 11 – – – 3 – – – 11 – – –

Row 3 4 5 – – 12 13 – – 4 5 – – 12 13 – –

Row 4 5 – – – 13 – – – 5 – – – 13 – – –

Row 5 6 1 – – 14 9 – – 6 1 – – 14 9 – –

Row 6 7 – – – 15 – – – 7 – – – 15 – – –

Row 7 8 3 – – 16 11 – – 8 3 16 11 16 11 8 3

13

01

03

11

12

00

02

10 , | , rrrrrrrr →→→→ Lef t / Ri ght , Col umn Mode

set 0 , 7 BYPASS L def > 2 ;


set 2 , 7 BYPASS R def > 2 ;






Figure 3.10: ShiftRows( ) Step 3, 4

The third and fourth step use Column Mode. Column 2i will get data from

Column 2i+1 (i = 0, 1, 2, 3), and vice versa. Only one context plane is shown in Figure

3.10. Others are similar.

ikr means rk in

Column i

48



Row 0 1 – – – 9 – – – 6 – – – 14 – – –

Row 1 2 7 10 15 10 15 2 7 6 7 10 15 14 15 2 7

Row 2 3 – – – 11 – – – 6 – – – 14 – – –

Row 3 4 5 – – 12 13 – – 6 5 – – 14 13 – –

Row 4 5 – – – 13 – – – 4 – – – 12 – – –

Row 5 6 1 – – 14 9 – – 4 1 – – 12 9 – –

Row 6 7 – – – 15 – – – 4 – – – 12 – – –

Row 7 8 3 16 11 16 11 8 3 4 3 16 11 12 11 8 3

3,2,1,00

50

7,6,5,40





set 11 , 5 BYPASS VE def > 0 WE ;






After four steps, all the seed data used for ShiftRows( ) and MixColumns( ) are

ready. Those seeds are highlighted in the left table in Figure 3.11. Then, the Express

Lanes are exploited again to store one byte into other four RCs at the same time. Here in

Step 5, the seed in register r0 of RC(3, i) and RC(5, i) are propagated through the Express

Lane and fetched by register r0 of RC(4-7, i) and RC (0-3, i), respectively.

ikr means rk in

Row i

49



Row 0 6 – – – 14 – – – 6 1 16 11 14 9 8 3

Row 1 6 7 10 15 14 15 2 7 6 1 16 11 14 9 8 3

Row 2 6 – – – 14 – – – 6 1 16 11 14 9 8 3

Row 3 6 5 – – 14 13 – – 6 1 16 11 14 9 8 3

Row 4 4 – – – 12 – – – 4 5 10 15 12 13 2 7

Row 5 4 1 – – 12 9 – – 4 5 10 15 12 13 2 7

Row 6 4 – – – 12 – – – 4 5 10 15 12 13 2 7

Row 7 4 3 16 11 12 11 8 3 4 5 10 15 12 13 2 7

3,2,1,01

51

7,6,5,41


3,2,1,02

72

7,6,5,42


3,2,1,03

73

7,6,5,43










Figure 3.12: ShiftRows( ) Step 6, 7, 8

Step 6, 7, and 8 are similar to Step 5. They will store the data from Express Lane

into register r1, r2, and r3, respectively. Only one context plane is shown above. Others

are similar. After these eight steps, every RC contains the data for MixColumns( ).

ikr means rk in

Row i

50

The algorithm for MixColumns( ) is listed below again for your convenience.

t mp = a0 ^ a1 ^ a2 ^ a3;

t m = a0 ^ a1; t m = xt i me( t m) ; a0 ^ = t m ^ t mp;




The distinct contexts for them are just “XOR” and “LDMM”. For example:

set 0 , 8 XOR r0 r4 > 4 ;

set 0 , 11 LDMM r5 def > 5 ;

So far all the functions for the Cipher have been discussed. After optimization, the

data processing part of the Cipher only uses 27 contexts in total.

In the Inverse Cipher, Function SubBytes( ) and AddRoundKey( ) are the same,

but InvShiftRows( ) and InvMixColumns( ) are slightly different. In InvShiftRows( ),

there are also eight steps of data more. And the only difference is the position of target

data. Figure 3.13 shows the first four steps for InvShiftRows( ).



Row 0 1 – – – 9 – – – 1 – – – 9 – – –

Row 1 2 – – – 10 – – – 2 5 – – 10 13 – –

Row 2 3 – – – 11 – – – 3 – – – 11 – – –

Row 3 4 – – – 12 – – – 4 7 12 15 12 15 4 7

Row 4 5 – – – 13 – – – 5 – – – 13 – – –

Row 5 6 – – – 14 – – – 6 3 14 11 14 11 6 3

Row 6 7 – – – 15 – – – 7 – – – 15 – – –

Row 7 8 – – – 16 – – – 8 1 – – 16 9 – –

Figure 3.13: InvShiftRows( ) Step 1, 2, 3, 4

51

Figure 3.14 shows the next four steps. The highlighted bytes in left table are

seeds. They are propagated to four RCs through the Express Lanes.



Row 0 1 – – – 9 – – – 8 1 14 11 16 9 6 3

Row 1 2 5 – – 10 13 – – 8 1 14 11 16 9 6 3

Row 2 3 – – – 11 – – – 8 1 14 11 16 9 6 3

Row 3 4 7 12 15 12 15 4 7 8 1 14 11 16 9 6 3

Row 4 5 – – – 13 – – – 2 5 12 15 10 13 4 7

Row 5 6 3 14 11 14 11 6 3 2 5 12 15 10 13 4 7

Row 6 7 – – – 15 – – – 2 5 12 15 10 13 4 7

Row 7 8 1 – – 16 9 – – 2 5 12 15 10 13 4 7

Figure 3.14: InvShiftRows( ) Step 5, 6, 7, 8

The algorithm for InvMixColumns( ) is listed below. Due to more xtime and XOR

operations, the running time is increased a little bit. However, with very careful

arrangement of registers and table lookup, the total number of contexts for data

processing part of decryption is only increased by 1, or 28.

t m1 = a0 ^ a1 / / r 5 f or t m1, r i f or ai ( i = 0, 1, 2, 3)

t mp1 = t m1 ^ a2 / / r 6 f or t mp1, get r 5 bef or e i t i s dest r oyed

t m1 = xt i me( t m1) / / r 5 f or t m1, needs one l ookup cont ext C0

r 4 = a0 ^ t m1 / / r 5 i s f r ee and can be used agai n

t m2 = a0 ^ a2 / / r 5 f or t m2

t m2 = xt i me( xt i me( t m2) ) / / al l use t he same cont ext C0 as bef or e

r 4 = r 4 ^ t m2 / / r 5 i s f r ee agai n

t mp2 = t mp1 ^ a3 / / r 5 f or t mp2, swi t ch back t o r 5

r 4 = r 4 ^ t mp2 / / t mp2 = a0 ^ a1 ^ a2 ^ a3 her e

t mp2 = xt i me( xt i me( xt i me( t mp2) ) ) / / al l use cont ext C0 r 4 = r 4 ^ t mp2 / / r 4 saves t he r esul t of I nvMi xCol umns( )

52

3.2.5 Data Storing

After four data blocks are processed in RC Array, they are stored into Frame

Buffer, and then into main memory. The involved instructions are WFBI and STFB. If

there are more data to be encrypted/decrypted, the program will continue to process next

four blocks with the same procedure, until reaching the end of data.

The result saved in the main memory has the concatenated format. For example, a

32-bit word “0x00010002” means two bytes: “0x01” and “0x02” . To comply with the

same format as input*, which uses 32 bits to represent a byte, the result needs to be

separated. Using the same example, “0x00010002” will be separated as “0x00000001”

and “0x00000002”. This separation is performed after all the data have been

encrypted/decrypted.

3.3 Simulation Environment

MorphoSys group has developed a set of software to facilitate the algorithm

mapping, source code compilation, and algorithm simulation for M1. The complete set

of software includes Tcc, TRASM, MorphoSim, mView, mLoad, mSched, and

mULATE, as shown in Figure 3.15. Tcc is a C/C++ compiler that generates the

TinyRISC executable code. TRASM is an assembly compiler that generates the

TinyRISC executable code. MorphoSim is a VHDL simulator, which exactly matches the

MorphoSys chip. mLoad, mView, and mSched are used for context generation and

application scheduling. mULATE is a cycle-accurate simulator, which is more abstract

than MorphoSim.

* This consistency might be unnecessary. It depends on the specific application.

53

TR_appFor I=1 to 20X[I]=X[I]+1


TinyRISCTinyRISC

RC ArrayRC Array

App. (C or Assembly Code)

C++,VHDL

MorphoSysChip

Tcc or TRASM

Z=RC_F(X)

W=RC_F(Y)

mLoad ContextLib.

mSchedmSchedExecutable

RC Arrayfunctions

MuLate,MorphoSim

mView

Conf igurat ioncontext



TinyRISCTinyRISC

RC ArrayRC Array

App. (C or Assembly Code)

C++,VHDL

MorphoSysChip

Tcc or TRASM

Z=RC_F(X)

W=RC_F(Y)

mLoad ContextLib.

mSchedmSchedExecutable

RC Arrayfunctions

MuLate,MorphoSim

mView

Conf igurat ioncontext

Figure 3.15: Software Tools for MorphoSys

To be compatible with the modifications in M2, all of these tools need to be

updated. Up to now, the mLoad*, mULATE†, and TRASM‡ have been updated. So the

author wrote and compiled the TinyRISC assembly code and contexts of the whole

algorithm, then used mULATE to simulate it.

3.4 Performance Analysis

A comprehensive simulation for the encryption and decryption under different

Key sizes is performed in mULATE. And the results are compared with those

implemented by assembly language, C/C++, Java, and ASIC/Programmable Logic cores.

* mLoad is the context compiler written in Perl. It was updated by the author. † mULATE was updated by Afshin Niktash. ‡ TRASM was updated by Afshin Niktash.

54

For the initialization part, other implementations may only need the Key

Expansion. However, for the MorphoSys implementation, it needs the Key Expansion,

lookup table loading, and context loading. Table 3.1 shows the numbers of cycles for the

Key Expansion implemented by ANSI C, C++, and MorphoSys TinyRISC*. In

MorphoSys implementation, the Key Expansion for the Inverse Cipher is much slower

because the InvMixColumns( ) operation is applied to each Round Key except the first

and last one, and the InvMixColumns( ) involves a lot of memory operations which need

a lot of cycles.

Table 3.1: # of Cycles for Key Expansion in Several Implementations

AES CD (ANSI C) Br ian Gladman (VC++) MorphoSys TinyRISC Key Size

Cipher Inverse Cipher Cipher Inverse Cipher Cipher Inverse Cipher

128 2100 2900 305 1389 2770 13320

192 2600 3600 277 1595 3386 15603

256 2800 3800 374 1960 4196 19184

The numbers of cycles for all three parts of the initialization in MorphoSys

implementation are listed in Table 3.2. It shows that the Cipher and Inverse Cipher may

need up to 10675 and 25671 cycles for the whole initialization, respectively. Assume M2

runs at 200MHz, it will take 54 µs and 128 µs, respectively. Obviously, this time is very

short and acceptable.

* The statistics for ANSI C and C++ is obtained from the AES proposal by Rijndael’s authors.

55

Table 3.2: # of Cycles for AES Initialization in MorphoSys Implementation

Key Size Key Expansion

Table Loading Context Loading

Total # of cycles

128 2770/13320 6249 230/238 9249/19807

192 3386/16029 6249 230/238 9865/22516

256 4196/19184 6249 230/238 10675/25671

* in “x/y” , “x” for encryption, “y” for decryption

For the data processing part, the numbers of cycles and/or throughputs for

encryption implemented by assembly language, C/C++, and Java are listed in Table 3.3.

All the throughputs (unit: Mb/s) are calculated at frequency 200 MHz.

Table 3.3: # of Cycles and Throughputs per Block in Other Implementations

Intel 8051 Motorola 68HC08

AES CD (ANSI C) Brain Gladman (VC++)

Java Key Size

# of cycles # of cycles # of cycles Xput # of cycles Xput # of cycles Xput

128 4065 8390 950 27.0 363 70.5 23000 1.1

192 4512 10780 1125 22.8 432 59.3 27600 0.93

256 5221 12490 1295 19.8 500 51.2 32300 0.79

* result for encryption only

The MorphoSys implementation result is listed in Table 3.4. Because each time

four blocks are processed in parallel, the actual number of cycles for one block is only

1/4 of the computing cycles. For example, when Key size is 128 bits, the data processing

part for encryption needs 601 / 4 = 150.25 cycles/block.

56

Table 3.4: # of Cycles and Throughputs per Block in MorphoSys Implementation

Encryption Decryption Key Size

# of cycles Xput # of cycles Xput

128 150.25 170.4 166 154.2

192 175.25 146.1 194.5 131.6

256 200.25 127.8 223 114.8

* in “a/b” , “a” for encryption, “b” for decryption

As shown in above tables, the running time for initialization is much longer than

that for one-block processing no matter how the AES is implemented. However, the

initialization is only a small fraction in total running time when the size of the data to be

processed is not very small. Assume the Key size is 128 bits, and the data size is 64K

Bytes, or 4K blocks, then MorphoSys needs to load the data to RC Array 1000 times. So

the total time for data processing part is 601,000 / 664,000 cycles for encryption /

decryption, and the time for initialization is only about 1.5% / 3% of the whole time.

On Aug 8, 2001, Amphion Semiconductor Ltd. [20] announced its application-

specific cores for AES applications. The performance of its CS 5210-5280 Family

(standard series) ASIC cores and programmable logic cores is shown in Table 3.5, 3.6

and 3.7. The ASIC cores are about 240% to 270% faster than the MorphoSys

implementation, and the programmable logic cores are also about 30% to 60% faster. But

several other issues should be considered when we compare their performance. First,

encryption and decryption need different Amphion cores; second, the initialization time

in Amphion cores is unknown (though this is usually not important); third, MorphoSys is

not just an ASIC or FPGA, and is capable of doing many other applications efficiently

with the same architecture.

57

Table 3.5: AES by Amphion ASIC Cores using TSMC 0.18µm Technology


Logic Gates Timing Constraints

(MHz) Throughput

(Mb/s) Timing Constraints

(MHz) Throughput

(Mb/s)

128 18.2K 200 581 200 581

192 18.2K 200 492 200 492

256 18.2K 200 426 200 426

Table 3.6: AES by Amphion Programmable Logic Cores using Altera APEX20KE-1


Logic Used (LE)*

Memory Used (ESB) Clock Speed

(MHz) Throughput

(Mb/s) Clock Speed

(MHz) Throughput

(Mb/s)

128 1452/1560 8 77.8 226 74.1 215

192 1452/1560 8 77.8 191 74.1 182

256 1452/1560 8 77.8 166 74.1 158

* encryption/decryption

Table 3.7: AES by Amphion Programmable Logic Cores using Xilinx VirtexE-8


Logic Used

(LUT)*

Memory Used

(BRAM) Clock Speed (MHz)

Throughput (Mb/s)

Clock Speed (MHz)

Throughput (Mb/s)

128 1008/1092 4 92.3 268 86.7 254

192 1008/1092 8 92.3 227 86.7 213

256 1008/1092 8 92.3 196 86.7 184

* encryption/decryption

58

Figure 3.16 compares the data processing throughputs of C/C++, MorphoSys,

Amphion ASIC core, and Amphion FPGA cores implementation for encryption at Key

size = 128 bits. The throughput of MorphoSys implementation is close to the throughput

of Amphion Altera core implementation.

Figure 3.16: Throughputs of Different Implementations

3.5 Conclusions

The performance of the AES implementation in MorphoSys is satisfactory. The

throughput is more than 100Mb/s, which is usually adequate for applications on mobile

phones and PDAs. If in an application the throughput requirement is very stringent and

cannot met by a single MorphoSys, one can consider a larger scale of parallel computing

system consisting of several identical MorphoSys cores. Since there is no data

dependency among blocks, the “scaling up” is theoretically unlimited and will not

introduce any performance degradation that otherwise would exist if there were inter-

block data communications. Of course, in the real implementation, the MorphoSys chip

Throughputs of Different Implementations

2770.5

170.4

581

226268

0

100

200

300

400

500

600

700

ANSI C C++ MorphoSys ASIC Core Altera Core Xilinx Core

Mb/s

59

usually does not run the AES algorithm alone. It might be uneconomical if we increase

the number of MorphoSys cores just for the AES requirement.

Another possible approach to improve the performance is to include some

programmable logic block in MorphoSys, such as PLD/CPLD, to handle logic functions

and bit-level operations. But there might be a tradeoff between the flexibility and the

speed. Actually it is a research topic in the MorphoSys group.

60

Bibliography

[1] M. H. Lee, H. Singh, G. Lu, N. Bagherzadeh, F. J. Kurdahi, “Design and Implementation of the MorphoSys Reconfigurable Computing Processor” , Journal of VLSI Signal Processing Systems, vol. 24, pp. 164-172, March 2000

[2] H. Singh, M. H. Lee, G. Lu, F. J. Kurdahi, N. Bagherzadeh, T. Lang, R. Heaton, and E. M. C. Filho, “MorphoSys: An Integrated Re-configurable Architecture,” NATO Symposium on Concepts and Integration, April 1998

[3] S. Brown and J. Rose, “Architecture of FPGAs and CPLDs: A Tutorial,” IEEE Design and Test of Computers, Vol. 13, No. 2, pp. 42-57, 1996

[4] G. Lu, “Modeling, Implementation and Scalability of the MorphoSys Dynamically Reconfigurable Computing Architecture,” Ph.D. Dissertation, 2000

[5] M. H. Lee, “Design and Implementation of the High-Performance Low-Power MorphoSys,” Ph.D. Dissertation, 2000

[6] A. Abnous, C. Christensen, J. Gray, J. Lenell, A. Naylor and N. Bagherzaheh, “Design and implementation of TinyRISC microprocessor,” Microprocessors and Microsystems, Vol.16, No.4, pp.187-94, 1992

[7] http://csrc.nist.gov/encryption/aes/

[8] http://csrc.nist.gov/publications/drafts/dfips-AES.pdf

[9] F. Koeune, J.-J. Quisquater, “A timing attack against Rijndael,” Technical Report CG-1999/1, UCL Crypto Group, Louvain-la-Neuve, 1999.

[10] E. Biham, A. Shamir, “Power Analysis of the Key Scheduling of the AES Candidates,” Proceedings of the Second Advanced Encryption Standard (AES) Candidate Conference, 1999.

[11] R. Lidl, H. Niederreiter, Introduction to finite fields and their applications, Cambridge University Press, 1986

[12] P. Barreto, V. Rijmen, Rijndael ANSI C Reference Code, downloadable at http://www.esat.kuleuven.ac.be/~rijmen/rijndael/rijndaelref.zip

[13] http://fp.gladman.plus.com/cryptography_technology/index.htm

[14] http://www.cosy.sbg.ac.at/~gwesp/sw/rijndael-1.0.tar.gz

61

[15] http://www.webappcabaret.com/cass/security

[16] http://www.esat.kuleuven.ac.be/~rijmen/rijndael/rijndaelvb.zip

[17] http://www.cpan.org/authors/id/D/DI/DIDO/Crypt-Rijndael-0.04.tar.gz

[18] http://www.esat.kuleuven.ac.be/~rijmen/rijndael/rijndael-80186.tar.gz

[19] http://www.esat.kuleuven.ac.be/~rijmen/rijndael/rijndaelada.zip

[20] http://www.amphion.com

62

Appendix A

Constant Tables Used in AES

A.1 Lookup Table “ S-box”

S-box is a 256-byte table used by the function SubBytes( ) in the Key Expansion

and the Cipher.

63 7C 77 7B F2 6B 6F C5 30 01 67 2B FE D7 AB 76

CA 82 C9 7D FA 59 47 F0 AD D4 A2 AF 9C A4 72 C0

B7 FD 93 26 36 3F F7 CC 34 A5 E5 F1 71 D8 31 15

04 C7 23 C3 18 96 05 9A 07 12 80 E2 EB 27 B2 75

09 83 2C 1A 1B 6E 5A A0 52 3B D6 B3 29 E3 2F 84

53 D1 00 ED 20 FC B1 5B 6A CB BE 39 4A 4C 58 CF

D0 EF AA FB 43 4D 33 85 45 F9 02 7F 50 3C 9F A8

51 A3 40 8F 92 9D 38 F5 BC B6 DA 21 10 FF F3 D2

CD 0C 13 EC 5F 97 44 17 C4 A7 7E 3D 64 5D 19 73

60 81 4F DC 22 2A 90 88 46 EE B8 14 DE 5E 0B DB

E0 32 3A 0A 49 06 24 5C C2 D3 AC 62 91 95 E4 79

E7 C8 37 6D 8D D5 4E A9 6C 56 F4 EA 65 7A AE 08

BA 78 25 2E 1C A6 B4 C6 E8 DD 74 1F 4B BD 8B 8A

70 3E B5 66 48 03 F6 0E 61 35 57 B9 86 C1 1D 9E

E1 F8 98 11 69 D9 8E 94 9B 1E 87 E9 CE 55 28 DF

8C A1 89 0D BF E6 42 68 41 99 2D 0F B0 54 BB 16

63

A.2 Lookup Table “ Inv S-box”

Inv S-box is a 256-byte table used by the function InvSubBytes( ) in the Inverse

Cipher.

52 09 6A D5 30 36 A5 38 BF 40 A3 9E 81 F3 D7 FB

7C E3 39 82 9B 2F FF 87 34 8E 43 44 C4 DE E9 CB

54 7B 94 32 A6 C2 23 3D EE 4C 95 0B 42 FA C3 4E

08 2E A1 66 28 D9 24 B2 76 5B A2 49 6D 8B D1 25

72 F8 F6 64 86 68 98 16 D4 A4 5C CC 5D 65 B6 92

6C 70 48 50 FD ED B9 DA 5E 15 46 57 A7 8D 9D 84

90 D8 AB 00 8C BC D3 0A F7 E4 58 05 B8 B3 45 06

D0 2C 1E 8F CA 3F 0F 02 C1 AF BD 03 01 13 8A 6B

3A 91 11 41 4F 67 DC EA 97 F2 CF CE F0 B4 E6 73

96 AC 74 22 E7 AD 35 85 E2 F9 37 E8 1C 75 DF 6E

47 F1 1A 71 1D 29 C5 89 6F B7 62 0E AA 18 BE 1B

FC 56 3E 4B C6 D2 79 20 9A DB C0 FE 78 CD 5A F4

1F DD A8 33 88 07 C7 31 B1 12 10 59 27 80 EC 5F

60 51 7F A9 19 B5 4A 0D 2D E5 7A 9F 93 C9 9C EF

A0 E0 3B 4D AE 2A F5 B0 C8 EB BB 3C 83 53 99 61

17 2B 04 7E BA 77 D6 26 E1 69 14 63 55 21 0C 7D

64

A.3 Lookup Table “ xtime”

xtime is a 256-byte table used both in the Cipher and Inverse Cipher to compute

the multiplication by x in GF(28).

00 02 04 06 08 0A 0C 0E 10 12 14 16 18 1A 1C 1E

20 22 24 26 28 2A 2C 2E 30 32 34 36 38 3A 3C 3E

40 42 44 46 48 4A 4C 4E 50 52 54 56 58 5A 5C 5E

60 62 64 66 68 6A 6C 6E 70 72 74 76 78 7A 7C 7E

80 82 84 86 88 8A 8C 8E 90 92 94 96 98 9A 9C 9E

A0 A2 A4 A6 A8 AA AC AE B0 B2 B4 B6 B8 BA BC BE

C0 C2 C4 C6 C8 CA CC CE D0 D2 D4 D6 D8 DA DC DE

E0 E2 E4 E6 E8 EA EC EE F0 F2 F4 F6 F8 FA FC FE

1B 19 1F 1D 13 11 17 15 0B 09 0F 0D 03 01 07 05

3B 39 3F 3D 33 31 37 35 2B 29 2F 2D 23 21 27 25

5B 59 5F 5D 53 51 57 55 4B 49 4F 4D 43 41 47 45

7B 79 7F 7D 73 71 77 75 6B 69 6F 6D 63 61 67 65

9B 99 9F 9D 93 91 97 95 8B 89 8F 8D 83 81 87 85

BB B9 BF BD B3 B1 B7 B5 AB A9 AF AD A3 A1 A7 A5

DB D9 DF DD D3 D1 D7 D5 CB C9 CF CD C3 C1 C7 C5

FB F9 FF FD F3 F1 F7 F5 EB E9 EF ED E3 E1 E7 E5

65

A.4 Lookup Table “ Log”

Log is a 256-byte table used in the Key Expansion (only for the Inverse Cipher) to

compute the multiplication in GF(28).

00 00 19 01 32 02 1A C6 4B C7 1B 68 33 EE DF 03

64 04 E0 0E 34 8D 81 EF 4C 71 08 C8 F8 69 1C C1

7D C2 1D B5 F9 B9 27 6A 4D E4 A6 72 9A C9 09 78

65 2F 8A 05 21 0F E1 24 12 F0 82 45 35 93 DA 8E

96 8F DB BD 36 D0 CE 94 13 5C D2 F1 40 46 83 38

66 DD FD 30 BF 06 8B 62 B3 25 E2 98 22 88 91 10

7E 6E 48 C3 A3 B6 1E 42 3A 6B 28 54 FA 85 3D BA

2B 79 0A 15 9B 9F 5E CA 4E D4 AC E5 F3 73 A7 57

AF 58 A8 50 F4 EA D6 74 4F AE E9 D5 E7 E6 AD E8

2C D7 75 7A EB 16 0B F5 59 CB 5F B0 9C A9 51 A0

7F 0C F6 6F 17 C4 49 EC D8 43 1F 2D A4 76 7B B7

CC BB 3E 5A FB 60 B1 86 3B 52 A1 6C AA 55 29 9D

97 B2 87 90 61 BE DC FC BC 95 CF CD 37 3F 5B D1

53 39 84 3C 41 A2 6D 47 14 2A 9E 5D 56 F2 D3 AB

44 11 92 D9 23 20 2E 89 B4 7C B8 26 77 99 E3 A5

67 4A ED DE C5 31 FE 18 0D 63 8C 80 C0 F7 70 07

66

A.4 Lookup Table “ Alog”

Alog is a 256-byte table used in the Key Expansion (only for the Inverse Cipher)

to compute the multiplication in GF(28).

01 03 05 0F 11 33 55 FF 1A 2E 72 96 A1 F8 13 35

5F E1 38 48 D8 73 95 A4 F7 02 06 0A 1E 22 66 AA

E5 34 5C E4 37 59 EB 26 6A BE D9 70 90 AB E6 31

53 F5 04 0C 14 3C 44 CC 4F D1 68 B8 D3 6E B2 CD

4C D4 67 A9 E0 3B 4D D7 62 A6 F1 08 18 28 78 88

83 9E B9 D0 6B BD DC 7F 81 98 B3 CE 49 DB 76 9A

B5 C4 57 F9 10 30 50 F0 0B 1D 27 69 BB D6 61 A3

FE 19 2B 7D 87 92 AD EC 2F 71 93 AE E9 20 60 A0

FB 16 3A 4E D2 6D B7 C2 5D E7 32 56 FA 15 3F 41

C3 5E E2 3D 47 C9 40 C0 5B ED 2C 74 9C BF DA 75

9F BA D5 64 AC EF 2A 7E 82 9D BC DF 7A 8E 89 80

9B B6 C1 58 E8 23 65 AF EA 25 6F B1 C8 43 C5 54

FC 1F 21 63 A5 F4 07 09 1B 2D 77 99 B0 CB 46 CA

45 CF 4A DE 79 8B 86 91 A8 E3 3E 42 C6 51 F3 0E

12 36 5A EE 29 7B 8D 8C 8F 8A 85 94 A7 F2 0D 17

39 4B DD 7C 84 97 A2 FD 1C 24 6C B4 C7 52 F6 01

A.5 Table “ Rcon”

Rcon is a 30-byte table used in the Key Expansion.

01 02 04 08 10 20 40 80 1b 36 6c d8 ab 4d 9a

2f 5e bc 63 c6 97 35 6a d4 b3 7d f a ef c5 91

67

Appendix B

MorphoSys TinyRISC ISA

B.1 Instruction Format

In the TinyRISC instruction set architecture (ISA), the instructions assume one

the two formats shown below:

31 - 25 24 23 - 20 19 -16 15 - 12 11 - 0

OpCode Immb SrcReg1 SrcReg2 DstReg Unused

31 - 25 24 23 - 20 19 - 16 15 - 12 11 - 0

OpCode Immb SrcReg1 DstReg Immediate

• OpCode: the 7-bit instruction opcode.

• Immb: the immediate bit. If Immb = 0, the second operand is stored in a data

register file. If Immb = 1, the second operand is a 16-bit immediate value

extended to 32 bits.

• SrcReg1: the register id of the first operand.

• DstReg: the id of the destination register.

• SrcReg2: the register id of the second operand.

• Immediate: the 16-bit immediate value (if Immb = 1).

68

B.2 Instruction Codes

The following subsections describe the instructions in each category: arithmetic,

logical, shift, comparison, load immediate, memory access, control transfer, and

MorphoSys instructions.

B.2.1 Arithmetic Instructions

ADD DstReg, SrcReg1, SrcReg2

31 - 25 24 23 - 20 19 - 16 15 - 12 11 - 0

0000100 0 sr1 sr2 dr unused

Description: This instruction adds the two unsigned values in registers sr1 and

sr2 and writes the result into register dr.

ADDI DstReg, SrcReg1, UnsImm

31 - 25 24 23 - 20 19 - 16 15 - 12 11 - 0

0000100 1 sr1 dr imm

Description: This instruction adds the unsigned value in register sr1 to the zero-

extended imm value and writes the result into register dr.

SUB DstReg, SrcReg1, SrcReg2

31 - 25 24 23 - 20 19 - 16 15 - 12 11 - 0


Description: This instruction subtracts the unsigned value in register sr2 from

the unsigned value in register sr1 and writes the result into register dr.

69

SUBI DstReg, SrcReg1, UnsImm

31 - 25 24 23 - 20 19 - 16 15 - 12 11 - 0

0000101 1 sr1 dr imm

Description: This instruction subtracts the zero-extended imm value from the

unsigned value in register sr1 and writes the result into register dr.

B.2.2 Logical Instructions

AND DstReg, SrcReg1, SrcReg2

31 - 25 24 23 - 20 19 - 16 15 - 12 11 - 0


Description: This instruction performs a bit-wise AND of the values in registers

sr1 and sr2 and writes the result into register dr.

ANDI DstReg, SrcReg1, UnsImm

31 - 25 24 23 - 20 19 - 16 15 - 12 11 - 0

0000000 1 sr1 dr imm

Description: This instruction performs a bit-wise AND of the value in register

sr1 and the zero-extended imm value and writes the result into register dr.

OR DstReg, SrcReg1, SrcReg2

31 - 25 24 23 - 20 19 - 16 15 - 12 11 - 0


Description: This instruction performs a bit-wise OR of the values in registers

sr1 and sr2 and writes the result into register dr.

70

ORI DstReg, SrcReg1, UnsImm

31 - 25 24 23 - 20 19 - 16 15 - 12 11 - 0

0000001 1 sr1 dr imm

Description: This instruction performs a bit-wise OR of the value in register sr1

and the zero-extended imm value and writes the result into register dr.

XOR DstReg, SrcReg1, SrcReg2

31 - 25 24 23 - 20 19 - 16 15 - 12 11 - 0


Description: This instruction performs a bit-wise exclusive-OR of the values in

registers sr1 and sr2 and writes the result into register dr.

XORI DstReg, SrcReg1, UnsImm

31 - 25 24 23 - 20 19 - 16 15 - 12 11 - 0

0000010 1 sr1 dr imm

Description: This instruction performs a bit-wise exclusive-OR of the value in

register sr1 and the zero-extended imm value and writes the result into register

dr.

XNOR DstReg, SrcReg1, SrcReg2

31 - 25 24 23 - 20 19 - 16 15 - 12 11 - 0


Description: This instruction performs a bit-wise exclusive-NOR of the values

in registers sr1 and sr2 and writes the result into register dr.

71

XNORI DstReg, SrcReg1, UnsImm

31 - 25 24 23 - 20 19 - 16 15 - 12 11 - 0

0000011 1 sr1 dr imm

Description: This instruction performs a bit-wise exclusive-NOR of the value

in register sr1 and the zero-extended imm value and writes the result into

register dr.

B.2.3 Shift Instructions

LSL DstReg, SrcReg1, SrcReg2

31 - 25 24 23 - 20 19 - 16 15 - 12 11 - 0


Description: This instruction shifts to the left the contents of sr1 by the amount

indicated in sr2, inserting zeros on the right. The result is written into register

dr.

LSLI DstReg, SrcReg1, UnsImm

31 - 25 24 23 - 20 19 - 16 15 - 12 11 - 0

0010001 1 sr1 dr imm

Description: This instruction shifts to the left the contents of sr1 by the amount

indicated in imm, inserting zeros on the right. The result is written into register

dr.

72

LSR DstReg, SrcReg1, SrcReg2

31 - 25 24 23 - 20 19 - 16 15 - 12 11 - 0


Description: This instruction shifts to the right the contents of sr1 by the

amount indicated in sr2, inserting zeros on the left. The result is written into

register dr.

LSRI DstReg, SrcReg1, UnsImm

31 - 25 24 23 - 20 19 - 16 15 - 12 11 - 0

0010010 1 sr1 dr imm


amount indicated in imm, inserting zeros on the left. The result is written into

register dr.

ASR DstReg, SrcReg1, SrcReg2

31 - 25 24 23 - 20 19 - 16 15 - 12 11 - 0



amount indicated in sr2, replicating the most significant bit. The result is

written into register dr.

ASRI DstReg, SrcReg1, UnsImm

31 - 25 24 23 - 20 19 - 16 15 - 12 11 - 0

0010011 1 sr1 dr imm

73

Description: This instruction shifts the contents of sr1 to the right by the

amount indicated in imm, replicating the most significant bit. The result is

written into register dr.

B.2.4 Comparison Instructions

SLT DstReg, SrcReg1, SrcReg2

31 - 25 24 23 - 20 19 - 16 15 - 12 11 - 0


Description: This instruction signed compares the values in registers sr1 and

sr2 and writes the value 0x00000001 into dr if [sr1] < [sr2] or the value

0x00000000 otherwise.

SLTI DstReg, SrcReg1, SignImm

31 - 25 24 23 - 20 19 - 16 15 - 12 11 - 0

0001000 1 sr1 dr imm

Description: This instruction signed compares the value in register sr1 and the

sign-extended value imm. It writes the value 0x00000001 into dr if [sr1] <

[imm] or the value 0x00000000 otherwise.

SLTU DstReg, SrcReg1, SrcReg2

31 - 25 24 23 - 20 19 - 16 15 - 12 11 - 0


Description: This instruction unsigned compares the values in registers sr1 and

sr2 and writes the value 0x00000001 into dr if [sr1] < [sr2] or the value


74

SLTUI DstReg, SrcReg1, UnsImm

31 – 25 24 23 - 20 19 - 16 15 - 12 11 - 0

0001001 1 sr1 dr imm

Description: This instruction unsigned compares the value in register sr1 and

the zero- extended value imm. It writes the value 0x00000001 into dr if [sr1] <


SGE DstReg, SrcReg1, SrcReg2

31 – 25 24 23 - 20 19 - 16 15 - 12 11 - 0



sr2 and writes the value 0x00000001 into dr if [sr1] > = [sr2] or the value


SGEI DstReg, SrcReg1, SignImm

31 – 25 24 23 - 20 19 - 16 15 - 12 11 - 0

0001010 1 sr1 dr imm


sign-extended value imm. It writes the value 0x00000001 into dr if [sr1] > =


75

SGEU DstReg, SrcReg1, SrcReg2

31 – 25 24 23 - 20 19 - 16 15 - 12 11 - 0



sr2 and writes the value 0x00000001 into dr if [sr1] > = [sr2] or the value


SGEUI DstReg, SrcReg1, UnsImm

31 – 25 24 23 - 20 19 - 16 15 - 12 11 - 0

0001011 1 sr1 dr imm

Description: This instruction unsigned compares the value in register sr1 and

the zero-extended value imm. It writes the value 0x00000001 into dr if [sr1] >

= [imm] or the value 0x00000000 otherwise.

SEQ DstReg, SrcReg1, SrcReg2

31 – 25 24 23 - 20 19 - 16 15 - 12 11 - 0



sr2 and writes the value 0x00000001 into dr if [sr1] = [sr2] or the value


76

SEQI DstReg, SrcReg1, SignImm

31 – 25 24 23 - 20 19 - 16 15 - 12 11 - 0

0001100 1 sr1 dr imm


sign-extended value imm. It writes the value 0x00000001 into dr if [sr1] =


B.2.5 Load-Immediate Instructions

LDLI DstReg, UnsImm

31 – 25 24 23 - 20 19 - 16 15 - 12 11 - 0

0011100 1 unused dr imm

Description: This instruction loads the immediate value into the lower 16 bits

of the dr register, zeroing the upper 16 bits.

LDUI DstReg, UnsImm

31 – 25 24 23 - 20 19 - 16 15 - 12 11 - 0

0011101 1 unused dr imm

Description: This instruction loads the immediate value into the upper 16 bits

of the dr register, zeroing the lower 16 bits.

77

B.2.6 Memory Access Instructions

LDW DstReg, RegSrc1

31 – 25 24 23 - 20 19 - 16 15 - 12 11 - 0

0010100 0 sr1 unused dr unused

Description: This instruction loads into register dr the value from the memory

location which address is in register sr1.

STW SrcReg1, SrcReg2

31 – 25 24 23 - 20 19 - 16 15 - 12 11 - 0

0010101 0 sr1 sr2 unused unused

Description: This instruction stores the value in register sr2 into the memory

location which address is in register sr1.

B.2.7 Control Transfer Instructions

BRT SrcReg1, SignImm

31 – 25 24 23 - 20 19 - 16 15 - 12 11 - 0

0011011 1 sr1 unused imm

Description: This instruction tests the value in register sr1 and jumps if it has

the value 0x00000001 with a one-instruction delay. The address of the target

instruction is calculated by adding the sign-extended imm offset to the

instruction's address.

78

BRF SrcReg1, SignImm

31 – 25 24 23 - 20 19 - 16 15 - 12 11 - 0

0011010 1 sr1 unused imm

Description: This instruction tests the value in register sr1 and jumps if it has

the value 0x00000000 with a one-instruction delay. The address of the target



BRLT SrcReg1, SignImm

31 – 25 24 23 - 20 19 - 16 15 - 12 11 - 0

0100000 1 sr1 dr imm

Description: This instruction signed compares the values in registers sr1 and dr

and jumps if [sr1] < [dr] with a one-instruction delay. The address of the target



BRLE SrcReg1, SignImm

31 – 25 24 23 - 20 19 - 16 15 - 12 11 - 0

0100001 1 sr1 dr imm

Description: This instruction signed compares the values in registers sr1 and dr

and jumps if [sr1] ≤ [dr] with a one-instruction delay. The address of the target



79

BREQ SrcReg1, SignImm

31 – 25 24 23 - 20 19 - 16 15 - 12 11 - 0

0100010 1 sr1 dr imm


dr and jumps if [sr1] > [dr] with a one-instruction delay. The address of the

target instruction is calculated by adding the sign-extended imm offset to the


BRNE SrcReg1, SignImm

31 – 25 24 23 - 20 19 - 16 15 - 12 11 - 0

0100011 1 sr1 dr imm


dr and jumps if [sr1] ≠ [dr] with a one-instruction delay. The address of the

target instruction is calculated by adding the sign-extended imm offset to the


JAL DstReg, SrcReg1

31 – 25 24 23 - 20 19 - 16 15 - 12 11 - 0

0011000 0 sr1 unused dr unused

Description: This instruction unconditionally jumps with a one-instruction

delay to the target address in register sr1. The instruction's address plus 2 is

saved into register dr.

80

B.2.8 MorphoSys Instruction

LDCTXT SreReg1, r/c#, r/c, context#, #contexts to be loaded

31-26 25 24 23-20 19 18-16 15 14-11 10-8 7-0

100000 - - SrcReg1 - r/c# r/c context# --- # contexts to be loaded

• SrcReg1: The starting address of external memory where the context

configuration is stored.(32bit address).

• r/c #: Used to control the starting cell in the Context memory. (0-7 in the

horizontal direction ).

• r/c: Select the column context or row context (0 - column, 1 - row).

• context #: Starting context (0-15).

• # contexts to be loaded: Specify how many contexts to be loaded through

DMA.

Description: This instruction is used to load context words in to the Context

Memory. When this instruction is issued, TinyRISC provides appropriate

control signals to the DMAC. Based on these signals, the DMAC performs the

loading of configuration data from external memory to the Context Memory.

Note: During DMAC loading context, it increases r/c# first, and then increases

the context#.

81

LDFB SreReg1, bank, set#, #words

31-26 25 24 23-20 19-11 10 9 8-0

100010 - - SrcReg1 -------- Bank Set# #words to load

• SrcReg1: The starting address of external memory where the data is

stored.(32bit address).

• Bank: Specifies which bank of Frame Buffer, 0 - bank A. 1 - bank B.

• Set #: Specifies which set of Frame Buffer (set number 0 or 1).

• # of words to load: Specifies how many 32-bit words to be loaded to the

Frame Buffer.

Description: This instruction is used to load image or application data into the

Frame Buffer for subsequent use by the RC Array. When this instruction is

issued, TinyRISC initiates operation of the DMAC to perform the transfer of

data from external memory to the Frame Buffer.

Note: When loading/storing data to/from Frame Buffer, it always starts from the

beginning of the specified bank. It is different from the mechanism of the

Context Memory, where the loading can start at any location. One bank has 64

rows, and each row has 2 words (64 bits).

82

STFB SrcReg1, bank, set#, #words

31-26 25 24 23-20 19-11 10 9 8-0

100011 - - SrcReg1 -------- Bank Set# #words to store

• SrcReg1: Provides the starting address in main memory where the data

should to be stored.

• Specifies which bank of Frame Buffer, 0 - bank A. 1 - bank B.

• Set #: Specifies which set of Frame Buffer (set number 0 or 1).

• # of words to store: Specifies how many 32-bit words to be stored from the

Frame Buffer to the main memory.

Description: This instruction is used to transfer the processed image or

application data from the Frame Buffer back to the external memory through the

DMAC.

SBCB b_all, b_row_col, r/c, context#, bank, set#, bank_addr

31-26 20 18-16 15 14-11 10 9 8-0

110100 b_all b_row_col r/c context # bank set# bank_addr

• b_all: Specifies whether the entire RC Array (8 x 8) is actived or only one

row or column of RC Array is actived, 1 = ALL of the RCs are actived. 0 =

only one row or column of RC Array is actived.

• b_row_col: If the b_all =0, then this field specifies which row or column of

RC Array is actived.

83

• r/c: Specifies the context broadcast mode. 1 = row context broadcast. 0 =

column context broadcast.

• context #: Specifies which context (in Context Memory) to be executed.

• bank: Specifies which Frame Buffer bank to be accessed.

• set #: Specifies which set of the Frame Buffer.

• bank_addr: Provides the Frame Buffer address.

Description: When this instruction is issued, the TinyRISC provides an address

that enables the RC Array to access eight bytes (single-operand) data from the

Frame Buffer. The RC Array also executes concurrently on the context word

specified in the instruction.

Note: Since each bank in the Frame buffer has the capacity of 64 x 8 bytes, 6

address bits are required to specify which row and the other 3 bits specify the

starting word in that row. The important feature of the Frame Buffer is that it

always fetches the eight consecutive bytes of data even though the data may

wrap around to the next row.

DBCBC SrcReg1, bank_B_addr_base, b_all, r/c#, context#, set#, bank_A_addr

31-26 25 24 23-20 19-16 15-12 11-9 8-0

111100 set b_all SrcReg1 base_bankB context# r/c# bank_A_addr

• set: Specifies Frame Buffer set 0 or set 1.

• b_all: Same as SBCB.

• SrcReg1: Specifies the register of the TinyRISC that provides the lower 5

address bits for the bank B of the Frame Buffer.

84

• base_bankB: This field directly provides the base address for bank B of the

Frame Buffer. These 4 bits, along with the 5 bits from SrcReg1, provide the

complete Bank B address (9 bits).

• context #: Same as SBCB.

• r/c #: If b_all = 0, this specifies which column of the RC Array is activated.

• bank_A_addr: These nine bits specify the location of data to be loaded from

bank A of the Frame Buffer.

Description: This instruction refers to double bank access of Frame Buffer

with column-wise context broadcast. When this instruction is issued, the

Frame Buffer provides eight sets of two-operand data to the RC Array. Each

RC get two bytes of data, where one byte is from bank A and the other is from

bank B.

DBCBR SrcReg1, bank_B_addr_base, b_all, r/c#, context#, set#, bank_A_addr

31-26 25 24 23-20 19-16 15-12 11-9 8-0

111101 set b_all SrcReg1 base_bankB context# r/c# bank_A_addr

Description: This instruction refers to double bank access of Frame Buffer

with row-wise context broadcast. All of the fields specify the same

information as those of DBCBC, except that r/c # is used to specify which row

is activated.

85

CBCAST b_all, b_row_col, r/c, context#

31-26 25-21 20 19 18-16 15 14-11 10-0

111000 ----- b_all - b_row_col r/c context# -----------

• b_all: 1 = all of the RC is actived. 0 = only one row or column of the RC

Array is actived.

• b_row_col: if b_all = 0, then b_row_col specifies which row or column is

actived.

• r/c: This field specifies the context broadcast mode. 0 = column context

broadcast. 1 = row context broadcast.

• context #: Specifies which context in the Context Memory to be executed.

Description: This instruction assumes that all data needed for the computation

is already present in the RC Array; hence, no access to the Frame Buffer is

required.

WFBI r/c#, r/c, bank, set#, bank_addr

31-26 25-19 18-16 15 14-11 10 9 8-0

101000 ------- r/c# - ---- bank set# bank_addr

• r/c #: Specifies which column of the RC Array from which the data has to be

written back to the Frame Buffer.

• bank: Specifies which bank of the Frame Buffer that the data has to be

written to.


86

• bank_addr: This field provides the immediate row address (9 bits) for the

Frame Buffer that the data from the RC Array will be written to.

Description: This instruction performs the writing of data to the Frame Buffer.

The immediate address is obtained from the field bank_addr. The source data is

from the indicated column (specified by r/c #) of the RC Array. Eight bytes of

data are written concurrently to one row of the Frame Buffer through a 64-bit

bus.

WFB SrcReg1, r/c#, r/c, bank, set#

31-26 25-24 23-20 19 18-16 15 14-11 10 9 8-0

101001 -- SrcReg1 - r/c# - ---- bank set# ------

• SrcReg1: Specifies the register of the TinyRISC that provides the Frame

buffer address.

• r/c #: Specifies which column of the RC Array from which the data has to be

written back to the Frame Buffer.

• bank: Specifies which bank of the Frame Buffer that the data from the RC

Array will be written to.


Description: This instruction performs the writing of data to the Frame Buffer

with address obtained from the TinyRISC register specified in the field

SrcReg1. The source data is from the indicated column (specified by r/c #) of

the RC Array. Eight bytes of data are written concurrently to one row of the

Frame Buffer through a 64-bit bus.

87

RCRISC Dest, col#

31-26 25-19 18-16 15-12 11-0

100100 ------- col# Dest ------------

• col #: Specifies which RC (out of eight) in the first row of the RC Array will

write data to the TinyRISC.

• Dest: Specifies the destination TinyRISC register where the data of the

specified RC has to be stored.

88

Appendix C

RC Array Instruction Set

Instruction Type

Instruction Input 1 Input 2 Output Descr iption

BYPASS MUX A - reg_file, out in1 � out

OR MUX A MUX B reg_file, out in1 or in2 � out

AND MUX A MUX B reg_file, out in1 and in2 � out

XOR MUX A MUX B reg_file, out in1 xor in2 � out

ADD MUX A MUX B reg_file, out in1 + in2 � out

SUB MUX A MUX B reg_file, out in1 - in2 � out

SUBB MUX A MUX B reg_file, out in2 - in1 � out

ANDCNT MUX A MUX B reg_file, out position of least significant 1 in (in1 and in2) � out

ADDSUBF MUX A MUX B reg_file, out in1 ± in2 (according to flag) � out

ABSSUB MUX A MUX B reg_file, out | in1 - in2 | + out � out

KEEP - - reg_file, out nop

ROUND MUX A MUX B reg_file, out round(out) � out

CADD MUX A MUX B reg_file, out complex: in1 + in2 � out

in1, in2: 8 bit real, 8 bit Imag

CSUB MUX A MUX B reg_file, out complex: in1 - in2 � out


RST - - - clear all reg’s

LDMM mem - reg_file mem(MAC_reg) � reg_File

A

L

U

STMM reg_file - mem reg_File � mem(MAC_reg)

89

CMUL MUX A MUX B reg_file, out sign complex: in1 * in2 � out


MUL MUX A MUX B reg_file, out sign: in1 * in2 � out

in1, in2 : 16 bit

MULADD MUX A MUX B reg_file, out in1 * MAC_reg + in2 � out

in1, in2 : 16 bit

MULADDO MUX A MUX B reg_file, out in1 * MAC_reg + out � out

in1, in2 : 16 bit

CMULADD MUX A MUX B reg_file, out in1 * MAC_reg + in2 � out


M

A

C

CMULADDO MUX A MUX B reg_file, out in1 * MAC_reg + out � out


LDIM - - reg_file immediate value � reg_file

immidiate value in context[15..0]

LDMM mem - reg_file mem(MAC_reg) � reg_file

M

E

M STMM reg_file - mem reg_file � mem(MAC_reg)

90

Appendix D

The Programs for AES Implementation in MorphoSys

D.1 Key Expansion

The Key Expansion program listed here is for encryption at Key size = 128 bits is.

For decryption, the function InvMixColumn( ) is applied to every generated Round Key

except the first and last one. If the Key size is 192 bits or 256 bits, the only change is the

number of Rounds (loops).

#######################################################################

# Round Key generation for NK=4 (128 bits).

# by Ye Tang, 05/22/01

#######################################################################

main:

ldi $10, 0x0100 # start address of the key; tracking last round key

ldi $11, 0x000A # loop number is 10, i.e., key length = 128 bits

ldi $12, 0x0200 # start address of "rcon[ ]"

ldi $13, 0x010C # 0x010C is the start address of last word of the key;

# use it as another index

# load the last word of the original key.

# notice that this part is unnecessary afterwards because $5 to $8 have already stored the last word of last round key (at the end of the loop).

ldw $5, $13 # load a "tinyrisc word"; $5 = tk[0][KC-1]

addi $13, $13, 1

ldw $6, $13 # 2nd one; $6 = tk[1][KC-1]

addi $13, $13, 1

ldw $7, $13 # 3nd one; $7 = tk[2][KC-1]

addi $13, $13, 1

ldw $8, $13 # 4th one; $8 = tk[3][KC-1]

######################################################################################

Rounds:

# calculate the 1st round key word

ldw $1, $10 # load the 1st byte; $1 = tk[0][0]

addi $10, $10, 1

ldw $2, $10 # load the 2nd one; $2 = tk[1][0]

addi $10, $10, 1

91

ldw $3, $10 # load the 3rd one; $3 = tk[2][0]

addi $10, $10, 1

ldw $4, $10 # load the 4th one; $4 = tk[3][0]

ldw $5, $5 # $5 = Sbox($5); Assume S-box is at address 0x0000

ldw $6, $6

ldw $7, $7

ldw $8, $8

ldw $9, $12 # $9 = rcon[$12]

addi $12, $12, 1 # for the use of rcon[ ] in next loop

xor $1, $6, $1 # $1 xor $6 -> $1; tk[i][0] ^= tk[(i+1)%4][KC-1]

xor $2, $7, $2 # xor a, b, c means c = a xor b

xor $3, $8, $3

xor $4, $5, $4

xor $1, $9, $1 # tk[0][0] ^= rcon[$12]

######################################################################################

# calculate the 2nd round key word and store the 1st round key word

addi $10, $10, 1 # the 2nd word of last round key, i.e., W[i-Nk] (i=Nk+1, Nk+2, ... , 2Nk-1)

ldw $5, $10 # load the 1st byte; $5 = tk[0][j] of last round key

addi $10, $10, 1

ldw $6, $10 # load the 2nd byte; $6 = tk[1][j] of last round key

addi $10, $10, 1

ldw $7, $10 # load the 3rd byte; $7 = tk[2][j] of last round key

addi $10, $10, 1

ldw $8, $10 # load the 4th byte; $8 = tk[3][j] of last round key

xor $1, $5, $5 # $5 = $1 xor $5 ; tk[i][j] ^= tk[i][j-1]; $1 is tk[i][j-1]

xor $2, $6, $6

xor $3, $7, $7

xor $4, $8, $8

addi $13, $13, 1 # store $1 to $4 (the 1st round key word) back at 0x0110, 0x0120 (next loop) and so on.

stw $1, $13 # $13 is the address. Manual is incorrect again!!!

addi $13, $13, 1 # in mULATE, it displays as "stw r13, r1".

stw $2, $13

addi $13, $13, 1

stw $3, $13

addi $13, $13, 1

stw $4, $13

92

######################################################################################

# calculate the 3rd round key word and store the 2nd round key word

# don't use loop so we can switch registers ($1 to $4, or $5 to $8) used for the words. by this means we can save time.

addi $10, $10, 1 # the 3rd word of last round key (or, original key);

ldw $1, $10 # switch to $1 again

addi $10, $10, 1

ldw $2, $10

addi $10, $10, 1

ldw $3, $10

addi $10, $10, 1

ldw $4, $10


xor $2, $6, $2

xor $3, $7, $3

xor $4, $8, $4

addi $13, $13, 1 # store $5 to $8 (the 2nd round key word) back at 0x0114 and so on.

stw $5, $13

addi $13, $13, 1

stw $6, $13

addi $13, $13, 1

stw $7, $13

addi $13, $13, 1

stw $8, $13

######################################################################################

# calculate the 4th round key word and store the 3rd round key word

addi $10, $10, 1 # the 4th word of last round key (or, original key);

ldw $5, $10 # switch to $5 again

addi $10, $10, 1

ldw $6, $10

addi $10, $10, 1

ldw $7, $10

addi $10, $10, 1

ldw $8, $10


xor $2, $6, $6

xor $3, $7, $7

xor $4, $8, $8

addi $13, $13, 1 # store $1 to $4 (the 3rd round key word) back at 0x0118 and so on.

stw $1, $13

addi $13, $13, 1

93

stw $2, $13

addi $13, $13, 1

stw $3, $13

addi $13, $13, 1

stw $4, $13

######################################################################################

# store the 4th round key word

addi $13, $13, 1 # store $5 to $8 back at 0x011c and so on.

stw $5, $13

addi $13, $13, 1

stw $6, $13

addi $13, $13, 1

stw $7, $13

addi $13, $13, 1

stw $8, $13

######################################################################################

addi $10, $10, 1 # point to the start address of current round key (0x0110 and so on), used for next loop

subi $11, $11, 1

brlt $0, $11, Rounds

nop # this nop (delay slot) is necessary, otherwise $10 will be assigned the value 0x0058.

# end of the "Rounds" loop

######################################################################################

# concatenate the round key so that it can be used by FB. the format is changed from "00000001 00000002" to "00010002".

ldi $10, 0x0058 # concatenate 2 numbers each time. there are 16*11 numbers in total. so 88 loops are needed.

ldi $11, 0x0100

ldi $12, 0x0100

Concatenation:

ldw $1, $11

addi $11, $11, 1

ldw $2, $11

lsli $1, $1, 16 # left shift $1 16 bits; get something like "00020000"

or $2, $1, $1 # $1 = $1 or $2, so we get something like "00020003"

stw $1, $12 # store back

addi $11, $11, 1

addi $12, $12, 1 # the increase of $12 is as half as that of $11

subi $10, $10, 1

brlt $0, $10, Concatenation

nop

.end main

94

D.2 Data Processing

The program listed here is for encryption at Key size = 128 bits. The programs for

decryption and other Key sizes are similar.

######################################################################################

# AES (Rijndael) - Encryption part

# - First part: lookup table loading

# - Second part: data process part

# By Ye Tang, 05/20/01

######################################################################################

# First Part: Table Loading

# first load the more frequently-used xtime table (so no offset is needed), then s-box.

main:

# Load Column Contexts;

ldi $1, 0x0000 # assume the column contexts address

ldi $10, 128

ldctxt $1, 0, 0, 0, 128

Delay1.DONE:

subi $10, $10, 4

nop

brle $0, $10, Delay1.DONE

nop

# Load Row Contexts;

ldi $2, 0x0080

ldi $10, 128

ldctxt $2, 0, 1, 0, 128

Delay2.DONE:

subi $10, $10, 4

nop


nop

# Begin loading table data

# First 32 contexts consist of 4 "control contexts" (which are last two col/row contexts,

# i.e., #14 and #15) and 28 "data contexts". So 28 table data can be loaded.

# Then the first 15 col/row contexts will be flushed by new ones. Only the last

# col/row context are kept for control context (which are STMM and ADD R1, R1, R2).

# So from the second 32 contexts, 30 table data can be loaded each time.

95

# To fully load 256-cell S-box table, we need 9 context switches -- (14+15*7+9)*2

# After finishing S-box table loading, we will begin to load xtime table.

# Notice that the value of R1 is just what we need for the next address, and R2 is still 1.

# So actually we load s-box and xtime table seamlessly (back to back).

# There are 18 context switches in total -- (28+30*16+4)

# Because of the format of "cbcast", we can't use loop for it.

# If the context # is increased from 16 to 128. We definitely can't bear it.

# A format of "cbcast 1, 0, 0, $1" would be much better.

# For now we can use a "coarse loop" between context switches, but except the first/last

# switch.

# The order of table loading execution: execute col context #0, #1, ... #14 (or the special

# first/last one in the first/last switch), then execute row context #0, #1, ..., #14.

# Remember this is also the order you must comply with when establish the contexts.

# Of course, when the context memory size is expanded to 128, things will not be so painful.

# first 32 contexts; load first 28 data of s-box table

cbcast 1, 0, 0, 14 # r1=0; col context

cbcast 1, 0, 1, 14 # r2=1; row context

cbcast 1, 0, 0, 0 # load 1st data to r0

cbcast 1, 0, 0, 15 # store r0

cbcast 1, 0, 1, 15 # increase r1 by 1 (address )

cbcast 1, 0, 0, 1 # load 2nd data...

cbcast 1, 0, 0, 15

cbcast 1, 0, 1, 15

cbcast 1, 0, 0, 2 # load 3rd data...

cbcast 1, 0, 0, 15

cbcast 1, 0, 1, 15

cbcast 1, 0, 0, 3 # and so on.

cbcast 1, 0, 0, 15

cbcast 1, 0, 1, 15

cbcast 1, 0, 0, 4

cbcast 1, 0, 0, 15

cbcast 1, 0, 1, 15

cbcast 1, 0, 0, 5

cbcast 1, 0, 0, 15

cbcast 1, 0, 1, 15

96

cbcast 1, 0, 0, 6

cbcast 1, 0, 0, 15

cbcast 1, 0, 1, 15

cbcast 1, 0, 0, 7

cbcast 1, 0, 0, 15

cbcast 1, 0, 1, 15

cbcast 1, 0, 0, 8

cbcast 1, 0, 0, 15

cbcast 1, 0, 1, 15

cbcast 1, 0, 0, 9

cbcast 1, 0, 0, 15

cbcast 1, 0, 1, 15

cbcast 1, 0, 0, 10

cbcast 1, 0, 0, 15

cbcast 1, 0, 1, 15

cbcast 1, 0, 0, 11

cbcast 1, 0, 0, 15

cbcast 1, 0, 1, 15

cbcast 1, 0, 0, 12

cbcast 1, 0, 0, 15

cbcast 1, 0, 1, 15

cbcast 1, 0, 0, 13

cbcast 1, 0, 0, 15

cbcast 1, 0, 1, 15

cbcast 1, 0, 1, 0 # Begin to execute row contexts

cbcast 1, 0, 0, 15

cbcast 1, 0, 1, 15

cbcast 1, 0, 1, 1

cbcast 1, 0, 0, 15

cbcast 1, 0, 1, 15

cbcast 1, 0, 1, 2

cbcast 1, 0, 0, 15

cbcast 1, 0, 1, 15

97

cbcast 1, 0, 1, 3

cbcast 1, 0, 0, 15

cbcast 1, 0, 1, 15

cbcast 1, 0, 1, 4

cbcast 1, 0, 0, 15

cbcast 1, 0, 1, 15

cbcast 1, 0, 1, 5

cbcast 1, 0, 0, 15

cbcast 1, 0, 1, 15

cbcast 1, 0, 1, 6

cbcast 1, 0, 0, 15

cbcast 1, 0, 1, 15

cbcast 1, 0, 1, 7

cbcast 1, 0, 0, 15

cbcast 1, 0, 1, 15

cbcast 1, 0, 1, 8

cbcast 1, 0, 0, 15

cbcast 1, 0, 1, 15

cbcast 1, 0, 1, 9

cbcast 1, 0, 0, 15

cbcast 1, 0, 1, 15

cbcast 1, 0, 1, 10

cbcast 1, 0, 0, 15

cbcast 1, 0, 1, 15

cbcast 1, 0, 1, 11

cbcast 1, 0, 0, 15

cbcast 1, 0, 1, 15

cbcast 1, 0, 1, 12

cbcast 1, 0, 0, 15

cbcast 1, 0, 1, 15

cbcast 1, 0, 1, 13

cbcast 1, 0, 0, 15

cbcast 1, 0, 1, 15

98

# 16 loops to load subsequent 15*2*16=480 numbers in the two tables.

ldi $11, 0x0010 # loop counter

# first reload contexts except the last col/row context, then cbcast

sbox:

# Load Column Contexts;

addi $1, $1, 0x100 # address of this part of contexts

ldi $10, 128

ldctxt $1, 0, 0, 0, 128

Delay3.DONE:

subi $10, $10, 4

nop


nop

# Load Row Contexts;


ldi $10, 128

ldctxt $2, 0, 1, 0, 128

Delay4.DONE:

subi $10, $10, 4

nop


nop

cbcast 1, 0, 0, 0 # load 1st data to r0; col context




cbcast 1, 0, 0, 15

cbcast 1, 0, 1, 15


cbcast 1, 0, 0, 15

cbcast 1, 0, 1, 15

cbcast 1, 0, 0, 3 # and so on.

cbcast 1, 0, 0, 15

cbcast 1, 0, 1, 15

cbcast 1, 0, 0, 4

cbcast 1, 0, 0, 15

cbcast 1, 0, 1, 15

99

cbcast 1, 0, 0, 5

cbcast 1, 0, 0, 15

cbcast 1, 0, 1, 15

cbcast 1, 0, 0, 6

cbcast 1, 0, 0, 15

cbcast 1, 0, 1, 15

cbcast 1, 0, 0, 7

cbcast 1, 0, 0, 15

cbcast 1, 0, 1, 15

cbcast 1, 0, 0, 8

cbcast 1, 0, 0, 15

cbcast 1, 0, 1, 15

cbcast 1, 0, 0, 9

cbcast 1, 0, 0, 15

cbcast 1, 0, 1, 15

cbcast 1, 0, 0, 10

cbcast 1, 0, 0, 15

cbcast 1, 0, 1, 15

cbcast 1, 0, 0, 11

cbcast 1, 0, 0, 15

cbcast 1, 0, 1, 15

cbcast 1, 0, 0, 12

cbcast 1, 0, 0, 15

cbcast 1, 0, 1, 15

cbcast 1, 0, 0, 13

cbcast 1, 0, 0, 15

cbcast 1, 0, 1, 15

cbcast 1, 0, 0, 14

cbcast 1, 0, 0, 15

cbcast 1, 0, 1, 15

cbcast 1, 0, 1, 0 # Begin to execute row contexts

cbcast 1, 0, 0, 15

cbcast 1, 0, 1, 15

100

cbcast 1, 0, 1, 1

cbcast 1, 0, 0, 15

cbcast 1, 0, 1, 15

cbcast 1, 0, 1, 2

cbcast 1, 0, 0, 15

cbcast 1, 0, 1, 15

cbcast 1, 0, 1, 3

cbcast 1, 0, 0, 15

cbcast 1, 0, 1, 15

cbcast 1, 0, 1, 4

cbcast 1, 0, 0, 15

cbcast 1, 0, 1, 15

cbcast 1, 0, 1, 5

cbcast 1, 0, 0, 15

cbcast 1, 0, 1, 15

cbcast 1, 0, 1, 6

cbcast 1, 0, 0, 15

cbcast 1, 0, 1, 15

cbcast 1, 0, 1, 7

cbcast 1, 0, 0, 15

cbcast 1, 0, 1, 15

cbcast 1, 0, 1, 8

cbcast 1, 0, 0, 15

cbcast 1, 0, 1, 15

cbcast 1, 0, 1, 9

cbcast 1, 0, 0, 15

cbcast 1, 0, 1, 15

cbcast 1, 0, 1, 10

cbcast 1, 0, 0, 15

cbcast 1, 0, 1, 15

cbcast 1, 0, 1, 11

cbcast 1, 0, 0, 15

cbcast 1, 0, 1, 15

101

cbcast 1, 0, 1, 12

cbcast 1, 0, 0, 15

cbcast 1, 0, 1, 15

cbcast 1, 0, 1, 13

cbcast 1, 0, 0, 15

cbcast 1, 0, 1, 15

cbcast 1, 0, 1, 14

cbcast 1, 0, 0, 15

cbcast 1, 0, 1, 15

subi $11, $11, 1

nop

brlt $0, $11, sbox # use "brlt" rather than "brle"!!!

nop

# Load last 4 (28+480+4) data of xtime table.

# Load 6(*8) Column Contexts including (control contexts);


ldi $10, 48

ldctxt $1, 0, 0, 0, 48

Delay5.DONE:

subi $10, $10, 4

nop


nop

cbcast 1, 0, 0, 0 # load 1st data to r0




cbcast 1, 0, 0, 4

cbcast 1, 0, 0, 5


cbcast 1, 0, 0, 4

cbcast 1, 0, 0, 5

cbcast 1, 0, 0, 3 # and the 4th data.

cbcast 1, 0, 0, 4

cbcast 1, 0, 0, 5

102

######################################################################################

# Second Part: Data Process

# The first part left a few extra free space in context memory. But we will not make use of them.

# The reason is that 16 col and 15 contexts are needed here and there is no penalty if we load

# them all together. Of course, if context memory is increased, things are different.

# In that case, we may load these contexts with the remaining ones in last part.

# Load 12 Column Contexts;


ldi $10, 96

ldctxt $1, 0, 0, 0, 96

Delay6.DONE:

subi $10, $10, 4

nop


nop

# Load 15 Row Contexts;

addi $2, $2, 0x0200 # address of this part of contexts, just for test purpose. notice $2 was not added by 100 last time.

ldi $10, 120

ldctxt $2, 0, 1, 0, 120

Delay7.DONE:

subi $10, $10, 4

nop


nop

# Load first Round Key and 4 blocks of data from external memory to FB;

ldi $3, 0x0009 # assume there are 9 intermediate rounds, i.e. key length = 128 bits

ldi $4, 0x1300 # assume Round Keys begin here

ldfb $4, 0, 0, 8 # load 8 words or lines, i.e., 16 effective bytes every time ( Bank 0, Set 0 )

ldi $10, 8

Delay8.DONE:

subi $10, $10, 4

nop


nop

ldi $5, 0x1400 # assume data begins here

ldfb $5, 1, 0, 32 # load 64 bytes (4 blocks) data ( Bank 1, Set 0 )

ldi $10, 32

Delay9.DONE:

subi $10, $10, 4

103

nop


nop

# Load Round Key from FB to RC;

sbcb 0, 0, 0, 0, 0, 0, 0 # Load 1st eight bytes to 1st column; 1 Byte/RC

sbcb 0, 1, 0, 0, 0, 0, 8 # Load 2nd eight bytes to 2nd column;

sbcb 0, 2, 0, 0, 0, 0, 0 # Load 1st eight bytes to 3rd column; 2nd block

sbcb 0, 3, 0, 0, 0, 0, 8 # Load 2nd eight bytes to 4th column;

sbcb 0, 4, 0, 0, 0, 0, 0 # Load 1st eight bytes to 5th column; 3rd block


sbcb 0, 6, 0, 0, 0, 0, 0 # Load 1st eight bytes to 7th column; 4th block


# Load data from FB to RC; 4 blocks

sbcb 0, 0, 0, 1, 1, 0, 0 # Load data; Bank 1, Set 0; Col Context #1,

sbcb 0, 1, 0, 1, 1, 0, 8

sbcb 0, 2, 0, 1, 1, 0, 16

sbcb 0, 3, 0, 1, 1, 0, 24

sbcb 0, 4, 0, 1, 1, 0, 32

sbcb 0, 5, 0, 1, 1, 0, 40

sbcb 0, 6, 0, 1, 1, 0, 48

sbcb 0, 7, 0, 1, 1, 0, 56

# Initial Round Key Addition

cbcast 1, 0, 1, 0 # Row context #0

# Intermediate Rounds begins (loop begins); Note that we have to load Round Key into FB

# in EVERY round and flush the previous one, otherwise we can't fix the address.

# If sbcb supports variable like "sbcb 0, 0, 0, 0, 0, 0, $1", things will be easier.

# Load Round Key and data from external memory to FB;

IntermediateRound:

addi $4, $4, 8 # next Round Key starts here.

ldfb $4, 0, 0, 8 # load 8 words or lines, i.e., 16 effective bytes every time

ldi $10, 8

Delay10.DONE:

subi $10, $10, 4

nop


nop

104










# ByteSub

cbcast 1, 0, 0, 2 # Col context #2



# ShiftRow-MixColumn

cbcast 1, 0, 1, 4 # Bypass r0, necessary!!! (because Col Ctx #4 is mem op and doesn't change out register.)




cbcast 1, 0, 1, 4 # Row context #4, bypass r0 again
















# Add Round Key


# loop condition

subi $3, $3, 1

brlt $0, $3, IntermediateRound

nop

105

# Final Round

# Load Round Key and data from external memory to FB;

addi $4, $4, 8 # next Round Key starts here.

ldfb $4, 0, 0, 8 # load 8 words or lines, i.e., 16 effective bytes every time

ldi $10, 8

Delay11.DONE:

subi $10, $10, 4

nop


nop










# ByteSub




# Mere ShiftRow, No MixColumn

cbcast 1, 0, 1, 4 # Bypass r0, necessary!



cbcast 1, 0, 0, 6 # Repeat Col context #6

cbcast 1, 0, 1, 6 # Bypass r1, Row context #6

cbcast 1, 0, 0, 5 # Repeat Col context #5


# Add Round Key


# Store data from RC to FB

nop # necessary before writing out

wfbi 0, 0, 1, 0, 0 # store column #0 to Bank 1, Set 0, addr 0

wfbi 1, 0, 1, 0, 8 # column #1

wfbi 2, 0, 1, 0, 16

106

wfbi 3, 0, 1, 0, 24

wfbi 4, 0, 1, 0, 32

wfbi 5, 0, 1, 0, 40

wfbi 6, 0, 1, 0, 48

wfbi 7, 0, 1, 0, 56

# Store data from FB to Extenal Memory

ldi $6, 0x2000 # assume output data begins here

stfb $6, 1, 0, 32 # save the 64 bytes (32 words) data back to main memory, Bank 1, Set 0

.end main

D.3 Contexts for Data Processing

The contexts listed here are for the encryption (applicable to all Key sizes). The

contexts for decryption are similar and not listed here.

Column Contexts

set 0 , 0 BYPASS I I > 7 ; # Load Round Key set 1 , 0 BYPASS I I > 7 ; set 2 , 0 BYPASS I I > 7 ; set 3 , 0 BYPASS I I > 7 ; set 4 , 0 BYPASS I I > 7 ; set 5 , 0 BYPASS I I > 7 ; set 6 , 0 BYPASS I I > 7 ; set 7 , 0 BYPASS I I > 7 ; set 0 , 1 BYPASS I I > 0 ; # Load original data set 1 , 1 BYPASS I I > 0 ; set 2 , 1 BYPASS I I > 0 ; set 3 , 1 BYPASS I I > 0 ; set 4 , 1 BYPASS I I > 0 ; set 5 , 1 BYPASS I I > 0 ; set 6 , 1 BYPASS I I > 0 ; set 7 , 1 BYPASS I I > 0 ; set 0 , 2 LDIM!0x0100 def def > 1 ; set 1 , 2 LDIM!0x0100 def def > 1 ; set 2 , 2 LDIM!0x0100 def def > 1 ; set 3 , 2 LDIM!0x0100 def def > 1 ; set 4 , 2 LDIM!0x0100 def def > 1 ; set 5 , 2 LDIM!0x0100 def def > 1 ; set 6 , 2 LDIM!0x0100 def def > 1 ; set 7 , 2 LDIM!0x0100 def def > 1 ; set 0 , 3 ADD r0 r1 > 0 ; set 1 , 3 ADD r0 r1 > 0 ; set 2 , 3 ADD r0 r1 > 0 ; set 3 , 3 ADD r0 r1 > 0 ; set 4 , 3 ADD r0 r1 > 0 ;

107

set 5 , 3 ADD r0 r1 > 0 ; set 6 , 3 ADD r0 r1 > 0 ; set 7 , 3 ADD r0 r1 > 0 ; set 0 , 4 LDMM r0 def > 0 ; set 1 , 4 LDMM r0 def > 0 ; set 2 , 4 LDMM r0 def > 0 ; set 3 , 4 LDMM r0 def > 0 ; set 4 , 4 LDMM r0 def > 0 ; set 5 , 4 LDMM r0 def > 0 ; set 6 , 4 LDMM r0 def > 0 ; set 7 , 4 LDMM r0 def > 0 ; set 0 , 5 BYPASS L def > 3 ; set 1 , 5 BYPASS L def > 3 ; set 2 , 5 BYPASS R def > 3 ; # also Final Step 4 set 3 , 5 BYPASS R def > 3 ; set 4 , 5 BYPASS L def > 3 ; set 5 , 5 BYPASS L def > 3 ; set 6 , 5 BYPASS R def > 3 ; set 7 , 5 BYPASS R def > 3 ; set 0 , 6 BYPASS L def > 2 ; set 1 , 6 BYPASS L def > 2 ; # also Final Step 3 set 2 , 6 BYPASS R def > 2 ; set 3 , 6 BYPASS R def > 2 ; set 4 , 6 BYPASS L def > 2 ; set 5 , 6 BYPASS L def > 2 ; set 6 , 6 BYPASS R def > 2 ; set 7 , 6 BYPASS R def > 2 ; set 0 , 7 XOR r0 r4 > 4 ; set 1 , 7 XOR r0 r4 > 4 ; set 2 , 7 XOR r0 r4 > 4 ; set 3 , 7 XOR r0 r4 > 4 ; set 4 , 7 XOR r0 r4 > 4 ; set 5 , 7 XOR r0 r4 > 4 ; set 6 , 7 XOR r0 r4 > 4 ; set 7 , 7 XOR r0 r4 > 4 ; set 0 , 8 XOR r2 r4 > 4 ; set 1 , 8 XOR r2 r4 > 4 ; set 2 , 8 XOR r2 r4 > 4 ; set 3 , 8 XOR r2 r4 > 4 ; set 4 , 8 XOR r2 r4 > 4 ; set 5 , 8 XOR r2 r4 > 4 ; set 6 , 8 XOR r2 r4 > 4 ; set 7 , 8 XOR r2 r4 > 4 ; set 0 , 9 XOR r3 r4 > 4 ; set 1 , 9 XOR r3 r4 > 4 ; set 2 , 9 XOR r3 r4 > 4 ; set 3 , 9 XOR r3 r4 > 4 ; set 4 , 9 XOR r3 r4 > 4 ; set 5 , 9 XOR r3 r4 > 4 ; set 6 , 9 XOR r3 r4 > 4 ; set 7 , 9 XOR r3 r4 > 4 ; set 0 , 10 LDMM r5 def > 5 ;

108

set 1 , 10 LDMM r5 def > 5 ; set 2 , 10 LDMM r5 def > 5 ; set 3 , 10 LDMM r5 def > 5 ; set 4 , 10 LDMM r5 def > 5 ; set 5 , 10 LDMM r5 def > 5 ; set 6 , 10 LDMM r5 def > 5 ; set 7 , 10 LDMM r5 def > 5 ; set 0 , 11 XOR r0 r4 > 0 ; set 1 , 11 XOR r0 r4 > 0 ; # r0 <-- r0 ^ r4 set 2 , 11 XOR r0 r4 > 0 ; set 3 , 11 XOR r0 r4 > 0 ; set 4 , 11 XOR r0 r4 > 0 ; set 5 , 11 XOR r0 r4 > 0 ; set 6 , 11 XOR r0 r4 > 0 ; set 7 , 11 XOR r0 r4 > 0 ;

Row Contexts

set 8 , 0 XOR r0 r7 > 0 ; # AddRoundKey, RoundKey is saved in r7 set 9 , 0 XOR r0 r7 > 0 ; set 10 , 0 XOR r0 r7 > 0 ; set 11 , 0 XOR r0 r7 > 0 ; set 12 , 0 XOR r0 r7 > 0 ; set 13 , 0 XOR r0 r7 > 0 ; set 14 , 0 XOR r0 r7 > 0 ; set 15 , 0 XOR r0 r7 > 0 ; set 8 , 1 BYPASS r0 def > 0 WE ; # ShiftRow-MixColumn Step 1 set 9 , 1 BYPASS r0 def > 0 ; set 10 , 1 BYPASS r0 def > 0 ; set 11 , 1 BYPASS VE def > 1 ; set 12 , 1 BYPASS r0 def > 0 WE ; set 13 , 1 BYPASS VE def > 1 ; set 14 , 1 BYPASS r0 def > 0 ; set 15 , 1 BYPASS r0 def > 0 ; set 8 , 2 BYPASS r0 def > 0 ; # ShiftRow-MixColumn Step 2 set 9 , 2 BYPASS VE def > 1 ; set 10 , 2 BYPASS r0 def > 0 WE ; set 11 , 2 BYPASS r0 def > 0 ; set 12 , 2 BYPASS r0 def > 0 ; set 13 , 2 BYPASS r0 def > 0 ; set 14 , 2 BYPASS r0 def > 0 WE ; set 15 , 2 BYPASS VE def > 1 ; set 8 , 3 BYPASS VE def > 2 ; # ShiftRow-MixColumn Step 7 set 9 , 3 BYPASS VE def > 2 WE ; # Because output register equals to r2 now, set 10 , 3 BYPASS VE def > 2 ; # execute step 7 before step 5 and 6 set 11 , 3 BYPASS VE def > 2 ; set 12 , 3 BYPASS VE def > 2 ;

109

set 13 , 3 BYPASS VE def > 2 ; set 14 , 3 BYPASS VE def > 2 ; set 15 , 3 BYPASS VE def > 2 WE ; set 8 , 4 BYPASS r0 def > 0 ; # ShiftRow-MixColumn Step 5 set 9 , 4 BYPASS r0 def > 0 ; set 10 , 4 BYPASS r0 def > 0 ; set 11 , 4 BYPASS r0 def > 0 ; set 12 , 4 BYPASS r0 def > 0 ; set 13 , 4 BYPASS r0 def > 0 ; set 14 , 4 BYPASS r0 def > 0 ; set 15 , 4 BYPASS r0 def > 0 ; set 8 , 5 BYPASS VE def > 0 ; set 9 , 5 BYPASS VE def > 0 ; set 10 , 5 BYPASS VE def > 0 ; set 11 , 5 BYPASS VE def > 0 WE ; set 12 , 5 BYPASS VE def > 0 ; set 13 , 5 BYPASS VE def > 0 WE ; set 14 , 5 BYPASS VE def > 0 ; set 15 , 5 BYPASS VE def > 0 ; set 8 , 6 BYPASS r1 def > 1 ; # ShiftRow-MixColumn Step 6 set 9 , 6 BYPASS r1 def > 1 ; set 10 , 6 BYPASS r1 def > 1 ; set 11 , 6 BYPASS r1 def > 1 ; set 12 , 6 BYPASS r1 def > 1 ; set 13 , 6 BYPASS r1 def > 1 ; set 14 , 6 BYPASS r1 def > 1 ; set 15 , 6 BYPASS r1 def > 1 ; set 8 , 7 BYPASS VE def > 1 ; set 9 , 7 BYPASS VE def > 1 ; set 10 , 7 BYPASS VE def > 1 ; set 11 , 7 BYPASS VE def > 1 WE ; set 12 , 7 BYPASS VE def > 1 ; set 13 , 7 BYPASS VE def > 1 WE ; set 14 , 7 BYPASS VE def > 1 ; set 15 , 7 BYPASS VE def > 1 ; set 8 , 8 BYPASS r3 def > 3 ; # ShiftRow-MixColumn Step 8 set 9 , 8 BYPASS r3 def > 3 ; set 10 , 8 BYPASS r3 def > 3 ; set 11 , 8 BYPASS r3 def > 3 ; set 12 , 8 BYPASS r3 def > 3 ; set 13 , 8 BYPASS r3 def > 3 ; set 14 , 8 BYPASS r3 def > 3 ; set 15 , 8 BYPASS r3 def > 3 ; set 8 , 9 BYPASS VE def > 3 ; set 9 , 9 BYPASS VE def > 3 WE ; set 10 , 9 BYPASS VE def > 3 ; set 11 , 9 BYPASS VE def > 3 ; set 12 , 9 BYPASS VE def > 3 ;

110

set 13 , 9 BYPASS VE def > 3 ; set 14 , 9 BYPASS VE def > 3 ; set 15 , 9 BYPASS VE def > 3 WE ; set 8 , 10 XOR r0 r1 > 5 ; # MixColumn - Flowing Step (1) b set 9 , 10 XOR r0 r3 > 5 ; # tm (r5) <-- r0 ^ r1 (or other registers) set 10 , 10 XOR r2 r3 > 5 ; set 11 , 10 XOR r1 r2 > 5 ; set 12 , 10 XOR r1 r2 > 5 ; set 13 , 10 XOR r2 r3 > 5 ; set 14 , 10 XOR r0 r3 > 5 ; set 15 , 10 XOR r0 r1 > 5 ; set 8 , 11 XOR r1 r5 > 0 ; # MixColumn - Flowing Step (2) b set 9 , 11 XOR r0 r5 > 0 ; # r0 <-- r0 ^ tm (or other registers) set 10 , 11 XOR r3 r5 > 0 ; set 11 , 11 XOR r2 r5 > 0 ; set 12 , 11 XOR r1 r5 > 0 ; set 13 , 11 XOR r2 r5 > 0 ; set 14 , 11 XOR r3 r5 > 0 ; set 15 , 11 XOR r0 r5 > 0 ; set 8 , 12 BYPASS r0 def > 0 ; # Final Round Step 1 set 9 , 12 BYPASS VE def > 1 WE ; # original data is in r0 set 10 , 12 BYPASS r0 def > 0 ; set 11 , 12 BYPASS r0 def > 0 ; set 12 , 12 BYPASS r0 def > 0 ; set 13 , 12 BYPASS VE def > 1 WE ; set 14 , 12 BYPASS r0 def > 0 ; set 15 , 12 BYPASS r0 def > 0 ; set 8 , 13 BYPASS r0 def > 0 ; # Final Round Step 2 set 9 , 13 BYPASS r0 def > 0 ; set 10 , 13 BYPASS r0 def > 0 ; set 11 , 13 BYPASS VE def > 1 WE ; set 12 , 13 BYPASS r0 def > 0 ; set 13 , 13 BYPASS r0 def > 0 ; set 14 , 13 BYPASS r0 def > 0 ; set 15 , 13 BYPASS VE def > 1 WE ; set 8 , 14 BYPASS r0 def > 0 ; # Final Round Step 5 set 9 , 14 BYPASS r1 def > 0 ; set 10 , 14 BYPASS r2 def > 0 ; set 11 , 14 BYPASS r3 def > 0 ; set 12 , 14 BYPASS r0 def > 0 ; set 13 , 14 BYPASS r3 def > 0 ; set 14 , 14 BYPASS r2 def > 0 ; set 15 , 14 BYPASS r1 def > 0 ;

university of california, irvinenewport.eecs.uci.edu/~ytang/academic/thesis.pdfuniversity of...

Documents