university of california, irvinenewport.eecs.uci.edu/~ytang/academic/thesis.pdfuniversity of...
TRANSCRIPT
UNIVERSITY OF CALIFORNIA, IRVINE
The Advanced Encryption Standard Mapping into MorphoSys Architecture
THESIS
submitted in partial satisfaction of the requirements for the degree of
MASTER OF SCIENCE
in Electrical and Computer Engineering
by
Ye Tang
Thesis Committee: Professor Nader Bagherzadeh, Chair
Professor Fadi J. Kurdahi Professor Stephen F. Jenks
2001
© 2001 Ye Tang
ii
The thesis of Ye Tang is approved:
_____________________________
_____________________________
_____________________________ Committee Chair
University of California, Irvine
2001
iii
DEDICATION
To
my dear wife Yang Zhao,
mother Xiuyun Zhou,
father Jiyin Tang,
sister Jun Tang,
for their love, support, understanding, and patience
iv
TABLE OF CONTENTS
LIST OF FIGURES.....................................................................................................vii
LIST OF TABLES .......................................................................................................ix
ACKNOWLEDGEMENTS ..........................................................................................x
ABSTRACT OF THE THESIS....................................................................................xi
CHAPTER 1 MorphoSys Architecture Introduction .................................................1
1.1 Reconfigurable Computing Systems.....................................................................1
1.2 MorphoSys Architecture.......................................................................................2
1.2.1 Reconfigurable Cell (RC)...............................................................................4
1.2.2 RC Array .......................................................................................................6
1.2.3 Frame Buffer and DMA Controller ..............................................................10
1.2.4 Context Memory..........................................................................................11
1.2.5 TinyRISC.....................................................................................................13
1.3 Modifications to MorphoSys...............................................................................15
1.3.1 Size Expansion of Register File and Context Memory..................................15
1.3.2 Embedded Lookup Table in Every RC.........................................................16
1.3.3 New RC Array Instructions..........................................................................16
CHAPTER 2 The Advanced Encryption Standard (AES) .......................................17
2.1 Introduction of the AES......................................................................................17
2.1.1 History of the AES Development .................................................................17
2.1.2 Overview of Rijndael ...................................................................................18
2.1.3 Definition of Terms, Parameters and Functions............................................19
2.2 Mathematical Background of Rijndael ................................................................20
2.2.1 Polynomial Representation of A Finite Field Element ..................................21
2.2.2 Addition in GF(28) .......................................................................................21
2.2.3 Multiplication in GF(28)...............................................................................22
2.2.4 Multiplication by x.......................................................................................23
2.2.5 Polynomials with Coefficients in GF(28) ......................................................24
2.3 Rijndael Specification.........................................................................................26
2.3.1 The Cipher ...................................................................................................26
2.3.1.1 SubBytes( ) Function.............................................................................27
v
2.3.1.2 ShiftRows( ) Function ...........................................................................28 2.3.1.3 MixColumns( ) Function .......................................................................29 2.3.1.4 AddRoundKey( ) Function ....................................................................31 2.3.1.5 Key Expansion......................................................................................31
2.3.2 The Inverse Cipher.......................................................................................33
CHAPTER 3 Mapping AES into MorphoSys ...........................................................36
3.1 Parallel Computing Exploration..........................................................................36
3.1.1 Multi-block Processing ................................................................................37
3.1.2 Parallel Table-lookup...................................................................................38
3.1.3 Dedicated Data Movement for Rijndael........................................................38
3.2 Algorithm Flowchart and Illustration..................................................................40
3.2.1 Key Expansion by TinyRISC .......................................................................42
3.2.2 Table Loading..............................................................................................43
3.2.3 Data and Round Key Loading ......................................................................44
3.2.4 Data Processing in RC Array........................................................................44
3.2.5 Data Storing.................................................................................................52
3.3 Simulation Environment .....................................................................................52
3.4 Performance Analysis.........................................................................................53
3.5 Conclusions........................................................................................................58
BIBLIOGRAPHY .......................................................................................................60
APPENDIX A Constant Tables Used in AES............................................................62
A.1 Lookup Table “S-box” .......................................................................................62
A.2 Lookup Table “Inv S-box” .................................................................................63
A.3 Lookup Table “xtime” .......................................................................................64
A.4 Lookup Table “Log” ..........................................................................................65
A.4 Lookup Table “Alog” .........................................................................................66
A.5 Table “Rcon” .....................................................................................................66
APPENDIX B MorphoSys TinyRISC ISA ................................................................67
B.1 Instruction Format..............................................................................................67
B.2 Instruction Codes...............................................................................................68
B.2.1 Arithmetic Instructions.................................................................................68
B.2.2 Logical Instructions......................................................................................69
B.2.3 Shift Instructions..........................................................................................71
B.2.4 Comparison Instructions...............................................................................73
vi
B.2.5 Load-Immediate Instructions........................................................................76
B.2.6 Memory Access Instructions ........................................................................77
B.2.7 Control Transfer Instructions........................................................................77
B.2.8 MorphoSys Instruction.................................................................................80
APPENDIX C RC Array Instruction Set ..................................................................88
APPENDIX D The Programs for AES Implementation in MorphoSys...................90
D.1 Key Expansion...................................................................................................90
D.2 Data Processing .................................................................................................94
D.3 Contexts for Data Processing ...........................................................................106
vii
LIST OF FIGURES
Figure 1.1: MorphoSys integrated architectural model .....................................................3
Figure 1.2: RC Architecture.............................................................................................4
Figure 1.3: 8 x 8 RC Array ..............................................................................................6
Figure 1.4: Level 1 & 2 of RC Array Interconnection Network........................................7
Figure 1.5: Level 3 of interconnection network................................................................8
Figure 1.6: L, M, R, T, C, B Port of MUXA ....................................................................9
Figure 1.7: L, U, D Port of MUXB ................................................................................10
Figure 1.8: Frame Buffer Block Diagram ......................................................................11
Figure 1.9: Structure of Context Memory ......................................................................12
Figure 1.10: TinyRISC block diagram...........................................................................13
Figure 2.1: Pseudo Code for the Cipher of Rijndael Algorithm......................................26
Figure 2.2: Transformation of ShiftRows( ) ...................................................................28
Figure 2.3: Doing MixColumns( ) by xtime Approach...................................................30
Figure 2.4: Pseudo Code for Key Expansion..................................................................31
Figure 2.5: Key Expansion and Round Key Partition for Nk = 6....................................32
Figure 2.6: Basic Pseudo Code for the Cipher of Rijndael Algorithm ............................33
Figure 2.7: Transformation of InvShiftRows( )..............................................................33
Figure 3.1: Intuitive Partitioning of RC Array................................................................37
Figure 3.2: Actual Partitioning of RC Array ..................................................................37
Figure 3.3: Transformation of ShiftRows( ) in 4x4 Matrix.............................................38
Figure 3.4: Transformation of ShiftRows( ) in 8x2 Matrix.............................................39
Figure 3.5: Data Movement for ShiftRows( ).................................................................40
Figure 3.6: Flowchart of Rijndael Implementation in MorphoSys..................................41
Figure 3.7: Concatenations of Round Keys....................................................................42
Figure 3.8: ShiftRows( ) Step 1 .....................................................................................45
Figure 3.9: ShiftRows( ) Step 2 .....................................................................................46
Figure 3.10: ShiftRows( ) Step 3, 4................................................................................47
Figure 3.11: ShiftRows( ) Step 5 ...................................................................................48
Figure 3.12: ShiftRows( ) Step 6, 7, 8............................................................................49
Figure 3.13: InvShiftRows( ) Step 1, 2, 3, 4...................................................................50
viii
Figure 3.14: InvShiftRows( ) Step 5, 6, 7, 8...................................................................51
Figure 3.15: Software Tools for MorphoSys..................................................................53
Figure 3.16: Throughputs of Different Implementations................................................58
ix
LIST OF TABLES
Table 1.1: RC Functions..................................................................................................5
Table 1.2: MorphoSys Instructions................................................................................14
Table 2.1: Terms and Acronyms Used in AES...............................................................19
Table 2.2: Parameter and Functions Used in AES..........................................................20
Table 2.3: Key-Block-Round Combinations..................................................................26
Table 3.1: # of Cycles for Key Expansion in Several Implementations..........................54
Table 3.2: # of Cycles for AES Initialization in MorphoSys Implementation.................55
Table 3.3: # of Cycles and Throughputs per Block in Other Implementations................55
Table 3.4: # of Cycles and Throughputs per Block in MorphoSys Implementation ........56
Table 3.5: AES by Amphion ASIC Cores using TSMC 0.18µm Technology.................57
Table 3.6: AES by Amphion Programmable Logic Cores using Altera APEX20KE-1...57
Table 3.7: AES by Amphion Programmable Logic Cores using Xilinx VirtexE-8..........57
x
ACKNOWLEDGEMENTS
I would like to thank my advisors, Professor Fadi J. Kurdahi and Nader
Bagherzadeh, for their guidance and support in my graduate studies and research towards
the M.S. degree. And thank my thesis committee member Professor Stephen F. Jenks.
This thesis would be impossible without their work.
I would also like to thank my group members in the VLSI Design Automation
Laboratory, Afshin Niktash, Chengzhi Pan, and Hooman T. Parizi, and former students in
the same group, Guangming Lu, Hartej Singh, Ming-Hau Lee. Their contributions on the
MorphoSys project are very important to my work.
Special thanks will go to Broadcom Corporation and Conexant Systems Inc.,
which provided me with a one-year fellowship for my graduate studies at UCI, and the
Defense and Advanced Projects Agency (DARPA), who is supporting the MorphoSys
project.
xi
ABSTRACT OF THE THESIS
The Advanced Encryption Standard Mapping into MorphoSys Architecture
By
Ye Tang
Master of Science in Electrical and Computer Engineering
University of California, Irvine, 2001
Professor Nader Bagherzadeh, Chair
The Advanced Encryption Standard (AES) specifies a cryptographic algorithm
that can be used to protect electronic data. The algorithm is called Rijndael, a high-
performance symmetric block cipher with very good security-level. AES is expected to
be used by the U.S. Government and, on a voluntary basis, by the private sector.
Hopefully, AES will gradually replace the current encryption standard, Data Encryption
Standard (DES).
MorphoSys is an SIMD based reconfigurable parallel computing system. It
includes a general-purpose RISC processor for the sequential and control part of an
algorithm, and 64 reconfigurable computing components for the parallel part of the
algorithm. The intrinsic data parallelism in AES algorithm, and the efficient data
communication and powerful data computing in MorphoSys, make MorphoSys very
suitable for AES implementation.
xii
The performance of MorphoSys implementation is quite good. The throughput is
more than 100Mb/s, adequate for applications on mobile phones and PDAs. It is one or
two orders of magnitude faster than software implementation by Assembly language,
C/C++, and Java. And up to now, one of the fastest hardware implementations by ASIC
or FPGA is only 240% ~ 270% or 30% ~ 60% faster than MorphoSys implementation,
respectively. Besides the high speed, another advantage of AES implementation by
MorphoSys is that MorphoSys is also capable of doing many other applications
efficiently with the same architecture. This feature is extremely critical when AES is only
part of the whole application.
1
Chapter 1
MorphoSys Architecture Introduction
MorphoSys is a reconfigurable computing system developed to investigate the
effectiveness of combining reconfigurable hardware with a general-purpose processor for
word-level, computation-intensive applications. It consists of a RISC processor,
embedded memory and high-speed memory interface, and an array of reconfigurable
computing cells. The dynamic reconfigurability, considerable depth of programmability,
and the large number of computing cells, make MorphoSys suitable for data-parallel and
high-throughput applications [1][2].
In this chapter, the features and advantages of reconfigurable systems, the
MorphoSys architecture and instructions, and modifications to the first generation
MorphoSys architecture are introduced.
1.1 Reconfigurable Computing Systems
General-purpose processors and Application-Specific Integrated Circuits (ASICs)
are two extremely different types of hardware. The former, such as Intel Pentium,
Motorola PowerPC, and Sun SPARC, provide the ability to run a great diversity of
applications, such as an operating system, a word processing application, or some
scientific calculation. As a consequence, the performance may be inferior to that achieved
by a system possessing architecture more suitable for the application. The latter, on the
other hand, implement exactly the functionality needed by a particular application. The
architecture of an ASIC exploits the intrinsic characteristics of an application’s algorithm
2
that lead to high performance. However, the direct architecture-algorithm mapping
restricts the range of applicability and reusability.
In order to combine the flexibility of general-purpose processors and the high
speed of ASICs, the concept of reconfigurable computing system is proposed. A
reconfigurable computing system is a hybrid approach between a general-purpose
processor and an ASIC. Ideally, a reconfigurable system delivers high performance
typical of ASICs and still provides the flexibility of general-purpose processors (i.e. it
can execute a wide range of applications).
Conventionally, field programmable gate arrays (FPGAs) are the most common
devices used for implementing reconfigurable components. This is because FPGAs allow
designers to manipulate gate-level devices such as flip-flops, memory and other logic
gates. However, FPGAs have certain disadvantages such as low logic density and
inefficient performance for word-level datapath operations [3]. Hence, many researchers
have proposed prototypes of coarse-grain reconfigurable systems that employ non-FPGA
reconfigurable components. MorphoSys is one among them.
1.2 MorphoSys Architecture
MorphoSys M1 (M1 is the first version of its physical implementation) consists of
five main components: the Reconfigurable Cell Array (RC Array), the RISC control
processor (TinyRISC), the Context Memory, the Frame Buffer and the DMA Controller.
Figure 1.1 shows the organization of the integrated MorphoSys reconfigurable computing
system.
3
TinyRISC Core Processor
Context Memory
(2 x 8 x 16 x 32 bits)
RC Array
(8 X 8 RCs)
DMA Controller
In s t . Cache
Tin
yRIS
CInstru
ction
TinyR
iscD
ata
Mem
Controller
Main Memory
Frame Buffer(2 x 128 x 64 bits)
Context
Seg
ment
Data
Segm
ent
Mem
Controller
M 1 Chip
Figure 1.1: MorphoSys integrated architectural model
The RC Array contains 64 reconfigurable computing elements. The Context
Memory is the local memory to store the configuration contexts, or instructions, for RC
Array. So RC Array and Context Memory correspond to the reconfigurable processor
array (SIMD co-processor), which is responsible for the parallel computing of the
application. The main processor is TinyRISC, a general-purpose 32-bit RISC processor.
TinyRISC is responsible for sequential tasks and control functions of the application. The
high-bandwidth memory interface is implemented through Frame Buffer and DMA
controller. The data to be processed is transferred from external memory to Frame Buffer,
then from Frame Buffer to RC Array, and in the reverse order for the result data.
In the following sections, all the components of MorphoSys architecture are
described in detail. For more information related to MorphoSys architecture, please refer
to [4][5].
4
1.2.1 Reconfigurable Cell (RC)
RC is the basic element of RC Array. Each RC incorporates an ALU-multiplier, a
shift unit, input multiplexors and a register file, as shown in Figure 1.2. The multiplier is
included since many target applications require integer multiplication. In addition, there
is a context register that is used to store the current context and provide
control/configuration signals to the RC components (namely the ALU-multiplier, shift
unit and the input muxes).
Context Memory
Data(31.....0)
MUXA
XQ
RM
ALU+MULT
REG
Output
ALU_CTRL
Context R
egister
Constant
Address From TinyRISC
T C B
MUXB
SHIFTALU_SFT
Register File
R0
R3
I U D L
RF
0R
F1
RF
2R
F3
16(X2)Entries
R1
R2
R3
R2
R1
R0
L VE
I
FLAG
ALU
_OP
MU
XA
MU
XB
Con
stant
RE
G_F
ILE
Write_E
XP
R
RS
_LS
11...031 15...1218...1622...1926...23
AL
U_S
FT
29...2830
Write_R
F_E
n
27
HE
16
28
16
1616
8
VE HE
To_FB
WE & Row_col
12
Figure 1.2: RC Architecture
The data to the multiplier/ALU is provided through two 16-bit input muxes.
MUXA selects an input from: (1) the outputs of other RCs (L, M, R, T, C, B ports) in the
same row/column within the same RC Array quadrant; (2) the nearest neighbors in the
adjacent quadrant (XQ port); (3) the data from Frame Buffer (I port); (4) the internal
5
register file (R0 through R3 port); or (5) the vertical/horizontal express lane (VE, HE
port). MUXB selects one input from: (1) three nearest neighbors (L, U, D port); (2) the
data from Frame Buffer (I port); or (3) the register file (R0 through R3 port). Please refer
to Section 1.2.2 for details of these connection ports.
The 32-bit context register stores current configuration for each RC. For example,
the field ALU_OP specifies the ALU function, and the field MUXA/MUXB indicates the
input from MUXA/MUXB.
Table 1.1 shows all the RC functions implemented in M1. The special functions
such as absolute value, count one's, and rounding are implemented as separate units from
the ALU to simplify the logic complexity of the ALU and improve the overall
performance.
Table 1.1: RC Functions
Instruction Description
A OR B, A AND B, A XOR B,
A OR C, A AND C, A XOR C
Two-operand logic functions
A + B, A− B, B − A, A + C, A − C Two-operand arithmetic functions
A * C Multiplication with constant
A*C + B, A*C + Out(t),
A*C − Out(t)
Multiply-accumulate functions
| A - B | + Out(t) Absolute difference accumulate
A AND B : Count One's AND with count # of one's in result
A+B if A>0, A-B if A<0 Conditional add/subtract based on sign bit of A
Rounding, RESET, BYPASS A, LOAD Constant, No-op
Miscellaneous functions
A: MUXA operand, B: MUXB operand, C: constant
Out(t) = previous output, Out(t+1) = new output
6
1.2.2 RC Array
The whole reconfigurable component is an array of RCs, or RC Array.
Considering that target applications (video compression, etc.) tend to be processed in
clusters of 8 x 8 data elements, the RC Array has 64 cells in a two-dimensional matrix, as
illustrated in Figure 1.3. This configuration is chosen to maximally utilize the parallelism
inherent in an application, which in turn enhances throughput.
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
Figure 1.3: 8 x 8 RC Array
The RC Array follows the SIMD model of computation. All RCs in the same
row/column share the same configuration data (context). However, each RC operates on
different data. Sharing the context across a row/column is useful for data-parallel
applications.
The RC Array has an extensive interconnection network, designed to enable fast
data exchange between the RCs. This results in enhanced performance for application
7
kernels that involve a lot of data movement, such as the discrete cosine transform (DCT)
used in video compression, and the AES algorithm described in this thesis.
There are three levels of RC Array interconnection network. The first level of the
RC Array interconnection network is the nearest neighbor layer that connects the RCs in
a 2-D mesh (see Figure 1.4). The second layer of connectivity is at the quadrant level (a
quadrant is a 4x4 RC group, see Figure 1.4), which provides complete row and column
connectivity within a quadrant. Therefore, each RC can access data from any other RC in
the same row/column within the quadrant.
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
Quad0 Quad1
Quad2 Quad3
Figure 1.4: Level 1 & 2 of RC Array Interconnection Network
At the third or global level, there are buses that support inter-quadrant
connectivity (see Figure 1.5). These buses are also called express lanes and they run
across rows as well as columns. These lanes can supply data from any RC of a quadrant
8
to other four RCs in the same row/column but different quadrant. For example, the value
of RC(0,1)* can be put on the horizontal express lane (HE) and then got by RC(0,4),
RC(0,5), RC(0,6) and RC(0,7); or it can be put on the vertical express lane (VE) and then
got by RC(4,1), RC(5,1), RC(6,1) and RC(7,1). Thus, up to four cells in a row/column
may access the output value of any one of four cells in the same row/column of the
adjacent quadrant. Express lanes greatly enhance the global connectivity. Some irregular
communication patterns, that otherwise require extensive interconnections, can be
handled quite efficiently. For example, an eight-point butterfly in FFT is accomplished in
only three clock cycles, and the data movement in the AES algorithm implementation
largely depends on the express lanes.
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
Figure 1.5: Level 3 of interconnection network
* RC(x,y) means the RC located at row x, column y.
9
The L, M, R, T, C, B port in MUXA: L (Left), M (Middle), R (Right), T (Top),
C (Center), and B (Bottom) port of MUXA are all connected to other RCs within the
same quadrant. For example, these ports for RC X and Y are marked in Figure 1.6.
Notice that they do not always match their literal meanings.
L M
T
X R
C
B
T
C
B
Y L M R
Figure 1.6: L, M, R, T, C, B Port of MUXA
The L, U, D port of MUXB: L (Left), U (Up), and D (Down) port of MUXB are
defined by absolute location. They are not necessarily limited within a quadrant. For
example, these ports for RC X, Y, and Z are marked in Figure 1.7. Notice that they are
wrapped.
10
L
U
X
D
D
U
L Z
D
U
L Y
Figure 1.7: L, U, D Port of MUXB
1.2.3 Frame Buffer and DMA Controller
The high parallelism of the RC Array would be ineffective if the memory
interface is unable to transfer data at an adequate rate. Therefore, a high-speed memory
interface consisting of a streaming buffer (Frame Buffer) and a DMA controller is
incorporated in the system. The Frame Buffer has two sets as illustrated in Figure 1.8.
The communication between Frame Buffer and main memory is controlled by DMA
controller. By using the two sets of Frame Buffer alternatively, the computation of RC
Array and the data load and store of Frame Buffer are overlapped. Therefore, the memory
accesses are virtually transparent to RC Array.
11
BANK A
(64 x 8 bytes)
SET 0
SET 1
MSB
LSB
AA
AA
AA
AA
AA
AA
AA
AA
BB
BB
BB
BB
BB
BB
BB
BB
BANK B
(64 x 8 bytes)
.
.
.
.
.
.
.
.
.
.
.
.
Figure 1.8: Frame Buffer Block Diagram
1.2.4 Context Memory
The context memory stores configuration data, or contexts, for RC Array.
Contexts resemble the instructions for a microprocessor. But here, every context can
serve eight RCs in the same row or column simultaneously*.
As shown in Figure 1.9, Context Memory is logically organized into two blocks,
column context block (on the top) and row context block (on the left). Each block
consists of eight context sets, and each set consists of 16 context words.
A context word in the row context block (called row context word) is broadcast
on a row. And a context word in the column context block (called column context word)
is broadcast on a column. By picking up one corresponding word from each set in the
* That also indicates the coarse-grain nature (word-level operations) of MorphoSys architecture.
12
row/column context block, those 8 words (a plane) can cover the whole 8 rows/columns,
or the 64 RCs.
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC16
16
16
16
16
16
16
16
16
16 16 16 16 16 16 16
Figure 1.9: Structure of Context Memory
The total number of row/column contexts is referred as the depth of
programmability. Because there are 16 words in a set, there are 16 row contexts and 16
column contexts in total. This means the depth of programmability is 32. In other words,
RC Array can perform up to 32 different operations without reloading new contexts.
This depth is enough for a lot of DSP and image processing applications.
However, it is not enough for some complicated algorithms. Because the penalty to
reload new contexts during application is large, a reasonable way is to increase the
context memory size. In M2, the next version MorphoSys, the depth will be increased to
256.
13
1.2.5 TinyRISC
Figure 1.10 shows the block diagram of TinyRISC. Since most target applications
involve some sequential processing, a RISC processor, TinyRISC [6], is included in the
system.
Fetch Stage
ProgramCounter
BranchUnit
ALU
ShiftUnit
MemoryUnit
MorphoSysUnit
Execute StageDecode Stage Write-Back Stage
ClockDriver
RegisterFi le Data Cache Core
Figure 1.10: TinyRISC block diagram
This is a MIPS-like processor with a 4-stage scalar pipeline. It has a 32-bit ALU,
register file and an on-chip data cache memory. This processor also coordinates system
operation and controls its interface with the external world. This is made possible by
some specific instructions (besides the standard RISC instructions) to the TinyRISC
Instruction Set Architecture (ISA). These instructions are called MorphoSys instructions.
They can initiate data transfers between main memory and MorphoSys components, and
control the execution of the RC Array.
14
These MorphoSys instructions are listed in Table 1.2. There are two major
categories of these instructions: DMA instructions and RC Array instructions.
Table 1.2: MorphoSys Instructions
Mnemonic Description of Operation
LDCTXT Load Context from Main Memory into Context Memory.
LDFB Load data from Main Memory into Frame Buffer.
STFB Store data into Main Memory from Frame Buffer.
CBCAST Context broadcast, no data from Frame Buffer.
DBCBC Column context broadcast, get data from both banks of Frame Buffer.
DBCBR Row context broadcast, get data from both banks of Frame Buffer.
DBCB Context broadcast, get data from both banks of Frame Buffer.
SBCB Context broadcast, transfer 128 bit data from Frame Buffer.
WFB Write the processed data back into Frame Buffer with indirect Address.
WFBI Write the processed data back into Frame Buffer with immediate address.
RCRISC Write one 16-bit data from RC Array into TinyRISC.
The DMA instructions contain fields that provide the DMA Controller with
adequate information, such as starting address in main memory, starting address in Frame
Buffer or Context Memory, number of bytes to load, load or store control, etc. This
enables the transfer of data between main memory and Frame Buffer or Context Memory
through the DMA Controller.
15
The RC Array instructions have fields that provide control signals to the RC
Array and Context Memory. This is essential to enable the execution of computations in
the RC Array. This information includes the contexts to be executed, the mode of context
broadcast (row or column), location of data to be loaded in from Frame Buffer, etc.
1.3 Modifications to MorphoSys
In the implementation of M2, some modifications to MorphoSys architecture are
proposed, including memory size expansion and architectural revamping of the RC. The
modifications that have impact on the implementation of AES are briefly mentioned
below.
1.3.1 Size Expansion of Register File and Context Memory
To make RC capable of more complicated algorithms, 8 registers (instead of 4)
will be included in the register file. The size of context memory will be increased to be
able to store 256 context planes instead of 32. These upgrades are critical to the
implementation effectiveness of some complex algorithms, such as AES, FFT, Reed
Solomon Codes, and so on. Specifically, AES uses 7 registers and 27 contexts for
encryption, and 8 registers and 28 contexts for decryption. Notice that the numbers of
contexts mentioned here are only for AES’s data processing part. Besides, its
initialization part needs more than 500 contexts for loading two tables, 256 bytes each.
Since these tables are only loaded once in a session, it is acceptable to repeatedly load
them into a small-size context memory. So a context memory with the capability of
storing 32 contexts is enough for AES. However, the increase of the number of registers
is necessary to achieve high-speed implementation of AES.
16
1.3.2 Embedded Lookup Table in Every RC
Lookup operation is common in quite a few algorithms. For AES, it is the most
important operation (see Chapter 2). To achieve high computing parallelism, M2 will
embed a 512-byte lookup table in each RC. This table will be implemented by SRAM.
1.3.3 New RC Array Instructions
To access the lookup table in every RC, two new RC Array instructions,
“LDMM” and “STMM”, are added to the instruction set. For example, “LDMM r1 > 5”
means loading the value of table element (memory) at address r1 into register r5; “STMM
r5 > 1” means storing the value of register r5 into the table element (memory) at address
r1.
17
Chapter 2
The Advanced Encryption Standard (AES)
Advanced Encryption Standard (AES) is the new encryption standard that is
expected to replace the current standard, Data Encryption Standard (DES) and Triple
DES. The National Institute of Standards and Technology (NIST) worked with industry
and public cryptographic community to develop the AES [7]. A comprehensive overview
of AES and its algorithm is described in this chapter.
2.1 Introduction of the AES
After more than three years’ work, NIST recently announced Rijndael as the AES
algorithm. The development of AES and the nature of Rijndael algorithm are briefly
introduced in this section.
2.1.1 History of the AES Development
The AES development was launched by NIST on Jan 2, 1997. On August 20,
1998, NIST selected fifteen algorithms as candidates for tests. After the comprehensive
analysis and public comments by the global cryptographic community, five algorithms
were selected from them as the AES finalist in April 1999. They were MARS, RC6,
Rijndael, Serpent, and Twofish. Then, after two rounds of further public analysis, NIST
announced on October 2, 2000 that Rijndael has been selected for the AES. Four months
after the announcement, NIST finished a draft Federal Information Processing Standard
(FIPS) for the AES and asked for public review and comment [8]. The comment period
18
ended on May 29, 2001. According to NIST’s schedule, the formal standard is to be
published by the summer of 2001.
2.1.2 Overview of Rijndael
Rijndael is a symmetric block cipher developed by two Belgium cryptology
experts, Joan Daemen and Vincent Rijmen. The pronunciation of Rijndael could be like
"Reign Dahl", "Rain Doll", or "Rhine Dahl", according to its authors’ suggestion.
Rijndael can apply to data blocks of 128 bits, using cipher keys with lengths of
128, 192, and 256 bits*. Rijndael's combination of security, performance, efficiency, ease
of implementation and flexibility make it an appropriate selection for the AES.
Specifically, Rijndael has very good performance in both hardware and software
across a wide range of computing. Its initialization time is short, and its key agility is
good. Rijndael's very low memory requirements make it very well suited for restricted-
space environments, in which it also demonstrates excellent performance. Rijndael's
operations are among the easiest to defend against power and timing attacks [9][10].
Additionally, Rijndael's internal round structure appears to have good potential to benefit
from instruction-level parallelism (ILP). It is the ILP characteristic of Rijndael that
stimulates the research of its implementation into MorphoSys architecture.
For all kinds of information about Rijndael, you may want to begin from the
website maintained by its authors: http://www.esat.kuleuven.ac.be/~rijmen/rijndael/.
* In Fact, Rijndael can handle any combination of Key size and block size from 128, 192, and 256 bits. But in the AES, the block size is fixed at 128 bits to be more easily accommodated by many types of block cipher design.
19
2.1.3 Definition of Terms, Parameters and Functions
The terms, parameters, and functions used by AES are defined in the following
two tables. They conform to the convention used by the draft FIPS.
Table 2.1: Terms and Acronyms Used in AES
Term Explanation
Block Sequence of binary bits that comprise the input, output, State, and Round Key. The length of a block is the number of bits it contains. For AES, the block length is 128 bits.
Byte A group of eight bits that is treated either as a single entity or as an array of 8 individual bits.
Cipher Series of transformations that converts plaintext to ciphertext using the Cipher Key.
Cipher Key Secret, cryptographic key that is used by the Key Expansion routine to generate a set of Round Keys; can be pictured as a rectangular array of bytes, having four rows and Nk columns.
Ciphertext Data output from the Cipher or input to the Inverse Cipher.
Inverse Cipher Series of transformations that converts ciphertext to plaintext using the Cipher Key.
Key Expansion Routine used to generate a series of Round Keys from the Cipher Key.
Plaintext Data input to the Cipher or output from the Inverse Cipher.
Round Key Round Keys are values derived from the Cipher Key using the Key Expansion routine; they are applied to the State in the Cipher and Inverse Cipher.
State Intermediate Cipher result that can be pictured as a rectangular array of bytes, having four rows and Nb columns.
S-box Non-linear substitution table used in several byte substitution of a byte value.
Word A group of 32 bits that is treated either as a single entity or as an array of 4 bytes.
20
Table 2.2: Parameter and Functions Used in AES
AddRoundKey( ) Transformation in the Cipher and Inverse Cipher in which a Round Key is added to the State using an XOR operation. The length of a Round Key equals the size of the State (128 bits, or 16 bytes).
SubBytes( ) Transformation in the Cipher that processes the State using a non-linear byte substitution table (S-box) that operates on each of the State bytes independently.
ShiftRows( ) Transformation in the Cipher that processes the State by cyclically shifting the last three rows of the State by different offsets.
MixColumns( ) Transformation in the Cipher that takes all of the columns of the State and mixes their data (independently of one another) to produce new columns.
InvSubBytes( ) Transformation in the Inverse Cipher that is the inverse of SubBytes( ).
InvShiftRows( ) Transformation in the Inverse Cipher that is the inverse of ShiftRows( ).
InvMixColumns( ) Transformation in the Inverse Cipher that is the inverse of MixColumns( ).
RotWord( ) Function used in the Key Expansion routine that takes a 4-byte word and performs a cyclic permutation.
SubWord( ) Function used in the Key Expansion routine that takes a 4-byte input word and applies an S-box to each of the 4 bytes to produce an output word.
Nb Number of columns (32-bit words) comprising the State. For AES, Nb = 4.
Nk Number of 32-bit words comprising the Cipher Key. For AES, Nk = 4, 6, or 8.
Nr Number of rounds, which is a function of Nk and Nb (which is fixed). For AES, Nr = 10, 12, or 14.
2.2 Mathematical Background of Rijndael
Before looking into the algorithm of Rijndael, it is helpful to understand the
mathematical basis used by it. In this section, the necessary mathematical concepts are
introduced, and some simple examples are given.
21
2.2.1 Polynomial Representation of A Finite Field Element
The basic processing unit in Rijndael is a byte, which can be represented as a
group of eight contiguous bits:
{ }01234567 ,,,,,,, bbbbbbbb where 1or 0=ib
Furthermore, it can be interpreted as finite field elements using a polynomial
representation [11]:
0
01
12
23
34
45
56
67
7 xbxbxbxbxbxbxbxb +++++++
For example, { 10011100} identifies the following specific finite field element:
2347 xxxx +++
To simplify the representation, hexadecimal notation is introduced. For example,
the above element { 10011100} can be represented as { 9C} , or simpler, ‘9C’.
Since the unit in Rijndael is a byte, all elements can be represented by two
hexadecimal digits. This kind of finite field is called GF(28). (GF stands for Galois Field.)
2.2.2 Addition in GF(28)
The addition of two elements is a polynomial with coefficients that are given by
the sum modulo 2 of the corresponding coefficients of the two operands. For example,
‘9C’ + ‘26’ = ‘BA’
Or, with the polynomial representation:
)()()( 134571252347 xxxxxxxxxxxx ++++=++++++
Not surprisingly, the addition in GF(28) is actually a simple and fast bitwise XOR
operation. To verify it with the previous example,
‘9C’ ⊕ ‘26’ = ‘10011100’ ⊕ ‘00100110’ = ‘10111010’ = ‘BA’
22
So from now on, the symbol for addition might be either + or ⊕ .
The neutral element is ‘00’ , and the inverse (or, more accurately, additive inverse)
of any element is itself. So subtraction and addition are the same here*.
2.2.3 Multiplication in GF(28)
The multiplication in GF(28) corresponds with multiplication of polynomials
modulo an irreducible binary polynomial of degree 8. A polynomial is irreducible if it has
no divisors other than 1 and itself. For Rijndael, this irreducible polynomial is fixed and
given by
1)( 1348 ++++= xxxxxm
Or, it can be represented as ‘11B’ in hexadecimal notation. Notice it is out of the
range of ‘00’ ~ ‘FF’.
Here is an example of multiplication.
‘9C’ • ‘26’ = ‘63’ , or:
1)1( mod )( then,
XOR) is(addition
)()()(
)()( first,
156134836712
36712
3458456978912
1252347
+++=+++++++
+++=+++++++++++=
++•+++
xxxxxxxxxxx
xxxx
xxxxxxxxxxxx
xxxxxxx
The modular reduction by m(x) ensures that the result will be a binary polynomial
of degree less than 8, and thus can be represented by a byte.
The natural element is ‘01’ , and b(x) is a(x)’s multiplicative inverse if
1)(mod)()( =• xmxaxb
* For more information, please refer to mathematics about Abelian group.
23
Unlike addition, there is no simple operation at the byte level that corresponds to
the multiplication. In software implementation of Rijndael, the multiplication is usually
done by two table-lookup operations:
if (a && b) return Alogtable[(Logtable[a] + Logtable[b])%255];
else return 0;
It is just like the normal mathematical equation: )log(loglog 1 baba +=• − . More
information about these tables as well as the whole software implementation of Rijndael
can be found at [12].
2.2.4 Multiplication by x
When b(x) is multiplied by x, the result before modulo m(x) is:
10
21
32
43
54
65
76
87 xbxbxbxbxbxbxbxb +++++++
If b7 = 0, no modular reduction is needed since the degree is already less than 8;
If b7 = 1, the subsequent modular reduction, however, is necessary. And the
reduction can be implemented by a bitwise XOR with ‘1B’. Notice that m(x) is actually
‘11B’, and the MSB will be XORed with b7 , thus generates a zero which can be omitted.
To summarize, a multiplication by x can be implemented at byte level as a 1-bit
left shift followed by a conditional bitwise XOR with ‘1B’, denoted by b(x) =
xtime(a(x)), or simpler, b = xtime(a). xtime operation is much faster than a normal
multiplication, which as shown before is implemented by two table-lookup operations.
However, xtime is not the ultimate goal. An important feature that makes xtime
useful is that ANY multiplication can be implemented by a sum of a series of xtime
operations. Here is the proof and an example:
24
Proof:
17a ifexist 12a ifexist 11a ifexist 10a ifexist
7710
)))((())(()(
)()()()()(
====
+++=•++•+•=•
�
���
�
bxtimextimextimebxtimextimebxtimeb
xaxbxaxbaxbxaxb
Example: ‘9C’ • ‘26’ = ‘63’ :
‘26’ = ‘00100110’, so a1 = a2 = a5 = ‘1’
xtime(‘9C’) = ‘00111000’ ⊕ ‘1B’ = ‘23’ # has conditional XOR
xtime(‘23’) = ‘46’ # no conditional XOR
xtime(‘46’) = ‘8C’ # no conditional XOR
xtime(‘8C’) = ‘00011000’ ⊕ ‘1B’ = ‘03’ # has conditional XOR
xtime(‘03’) = ‘06’ # no conditional XOR
‘9C’ • ‘26’ = ‘9C’ • (‘02’ ⊕ ‘04’ ⊕ ‘20’) = ‘23’ ⊕ ‘46’ ⊕ ‘06’ = ‘63’ .
Notice that multiple xtime operations may be needed to perform just one
multiplication.
2.2.5 Polynomials with Coefficients in GF(28)
A polynomial can be defined with coefficients in GF(28). In Rijndael, this kind of
polynomial has a degree of 4. For example, 01
12
23
3)( axaxaxaxa +++= is such a
polynomial. Notice that 0123 ,,, aaaa are bytes defined in GF(28) rather than simple ‘0’ or
‘1’ .
Addition can be defined similarly:
)()()()()()( 001
112
223
33 baxbaxbaxbaxbxa ⊕+⊕+⊕+⊕=+
25
Multiplication is a little different. The first step is
01
12
23
34
45
56
6)()()( cxcxcxcxcxcxcxbxaxc ++++++=•=
where
336
32235
3122134
302112033
2011022
10011
000
bac
babac
bababac
babababac
bababac
babac
bac
•=•⊕•=
•⊕•⊕•=•⊕•⊕•⊕•=
•⊕•⊕•=•⊕•=
•=
The second step is to reduce the previous result to a polynomial of degree less
than 4. In Rijndael, it is accomplished by modulo 1)( 4 += xxM . Let d(x) be the modular
product of a(x) and b(x), then
01
12
23
3)( dxdxdxdxd +++=
where
)()()()(
)()()()(
)()()()(
)()()()(
302112033
332011022
322310011
312213000
babababad
babababad
babababad
babababad
•⊕•⊕•⊕•=•⊕•⊕•⊕•=•⊕•⊕•⊕•=•⊕•⊕•⊕•=
Using matrix form, it can be written as the following circulant format:
����
�
�
����
�
�
����
�
�
����
�
�
=����
�
�
����
�
�
3
2
1
0
0123
3012
2301
1230
3
2
1
0
b
b
b
b
aaaa
aaaa
aaaa
aaaa
d
d
d
d
26
2.3 Rijndael Specification
Rijndael is an iterated block cipher. The number of rounds depends on the values
of Nb and Nk. The Key-Block-Round relation is given in Table 2.3. In the following
sections, the algorithms for the cipher and inverse cipher are described separately. Since
the inverse cipher’s algorithm is very similar to the cipher’s, the discussion will be
mainly focused on the cipher.
Table 2.3: Key-Block-Round Combinations
Key Length (Nk words) Block Size (Nb words) Number of Rounds (Nr)
AES-128 4 4 10
AES-192 6 4 12
AES-256 8 4 14
2.3.1 The Cipher
The pseudo code for the cipher is listed in Figure 2.1.
KeyExpansion(CipherKey, RoundKey); // see sec 2.3.1.5
state = in;
AddRoundKey(state);
for ( round = 1; round < Nr; round ++)
{
SubBytes(state); // see sec 2.3.1.1
ShiftRows(state); // see sec 2.3.1.2
MixColumns(state); // see sec 2.3.1.3
AddRoundKey(state); // see sec 2.3.1.4
}
SubBytes(state);
ShiftRows(state);
AddRoundKey(state);
out = state;
Figure 2.1: Pseudo Code for the Cipher of Rijndael Algorithm
27
The cipher consists of three parts:
• an initial Key Expansion and Round Key addition.
• Nr-1 intermediate rounds
• a final round
Every intermediate round consists of four steps:
• Substitute Bytes
• Shift Rows
• Mix Columns
• Add Round Key
And the final round can be regarded as an incomplete intermediate round, lacking
the MixColumns step.
KeyExpansion, SubBytes, ShiftRows, MixColumns, and AddRoundKey are all
the distinct functions in the cipher. Their algorithms are described below. Because
KeyExpansion calls function SubBytes, it will be interpreted at last.
2.3.1.1 SubBytes( ) Function
SubBytes is a non-linear byte substitution. It substitutes each byte of the State
with a corresponding element in a table called S-box. This table is constructed by two
steps:
28
1. Take the multiplicative inverse in GF(28), while ‘00’ mapped onto itself.
2. Apply an affine transformation over GF(28) defined by:
�����������
�
�
�����������
�
�
+
�����������
�
�
�����������
�
�
�����������
�
�
�����������
�
�
=
�����������
�
�
�����������
�
�
0
1
1
0
0
0
1
1
11111000
01111100
00111110
00011111
10001111
11000111
11100011
11110001
7
6
5
4
3
2
1
0
7
6
5
4
3
2
1
0
x
x
x
x
x
x
x
x
y
y
y
y
y
y
y
y
Since the table is fixed, it is loaded during the initialization and accessed by table-
lookup operation. For example, SubByte(‘00’) = ‘63’ . (’63’ is the first element in S-box.)
The whole S-box table is listed in Appendix A.
2.3.1.2 ShiftRows( ) Function
In ShiftRows, the four rows of the State are cyclically shifted over different
offsets: Row 0 is not shifted; Row 1 is shifted over 1 byte; Row 2 is shifted over 2 bytes;
Row 3 is shifted over 3 bytes. So the positions (denoted by numbers from 1 to 16) of the
bytes in a State are changed like:
1 5 9 13 1 5 9 13
2 6 10 14 6 10 14 2
3 7 11 15 11 15 3 7
4 8 12 16 16 4 8 12
Figure 2.2: Transformation of ShiftRows( )
29
2.3.1.3 MixColumns( ) Function
In MixColumns, the columns are considered as polynomials with coefficients in
GF(28). They are multiplied modulo x4+1 with a fixed polynomial c(x), given by
'02''01''01''03')( 23 +++= xxxxc
Recall Section 2.2.5, multiplication )()()( xcxaxd •= can be denoted by
����
�
�
����
�
�
����
�
�
����
�
�
=����
�
�
����
�
�
3
2
1
0
3
2
1
0
02010103
03020101
01030201
01010302
a
a
a
a
d
d
d
d
As a result, the four bytes in a column a(x) are transformed into the following
d(x):
)'02(')'03('
)'03(')'02('
)'03(')'02('
)'03(')'02('
32102
32102
32101
32100
aaaad
aaaad
aaaad
aaaad
•⊕⊕⊕•=•⊕•⊕⊕=
⊕•⊕•⊕=⊕⊕•⊕•=
There are two ways to do the multiplications in above expressions. One way is to
use two tables (Logtable and Alogtable) and three table-lookup operations (see Section
2.2.3); another way is to use multiple xtime operations (see Section 2.2.4), each of which
can be implemented either by dedicated hardware or by one table-lookup operation.
As shown before, a disadvantage of xtime approach is that multiple xtime
operations may be needed to finish one multiplication, especially when the degrees of
a(x)’s coefficients are high. Fortunately, a(x) is fixed in Rijndael and the degrees of its
coefficients are not very high: in the encryption part of Rijndael, the coefficients are ‘03’ ,
‘01’ , ‘01’ , and ‘02’ . So xtime approach works perfectly in that case. In the decryption
30
part of Rijndael, the coefficients are ‘0B’, ‘0D’, ‘09’ , and ‘0E’. That will introduce more
xtime operations and reduce the speed a little bit.
Another concern about xtime approach comes from the structure of MorphoSys.
Because MorphoSys is a reconfigurable computing system rather than an ASIC, one
should not expect to implement xtime operation at byte level as a 1-bit left shift followed
by a conditional bitwise XOR with ‘1B’. Instead, xtime operation will be implemented by
a table-lookup operation. It seems that xtime is not attractive any more because it still
needs a table-lookup, and one multiplication needs multiple xtime operations. But, in
fact, M2 of MorphoSys can do the table-lookup operation quite efficiently. And more
importantly, xtime approach needs only one table, while a normal multiplication needs
two. Considering the tradeoff between speed (not much difference) and memory usuage
(ratio of 1 to 2), the xtime approach is preferable.
In MixColumns, the four elements in a column are transformed by the following
code.
tmp = a[0]^a[1]^a[2]^a[3]; // ^ means XOR
tm = a[0]^a[1]; tm = xtime(tm); a[0] = a[0] ^ tm ^ tmp;
tm = a[1]^a[2]; tm = xtime(tm); a[1] = a[1] ^ tm ^ tmp;
tm = a[2]^a[3]; tm = xtime(tm); a[2] = a[2] ^ tm ^ tmp;
tm = a[3]^a[0]; tm = xtime(tm); a[3] = a[0] ^ tm ^ tmp;
Figure 2.3: Doing MixColumns( ) by xtime Approach
It is easy to prove that the new a[i] equals to the di shown in the previous page.
Notice that a[0]^(a[0]^a[1]^a[2]^a[3]) = a[1]^a[2]^a[3], etc., and tmp is shared
among four expressions to save some registers.
31
2.3.1.4 AddRoundKey( ) Function
Round Key addition is very simple and straightforward. In this operation, current
Round Key is applied to the current State by a bitwise XOR.
2.3.1.5 Key Expansion
The purpose of Key Expansion is to derive all Round Keys from the Cipher Key.
It should be done during the initialization. And it only needs to be done once if the Cipher
Key is not changed during the whole session*.
The pseudo code for Key Expansion is shown in Figure 2.4.
KeyExpansion (Key[4*Nk], W[4*(Nr+1)], Nk)
{
for ( i = 0; i < Nk; i++)
W[i] = (Key[4*i], Key[4*i+1], Key[4*i+2], Key[4*i+3]);
for ( i = Nk; i < 4*(Nr+1); i++)
{
temp = W[i-1];
if ( i % Nk == 0)
temp = SubWord(RotWord(temp)) ^ Rcon[i/Nk];
else if ( Nk = 8 and i % Nk == 4)
temp = SubWord(temp);
W[i] = W[i-Nk] ^ temp;
}
}
SubWord (W(a, b, c, d))
{ return W(S-box(a), S-box(b), S-box(c), S-box(d)); }
RotWord (W(a, b, c, d))
{ return W(b, c, d, a); }
Figure 2.4: Pseudo Code for Key Expansion
* Usually the Cipher Key is not changed in one session of encryption/decryption. But theoretically, one can use several Cipher Keys within one session to achieve better security. In that case, each change of Cipher Key will introduce one Key Expansion.
32
Recall there are an initial Round Key addition, several intermediate rounds, and a
final round in total, the number of Round Keys should be equal to the number of rounds
plus 1. Because Nr = 10, 12, 14 for Nk = 4, 6, 8, respectively, the numbers of Round
Keys are 11, 13, 15, respectively.
The expansion processes the data at word level. The ith word, or W[i], includes
the (4* i)th, (4* i+1)th, (4* i+2)th, (4* i+3)th byte, or the ith column. For example, if Nk =
4, there are 4 words in the Cipher Key. And it would be expanded to 11*4 = 44 words, or
44*4*8 = 1408 bits.
The Rcon[ ] array in the code is a constant array listed in Appendix A.
As shown in the code, the first Nk words of the whole expanded Round Keys are
exactly the original Cipher Key. After that, the optimized expansion implemented in
hardware should be done by a number of loops because by this means the expanded
Round Keys can be calculated in place to save a lot of memory. Please refer to Section
3.2.1 for detailed information.
The result of Key Expansion is a bunch of words that should be partitioned into
(Nr+1) Round Keys. The partition is very simple: from the beginning, every 4 words
form a Round Key. Figure 2.5 shows the Round Key expansion and partition for Nk = 6.
As shown below, W0 to W5 form the original Cipher Key, but every Round Key contains
only 4 words.
W0 W1 W2 W3 W4 W5 W6 W7 W8 W9 W10 W11 W12 W13 …
Round Key 0 Round Key 1 Round Key 2 …
Figure 2.5: Key Expansion and Round Key Partition for Nk = 6
33
2.3.2 The Inverse Cipher
In the Inverse Cipher, each function is substituted by its inverse function, and the
order is reversed. The basic pseudo code is listed below.
KeyExpansion(CipherKey, RoundKey);
state = in;
// inverse of last round
AddRoundKey(state); // use RoundKey[Nr]
InvShiftRows(state);
InvSubBytes(state);
// inverse of intermediate rounds
for ( round = Nr-1; round > 0; round --)
{
AddRoundKey(state); // inv addition = addition
InvMixColumns(state);
InvShiftRows(state);
InvSubBytes(state);
}
// inverse initial Round Key Addition
AddRoundKey(state); // use RoundKey[0]
out = state;
Figure 2.6: Basic Pseudo Code for the Cipher of Rijndael Algorithm
InvShiftRows( ) is defined as: Row 0 is not shifted; Row 1 is shifted over 3 byte;
Row 2 is shifted over 2 bytes; Row 3 is shifted over 1 bytes. So the positions of the bytes
in a State are changed like:
1 5 9 13 1 5 9 13
2 6 10 14 14 2 6 10
3 7 11 15 11 15 3 7
4 8 12 16 8 12 16 4
Figure 2.7: Transformation of InvShiftRows( )
34
InvSubBytes( ) is the byte substitution where the inverse table, inv S-box, is
applied. The inv S-box table is listed in Appendix A.
InvMixColumns( ) is similar to MixColumns( ). But it uses a different c(x), given
by
'0''09''0''0')( 23 ExxDxBxc +++=
The coefficients of this polynomial is larger than those of the polynomial used by
MixColumns( ), '02''01''01''03' 23 +++ xxx . So the speed of InvMixColumns( ) is slower
due to more xtime and XOR operations (see Section 2.3.1.3.)
There are some properties of these inverse functions that can be exploited to
derive a Cipher-like structure for the Inverse Cipher.
First, the order of InvShiftRows( ) and InvSubBytes( ) is indifferent. This is
because InvShiftRows( ) simply transposes the bytes and has no effect on the values, and
InvSubBytes( ) works on individual bytes, independent of their positions.
Second, the sequence
AddRoundKey(State, RoundKey);
InvMixColumn(State);
can be replaced by
InvMixColumn(State);
AddRoundKey(State, InvRoundKey);
where InvRoundKey is obtained by:
1. Apply the Key Expansion.
2. Apply InvMixColumn to all Round Keys except the first one and last one.
35
Notice that the basic pseudo code in Figure 2.6 can be represented by the
following sequence:
ASB AMSB AMSB … AMSB A
where A means AddRoundKey( ), S means InvShiftRows( ), B means
InvSubBytes( ), and M means InvMixColumns( ).
Using the two properties to change the order SB to BS, AM to MA, the sequence
becomes
ABS MABS MABS … MABS A
or equivalently
A BSMA BSMA … BSMA BSA
The last sequence is exactly the Cipher’s sequence. So, with the use of
InvRoundKey, the Inverse Cipher’s structure is the same as the Cipher’s. When AES is
mapped into MorphoSys, the Inverse Cipher uses right the same architecture as the
Cipher’s. Of course, the function InvShiftRows( ) and InvMixColumns( ) are slightly
different than ShiftRows( ) and MixColumns( ), and InvRoundKey replaces the
RoundKey.
36
Chapter 3
Mapping AES into MorphoSys
AES has already been widely implemented in different formats, such as
C/C++[13][14], Java[15], Visual Basic[16], Perl[17], Assembly[18], Ada[19], etc. It can
also be implemented by hardware, such as ASIC. MorphoSys is designed for applications
with inherent data-parallelism, high regularity, and high throughput requirement. Due to
the high data-parallelism in the AES algorithm, MorphoSys is able to implement it much
faster than those software implementations. Besides, because of the reconfigurability of
MorphoSys, the mapped AES algorithm can be part of a larger system.
In this chapter, several key features of MorphoSys that help the mapping of AES
are pointed out. Then, the complete mapping progress, including the Key Expansion by
TinyRISC processor, the data processing by RC Array, the Context/data loading and
storing, are discussed. At last, the simulation and results are introduced and analyzed.
3.1 Parallel Computing Exploration
Rijndael is a block cipher that includes a large amount of table lookup operations
and data movement, the actual ALU operation is just a very small part in terms of
running time or number of instructions. So how to input/output the blocks between Frame
Buffer and RC Array, to do the table-lookup operations, and to move the data among RCs
with the help of three layers of RC Array interconnection network are main concerns.
37
3.1.1 Multi-block Processing
Every data block in Rijndael has 16 bytes, while the number of RCs in the RC
Array is 64. Because there is no data dependency between any two data blocks,
MorphoSys has the capability to process 4 data blocks at the same time.
Because each block is a 4x4 matrix, it is very natural to partition the 4 blocks as
shown in Figure 3.1. However, because the data is column-wise stored in main memory
and Frame Buffer, this partitioning will introduce data reshuffle, which is very difficult to
realize in the Frame Buffer.
Block 0
(4x4)
Block 1
(4x4)
Block 2
(4x4)
Block 3
(4x4)
Figure 3.1: Intuitive Partitioning of RC Array
The actual partitioning used in the implementation is shown in Figure 3.2.
Block 0 (8x2)
Block 1 (8x2)
Block 2 (8x2)
Block 3 (8x2)
Figure 3.2: Actual Partitioning of RC Array
38
Under this partitioning, the data loading/storing process is straightforward. But
the data movement for ShiftRows( ) is not the same as in a 4x4 matrix. Please refer to
Section 3.1.3 for details about the data movement.
3.1.2 Parallel Table-lookup
In M2’s architecture, there is an embedded memory for each RC. This memory
behaves as a local lookup table. When a context commands a row/column to perform a
table-lookup operation, eight table-lookups are done in parallel. Furthermore, if the eight
contexts in a whole context plane all indicate table-lookup operations, 64 table-lookups
are done in parallel. On the other hand, in a software implementation of Rijndael, the
table-lookup operation can only be done one by one. That is significantly slower than the
implementation in MorphoSys.
3.1.3 Dedicated Data Movement for Rijndael
Recall the data movement for ShiftRows( ). The new position of every byte is
shown in Figure 3.3.
1 5 9 13 1 5 9 13
2 6 10 14 6 10 14 2
3 7 11 15 11 15 3 7
4 8 12 16 16 4 8 12
Figure 3.3: Transformation of ShiftRows( ) in 4x4 Matrix
Before moving the data according to ShiftRows( ), one needs to be aware what
data is needed in the subsequent function MixColumns( ). MixColumn( ) is a “column”
function, which means a byte will only need the value of all the four bytes (including
39
itself) in the same column for the transformation. For example, the highlighted byte at
position 10 will need the values of the bytes at the same column marked by 5, 10, 15, and
4 to do MixColumn( ).
In MorphoSys, a block is partitioned into 8x2 matrix, and every RC stores a byte.
So the ShiftRows( ) will move the data as following.
1 9 1 9
2 10 6 14
3 11 11 3
4 12 16 8
5 13 5 13
6 14 10 2
7 15 15 7
8 16 4 12
Figure 3.4: Transformation of ShiftRows( ) in 8x2 Matrix
To make every RC do MixColumns( ) independently and simultaneously, it is
desirable to have every RC store four relevant values used by MixColumns( ) into its
local registers. For example, because RC(5,0)* will use the input value in RC(4,0),
RC(5,0), RC(6,0), and RC(7,0) for MixColumns( ), it should store them into its local
registers.
Figure 3.5 shows the data movement result for ShiftRows( ). After the move, each
RC will contain the shifted data as well as the relevant data for MixColumns( ). Notice
that only two columns of RCs are shown here. The other six columns of RCs (other three
* Assume we only consider block 0 here. The corresponding RCs in other three blocks are RC(5,2), RC(5,4), and RC(5,6).
40
blocks) apply the same move. And the order of the bytes saved in four registers are not
important. As shown later, the order is not exactly the same as Figure 3.5. It merely
depends on the ease of implementation.
Column 0 Column 1 Column 0 Column 1
r0 r1 r2 r3 r0 r1 r2 r3 r0 r1 r2 r3 r0 r1 r2 r3
Row 0 1 – – – 9 – – – 1 6 11 16 9 14 3 8
Row 1 2 – – – 10 – – – 1 6 11 16 9 14 3 8
Row 2 3 – – – 11 – – – 1 6 11 16 9 14 3 8
Row 3 4 – – – 12 – – – 1 6 11 16 9 14 3 8
Row 4 5 – – – 13 – – – 5 10 15 4 13 2 7 12
Row 5 6 – – – 14 – – – 5 10 15 4 13 2 7 12
Row 6 7 – – – 15 – – – 5 10 15 4 13 2 7 12
Row 7 8 – – – 16 – – – 5 10 15 4 13 2 7 12
Figure 3.5: Data Movement for ShiftRows( )
The detailed data movement illustration and algorithm for encryption/decryption
are discussed in Section 3.2.4.
3.2 Algorithm Flowchart and Illustration
The whole algorithm can be divided into two parts: sequential part and parallel
part. The sequential part includes Key Expansion, and is done by TinyRISC. The parallel
part includes loading lookup tables, loading Round Keys, loading data, processing data,
and storing data. It is done by RC Array.
41
The complete flowchart is shown in Figure 3.6. And the implementation of each
block is discussed in the following sections.
Key Expansion by TinyRISCStore the result - Round Keys
into main memory
Table LoadingLoad xtime and S-box (or inv S-box)
table into every RC
Data and Round Key LoadingLoad four data blocks and currently-
needed Round Key from main memoryto Frame Buffer, then to RC Array
Data Encryption/DecryptionPerform the multiple-round
cipher or inverse cipher in RC Array
Data StoringStore four data blocks from RC Array
to Frame Buffer, then to main memory
End of Data?No
Yes
End
Figure 3.6: Flowchart of Rijndael Implementation in MorphoSys
42
3.2.1 Key Expansion by TinyRISC
The pseudo code for Key Expansion has been discussed in Section 2.3.1.5. In
order to reduce the number of registers used in TinyRISC, the assembly code uses loop
structure: Nk words are generated in each loop, until the total number reaches the desired
number (of words). For example, if Nk = 4, the total number of words in all Round Keys
is 4*11 = 44, so the total number of loops is � � 114/44 = ; if Nk = 6, the total number is
4*13 = 52, so the number of loops is � � 96/52 = ; if Nk = 8, the total number is 4*15 =
60, so the number of loops is � � 88/60 = . The indivisibility when Nk ≠ 4 means more
than necessary words would be generated during the expansion. The extra words can
simply be discarded.
In the Inverse Cipher, an additional InvMixColumns( ) function is applied to
every Round Key except the first and last one.
Because the main memory and TinyRISC are 32-bit, the expanded Round Keys
are also 32-bit. But this format cannot be used by Frame Buffer, which expects 16-bit
inputs. For example, when Frame Buffer reads a 32-bit word 0x00000064 from main
memory, it will treat it as two numbers: 0x0000 and 0x0064. So the result needs 2-to-1
concatenations: after all Round Keys are generated and stored back into main memory as
32-bit format, they will be loaded into TinyRISC again, with two 32-bit words each time,
and concatenated to one 32-bit word, then stored back into main memory.
0x000000eb 0x00eb003d
0x0000003d ( next concat enat i on)
Figure 3.7: Concatenations of Round Keys
43
3.2.2 Table Loading
Three types of contexts are need for loading each table element. They are:
set 0, 0 LDI M! 5 def def > 0; # l oad val ue 5 i nt o RC’ s r 0
set 0, 15 STMM r 0 def > 1; # st or e r 0 i nt o t abl e addr ess r 1
set 8, 15 ADD r 1 r 2 > 1; # i ncr ease r 1 by 1 ( r 2)
Notice that once STMM and ADD* are loaded into Context Memory, they can be
used for every table element. So theoretically, the total number of contexts to load two
256-byte table is 256*2 (two tables’ LDIM) + 1 (STMM) + 1 (ADD) + 2 (set r0, r1’s
initial value) = 516. But the size of Context Memory is not big enough to save all 516
contexts. In M2, the Context Memory can save up to 256 contexts. Since STMM, ADD,
and initialization contexts are needed once in every 256 contexts, the pattern of contexts
should be:
1st l oadi ng: 252 LDI Ms + 1 STMM + 1 ADD + 2 I ni t i al i zat i on
2nd l oadi ng: 252 LDI Ms + 1 STMM + 1 ADD + 2 I ni t i al i zat i on
3r d l oadi ng: 8 LDI Ms + 1 STMM + 1 ADD + 2 I ni t i al i zat i on
The total number of contexts is 256*2 + 12 = 524.
At the time the author simulated the implementation of AES, the simulator was
only able to handle up to 32 contexts (i.e., M1’s structure). So there are 18 times of table
loading instead of 3. But in any case it is not a big issue – the table loading is done only
once during the initialization.
There are two tables to be loaded: xtime and S-box (or inv S-box). One of them is
from address 0x00 to 0xFF, and another is from address 0x100 to 0x1FF. As shown later,
to access the second table, an extra context to add the offset 0x100 is needed for every
* All RC Array instructions are listed in Appendix C.
44
table lookup operation. Because xtime table is used more frequently (see next section), it
is reasonable to load it first.
3.2.3 Data and Round Key Loading
Four blocks, or 64 bytes of data, and the currently needed Round Key (16 bytes)
are loaded from main memory into Frame Buffer, then into RC Array. Because the four
blocks use the same Round Key, the Round Key will be repeatedly loaded from Frame
Buffer to RC Array for four times. The involved instructions are LDFB and SBCB.
3.2.4 Data Processing in RC Array
After the data and Round Key have been loaded into RC Array, the next thing is
to process data in RC Array. As stated in Chapter 2, the process includes four functions:
SubBytes( ), ShiftRows( ), MixColumns( ), and AddRoundKey( ).
The contexts for SubBytes( ) are very simple:
set 0, 3 ADD r 0 r 1 > 0; # r 1 i s const ant 0x0100
set 0, 4 LDMM r 0 def > 0; # l oad i nt o r 0
The first context is to add offset 0x100 to index register r0. The second context is
to load table element at address [r0+0x100] into r0. So the result is r0 = S-box(r0) (or inv
S-box(r0)).
The context for AddRoundKey( ) is also very simple:
set 8, 0 XOR r 0 r 7 > 0; # RoundKey i s saved i n r 7
However, the contexts for ShiftRows( ) and MixColumns are more complicated.
ShiftRows( ) includes eight steps of data movement, and MixColumns( ) mainly consists
of xtime and XOR operations.
45
The data movement and contexts for ShiftRows( ) are illustrated in several
figures.
Column 0 Column 1 Column 0 Column 1
r0 r1 r2 r3 r0 r1 r2 r3 r0 r1 r2 r3 r0 r1 r2 r3
Row 0 1 – – – 9 – – – 1 – – – 9 – – –
Row 1 2 – – – 10 – – – 2 – – – 10 – – –
Row 2 3 – – – 11 – – – 3 – – – 11 – – –
Row 3 4 – – – 12 – – – 4 5 – – 12 13 – –
Row 4 5 – – – 13 – – – 5 – – – 13 – – –
Row 5 6 – – – 14 – – – 6 1 – – 14 9 – –
Row 6 7 – – – 15 – – – 7 – – – 15 – – –
Row 7 8 – – – 16 – – – 8 – – – 16 – – –
31
40
51
00 , rrrr →→ Expr ess Lane, Row Mode
set 8 , 1 BYPASS r0 def > 0 WE ;
set 9 , 1 BYPASS r0 def > 0 ;
set 10 , 1 BYPASS r0 def > 0 ;
set 11 , 1 BYPASS VE def > 1 ;
set 12 , 1 BYPASS r0 def > 0 WE ;
set 13 , 1 BYPASS VE def > 1 ;
set 14 , 1 BYPASS r0 def > 0 ;
set 15 , 1 BYPASS r0 def > 0 ;
Figure 3.8: ShiftRows( ) Step 1
Figure 3.8 shows the first step of ShifRows( ). The contexts are in Row Mode,
which means one context for one row. Row 0/4 will put the data in r0 onto Express Lane,
and Row 3/5 will get the data from corresponding vertical Express Lane and save into
ikr means rk in
Row i
46
register r1. By this means, the value at position 1 and 5 is transferred to desired positions.
In this step, Row 1, 2, 6, and 7 are doing NOP operations.
Column 0 Column 1 Column 0 Column 1
r0 r1 r2 r3 r0 r1 r2 r3 r0 r1 r2 r3 r0 r1 r2 r3
Row 0 1 – – – 9 – – – 1 – – – 9 – – –
Row 1 2 – – – 10 – – – 2 7 – – 10 15 – –
Row 2 3 – – – 11 – – – 3 – – – 11 – – –
Row 3 4 5 – – 12 13 – – 4 5 – – 12 13 – –
Row 4 5 – – – 13 – – – 5 – – – 13 – – –
Row 5 6 1 – – 14 9 – – 6 1 – – 14 9 – –
Row 6 7 – – – 15 – – – 7 – – – 15 – – –
Row 7 8 – – – 16 – – – 8 3 – – 16 11 – –
11
60
71
20 , rrrr →→ Expr ess Lane, Row Mode
set 8 , 2 BYPASS r0 def > 0 ;
set 9 , 2 BYPASS VE def > 1 ;
set 10 , 2 BYPASS r0 def > 0 WE ;
set 11 , 2 BYPASS r0 def > 0 ;
set 12 , 2 BYPASS r0 def > 0 ;
set 13 , 2 BYPASS r0 def > 0 ;
set 14 , 2 BYPASS r0 def > 0 WE ;
set 15 , 2 BYPASS VE def > 1 ;
Figure 3.9: ShiftRows( ) Step 2
Figure 3.9 shows the second step. It is similar to the first step, but moves different
data into desired positions.
ikr means rk in
Row i
47
Column 0 Column 1 Column 0 Column 1
r0 r1 r2 r3 r0 r1 r2 r3 r0 r1 r2 r3 r0 r1 r2 r3
Row 0 1 – – – 9 – – – 1 – – – 9 – – –
Row 1 2 7 – – 10 15 – – 2 7 10 15 10 15 2 7
Row 2 3 – – – 11 – – – 3 – – – 11 – – –
Row 3 4 5 – – 12 13 – – 4 5 – – 12 13 – –
Row 4 5 – – – 13 – – – 5 – – – 13 – – –
Row 5 6 1 – – 14 9 – – 6 1 – – 14 9 – –
Row 6 7 – – – 15 – – – 7 – – – 15 – – –
Row 7 8 3 – – 16 11 – – 8 3 16 11 16 11 8 3
13
01
03
11
12
00
02
10 , | , rrrrrrrr →→→→ Lef t / Ri ght , Col umn Mode
set 0 , 7 BYPASS L def > 2 ;
set 1 , 7 BYPASS L def > 2 ;
set 2 , 7 BYPASS R def > 2 ;
set 3 , 7 BYPASS R def > 2 ;
set 4 , 7 BYPASS L def > 2 ;
set 5 , 7 BYPASS L def > 2 ;
set 6 , 7 BYPASS R def > 2 ;
set 7 , 7 BYPASS R def > 2 ;
Figure 3.10: ShiftRows( ) Step 3, 4
The third and fourth step use Column Mode. Column 2i will get data from
Column 2i+1 (i = 0, 1, 2, 3), and vice versa. Only one context plane is shown in Figure
3.10. Others are similar.
ikr means rk in
Column i
48
Column 0 Column 1 Column 0 Column 1
r0 r1 r2 r3 r0 r1 r2 r3 r0 r1 r2 r3 r0 r1 r2 r3
Row 0 1 – – – 9 – – – 6 – – – 14 – – –
Row 1 2 7 10 15 10 15 2 7 6 7 10 15 14 15 2 7
Row 2 3 – – – 11 – – – 6 – – – 14 – – –
Row 3 4 5 – – 12 13 – – 6 5 – – 14 13 – –
Row 4 5 – – – 13 – – – 4 – – – 12 – – –
Row 5 6 1 – – 14 9 – – 4 1 – – 12 9 – –
Row 6 7 – – – 15 – – – 4 – – – 12 – – –
Row 7 8 3 16 11 16 11 8 3 4 3 16 11 12 11 8 3
3,2,1,00
50
7,6,5,40
30 , rrrr →→ Expr ess Lane, Row Mode
set 8 , 5 BYPASS VE def > 0 ;
set 9 , 5 BYPASS VE def > 0 ;
set 10 , 5 BYPASS VE def > 0 ;
set 11 , 5 BYPASS VE def > 0 WE ;
set 12 , 5 BYPASS VE def > 0 ;
set 13 , 5 BYPASS VE def > 0 WE ;
set 14 , 5 BYPASS VE def > 0 ;
set 15 , 5 BYPASS VE def > 0 ;
Figure 3.11: ShiftRows( ) Step 5
After four steps, all the seed data used for ShiftRows( ) and MixColumns( ) are
ready. Those seeds are highlighted in the left table in Figure 3.11. Then, the Express
Lanes are exploited again to store one byte into other four RCs at the same time. Here in
Step 5, the seed in register r0 of RC(3, i) and RC(5, i) are propagated through the Express
Lane and fetched by register r0 of RC(4-7, i) and RC (0-3, i), respectively.
ikr means rk in
Row i
49
Column 0 Column 1 Column 0 Column 1
r0 r1 r2 r3 r0 r1 r2 r3 r0 r1 r2 r3 r0 r1 r2 r3
Row 0 6 – – – 14 – – – 6 1 16 11 14 9 8 3
Row 1 6 7 10 15 14 15 2 7 6 1 16 11 14 9 8 3
Row 2 6 – – – 14 – – – 6 1 16 11 14 9 8 3
Row 3 6 5 – – 14 13 – – 6 1 16 11 14 9 8 3
Row 4 4 – – – 12 – – – 4 5 10 15 12 13 2 7
Row 5 4 1 – – 12 9 – – 4 5 10 15 12 13 2 7
Row 6 4 – – – 12 – – – 4 5 10 15 12 13 2 7
Row 7 4 3 16 11 12 11 8 3 4 5 10 15 12 13 2 7
3,2,1,01
51
7,6,5,41
31 , rrrr →→ Expr ess Lane, Row Mode
3,2,1,02
72
7,6,5,42
12 , rrrr →→ Expr ess Lane, Row Mode
3,2,1,03
73
7,6,5,43
13 , rrrr →→ Expr ess Lane, Row Mode
set 8 , 7 BYPASS VE def > 1 ;
set 9 , 7 BYPASS VE def > 1 ;
set 10 , 7 BYPASS VE def > 1 ;
set 11 , 7 BYPASS VE def > 1 WE ;
set 12 , 7 BYPASS VE def > 1 ;
set 13 , 7 BYPASS VE def > 1 WE ;
set 14 , 7 BYPASS VE def > 1 ;
set 15 , 7 BYPASS VE def > 1 ;
Figure 3.12: ShiftRows( ) Step 6, 7, 8
Step 6, 7, and 8 are similar to Step 5. They will store the data from Express Lane
into register r1, r2, and r3, respectively. Only one context plane is shown above. Others
are similar. After these eight steps, every RC contains the data for MixColumns( ).
ikr means rk in
Row i
50
The algorithm for MixColumns( ) is listed below again for your convenience.
t mp = a0 ^ a1 ^ a2 ^ a3;
t m = a0 ^ a1; t m = xt i me( t m) ; a0 ^ = t m ^ t mp;
t m = a1 ^ a2; t m = xt i me( t m) ; a1 ^ = t m ^ t mp;
t m = a2 ^ a3; t m = xt i me( t m) ; a2 ^ = t m ^ t mp;
t m = a3 ^ a0; t m = xt i me( t m) ; a3 ^ = t m ^ t mp;
The distinct contexts for them are just “XOR” and “LDMM”. For example:
set 0 , 8 XOR r0 r4 > 4 ;
set 0 , 11 LDMM r5 def > 5 ;
So far all the functions for the Cipher have been discussed. After optimization, the
data processing part of the Cipher only uses 27 contexts in total.
In the Inverse Cipher, Function SubBytes( ) and AddRoundKey( ) are the same,
but InvShiftRows( ) and InvMixColumns( ) are slightly different. In InvShiftRows( ),
there are also eight steps of data more. And the only difference is the position of target
data. Figure 3.13 shows the first four steps for InvShiftRows( ).
Column 0 Column 1 Column 0 Column 1
r0 r1 r2 r3 r0 r1 r2 r3 r0 r1 r2 r3 r0 r1 r2 r3
Row 0 1 – – – 9 – – – 1 – – – 9 – – –
Row 1 2 – – – 10 – – – 2 5 – – 10 13 – –
Row 2 3 – – – 11 – – – 3 – – – 11 – – –
Row 3 4 – – – 12 – – – 4 7 12 15 12 15 4 7
Row 4 5 – – – 13 – – – 5 – – – 13 – – –
Row 5 6 – – – 14 – – – 6 3 14 11 14 11 6 3
Row 6 7 – – – 15 – – – 7 – – – 15 – – –
Row 7 8 – – – 16 – – – 8 1 – – 16 9 – –
Figure 3.13: InvShiftRows( ) Step 1, 2, 3, 4
51
Figure 3.14 shows the next four steps. The highlighted bytes in left table are
seeds. They are propagated to four RCs through the Express Lanes.
Column 0 Column 1 Column 0 Column 1
r0 r1 r2 r3 r0 r1 r2 r3 r0 r1 r2 r3 r0 r1 r2 r3
Row 0 1 – – – 9 – – – 8 1 14 11 16 9 6 3
Row 1 2 5 – – 10 13 – – 8 1 14 11 16 9 6 3
Row 2 3 – – – 11 – – – 8 1 14 11 16 9 6 3
Row 3 4 7 12 15 12 15 4 7 8 1 14 11 16 9 6 3
Row 4 5 – – – 13 – – – 2 5 12 15 10 13 4 7
Row 5 6 3 14 11 14 11 6 3 2 5 12 15 10 13 4 7
Row 6 7 – – – 15 – – – 2 5 12 15 10 13 4 7
Row 7 8 1 – – 16 9 – – 2 5 12 15 10 13 4 7
Figure 3.14: InvShiftRows( ) Step 5, 6, 7, 8
The algorithm for InvMixColumns( ) is listed below. Due to more xtime and XOR
operations, the running time is increased a little bit. However, with very careful
arrangement of registers and table lookup, the total number of contexts for data
processing part of decryption is only increased by 1, or 28.
t m1 = a0 ^ a1 / / r 5 f or t m1, r i f or ai ( i = 0, 1, 2, 3)
t mp1 = t m1 ^ a2 / / r 6 f or t mp1, get r 5 bef or e i t i s dest r oyed
t m1 = xt i me( t m1) / / r 5 f or t m1, needs one l ookup cont ext C0
r 4 = a0 ^ t m1 / / r 5 i s f r ee and can be used agai n
t m2 = a0 ^ a2 / / r 5 f or t m2
t m2 = xt i me( xt i me( t m2) ) / / al l use t he same cont ext C0 as bef or e
r 4 = r 4 ^ t m2 / / r 5 i s f r ee agai n
t mp2 = t mp1 ^ a3 / / r 5 f or t mp2, swi t ch back t o r 5
r 4 = r 4 ^ t mp2 / / t mp2 = a0 ^ a1 ^ a2 ^ a3 her e
t mp2 = xt i me( xt i me( xt i me( t mp2) ) ) / / al l use cont ext C0 r 4 = r 4 ^ t mp2 / / r 4 saves t he r esul t of I nvMi xCol umns( )
52
3.2.5 Data Storing
After four data blocks are processed in RC Array, they are stored into Frame
Buffer, and then into main memory. The involved instructions are WFBI and STFB. If
there are more data to be encrypted/decrypted, the program will continue to process next
four blocks with the same procedure, until reaching the end of data.
The result saved in the main memory has the concatenated format. For example, a
32-bit word “0x00010002” means two bytes: “0x01” and “0x02” . To comply with the
same format as input*, which uses 32 bits to represent a byte, the result needs to be
separated. Using the same example, “0x00010002” will be separated as “0x00000001”
and “0x00000002”. This separation is performed after all the data have been
encrypted/decrypted.
3.3 Simulation Environment
MorphoSys group has developed a set of software to facilitate the algorithm
mapping, source code compilation, and algorithm simulation for M1. The complete set
of software includes Tcc, TRASM, MorphoSim, mView, mLoad, mSched, and
mULATE, as shown in Figure 3.15. Tcc is a C/C++ compiler that generates the
TinyRISC executable code. TRASM is an assembly compiler that generates the
TinyRISC executable code. MorphoSim is a VHDL simulator, which exactly matches the
MorphoSys chip. mLoad, mView, and mSched are used for context generation and
application scheduling. mULATE is a cycle-accurate simulator, which is more abstract
than MorphoSim.
* This consistency might be unnecessary. It depends on the specific application.
53
TR_appFor I=1 to 20X[I]=X[I]+1
TR_appFor I=1 to 20X[I]=X[I]+1
TinyRISCTinyRISC
RC ArrayRC Array
App. (C or Assembly Code)
C++,VHDL
MorphoSysChip
Tcc or TRASM
Z=RC_F(X)
W=RC_F(Y)
mLoad ContextLib.
mSchedmSchedExecutable
RC Arrayfunctions
MuLate,MorphoSim
mView
Conf igurat ioncontext
TR_appFor I=1 to 20X[I]=X[I]+1
TR_appFor I=1 to 20X[I]=X[I]+1
TinyRISCTinyRISC
RC ArrayRC Array
App. (C or Assembly Code)
C++,VHDL
MorphoSysChip
Tcc or TRASM
Z=RC_F(X)
W=RC_F(Y)
mLoad ContextLib.
mSchedmSchedExecutable
RC Arrayfunctions
MuLate,MorphoSim
mView
Conf igurat ioncontext
Figure 3.15: Software Tools for MorphoSys
To be compatible with the modifications in M2, all of these tools need to be
updated. Up to now, the mLoad*, mULATE†, and TRASM‡ have been updated. So the
author wrote and compiled the TinyRISC assembly code and contexts of the whole
algorithm, then used mULATE to simulate it.
3.4 Performance Analysis
A comprehensive simulation for the encryption and decryption under different
Key sizes is performed in mULATE. And the results are compared with those
implemented by assembly language, C/C++, Java, and ASIC/Programmable Logic cores.
* mLoad is the context compiler written in Perl. It was updated by the author. † mULATE was updated by Afshin Niktash. ‡ TRASM was updated by Afshin Niktash.
54
For the initialization part, other implementations may only need the Key
Expansion. However, for the MorphoSys implementation, it needs the Key Expansion,
lookup table loading, and context loading. Table 3.1 shows the numbers of cycles for the
Key Expansion implemented by ANSI C, C++, and MorphoSys TinyRISC*. In
MorphoSys implementation, the Key Expansion for the Inverse Cipher is much slower
because the InvMixColumns( ) operation is applied to each Round Key except the first
and last one, and the InvMixColumns( ) involves a lot of memory operations which need
a lot of cycles.
Table 3.1: # of Cycles for Key Expansion in Several Implementations
AES CD (ANSI C) Br ian Gladman (VC++) MorphoSys TinyRISC Key Size
Cipher Inverse Cipher Cipher Inverse Cipher Cipher Inverse Cipher
128 2100 2900 305 1389 2770 13320
192 2600 3600 277 1595 3386 15603
256 2800 3800 374 1960 4196 19184
The numbers of cycles for all three parts of the initialization in MorphoSys
implementation are listed in Table 3.2. It shows that the Cipher and Inverse Cipher may
need up to 10675 and 25671 cycles for the whole initialization, respectively. Assume M2
runs at 200MHz, it will take 54 µs and 128 µs, respectively. Obviously, this time is very
short and acceptable.
* The statistics for ANSI C and C++ is obtained from the AES proposal by Rijndael’s authors.
55
Table 3.2: # of Cycles for AES Initialization in MorphoSys Implementation
Key Size Key Expansion
Table Loading Context Loading
Total # of cycles
128 2770/13320 6249 230/238 9249/19807
192 3386/16029 6249 230/238 9865/22516
256 4196/19184 6249 230/238 10675/25671
* in “x/y” , “x” for encryption, “y” for decryption
For the data processing part, the numbers of cycles and/or throughputs for
encryption implemented by assembly language, C/C++, and Java are listed in Table 3.3.
All the throughputs (unit: Mb/s) are calculated at frequency 200 MHz.
Table 3.3: # of Cycles and Throughputs per Block in Other Implementations
Intel 8051 Motorola 68HC08
AES CD (ANSI C) Brain Gladman (VC++)
Java Key Size
# of cycles # of cycles # of cycles Xput # of cycles Xput # of cycles Xput
128 4065 8390 950 27.0 363 70.5 23000 1.1
192 4512 10780 1125 22.8 432 59.3 27600 0.93
256 5221 12490 1295 19.8 500 51.2 32300 0.79
* result for encryption only
The MorphoSys implementation result is listed in Table 3.4. Because each time
four blocks are processed in parallel, the actual number of cycles for one block is only
1/4 of the computing cycles. For example, when Key size is 128 bits, the data processing
part for encryption needs 601 / 4 = 150.25 cycles/block.
56
Table 3.4: # of Cycles and Throughputs per Block in MorphoSys Implementation
Encryption Decryption Key Size
# of cycles Xput # of cycles Xput
128 150.25 170.4 166 154.2
192 175.25 146.1 194.5 131.6
256 200.25 127.8 223 114.8
* in “a/b” , “a” for encryption, “b” for decryption
As shown in above tables, the running time for initialization is much longer than
that for one-block processing no matter how the AES is implemented. However, the
initialization is only a small fraction in total running time when the size of the data to be
processed is not very small. Assume the Key size is 128 bits, and the data size is 64K
Bytes, or 4K blocks, then MorphoSys needs to load the data to RC Array 1000 times. So
the total time for data processing part is 601,000 / 664,000 cycles for encryption /
decryption, and the time for initialization is only about 1.5% / 3% of the whole time.
On Aug 8, 2001, Amphion Semiconductor Ltd. [20] announced its application-
specific cores for AES applications. The performance of its CS 5210-5280 Family
(standard series) ASIC cores and programmable logic cores is shown in Table 3.5, 3.6
and 3.7. The ASIC cores are about 240% to 270% faster than the MorphoSys
implementation, and the programmable logic cores are also about 30% to 60% faster. But
several other issues should be considered when we compare their performance. First,
encryption and decryption need different Amphion cores; second, the initialization time
in Amphion cores is unknown (though this is usually not important); third, MorphoSys is
not just an ASIC or FPGA, and is capable of doing many other applications efficiently
with the same architecture.
57
Table 3.5: AES by Amphion ASIC Cores using TSMC 0.18µm Technology
Encryption Decryption Key Size
Logic Gates Timing Constraints
(MHz) Throughput
(Mb/s) Timing Constraints
(MHz) Throughput
(Mb/s)
128 18.2K 200 581 200 581
192 18.2K 200 492 200 492
256 18.2K 200 426 200 426
Table 3.6: AES by Amphion Programmable Logic Cores using Altera APEX20KE-1
Encryption Decryption Key Size
Logic Used (LE)*
Memory Used (ESB) Clock Speed
(MHz) Throughput
(Mb/s) Clock Speed
(MHz) Throughput
(Mb/s)
128 1452/1560 8 77.8 226 74.1 215
192 1452/1560 8 77.8 191 74.1 182
256 1452/1560 8 77.8 166 74.1 158
* encryption/decryption
Table 3.7: AES by Amphion Programmable Logic Cores using Xilinx VirtexE-8
Encryption Decryption Key Size
Logic Used
(LUT)*
Memory Used
(BRAM) Clock Speed (MHz)
Throughput (Mb/s)
Clock Speed (MHz)
Throughput (Mb/s)
128 1008/1092 4 92.3 268 86.7 254
192 1008/1092 8 92.3 227 86.7 213
256 1008/1092 8 92.3 196 86.7 184
* encryption/decryption
58
Figure 3.16 compares the data processing throughputs of C/C++, MorphoSys,
Amphion ASIC core, and Amphion FPGA cores implementation for encryption at Key
size = 128 bits. The throughput of MorphoSys implementation is close to the throughput
of Amphion Altera core implementation.
Figure 3.16: Throughputs of Different Implementations
3.5 Conclusions
The performance of the AES implementation in MorphoSys is satisfactory. The
throughput is more than 100Mb/s, which is usually adequate for applications on mobile
phones and PDAs. If in an application the throughput requirement is very stringent and
cannot met by a single MorphoSys, one can consider a larger scale of parallel computing
system consisting of several identical MorphoSys cores. Since there is no data
dependency among blocks, the “scaling up” is theoretically unlimited and will not
introduce any performance degradation that otherwise would exist if there were inter-
block data communications. Of course, in the real implementation, the MorphoSys chip
Throughputs of Different Implementations
2770.5
170.4
581
226268
0
100
200
300
400
500
600
700
ANSI C C++ MorphoSys ASIC Core Altera Core Xilinx Core
Mb/s
59
usually does not run the AES algorithm alone. It might be uneconomical if we increase
the number of MorphoSys cores just for the AES requirement.
Another possible approach to improve the performance is to include some
programmable logic block in MorphoSys, such as PLD/CPLD, to handle logic functions
and bit-level operations. But there might be a tradeoff between the flexibility and the
speed. Actually it is a research topic in the MorphoSys group.
60
Bibliography
[1] M. H. Lee, H. Singh, G. Lu, N. Bagherzadeh, F. J. Kurdahi, “Design and Implementation of the MorphoSys Reconfigurable Computing Processor” , Journal of VLSI Signal Processing Systems, vol. 24, pp. 164-172, March 2000
[2] H. Singh, M. H. Lee, G. Lu, F. J. Kurdahi, N. Bagherzadeh, T. Lang, R. Heaton, and E. M. C. Filho, “MorphoSys: An Integrated Re-configurable Architecture,” NATO Symposium on Concepts and Integration, April 1998
[3] S. Brown and J. Rose, “Architecture of FPGAs and CPLDs: A Tutorial,” IEEE Design and Test of Computers, Vol. 13, No. 2, pp. 42-57, 1996
[4] G. Lu, “Modeling, Implementation and Scalability of the MorphoSys Dynamically Reconfigurable Computing Architecture,” Ph.D. Dissertation, 2000
[5] M. H. Lee, “Design and Implementation of the High-Performance Low-Power MorphoSys,” Ph.D. Dissertation, 2000
[6] A. Abnous, C. Christensen, J. Gray, J. Lenell, A. Naylor and N. Bagherzaheh, “Design and implementation of TinyRISC microprocessor,” Microprocessors and Microsystems, Vol.16, No.4, pp.187-94, 1992
[7] http://csrc.nist.gov/encryption/aes/
[8] http://csrc.nist.gov/publications/drafts/dfips-AES.pdf
[9] F. Koeune, J.-J. Quisquater, “A timing attack against Rijndael,” Technical Report CG-1999/1, UCL Crypto Group, Louvain-la-Neuve, 1999.
[10] E. Biham, A. Shamir, “Power Analysis of the Key Scheduling of the AES Candidates,” Proceedings of the Second Advanced Encryption Standard (AES) Candidate Conference, 1999.
[11] R. Lidl, H. Niederreiter, Introduction to finite fields and their applications, Cambridge University Press, 1986
[12] P. Barreto, V. Rijmen, Rijndael ANSI C Reference Code, downloadable at http://www.esat.kuleuven.ac.be/~rijmen/rijndael/rijndaelref.zip
[13] http://fp.gladman.plus.com/cryptography_technology/index.htm
[14] http://www.cosy.sbg.ac.at/~gwesp/sw/rijndael-1.0.tar.gz
61
[15] http://www.webappcabaret.com/cass/security
[16] http://www.esat.kuleuven.ac.be/~rijmen/rijndael/rijndaelvb.zip
[17] http://www.cpan.org/authors/id/D/DI/DIDO/Crypt-Rijndael-0.04.tar.gz
[18] http://www.esat.kuleuven.ac.be/~rijmen/rijndael/rijndael-80186.tar.gz
[19] http://www.esat.kuleuven.ac.be/~rijmen/rijndael/rijndaelada.zip
[20] http://www.amphion.com
62
Appendix A
Constant Tables Used in AES
A.1 Lookup Table “ S-box”
S-box is a 256-byte table used by the function SubBytes( ) in the Key Expansion
and the Cipher.
63 7C 77 7B F2 6B 6F C5 30 01 67 2B FE D7 AB 76
CA 82 C9 7D FA 59 47 F0 AD D4 A2 AF 9C A4 72 C0
B7 FD 93 26 36 3F F7 CC 34 A5 E5 F1 71 D8 31 15
04 C7 23 C3 18 96 05 9A 07 12 80 E2 EB 27 B2 75
09 83 2C 1A 1B 6E 5A A0 52 3B D6 B3 29 E3 2F 84
53 D1 00 ED 20 FC B1 5B 6A CB BE 39 4A 4C 58 CF
D0 EF AA FB 43 4D 33 85 45 F9 02 7F 50 3C 9F A8
51 A3 40 8F 92 9D 38 F5 BC B6 DA 21 10 FF F3 D2
CD 0C 13 EC 5F 97 44 17 C4 A7 7E 3D 64 5D 19 73
60 81 4F DC 22 2A 90 88 46 EE B8 14 DE 5E 0B DB
E0 32 3A 0A 49 06 24 5C C2 D3 AC 62 91 95 E4 79
E7 C8 37 6D 8D D5 4E A9 6C 56 F4 EA 65 7A AE 08
BA 78 25 2E 1C A6 B4 C6 E8 DD 74 1F 4B BD 8B 8A
70 3E B5 66 48 03 F6 0E 61 35 57 B9 86 C1 1D 9E
E1 F8 98 11 69 D9 8E 94 9B 1E 87 E9 CE 55 28 DF
8C A1 89 0D BF E6 42 68 41 99 2D 0F B0 54 BB 16
63
A.2 Lookup Table “ Inv S-box”
Inv S-box is a 256-byte table used by the function InvSubBytes( ) in the Inverse
Cipher.
52 09 6A D5 30 36 A5 38 BF 40 A3 9E 81 F3 D7 FB
7C E3 39 82 9B 2F FF 87 34 8E 43 44 C4 DE E9 CB
54 7B 94 32 A6 C2 23 3D EE 4C 95 0B 42 FA C3 4E
08 2E A1 66 28 D9 24 B2 76 5B A2 49 6D 8B D1 25
72 F8 F6 64 86 68 98 16 D4 A4 5C CC 5D 65 B6 92
6C 70 48 50 FD ED B9 DA 5E 15 46 57 A7 8D 9D 84
90 D8 AB 00 8C BC D3 0A F7 E4 58 05 B8 B3 45 06
D0 2C 1E 8F CA 3F 0F 02 C1 AF BD 03 01 13 8A 6B
3A 91 11 41 4F 67 DC EA 97 F2 CF CE F0 B4 E6 73
96 AC 74 22 E7 AD 35 85 E2 F9 37 E8 1C 75 DF 6E
47 F1 1A 71 1D 29 C5 89 6F B7 62 0E AA 18 BE 1B
FC 56 3E 4B C6 D2 79 20 9A DB C0 FE 78 CD 5A F4
1F DD A8 33 88 07 C7 31 B1 12 10 59 27 80 EC 5F
60 51 7F A9 19 B5 4A 0D 2D E5 7A 9F 93 C9 9C EF
A0 E0 3B 4D AE 2A F5 B0 C8 EB BB 3C 83 53 99 61
17 2B 04 7E BA 77 D6 26 E1 69 14 63 55 21 0C 7D
64
A.3 Lookup Table “ xtime”
xtime is a 256-byte table used both in the Cipher and Inverse Cipher to compute
the multiplication by x in GF(28).
00 02 04 06 08 0A 0C 0E 10 12 14 16 18 1A 1C 1E
20 22 24 26 28 2A 2C 2E 30 32 34 36 38 3A 3C 3E
40 42 44 46 48 4A 4C 4E 50 52 54 56 58 5A 5C 5E
60 62 64 66 68 6A 6C 6E 70 72 74 76 78 7A 7C 7E
80 82 84 86 88 8A 8C 8E 90 92 94 96 98 9A 9C 9E
A0 A2 A4 A6 A8 AA AC AE B0 B2 B4 B6 B8 BA BC BE
C0 C2 C4 C6 C8 CA CC CE D0 D2 D4 D6 D8 DA DC DE
E0 E2 E4 E6 E8 EA EC EE F0 F2 F4 F6 F8 FA FC FE
1B 19 1F 1D 13 11 17 15 0B 09 0F 0D 03 01 07 05
3B 39 3F 3D 33 31 37 35 2B 29 2F 2D 23 21 27 25
5B 59 5F 5D 53 51 57 55 4B 49 4F 4D 43 41 47 45
7B 79 7F 7D 73 71 77 75 6B 69 6F 6D 63 61 67 65
9B 99 9F 9D 93 91 97 95 8B 89 8F 8D 83 81 87 85
BB B9 BF BD B3 B1 B7 B5 AB A9 AF AD A3 A1 A7 A5
DB D9 DF DD D3 D1 D7 D5 CB C9 CF CD C3 C1 C7 C5
FB F9 FF FD F3 F1 F7 F5 EB E9 EF ED E3 E1 E7 E5
65
A.4 Lookup Table “ Log”
Log is a 256-byte table used in the Key Expansion (only for the Inverse Cipher) to
compute the multiplication in GF(28).
00 00 19 01 32 02 1A C6 4B C7 1B 68 33 EE DF 03
64 04 E0 0E 34 8D 81 EF 4C 71 08 C8 F8 69 1C C1
7D C2 1D B5 F9 B9 27 6A 4D E4 A6 72 9A C9 09 78
65 2F 8A 05 21 0F E1 24 12 F0 82 45 35 93 DA 8E
96 8F DB BD 36 D0 CE 94 13 5C D2 F1 40 46 83 38
66 DD FD 30 BF 06 8B 62 B3 25 E2 98 22 88 91 10
7E 6E 48 C3 A3 B6 1E 42 3A 6B 28 54 FA 85 3D BA
2B 79 0A 15 9B 9F 5E CA 4E D4 AC E5 F3 73 A7 57
AF 58 A8 50 F4 EA D6 74 4F AE E9 D5 E7 E6 AD E8
2C D7 75 7A EB 16 0B F5 59 CB 5F B0 9C A9 51 A0
7F 0C F6 6F 17 C4 49 EC D8 43 1F 2D A4 76 7B B7
CC BB 3E 5A FB 60 B1 86 3B 52 A1 6C AA 55 29 9D
97 B2 87 90 61 BE DC FC BC 95 CF CD 37 3F 5B D1
53 39 84 3C 41 A2 6D 47 14 2A 9E 5D 56 F2 D3 AB
44 11 92 D9 23 20 2E 89 B4 7C B8 26 77 99 E3 A5
67 4A ED DE C5 31 FE 18 0D 63 8C 80 C0 F7 70 07
66
A.4 Lookup Table “ Alog”
Alog is a 256-byte table used in the Key Expansion (only for the Inverse Cipher)
to compute the multiplication in GF(28).
01 03 05 0F 11 33 55 FF 1A 2E 72 96 A1 F8 13 35
5F E1 38 48 D8 73 95 A4 F7 02 06 0A 1E 22 66 AA
E5 34 5C E4 37 59 EB 26 6A BE D9 70 90 AB E6 31
53 F5 04 0C 14 3C 44 CC 4F D1 68 B8 D3 6E B2 CD
4C D4 67 A9 E0 3B 4D D7 62 A6 F1 08 18 28 78 88
83 9E B9 D0 6B BD DC 7F 81 98 B3 CE 49 DB 76 9A
B5 C4 57 F9 10 30 50 F0 0B 1D 27 69 BB D6 61 A3
FE 19 2B 7D 87 92 AD EC 2F 71 93 AE E9 20 60 A0
FB 16 3A 4E D2 6D B7 C2 5D E7 32 56 FA 15 3F 41
C3 5E E2 3D 47 C9 40 C0 5B ED 2C 74 9C BF DA 75
9F BA D5 64 AC EF 2A 7E 82 9D BC DF 7A 8E 89 80
9B B6 C1 58 E8 23 65 AF EA 25 6F B1 C8 43 C5 54
FC 1F 21 63 A5 F4 07 09 1B 2D 77 99 B0 CB 46 CA
45 CF 4A DE 79 8B 86 91 A8 E3 3E 42 C6 51 F3 0E
12 36 5A EE 29 7B 8D 8C 8F 8A 85 94 A7 F2 0D 17
39 4B DD 7C 84 97 A2 FD 1C 24 6C B4 C7 52 F6 01
A.5 Table “ Rcon”
Rcon is a 30-byte table used in the Key Expansion.
01 02 04 08 10 20 40 80 1b 36 6c d8 ab 4d 9a
2f 5e bc 63 c6 97 35 6a d4 b3 7d f a ef c5 91
67
Appendix B
MorphoSys TinyRISC ISA
B.1 Instruction Format
In the TinyRISC instruction set architecture (ISA), the instructions assume one
the two formats shown below:
31 - 25 24 23 - 20 19 -16 15 - 12 11 - 0
OpCode Immb SrcReg1 SrcReg2 DstReg Unused
31 - 25 24 23 - 20 19 - 16 15 - 12 11 - 0
OpCode Immb SrcReg1 DstReg Immediate
• OpCode: the 7-bit instruction opcode.
• Immb: the immediate bit. If Immb = 0, the second operand is stored in a data
register file. If Immb = 1, the second operand is a 16-bit immediate value
extended to 32 bits.
• SrcReg1: the register id of the first operand.
• DstReg: the id of the destination register.
• SrcReg2: the register id of the second operand.
• Immediate: the 16-bit immediate value (if Immb = 1).
68
B.2 Instruction Codes
The following subsections describe the instructions in each category: arithmetic,
logical, shift, comparison, load immediate, memory access, control transfer, and
MorphoSys instructions.
B.2.1 Arithmetic Instructions
ADD DstReg, SrcReg1, SrcReg2
31 - 25 24 23 - 20 19 - 16 15 - 12 11 - 0
0000100 0 sr1 sr2 dr unused
Description: This instruction adds the two unsigned values in registers sr1 and
sr2 and writes the result into register dr.
ADDI DstReg, SrcReg1, UnsImm
31 - 25 24 23 - 20 19 - 16 15 - 12 11 - 0
0000100 1 sr1 dr imm
Description: This instruction adds the unsigned value in register sr1 to the zero-
extended imm value and writes the result into register dr.
SUB DstReg, SrcReg1, SrcReg2
31 - 25 24 23 - 20 19 - 16 15 - 12 11 - 0
0000101 0 sr1 sr2 dr unused
Description: This instruction subtracts the unsigned value in register sr2 from
the unsigned value in register sr1 and writes the result into register dr.
69
SUBI DstReg, SrcReg1, UnsImm
31 - 25 24 23 - 20 19 - 16 15 - 12 11 - 0
0000101 1 sr1 dr imm
Description: This instruction subtracts the zero-extended imm value from the
unsigned value in register sr1 and writes the result into register dr.
B.2.2 Logical Instructions
AND DstReg, SrcReg1, SrcReg2
31 - 25 24 23 - 20 19 - 16 15 - 12 11 - 0
0000000 0 sr1 sr2 dr unused
Description: This instruction performs a bit-wise AND of the values in registers
sr1 and sr2 and writes the result into register dr.
ANDI DstReg, SrcReg1, UnsImm
31 - 25 24 23 - 20 19 - 16 15 - 12 11 - 0
0000000 1 sr1 dr imm
Description: This instruction performs a bit-wise AND of the value in register
sr1 and the zero-extended imm value and writes the result into register dr.
OR DstReg, SrcReg1, SrcReg2
31 - 25 24 23 - 20 19 - 16 15 - 12 11 - 0
0000001 0 sr1 sr2 dr unused
Description: This instruction performs a bit-wise OR of the values in registers
sr1 and sr2 and writes the result into register dr.
70
ORI DstReg, SrcReg1, UnsImm
31 - 25 24 23 - 20 19 - 16 15 - 12 11 - 0
0000001 1 sr1 dr imm
Description: This instruction performs a bit-wise OR of the value in register sr1
and the zero-extended imm value and writes the result into register dr.
XOR DstReg, SrcReg1, SrcReg2
31 - 25 24 23 - 20 19 - 16 15 - 12 11 - 0
0000010 0 sr1 sr2 dr unused
Description: This instruction performs a bit-wise exclusive-OR of the values in
registers sr1 and sr2 and writes the result into register dr.
XORI DstReg, SrcReg1, UnsImm
31 - 25 24 23 - 20 19 - 16 15 - 12 11 - 0
0000010 1 sr1 dr imm
Description: This instruction performs a bit-wise exclusive-OR of the value in
register sr1 and the zero-extended imm value and writes the result into register
dr.
XNOR DstReg, SrcReg1, SrcReg2
31 - 25 24 23 - 20 19 - 16 15 - 12 11 - 0
0000011 0 sr1 sr2 dr unused
Description: This instruction performs a bit-wise exclusive-NOR of the values
in registers sr1 and sr2 and writes the result into register dr.
71
XNORI DstReg, SrcReg1, UnsImm
31 - 25 24 23 - 20 19 - 16 15 - 12 11 - 0
0000011 1 sr1 dr imm
Description: This instruction performs a bit-wise exclusive-NOR of the value
in register sr1 and the zero-extended imm value and writes the result into
register dr.
B.2.3 Shift Instructions
LSL DstReg, SrcReg1, SrcReg2
31 - 25 24 23 - 20 19 - 16 15 - 12 11 - 0
0010001 0 sr1 sr2 dr unused
Description: This instruction shifts to the left the contents of sr1 by the amount
indicated in sr2, inserting zeros on the right. The result is written into register
dr.
LSLI DstReg, SrcReg1, UnsImm
31 - 25 24 23 - 20 19 - 16 15 - 12 11 - 0
0010001 1 sr1 dr imm
Description: This instruction shifts to the left the contents of sr1 by the amount
indicated in imm, inserting zeros on the right. The result is written into register
dr.
72
LSR DstReg, SrcReg1, SrcReg2
31 - 25 24 23 - 20 19 - 16 15 - 12 11 - 0
0010010 0 sr1 sr2 dr unused
Description: This instruction shifts to the right the contents of sr1 by the
amount indicated in sr2, inserting zeros on the left. The result is written into
register dr.
LSRI DstReg, SrcReg1, UnsImm
31 - 25 24 23 - 20 19 - 16 15 - 12 11 - 0
0010010 1 sr1 dr imm
Description: This instruction shifts to the right the contents of sr1 by the
amount indicated in imm, inserting zeros on the left. The result is written into
register dr.
ASR DstReg, SrcReg1, SrcReg2
31 - 25 24 23 - 20 19 - 16 15 - 12 11 - 0
0010011 0 sr1 sr2 dr unused
Description: This instruction shifts to the right the contents of sr1 by the
amount indicated in sr2, replicating the most significant bit. The result is
written into register dr.
ASRI DstReg, SrcReg1, UnsImm
31 - 25 24 23 - 20 19 - 16 15 - 12 11 - 0
0010011 1 sr1 dr imm
73
Description: This instruction shifts the contents of sr1 to the right by the
amount indicated in imm, replicating the most significant bit. The result is
written into register dr.
B.2.4 Comparison Instructions
SLT DstReg, SrcReg1, SrcReg2
31 - 25 24 23 - 20 19 - 16 15 - 12 11 - 0
0001000 0 sr1 sr2 dr unused
Description: This instruction signed compares the values in registers sr1 and
sr2 and writes the value 0x00000001 into dr if [sr1] < [sr2] or the value
0x00000000 otherwise.
SLTI DstReg, SrcReg1, SignImm
31 - 25 24 23 - 20 19 - 16 15 - 12 11 - 0
0001000 1 sr1 dr imm
Description: This instruction signed compares the value in register sr1 and the
sign-extended value imm. It writes the value 0x00000001 into dr if [sr1] <
[imm] or the value 0x00000000 otherwise.
SLTU DstReg, SrcReg1, SrcReg2
31 - 25 24 23 - 20 19 - 16 15 - 12 11 - 0
0001001 0 sr1 sr2 dr unused
Description: This instruction unsigned compares the values in registers sr1 and
sr2 and writes the value 0x00000001 into dr if [sr1] < [sr2] or the value
0x00000000 otherwise.
74
SLTUI DstReg, SrcReg1, UnsImm
31 – 25 24 23 - 20 19 - 16 15 - 12 11 - 0
0001001 1 sr1 dr imm
Description: This instruction unsigned compares the value in register sr1 and
the zero- extended value imm. It writes the value 0x00000001 into dr if [sr1] <
[imm] or the value 0x00000000 otherwise.
SGE DstReg, SrcReg1, SrcReg2
31 – 25 24 23 - 20 19 - 16 15 - 12 11 - 0
0001010 0 sr1 sr2 dr unused
Description: This instruction signed compares the values in registers sr1 and
sr2 and writes the value 0x00000001 into dr if [sr1] > = [sr2] or the value
0x00000000 otherwise.
SGEI DstReg, SrcReg1, SignImm
31 – 25 24 23 - 20 19 - 16 15 - 12 11 - 0
0001010 1 sr1 dr imm
Description: This instruction signed compares the value in register sr1 and the
sign-extended value imm. It writes the value 0x00000001 into dr if [sr1] > =
[imm] or the value 0x00000000 otherwise.
75
SGEU DstReg, SrcReg1, SrcReg2
31 – 25 24 23 - 20 19 - 16 15 - 12 11 - 0
0001011 0 sr1 sr2 dr unused
Description: This instruction unsigned compares the values in registers sr1 and
sr2 and writes the value 0x00000001 into dr if [sr1] > = [sr2] or the value
0x00000000 otherwise.
SGEUI DstReg, SrcReg1, UnsImm
31 – 25 24 23 - 20 19 - 16 15 - 12 11 - 0
0001011 1 sr1 dr imm
Description: This instruction unsigned compares the value in register sr1 and
the zero-extended value imm. It writes the value 0x00000001 into dr if [sr1] >
= [imm] or the value 0x00000000 otherwise.
SEQ DstReg, SrcReg1, SrcReg2
31 – 25 24 23 - 20 19 - 16 15 - 12 11 - 0
0001100 0 sr1 sr2 dr unused
Description: This instruction signed compares the values in registers sr1 and
sr2 and writes the value 0x00000001 into dr if [sr1] = [sr2] or the value
0x00000000 otherwise.
76
SEQI DstReg, SrcReg1, SignImm
31 – 25 24 23 - 20 19 - 16 15 - 12 11 - 0
0001100 1 sr1 dr imm
Description: This instruction signed compares the value in register sr1 and the
sign-extended value imm. It writes the value 0x00000001 into dr if [sr1] =
[imm] or the value 0x00000000 otherwise.
B.2.5 Load-Immediate Instructions
LDLI DstReg, UnsImm
31 – 25 24 23 - 20 19 - 16 15 - 12 11 - 0
0011100 1 unused dr imm
Description: This instruction loads the immediate value into the lower 16 bits
of the dr register, zeroing the upper 16 bits.
LDUI DstReg, UnsImm
31 – 25 24 23 - 20 19 - 16 15 - 12 11 - 0
0011101 1 unused dr imm
Description: This instruction loads the immediate value into the upper 16 bits
of the dr register, zeroing the lower 16 bits.
77
B.2.6 Memory Access Instructions
LDW DstReg, RegSrc1
31 – 25 24 23 - 20 19 - 16 15 - 12 11 - 0
0010100 0 sr1 unused dr unused
Description: This instruction loads into register dr the value from the memory
location which address is in register sr1.
STW SrcReg1, SrcReg2
31 – 25 24 23 - 20 19 - 16 15 - 12 11 - 0
0010101 0 sr1 sr2 unused unused
Description: This instruction stores the value in register sr2 into the memory
location which address is in register sr1.
B.2.7 Control Transfer Instructions
BRT SrcReg1, SignImm
31 – 25 24 23 - 20 19 - 16 15 - 12 11 - 0
0011011 1 sr1 unused imm
Description: This instruction tests the value in register sr1 and jumps if it has
the value 0x00000001 with a one-instruction delay. The address of the target
instruction is calculated by adding the sign-extended imm offset to the
instruction's address.
78
BRF SrcReg1, SignImm
31 – 25 24 23 - 20 19 - 16 15 - 12 11 - 0
0011010 1 sr1 unused imm
Description: This instruction tests the value in register sr1 and jumps if it has
the value 0x00000000 with a one-instruction delay. The address of the target
instruction is calculated by adding the sign-extended imm offset to the
instruction's address.
BRLT SrcReg1, SignImm
31 – 25 24 23 - 20 19 - 16 15 - 12 11 - 0
0100000 1 sr1 dr imm
Description: This instruction signed compares the values in registers sr1 and dr
and jumps if [sr1] < [dr] with a one-instruction delay. The address of the target
instruction is calculated by adding the sign-extended imm offset to the
instruction's address.
BRLE SrcReg1, SignImm
31 – 25 24 23 - 20 19 - 16 15 - 12 11 - 0
0100001 1 sr1 dr imm
Description: This instruction signed compares the values in registers sr1 and dr
and jumps if [sr1] ≤ [dr] with a one-instruction delay. The address of the target
instruction is calculated by adding the sign-extended imm offset to the
instruction's address.
79
BREQ SrcReg1, SignImm
31 – 25 24 23 - 20 19 - 16 15 - 12 11 - 0
0100010 1 sr1 dr imm
Description: This instruction unsigned compares the values in registers sr1 and
dr and jumps if [sr1] > [dr] with a one-instruction delay. The address of the
target instruction is calculated by adding the sign-extended imm offset to the
instruction's address.
BRNE SrcReg1, SignImm
31 – 25 24 23 - 20 19 - 16 15 - 12 11 - 0
0100011 1 sr1 dr imm
Description: This instruction unsigned compares the values in registers sr1 and
dr and jumps if [sr1] ≠ [dr] with a one-instruction delay. The address of the
target instruction is calculated by adding the sign-extended imm offset to the
instruction's address.
JAL DstReg, SrcReg1
31 – 25 24 23 - 20 19 - 16 15 - 12 11 - 0
0011000 0 sr1 unused dr unused
Description: This instruction unconditionally jumps with a one-instruction
delay to the target address in register sr1. The instruction's address plus 2 is
saved into register dr.
80
B.2.8 MorphoSys Instruction
LDCTXT SreReg1, r/c#, r/c, context#, #contexts to be loaded
31-26 25 24 23-20 19 18-16 15 14-11 10-8 7-0
100000 - - SrcReg1 - r/c# r/c context# --- # contexts to be loaded
• SrcReg1: The starting address of external memory where the context
configuration is stored.(32bit address).
• r/c #: Used to control the starting cell in the Context memory. (0-7 in the
horizontal direction ).
• r/c: Select the column context or row context (0 - column, 1 - row).
• context #: Starting context (0-15).
• # contexts to be loaded: Specify how many contexts to be loaded through
DMA.
Description: This instruction is used to load context words in to the Context
Memory. When this instruction is issued, TinyRISC provides appropriate
control signals to the DMAC. Based on these signals, the DMAC performs the
loading of configuration data from external memory to the Context Memory.
Note: During DMAC loading context, it increases r/c# first, and then increases
the context#.
81
LDFB SreReg1, bank, set#, #words
31-26 25 24 23-20 19-11 10 9 8-0
100010 - - SrcReg1 -------- Bank Set# #words to load
• SrcReg1: The starting address of external memory where the data is
stored.(32bit address).
• Bank: Specifies which bank of Frame Buffer, 0 - bank A. 1 - bank B.
• Set #: Specifies which set of Frame Buffer (set number 0 or 1).
• # of words to load: Specifies how many 32-bit words to be loaded to the
Frame Buffer.
Description: This instruction is used to load image or application data into the
Frame Buffer for subsequent use by the RC Array. When this instruction is
issued, TinyRISC initiates operation of the DMAC to perform the transfer of
data from external memory to the Frame Buffer.
Note: When loading/storing data to/from Frame Buffer, it always starts from the
beginning of the specified bank. It is different from the mechanism of the
Context Memory, where the loading can start at any location. One bank has 64
rows, and each row has 2 words (64 bits).
82
STFB SrcReg1, bank, set#, #words
31-26 25 24 23-20 19-11 10 9 8-0
100011 - - SrcReg1 -------- Bank Set# #words to store
• SrcReg1: Provides the starting address in main memory where the data
should to be stored.
• Specifies which bank of Frame Buffer, 0 - bank A. 1 - bank B.
• Set #: Specifies which set of Frame Buffer (set number 0 or 1).
• # of words to store: Specifies how many 32-bit words to be stored from the
Frame Buffer to the main memory.
Description: This instruction is used to transfer the processed image or
application data from the Frame Buffer back to the external memory through the
DMAC.
SBCB b_all, b_row_col, r/c, context#, bank, set#, bank_addr
31-26 20 18-16 15 14-11 10 9 8-0
110100 b_all b_row_col r/c context # bank set# bank_addr
• b_all: Specifies whether the entire RC Array (8 x 8) is actived or only one
row or column of RC Array is actived, 1 = ALL of the RCs are actived. 0 =
only one row or column of RC Array is actived.
• b_row_col: If the b_all =0, then this field specifies which row or column of
RC Array is actived.
83
• r/c: Specifies the context broadcast mode. 1 = row context broadcast. 0 =
column context broadcast.
• context #: Specifies which context (in Context Memory) to be executed.
• bank: Specifies which Frame Buffer bank to be accessed.
• set #: Specifies which set of the Frame Buffer.
• bank_addr: Provides the Frame Buffer address.
Description: When this instruction is issued, the TinyRISC provides an address
that enables the RC Array to access eight bytes (single-operand) data from the
Frame Buffer. The RC Array also executes concurrently on the context word
specified in the instruction.
Note: Since each bank in the Frame buffer has the capacity of 64 x 8 bytes, 6
address bits are required to specify which row and the other 3 bits specify the
starting word in that row. The important feature of the Frame Buffer is that it
always fetches the eight consecutive bytes of data even though the data may
wrap around to the next row.
DBCBC SrcReg1, bank_B_addr_base, b_all, r/c#, context#, set#, bank_A_addr
31-26 25 24 23-20 19-16 15-12 11-9 8-0
111100 set b_all SrcReg1 base_bankB context# r/c# bank_A_addr
• set: Specifies Frame Buffer set 0 or set 1.
• b_all: Same as SBCB.
• SrcReg1: Specifies the register of the TinyRISC that provides the lower 5
address bits for the bank B of the Frame Buffer.
84
• base_bankB: This field directly provides the base address for bank B of the
Frame Buffer. These 4 bits, along with the 5 bits from SrcReg1, provide the
complete Bank B address (9 bits).
• context #: Same as SBCB.
• r/c #: If b_all = 0, this specifies which column of the RC Array is activated.
• bank_A_addr: These nine bits specify the location of data to be loaded from
bank A of the Frame Buffer.
Description: This instruction refers to double bank access of Frame Buffer
with column-wise context broadcast. When this instruction is issued, the
Frame Buffer provides eight sets of two-operand data to the RC Array. Each
RC get two bytes of data, where one byte is from bank A and the other is from
bank B.
DBCBR SrcReg1, bank_B_addr_base, b_all, r/c#, context#, set#, bank_A_addr
31-26 25 24 23-20 19-16 15-12 11-9 8-0
111101 set b_all SrcReg1 base_bankB context# r/c# bank_A_addr
Description: This instruction refers to double bank access of Frame Buffer
with row-wise context broadcast. All of the fields specify the same
information as those of DBCBC, except that r/c # is used to specify which row
is activated.
85
CBCAST b_all, b_row_col, r/c, context#
31-26 25-21 20 19 18-16 15 14-11 10-0
111000 ----- b_all - b_row_col r/c context# -----------
• b_all: 1 = all of the RC is actived. 0 = only one row or column of the RC
Array is actived.
• b_row_col: if b_all = 0, then b_row_col specifies which row or column is
actived.
• r/c: This field specifies the context broadcast mode. 0 = column context
broadcast. 1 = row context broadcast.
• context #: Specifies which context in the Context Memory to be executed.
Description: This instruction assumes that all data needed for the computation
is already present in the RC Array; hence, no access to the Frame Buffer is
required.
WFBI r/c#, r/c, bank, set#, bank_addr
31-26 25-19 18-16 15 14-11 10 9 8-0
101000 ------- r/c# - ---- bank set# bank_addr
• r/c #: Specifies which column of the RC Array from which the data has to be
written back to the Frame Buffer.
• bank: Specifies which bank of the Frame Buffer that the data has to be
written to.
• set #: Specifies which set of the Frame Buffer.
86
• bank_addr: This field provides the immediate row address (9 bits) for the
Frame Buffer that the data from the RC Array will be written to.
Description: This instruction performs the writing of data to the Frame Buffer.
The immediate address is obtained from the field bank_addr. The source data is
from the indicated column (specified by r/c #) of the RC Array. Eight bytes of
data are written concurrently to one row of the Frame Buffer through a 64-bit
bus.
WFB SrcReg1, r/c#, r/c, bank, set#
31-26 25-24 23-20 19 18-16 15 14-11 10 9 8-0
101001 -- SrcReg1 - r/c# - ---- bank set# ------
• SrcReg1: Specifies the register of the TinyRISC that provides the Frame
buffer address.
• r/c #: Specifies which column of the RC Array from which the data has to be
written back to the Frame Buffer.
• bank: Specifies which bank of the Frame Buffer that the data from the RC
Array will be written to.
• set #: Specifies which set of the Frame Buffer.
Description: This instruction performs the writing of data to the Frame Buffer
with address obtained from the TinyRISC register specified in the field
SrcReg1. The source data is from the indicated column (specified by r/c #) of
the RC Array. Eight bytes of data are written concurrently to one row of the
Frame Buffer through a 64-bit bus.
87
RCRISC Dest, col#
31-26 25-19 18-16 15-12 11-0
100100 ------- col# Dest ------------
• col #: Specifies which RC (out of eight) in the first row of the RC Array will
write data to the TinyRISC.
• Dest: Specifies the destination TinyRISC register where the data of the
specified RC has to be stored.
88
Appendix C
RC Array Instruction Set
Instruction Type
Instruction Input 1 Input 2 Output Descr iption
BYPASS MUX A - reg_file, out in1 � out
OR MUX A MUX B reg_file, out in1 or in2 � out
AND MUX A MUX B reg_file, out in1 and in2 � out
XOR MUX A MUX B reg_file, out in1 xor in2 � out
ADD MUX A MUX B reg_file, out in1 + in2 � out
SUB MUX A MUX B reg_file, out in1 - in2 � out
SUBB MUX A MUX B reg_file, out in2 - in1 � out
ANDCNT MUX A MUX B reg_file, out position of least significant 1 in (in1 and in2) � out
ADDSUBF MUX A MUX B reg_file, out in1 ± in2 (according to flag) � out
ABSSUB MUX A MUX B reg_file, out | in1 - in2 | + out � out
KEEP - - reg_file, out nop
ROUND MUX A MUX B reg_file, out round(out) � out
CADD MUX A MUX B reg_file, out complex: in1 + in2 � out
in1, in2: 8 bit real, 8 bit Imag
CSUB MUX A MUX B reg_file, out complex: in1 - in2 � out
in1, in2: 8 bit real, 8 bit Imag
RST - - - clear all reg’s
LDMM mem - reg_file mem(MAC_reg) � reg_File
A
L
U
STMM reg_file - mem reg_File � mem(MAC_reg)
89
CMUL MUX A MUX B reg_file, out sign complex: in1 * in2 � out
in1, in2: 8 bit real, 8 bit Imag
MUL MUX A MUX B reg_file, out sign: in1 * in2 � out
in1, in2 : 16 bit
MULADD MUX A MUX B reg_file, out in1 * MAC_reg + in2 � out
in1, in2 : 16 bit
MULADDO MUX A MUX B reg_file, out in1 * MAC_reg + out � out
in1, in2 : 16 bit
CMULADD MUX A MUX B reg_file, out in1 * MAC_reg + in2 � out
in1, in2: 8 bit real, 8 bit Imag
M
A
C
CMULADDO MUX A MUX B reg_file, out in1 * MAC_reg + out � out
in1, in2: 8 bit real, 8 bit Imag
LDIM - - reg_file immediate value � reg_file
immidiate value in context[15..0]
LDMM mem - reg_file mem(MAC_reg) � reg_file
M
E
M STMM reg_file - mem reg_file � mem(MAC_reg)
90
Appendix D
The Programs for AES Implementation in MorphoSys
D.1 Key Expansion
The Key Expansion program listed here is for encryption at Key size = 128 bits is.
For decryption, the function InvMixColumn( ) is applied to every generated Round Key
except the first and last one. If the Key size is 192 bits or 256 bits, the only change is the
number of Rounds (loops).
#######################################################################
# Round Key generation for NK=4 (128 bits).
# by Ye Tang, 05/22/01
#######################################################################
main:
ldi $10, 0x0100 # start address of the key; tracking last round key
ldi $11, 0x000A # loop number is 10, i.e., key length = 128 bits
ldi $12, 0x0200 # start address of "rcon[ ]"
ldi $13, 0x010C # 0x010C is the start address of last word of the key;
# use it as another index
# load the last word of the original key.
# notice that this part is unnecessary afterwards because $5 to $8 have already stored the last word of last round key (at the end of the loop).
ldw $5, $13 # load a "tinyrisc word"; $5 = tk[0][KC-1]
addi $13, $13, 1
ldw $6, $13 # 2nd one; $6 = tk[1][KC-1]
addi $13, $13, 1
ldw $7, $13 # 3nd one; $7 = tk[2][KC-1]
addi $13, $13, 1
ldw $8, $13 # 4th one; $8 = tk[3][KC-1]
######################################################################################
Rounds:
# calculate the 1st round key word
ldw $1, $10 # load the 1st byte; $1 = tk[0][0]
addi $10, $10, 1
ldw $2, $10 # load the 2nd one; $2 = tk[1][0]
addi $10, $10, 1
91
ldw $3, $10 # load the 3rd one; $3 = tk[2][0]
addi $10, $10, 1
ldw $4, $10 # load the 4th one; $4 = tk[3][0]
ldw $5, $5 # $5 = Sbox($5); Assume S-box is at address 0x0000
ldw $6, $6
ldw $7, $7
ldw $8, $8
ldw $9, $12 # $9 = rcon[$12]
addi $12, $12, 1 # for the use of rcon[ ] in next loop
xor $1, $6, $1 # $1 xor $6 -> $1; tk[i][0] ^= tk[(i+1)%4][KC-1]
xor $2, $7, $2 # xor a, b, c means c = a xor b
xor $3, $8, $3
xor $4, $5, $4
xor $1, $9, $1 # tk[0][0] ^= rcon[$12]
######################################################################################
# calculate the 2nd round key word and store the 1st round key word
addi $10, $10, 1 # the 2nd word of last round key, i.e., W[i-Nk] (i=Nk+1, Nk+2, ... , 2Nk-1)
ldw $5, $10 # load the 1st byte; $5 = tk[0][j] of last round key
addi $10, $10, 1
ldw $6, $10 # load the 2nd byte; $6 = tk[1][j] of last round key
addi $10, $10, 1
ldw $7, $10 # load the 3rd byte; $7 = tk[2][j] of last round key
addi $10, $10, 1
ldw $8, $10 # load the 4th byte; $8 = tk[3][j] of last round key
xor $1, $5, $5 # $5 = $1 xor $5 ; tk[i][j] ^= tk[i][j-1]; $1 is tk[i][j-1]
xor $2, $6, $6
xor $3, $7, $7
xor $4, $8, $8
addi $13, $13, 1 # store $1 to $4 (the 1st round key word) back at 0x0110, 0x0120 (next loop) and so on.
stw $1, $13 # $13 is the address. Manual is incorrect again!!!
addi $13, $13, 1 # in mULATE, it displays as "stw r13, r1".
stw $2, $13
addi $13, $13, 1
stw $3, $13
addi $13, $13, 1
stw $4, $13
92
######################################################################################
# calculate the 3rd round key word and store the 2nd round key word
# don't use loop so we can switch registers ($1 to $4, or $5 to $8) used for the words. by this means we can save time.
addi $10, $10, 1 # the 3rd word of last round key (or, original key);
ldw $1, $10 # switch to $1 again
addi $10, $10, 1
ldw $2, $10
addi $10, $10, 1
ldw $3, $10
addi $10, $10, 1
ldw $4, $10
xor $1, $5, $1 # $1 = $1 xor $5 ; tk[i][j] ^= tk[i][j-1]; $5 is tk[i][j-1]
xor $2, $6, $2
xor $3, $7, $3
xor $4, $8, $4
addi $13, $13, 1 # store $5 to $8 (the 2nd round key word) back at 0x0114 and so on.
stw $5, $13
addi $13, $13, 1
stw $6, $13
addi $13, $13, 1
stw $7, $13
addi $13, $13, 1
stw $8, $13
######################################################################################
# calculate the 4th round key word and store the 3rd round key word
addi $10, $10, 1 # the 4th word of last round key (or, original key);
ldw $5, $10 # switch to $5 again
addi $10, $10, 1
ldw $6, $10
addi $10, $10, 1
ldw $7, $10
addi $10, $10, 1
ldw $8, $10
xor $1, $5, $5 # $5 = $1 xor $5 ; tk[i][j] ^= tk[i][j-1]; $1 is tk[i][j-1]
xor $2, $6, $6
xor $3, $7, $7
xor $4, $8, $8
addi $13, $13, 1 # store $1 to $4 (the 3rd round key word) back at 0x0118 and so on.
stw $1, $13
addi $13, $13, 1
93
stw $2, $13
addi $13, $13, 1
stw $3, $13
addi $13, $13, 1
stw $4, $13
######################################################################################
# store the 4th round key word
addi $13, $13, 1 # store $5 to $8 back at 0x011c and so on.
stw $5, $13
addi $13, $13, 1
stw $6, $13
addi $13, $13, 1
stw $7, $13
addi $13, $13, 1
stw $8, $13
######################################################################################
addi $10, $10, 1 # point to the start address of current round key (0x0110 and so on), used for next loop
subi $11, $11, 1
brlt $0, $11, Rounds
nop # this nop (delay slot) is necessary, otherwise $10 will be assigned the value 0x0058.
# end of the "Rounds" loop
######################################################################################
# concatenate the round key so that it can be used by FB. the format is changed from "00000001 00000002" to "00010002".
ldi $10, 0x0058 # concatenate 2 numbers each time. there are 16*11 numbers in total. so 88 loops are needed.
ldi $11, 0x0100
ldi $12, 0x0100
Concatenation:
ldw $1, $11
addi $11, $11, 1
ldw $2, $11
lsli $1, $1, 16 # left shift $1 16 bits; get something like "00020000"
or $2, $1, $1 # $1 = $1 or $2, so we get something like "00020003"
stw $1, $12 # store back
addi $11, $11, 1
addi $12, $12, 1 # the increase of $12 is as half as that of $11
subi $10, $10, 1
brlt $0, $10, Concatenation
nop
.end main
94
D.2 Data Processing
The program listed here is for encryption at Key size = 128 bits. The programs for
decryption and other Key sizes are similar.
######################################################################################
# AES (Rijndael) - Encryption part
# - First part: lookup table loading
# - Second part: data process part
# By Ye Tang, 05/20/01
######################################################################################
# First Part: Table Loading
# first load the more frequently-used xtime table (so no offset is needed), then s-box.
main:
# Load Column Contexts;
ldi $1, 0x0000 # assume the column contexts address
ldi $10, 128
ldctxt $1, 0, 0, 0, 128
Delay1.DONE:
subi $10, $10, 4
nop
brle $0, $10, Delay1.DONE
nop
# Load Row Contexts;
ldi $2, 0x0080
ldi $10, 128
ldctxt $2, 0, 1, 0, 128
Delay2.DONE:
subi $10, $10, 4
nop
brle $0, $10, Delay2.DONE
nop
# Begin loading table data
# First 32 contexts consist of 4 "control contexts" (which are last two col/row contexts,
# i.e., #14 and #15) and 28 "data contexts". So 28 table data can be loaded.
# Then the first 15 col/row contexts will be flushed by new ones. Only the last
# col/row context are kept for control context (which are STMM and ADD R1, R1, R2).
# So from the second 32 contexts, 30 table data can be loaded each time.
95
# To fully load 256-cell S-box table, we need 9 context switches -- (14+15*7+9)*2
# After finishing S-box table loading, we will begin to load xtime table.
# Notice that the value of R1 is just what we need for the next address, and R2 is still 1.
# So actually we load s-box and xtime table seamlessly (back to back).
# There are 18 context switches in total -- (28+30*16+4)
# Because of the format of "cbcast", we can't use loop for it.
# If the context # is increased from 16 to 128. We definitely can't bear it.
# A format of "cbcast 1, 0, 0, $1" would be much better.
# For now we can use a "coarse loop" between context switches, but except the first/last
# switch.
# The order of table loading execution: execute col context #0, #1, ... #14 (or the special
# first/last one in the first/last switch), then execute row context #0, #1, ..., #14.
# Remember this is also the order you must comply with when establish the contexts.
# Of course, when the context memory size is expanded to 128, things will not be so painful.
# first 32 contexts; load first 28 data of s-box table
cbcast 1, 0, 0, 14 # r1=0; col context
cbcast 1, 0, 1, 14 # r2=1; row context
cbcast 1, 0, 0, 0 # load 1st data to r0
cbcast 1, 0, 0, 15 # store r0
cbcast 1, 0, 1, 15 # increase r1 by 1 (address )
cbcast 1, 0, 0, 1 # load 2nd data...
cbcast 1, 0, 0, 15
cbcast 1, 0, 1, 15
cbcast 1, 0, 0, 2 # load 3rd data...
cbcast 1, 0, 0, 15
cbcast 1, 0, 1, 15
cbcast 1, 0, 0, 3 # and so on.
cbcast 1, 0, 0, 15
cbcast 1, 0, 1, 15
cbcast 1, 0, 0, 4
cbcast 1, 0, 0, 15
cbcast 1, 0, 1, 15
cbcast 1, 0, 0, 5
cbcast 1, 0, 0, 15
cbcast 1, 0, 1, 15
96
cbcast 1, 0, 0, 6
cbcast 1, 0, 0, 15
cbcast 1, 0, 1, 15
cbcast 1, 0, 0, 7
cbcast 1, 0, 0, 15
cbcast 1, 0, 1, 15
cbcast 1, 0, 0, 8
cbcast 1, 0, 0, 15
cbcast 1, 0, 1, 15
cbcast 1, 0, 0, 9
cbcast 1, 0, 0, 15
cbcast 1, 0, 1, 15
cbcast 1, 0, 0, 10
cbcast 1, 0, 0, 15
cbcast 1, 0, 1, 15
cbcast 1, 0, 0, 11
cbcast 1, 0, 0, 15
cbcast 1, 0, 1, 15
cbcast 1, 0, 0, 12
cbcast 1, 0, 0, 15
cbcast 1, 0, 1, 15
cbcast 1, 0, 0, 13
cbcast 1, 0, 0, 15
cbcast 1, 0, 1, 15
cbcast 1, 0, 1, 0 # Begin to execute row contexts
cbcast 1, 0, 0, 15
cbcast 1, 0, 1, 15
cbcast 1, 0, 1, 1
cbcast 1, 0, 0, 15
cbcast 1, 0, 1, 15
cbcast 1, 0, 1, 2
cbcast 1, 0, 0, 15
cbcast 1, 0, 1, 15
97
cbcast 1, 0, 1, 3
cbcast 1, 0, 0, 15
cbcast 1, 0, 1, 15
cbcast 1, 0, 1, 4
cbcast 1, 0, 0, 15
cbcast 1, 0, 1, 15
cbcast 1, 0, 1, 5
cbcast 1, 0, 0, 15
cbcast 1, 0, 1, 15
cbcast 1, 0, 1, 6
cbcast 1, 0, 0, 15
cbcast 1, 0, 1, 15
cbcast 1, 0, 1, 7
cbcast 1, 0, 0, 15
cbcast 1, 0, 1, 15
cbcast 1, 0, 1, 8
cbcast 1, 0, 0, 15
cbcast 1, 0, 1, 15
cbcast 1, 0, 1, 9
cbcast 1, 0, 0, 15
cbcast 1, 0, 1, 15
cbcast 1, 0, 1, 10
cbcast 1, 0, 0, 15
cbcast 1, 0, 1, 15
cbcast 1, 0, 1, 11
cbcast 1, 0, 0, 15
cbcast 1, 0, 1, 15
cbcast 1, 0, 1, 12
cbcast 1, 0, 0, 15
cbcast 1, 0, 1, 15
cbcast 1, 0, 1, 13
cbcast 1, 0, 0, 15
cbcast 1, 0, 1, 15
98
# 16 loops to load subsequent 15*2*16=480 numbers in the two tables.
ldi $11, 0x0010 # loop counter
# first reload contexts except the last col/row context, then cbcast
sbox:
# Load Column Contexts;
addi $1, $1, 0x100 # address of this part of contexts
ldi $10, 128
ldctxt $1, 0, 0, 0, 128
Delay3.DONE:
subi $10, $10, 4
nop
brle $0, $10, Delay3.DONE
nop
# Load Row Contexts;
addi $2, $2, 0x100 # address of this part of contexts
ldi $10, 128
ldctxt $2, 0, 1, 0, 128
Delay4.DONE:
subi $10, $10, 4
nop
brle $0, $10, Delay4.DONE
nop
cbcast 1, 0, 0, 0 # load 1st data to r0; col context
cbcast 1, 0, 0, 15 # store r0
cbcast 1, 0, 1, 15 # increase r1 by 1 (address )
cbcast 1, 0, 0, 1 # load 2nd data...
cbcast 1, 0, 0, 15
cbcast 1, 0, 1, 15
cbcast 1, 0, 0, 2 # load 3rd data...
cbcast 1, 0, 0, 15
cbcast 1, 0, 1, 15
cbcast 1, 0, 0, 3 # and so on.
cbcast 1, 0, 0, 15
cbcast 1, 0, 1, 15
cbcast 1, 0, 0, 4
cbcast 1, 0, 0, 15
cbcast 1, 0, 1, 15
99
cbcast 1, 0, 0, 5
cbcast 1, 0, 0, 15
cbcast 1, 0, 1, 15
cbcast 1, 0, 0, 6
cbcast 1, 0, 0, 15
cbcast 1, 0, 1, 15
cbcast 1, 0, 0, 7
cbcast 1, 0, 0, 15
cbcast 1, 0, 1, 15
cbcast 1, 0, 0, 8
cbcast 1, 0, 0, 15
cbcast 1, 0, 1, 15
cbcast 1, 0, 0, 9
cbcast 1, 0, 0, 15
cbcast 1, 0, 1, 15
cbcast 1, 0, 0, 10
cbcast 1, 0, 0, 15
cbcast 1, 0, 1, 15
cbcast 1, 0, 0, 11
cbcast 1, 0, 0, 15
cbcast 1, 0, 1, 15
cbcast 1, 0, 0, 12
cbcast 1, 0, 0, 15
cbcast 1, 0, 1, 15
cbcast 1, 0, 0, 13
cbcast 1, 0, 0, 15
cbcast 1, 0, 1, 15
cbcast 1, 0, 0, 14
cbcast 1, 0, 0, 15
cbcast 1, 0, 1, 15
cbcast 1, 0, 1, 0 # Begin to execute row contexts
cbcast 1, 0, 0, 15
cbcast 1, 0, 1, 15
100
cbcast 1, 0, 1, 1
cbcast 1, 0, 0, 15
cbcast 1, 0, 1, 15
cbcast 1, 0, 1, 2
cbcast 1, 0, 0, 15
cbcast 1, 0, 1, 15
cbcast 1, 0, 1, 3
cbcast 1, 0, 0, 15
cbcast 1, 0, 1, 15
cbcast 1, 0, 1, 4
cbcast 1, 0, 0, 15
cbcast 1, 0, 1, 15
cbcast 1, 0, 1, 5
cbcast 1, 0, 0, 15
cbcast 1, 0, 1, 15
cbcast 1, 0, 1, 6
cbcast 1, 0, 0, 15
cbcast 1, 0, 1, 15
cbcast 1, 0, 1, 7
cbcast 1, 0, 0, 15
cbcast 1, 0, 1, 15
cbcast 1, 0, 1, 8
cbcast 1, 0, 0, 15
cbcast 1, 0, 1, 15
cbcast 1, 0, 1, 9
cbcast 1, 0, 0, 15
cbcast 1, 0, 1, 15
cbcast 1, 0, 1, 10
cbcast 1, 0, 0, 15
cbcast 1, 0, 1, 15
cbcast 1, 0, 1, 11
cbcast 1, 0, 0, 15
cbcast 1, 0, 1, 15
101
cbcast 1, 0, 1, 12
cbcast 1, 0, 0, 15
cbcast 1, 0, 1, 15
cbcast 1, 0, 1, 13
cbcast 1, 0, 0, 15
cbcast 1, 0, 1, 15
cbcast 1, 0, 1, 14
cbcast 1, 0, 0, 15
cbcast 1, 0, 1, 15
subi $11, $11, 1
nop
brlt $0, $11, sbox # use "brlt" rather than "brle"!!!
nop
# Load last 4 (28+480+4) data of xtime table.
# Load 6(*8) Column Contexts including (control contexts);
addi $1, $1, 0x100 # address of this part of contexts
ldi $10, 48
ldctxt $1, 0, 0, 0, 48
Delay5.DONE:
subi $10, $10, 4
nop
brle $0, $10, Delay5.DONE
nop
cbcast 1, 0, 0, 0 # load 1st data to r0
cbcast 1, 0, 0, 4 # store r0
cbcast 1, 0, 0, 5 # increase r1 by 1 (address )
cbcast 1, 0, 0, 1 # load 2nd data...
cbcast 1, 0, 0, 4
cbcast 1, 0, 0, 5
cbcast 1, 0, 0, 2 # load 3rd data...
cbcast 1, 0, 0, 4
cbcast 1, 0, 0, 5
cbcast 1, 0, 0, 3 # and the 4th data.
cbcast 1, 0, 0, 4
cbcast 1, 0, 0, 5
102
######################################################################################
# Second Part: Data Process
# The first part left a few extra free space in context memory. But we will not make use of them.
# The reason is that 16 col and 15 contexts are needed here and there is no penalty if we load
# them all together. Of course, if context memory is increased, things are different.
# In that case, we may load these contexts with the remaining ones in last part.
# Load 12 Column Contexts;
addi $1, $1, 0x0100 # address of this part of contexts
ldi $10, 96
ldctxt $1, 0, 0, 0, 96
Delay6.DONE:
subi $10, $10, 4
nop
brle $0, $10, Delay6.DONE
nop
# Load 15 Row Contexts;
addi $2, $2, 0x0200 # address of this part of contexts, just for test purpose. notice $2 was not added by 100 last time.
ldi $10, 120
ldctxt $2, 0, 1, 0, 120
Delay7.DONE:
subi $10, $10, 4
nop
brle $0, $10, Delay7.DONE
nop
# Load first Round Key and 4 blocks of data from external memory to FB;
ldi $3, 0x0009 # assume there are 9 intermediate rounds, i.e. key length = 128 bits
ldi $4, 0x1300 # assume Round Keys begin here
ldfb $4, 0, 0, 8 # load 8 words or lines, i.e., 16 effective bytes every time ( Bank 0, Set 0 )
ldi $10, 8
Delay8.DONE:
subi $10, $10, 4
nop
brle $0, $10, Delay8.DONE
nop
ldi $5, 0x1400 # assume data begins here
ldfb $5, 1, 0, 32 # load 64 bytes (4 blocks) data ( Bank 1, Set 0 )
ldi $10, 32
Delay9.DONE:
subi $10, $10, 4
103
nop
brle $0, $10, Delay9.DONE
nop
# Load Round Key from FB to RC;
sbcb 0, 0, 0, 0, 0, 0, 0 # Load 1st eight bytes to 1st column; 1 Byte/RC
sbcb 0, 1, 0, 0, 0, 0, 8 # Load 2nd eight bytes to 2nd column;
sbcb 0, 2, 0, 0, 0, 0, 0 # Load 1st eight bytes to 3rd column; 2nd block
sbcb 0, 3, 0, 0, 0, 0, 8 # Load 2nd eight bytes to 4th column;
sbcb 0, 4, 0, 0, 0, 0, 0 # Load 1st eight bytes to 5th column; 3rd block
sbcb 0, 5, 0, 0, 0, 0, 8 # Load 2nd eight bytes to 6th column;
sbcb 0, 6, 0, 0, 0, 0, 0 # Load 1st eight bytes to 7th column; 4th block
sbcb 0, 7, 0, 0, 0, 0, 8 # Load 2nd eight bytes to 8th column;
# Load data from FB to RC; 4 blocks
sbcb 0, 0, 0, 1, 1, 0, 0 # Load data; Bank 1, Set 0; Col Context #1,
sbcb 0, 1, 0, 1, 1, 0, 8
sbcb 0, 2, 0, 1, 1, 0, 16
sbcb 0, 3, 0, 1, 1, 0, 24
sbcb 0, 4, 0, 1, 1, 0, 32
sbcb 0, 5, 0, 1, 1, 0, 40
sbcb 0, 6, 0, 1, 1, 0, 48
sbcb 0, 7, 0, 1, 1, 0, 56
# Initial Round Key Addition
cbcast 1, 0, 1, 0 # Row context #0
# Intermediate Rounds begins (loop begins); Note that we have to load Round Key into FB
# in EVERY round and flush the previous one, otherwise we can't fix the address.
# If sbcb supports variable like "sbcb 0, 0, 0, 0, 0, 0, $1", things will be easier.
# Load Round Key and data from external memory to FB;
IntermediateRound:
addi $4, $4, 8 # next Round Key starts here.
ldfb $4, 0, 0, 8 # load 8 words or lines, i.e., 16 effective bytes every time
ldi $10, 8
Delay10.DONE:
subi $10, $10, 4
nop
brle $0, $10, Delay10.DONE
nop
104
# Load Round Key from FB to RC;
sbcb 0, 0, 0, 0, 0, 0, 0 # Load 1st eight bytes to 1st column; 1 Byte/RC
sbcb 0, 1, 0, 0, 0, 0, 8 # Load 2nd eight bytes to 2nd column;
sbcb 0, 2, 0, 0, 0, 0, 0 # Load 1st eight bytes to 3rd column; 2nd block
sbcb 0, 3, 0, 0, 0, 0, 8 # Load 2nd eight bytes to 4th column;
sbcb 0, 4, 0, 0, 0, 0, 0 # Load 1st eight bytes to 5th column; 3rd block
sbcb 0, 5, 0, 0, 0, 0, 8 # Load 2nd eight bytes to 6th column;
sbcb 0, 6, 0, 0, 0, 0, 0 # Load 1st eight bytes to 7th column; 4th block
sbcb 0, 7, 0, 0, 0, 0, 8 # Load 2nd eight bytes to 8th column;
# ByteSub
cbcast 1, 0, 0, 2 # Col context #2
cbcast 1, 0, 0, 3 # Col context #3
cbcast 1, 0, 0, 4 # Col context #4
# ShiftRow-MixColumn
cbcast 1, 0, 1, 4 # Bypass r0, necessary!!! (because Col Ctx #4 is mem op and doesn't change out register.)
cbcast 1, 0, 1, 1 # Row context #1
cbcast 1, 0, 1, 2 # Row context #2
cbcast 1, 0, 0, 5 # Col context #5
cbcast 1, 0, 1, 4 # Row context #4, bypass r0 again
cbcast 1, 0, 0, 6 # Col context #6
cbcast 1, 0, 1, 3 # Row context #3
cbcast 1, 0, 1, 4 # Row context #4
cbcast 1, 0, 1, 5 # Row context #5
cbcast 1, 0, 1, 6 # Row context #6
cbcast 1, 0, 1, 7 # Row context #7
cbcast 1, 0, 1, 8 # Row context #8
cbcast 1, 0, 1, 9 # Row context #9
cbcast 1, 0, 0, 7 # Col context #7
cbcast 1, 0, 0, 8 # Col context #8
cbcast 1, 0, 0, 9 # Col context #9
cbcast 1, 0, 1, 10 # Row context #10
cbcast 1, 0, 0, 10 # Col context #10
cbcast 1, 0, 1, 11 # Row context #11
cbcast 1, 0, 0, 11 # Col context #11
# Add Round Key
cbcast 1, 0, 1, 0 # Row context #0
# loop condition
subi $3, $3, 1
brlt $0, $3, IntermediateRound
nop
105
# Final Round
# Load Round Key and data from external memory to FB;
addi $4, $4, 8 # next Round Key starts here.
ldfb $4, 0, 0, 8 # load 8 words or lines, i.e., 16 effective bytes every time
ldi $10, 8
Delay11.DONE:
subi $10, $10, 4
nop
brle $0, $10, Delay11.DONE
nop
# Load Round Key from FB to RC;
sbcb 0, 0, 0, 0, 0, 0, 0 # Load 1st eight bytes to 1st column; 1 Byte/RC
sbcb 0, 1, 0, 0, 0, 0, 8 # Load 2nd eight bytes to 2nd column;
sbcb 0, 2, 0, 0, 0, 0, 0 # Load 1st eight bytes to 3rd column; 2nd block
sbcb 0, 3, 0, 0, 0, 0, 8 # Load 2nd eight bytes to 4th column;
sbcb 0, 4, 0, 0, 0, 0, 0 # Load 1st eight bytes to 5th column; 3rd block
sbcb 0, 5, 0, 0, 0, 0, 8 # Load 2nd eight bytes to 6th column;
sbcb 0, 6, 0, 0, 0, 0, 0 # Load 1st eight bytes to 7th column; 4th block
sbcb 0, 7, 0, 0, 0, 0, 8 # Load 2nd eight bytes to 8th column;
# ByteSub
cbcast 1, 0, 0, 2 # Col context #2
cbcast 1, 0, 0, 3 # Col context #3
cbcast 1, 0, 0, 4 # Col context #4
# Mere ShiftRow, No MixColumn
cbcast 1, 0, 1, 4 # Bypass r0, necessary!
cbcast 1, 0, 1, 12 # Row context #12
cbcast 1, 0, 1, 13 # Row context #13
cbcast 1, 0, 0, 6 # Repeat Col context #6
cbcast 1, 0, 1, 6 # Bypass r1, Row context #6
cbcast 1, 0, 0, 5 # Repeat Col context #5
cbcast 1, 0, 1, 14 # Row context #14
# Add Round Key
cbcast 1, 0, 1, 0 # Row context #0
# Store data from RC to FB
nop # necessary before writing out
wfbi 0, 0, 1, 0, 0 # store column #0 to Bank 1, Set 0, addr 0
wfbi 1, 0, 1, 0, 8 # column #1
wfbi 2, 0, 1, 0, 16
106
wfbi 3, 0, 1, 0, 24
wfbi 4, 0, 1, 0, 32
wfbi 5, 0, 1, 0, 40
wfbi 6, 0, 1, 0, 48
wfbi 7, 0, 1, 0, 56
# Store data from FB to Extenal Memory
ldi $6, 0x2000 # assume output data begins here
stfb $6, 1, 0, 32 # save the 64 bytes (32 words) data back to main memory, Bank 1, Set 0
.end main
D.3 Contexts for Data Processing
The contexts listed here are for the encryption (applicable to all Key sizes). The
contexts for decryption are similar and not listed here.
Column Contexts
set 0 , 0 BYPASS I I > 7 ; # Load Round Key set 1 , 0 BYPASS I I > 7 ; set 2 , 0 BYPASS I I > 7 ; set 3 , 0 BYPASS I I > 7 ; set 4 , 0 BYPASS I I > 7 ; set 5 , 0 BYPASS I I > 7 ; set 6 , 0 BYPASS I I > 7 ; set 7 , 0 BYPASS I I > 7 ; set 0 , 1 BYPASS I I > 0 ; # Load original data set 1 , 1 BYPASS I I > 0 ; set 2 , 1 BYPASS I I > 0 ; set 3 , 1 BYPASS I I > 0 ; set 4 , 1 BYPASS I I > 0 ; set 5 , 1 BYPASS I I > 0 ; set 6 , 1 BYPASS I I > 0 ; set 7 , 1 BYPASS I I > 0 ; set 0 , 2 LDIM!0x0100 def def > 1 ; set 1 , 2 LDIM!0x0100 def def > 1 ; set 2 , 2 LDIM!0x0100 def def > 1 ; set 3 , 2 LDIM!0x0100 def def > 1 ; set 4 , 2 LDIM!0x0100 def def > 1 ; set 5 , 2 LDIM!0x0100 def def > 1 ; set 6 , 2 LDIM!0x0100 def def > 1 ; set 7 , 2 LDIM!0x0100 def def > 1 ; set 0 , 3 ADD r0 r1 > 0 ; set 1 , 3 ADD r0 r1 > 0 ; set 2 , 3 ADD r0 r1 > 0 ; set 3 , 3 ADD r0 r1 > 0 ; set 4 , 3 ADD r0 r1 > 0 ;
107
set 5 , 3 ADD r0 r1 > 0 ; set 6 , 3 ADD r0 r1 > 0 ; set 7 , 3 ADD r0 r1 > 0 ; set 0 , 4 LDMM r0 def > 0 ; set 1 , 4 LDMM r0 def > 0 ; set 2 , 4 LDMM r0 def > 0 ; set 3 , 4 LDMM r0 def > 0 ; set 4 , 4 LDMM r0 def > 0 ; set 5 , 4 LDMM r0 def > 0 ; set 6 , 4 LDMM r0 def > 0 ; set 7 , 4 LDMM r0 def > 0 ; set 0 , 5 BYPASS L def > 3 ; set 1 , 5 BYPASS L def > 3 ; set 2 , 5 BYPASS R def > 3 ; # also Final Step 4 set 3 , 5 BYPASS R def > 3 ; set 4 , 5 BYPASS L def > 3 ; set 5 , 5 BYPASS L def > 3 ; set 6 , 5 BYPASS R def > 3 ; set 7 , 5 BYPASS R def > 3 ; set 0 , 6 BYPASS L def > 2 ; set 1 , 6 BYPASS L def > 2 ; # also Final Step 3 set 2 , 6 BYPASS R def > 2 ; set 3 , 6 BYPASS R def > 2 ; set 4 , 6 BYPASS L def > 2 ; set 5 , 6 BYPASS L def > 2 ; set 6 , 6 BYPASS R def > 2 ; set 7 , 6 BYPASS R def > 2 ; set 0 , 7 XOR r0 r4 > 4 ; set 1 , 7 XOR r0 r4 > 4 ; set 2 , 7 XOR r0 r4 > 4 ; set 3 , 7 XOR r0 r4 > 4 ; set 4 , 7 XOR r0 r4 > 4 ; set 5 , 7 XOR r0 r4 > 4 ; set 6 , 7 XOR r0 r4 > 4 ; set 7 , 7 XOR r0 r4 > 4 ; set 0 , 8 XOR r2 r4 > 4 ; set 1 , 8 XOR r2 r4 > 4 ; set 2 , 8 XOR r2 r4 > 4 ; set 3 , 8 XOR r2 r4 > 4 ; set 4 , 8 XOR r2 r4 > 4 ; set 5 , 8 XOR r2 r4 > 4 ; set 6 , 8 XOR r2 r4 > 4 ; set 7 , 8 XOR r2 r4 > 4 ; set 0 , 9 XOR r3 r4 > 4 ; set 1 , 9 XOR r3 r4 > 4 ; set 2 , 9 XOR r3 r4 > 4 ; set 3 , 9 XOR r3 r4 > 4 ; set 4 , 9 XOR r3 r4 > 4 ; set 5 , 9 XOR r3 r4 > 4 ; set 6 , 9 XOR r3 r4 > 4 ; set 7 , 9 XOR r3 r4 > 4 ; set 0 , 10 LDMM r5 def > 5 ;
108
set 1 , 10 LDMM r5 def > 5 ; set 2 , 10 LDMM r5 def > 5 ; set 3 , 10 LDMM r5 def > 5 ; set 4 , 10 LDMM r5 def > 5 ; set 5 , 10 LDMM r5 def > 5 ; set 6 , 10 LDMM r5 def > 5 ; set 7 , 10 LDMM r5 def > 5 ; set 0 , 11 XOR r0 r4 > 0 ; set 1 , 11 XOR r0 r4 > 0 ; # r0 <-- r0 ^ r4 set 2 , 11 XOR r0 r4 > 0 ; set 3 , 11 XOR r0 r4 > 0 ; set 4 , 11 XOR r0 r4 > 0 ; set 5 , 11 XOR r0 r4 > 0 ; set 6 , 11 XOR r0 r4 > 0 ; set 7 , 11 XOR r0 r4 > 0 ;
Row Contexts
set 8 , 0 XOR r0 r7 > 0 ; # AddRoundKey, RoundKey is saved in r7 set 9 , 0 XOR r0 r7 > 0 ; set 10 , 0 XOR r0 r7 > 0 ; set 11 , 0 XOR r0 r7 > 0 ; set 12 , 0 XOR r0 r7 > 0 ; set 13 , 0 XOR r0 r7 > 0 ; set 14 , 0 XOR r0 r7 > 0 ; set 15 , 0 XOR r0 r7 > 0 ; set 8 , 1 BYPASS r0 def > 0 WE ; # ShiftRow-MixColumn Step 1 set 9 , 1 BYPASS r0 def > 0 ; set 10 , 1 BYPASS r0 def > 0 ; set 11 , 1 BYPASS VE def > 1 ; set 12 , 1 BYPASS r0 def > 0 WE ; set 13 , 1 BYPASS VE def > 1 ; set 14 , 1 BYPASS r0 def > 0 ; set 15 , 1 BYPASS r0 def > 0 ; set 8 , 2 BYPASS r0 def > 0 ; # ShiftRow-MixColumn Step 2 set 9 , 2 BYPASS VE def > 1 ; set 10 , 2 BYPASS r0 def > 0 WE ; set 11 , 2 BYPASS r0 def > 0 ; set 12 , 2 BYPASS r0 def > 0 ; set 13 , 2 BYPASS r0 def > 0 ; set 14 , 2 BYPASS r0 def > 0 WE ; set 15 , 2 BYPASS VE def > 1 ; set 8 , 3 BYPASS VE def > 2 ; # ShiftRow-MixColumn Step 7 set 9 , 3 BYPASS VE def > 2 WE ; # Because output register equals to r2 now, set 10 , 3 BYPASS VE def > 2 ; # execute step 7 before step 5 and 6 set 11 , 3 BYPASS VE def > 2 ; set 12 , 3 BYPASS VE def > 2 ;
109
set 13 , 3 BYPASS VE def > 2 ; set 14 , 3 BYPASS VE def > 2 ; set 15 , 3 BYPASS VE def > 2 WE ; set 8 , 4 BYPASS r0 def > 0 ; # ShiftRow-MixColumn Step 5 set 9 , 4 BYPASS r0 def > 0 ; set 10 , 4 BYPASS r0 def > 0 ; set 11 , 4 BYPASS r0 def > 0 ; set 12 , 4 BYPASS r0 def > 0 ; set 13 , 4 BYPASS r0 def > 0 ; set 14 , 4 BYPASS r0 def > 0 ; set 15 , 4 BYPASS r0 def > 0 ; set 8 , 5 BYPASS VE def > 0 ; set 9 , 5 BYPASS VE def > 0 ; set 10 , 5 BYPASS VE def > 0 ; set 11 , 5 BYPASS VE def > 0 WE ; set 12 , 5 BYPASS VE def > 0 ; set 13 , 5 BYPASS VE def > 0 WE ; set 14 , 5 BYPASS VE def > 0 ; set 15 , 5 BYPASS VE def > 0 ; set 8 , 6 BYPASS r1 def > 1 ; # ShiftRow-MixColumn Step 6 set 9 , 6 BYPASS r1 def > 1 ; set 10 , 6 BYPASS r1 def > 1 ; set 11 , 6 BYPASS r1 def > 1 ; set 12 , 6 BYPASS r1 def > 1 ; set 13 , 6 BYPASS r1 def > 1 ; set 14 , 6 BYPASS r1 def > 1 ; set 15 , 6 BYPASS r1 def > 1 ; set 8 , 7 BYPASS VE def > 1 ; set 9 , 7 BYPASS VE def > 1 ; set 10 , 7 BYPASS VE def > 1 ; set 11 , 7 BYPASS VE def > 1 WE ; set 12 , 7 BYPASS VE def > 1 ; set 13 , 7 BYPASS VE def > 1 WE ; set 14 , 7 BYPASS VE def > 1 ; set 15 , 7 BYPASS VE def > 1 ; set 8 , 8 BYPASS r3 def > 3 ; # ShiftRow-MixColumn Step 8 set 9 , 8 BYPASS r3 def > 3 ; set 10 , 8 BYPASS r3 def > 3 ; set 11 , 8 BYPASS r3 def > 3 ; set 12 , 8 BYPASS r3 def > 3 ; set 13 , 8 BYPASS r3 def > 3 ; set 14 , 8 BYPASS r3 def > 3 ; set 15 , 8 BYPASS r3 def > 3 ; set 8 , 9 BYPASS VE def > 3 ; set 9 , 9 BYPASS VE def > 3 WE ; set 10 , 9 BYPASS VE def > 3 ; set 11 , 9 BYPASS VE def > 3 ; set 12 , 9 BYPASS VE def > 3 ;
110
set 13 , 9 BYPASS VE def > 3 ; set 14 , 9 BYPASS VE def > 3 ; set 15 , 9 BYPASS VE def > 3 WE ; set 8 , 10 XOR r0 r1 > 5 ; # MixColumn - Flowing Step (1) b set 9 , 10 XOR r0 r3 > 5 ; # tm (r5) <-- r0 ^ r1 (or other registers) set 10 , 10 XOR r2 r3 > 5 ; set 11 , 10 XOR r1 r2 > 5 ; set 12 , 10 XOR r1 r2 > 5 ; set 13 , 10 XOR r2 r3 > 5 ; set 14 , 10 XOR r0 r3 > 5 ; set 15 , 10 XOR r0 r1 > 5 ; set 8 , 11 XOR r1 r5 > 0 ; # MixColumn - Flowing Step (2) b set 9 , 11 XOR r0 r5 > 0 ; # r0 <-- r0 ^ tm (or other registers) set 10 , 11 XOR r3 r5 > 0 ; set 11 , 11 XOR r2 r5 > 0 ; set 12 , 11 XOR r1 r5 > 0 ; set 13 , 11 XOR r2 r5 > 0 ; set 14 , 11 XOR r3 r5 > 0 ; set 15 , 11 XOR r0 r5 > 0 ; set 8 , 12 BYPASS r0 def > 0 ; # Final Round Step 1 set 9 , 12 BYPASS VE def > 1 WE ; # original data is in r0 set 10 , 12 BYPASS r0 def > 0 ; set 11 , 12 BYPASS r0 def > 0 ; set 12 , 12 BYPASS r0 def > 0 ; set 13 , 12 BYPASS VE def > 1 WE ; set 14 , 12 BYPASS r0 def > 0 ; set 15 , 12 BYPASS r0 def > 0 ; set 8 , 13 BYPASS r0 def > 0 ; # Final Round Step 2 set 9 , 13 BYPASS r0 def > 0 ; set 10 , 13 BYPASS r0 def > 0 ; set 11 , 13 BYPASS VE def > 1 WE ; set 12 , 13 BYPASS r0 def > 0 ; set 13 , 13 BYPASS r0 def > 0 ; set 14 , 13 BYPASS r0 def > 0 ; set 15 , 13 BYPASS VE def > 1 WE ; set 8 , 14 BYPASS r0 def > 0 ; # Final Round Step 5 set 9 , 14 BYPASS r1 def > 0 ; set 10 , 14 BYPASS r2 def > 0 ; set 11 , 14 BYPASS r3 def > 0 ; set 12 , 14 BYPASS r0 def > 0 ; set 13 , 14 BYPASS r3 def > 0 ; set 14 , 14 BYPASS r2 def > 0 ; set 15 , 14 BYPASS r1 def > 0 ;