1222 740 thesis - tu delftce-publications.et.tudelft.nl/publications/302_fpga_hardware... · this...

53
Computer Engineering Mekelweg 4, 2628 CD Delft The Netherlands http://ce.et.tudelft.nl/ 2009 MSc THESIS FPGA Hardware acceleration of co-occurring aberrations in aCGH data Marco R. van der Leije Abstract CE-MS-2009-17 Unbalanced transaction can lead to addition and deletions in genes, which can be an indication of tumor cells. This is measured with array Comparative Genomic Hybridization. To find co-occurring aberrations in DNA, an algorithm was designed. However the execution takes days to find these DNA aberrations. This thesis proposes a partial FPGA based design were the number of parallel computations can be increased. The FPGA communicates with a computer on a gigabit Ethernet, where on the FPGA a hardware based Ethernet controller is build. This design is scalable for FPGA’s, so its performance is linear to the size of the FPGA’s resources. On a XC4VFX12 device, a minimum speedup of a factor 3 and a maximum speedup of several hundreds is achieved.

Upload: doannhu

Post on 01-May-2018

214 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: 1222 740 thesis - TU Delftce-publications.et.tudelft.nl/publications/302_fpga_hardware... · This thesis proposes a partial FPGA based design were the number of parallel ... Figure

Computer Engineering Mekelweg 4,

2628 CD Delft The

Netherlands

http://ce.et.tudelft.nl/

2009

MSc THESIS

FPGA Hardware acceleration of co-occurring

a b e r r a t i o n s in aCGH data

Marco R. van der Leije

Abstract

CE-MS-2009-17

Unbalanced transaction can lead to addition and deletions in genes, which can be an indication of tumor cells. This is measured with array Comparative Genomic Hybridization. To find co-occurring aberrations in DNA, an algorithm was designed. However the execution takes days to find these DNA aberrations. This thesis proposes a partial FPGA based design were the number of parallel computations can be increased. The FPGA communicates with a computer on a gigabit Ethernet, where on the FPGA a hardware based Ethernet controller is build. This design is scalable for FPGA’s, so its performance is linear to the size of the FPGA’s resources. On a XC4VFX12 device, a minimum speedup of a factor 3 and a maximum speedup of several hundreds is achieved.

Page 2: 1222 740 thesis - TU Delftce-publications.et.tudelft.nl/publications/302_fpga_hardware... · This thesis proposes a partial FPGA based design were the number of parallel ... Figure
Page 3: 1222 740 thesis - TU Delftce-publications.et.tudelft.nl/publications/302_fpga_hardware... · This thesis proposes a partial FPGA based design were the number of parallel ... Figure

FPGA Hardware acceleration of co-occurring

aberrations in aCGH data

THESIS

submitted in partial fulfilment of the requirements for the degree of

MASTER OF SCIENCE

in

COMPUTER ENGINEERING

by

Marco R. van der Leije born in Rotterdam, The Netherlands

Computer Engineering

Department of Electrical Engineer ing

Faculty of Electrical Engineer ing, Ma t h e ma t i c s and Computer Science

Delft University of Technology

Page 4: 1222 740 thesis - TU Delftce-publications.et.tudelft.nl/publications/302_fpga_hardware... · This thesis proposes a partial FPGA based design were the number of parallel ... Figure
Page 5: 1222 740 thesis - TU Delftce-publications.et.tudelft.nl/publications/302_fpga_hardware... · This thesis proposes a partial FPGA based design were the number of parallel ... Figure

i

FPGA Hardware acceleration of co-occurring

aberrations in aCGH data

by Marco van der Leije

Abstract

Unbalanced transaction can lead to addition and deletions in genes, which can be an indication of tumor

cells. This is measured with array Comparative Genomic Hybridization. To find co-occurring aberrations in

DNA, an algorithm was designed. However the execution takes days to find these DNA aberrations. This

thesis proposes a partial FPGA based design were the number of parallel computations can be increased.

The FPGA communicates with a computer on a gigabit Ethernet, where on the FPGA a hardware based

Ethernet controller is build. This design is scalable for FPGA’s, so its performance is linear to the size of the

FPGA resources. On a XC4VFX12 device, a minimum speedup of a factor 3 and a maximum speedup of

several hundreds is achieved.

Laboratory : Computer Engineering

Codenumber : CE-MS-2009-17

Committee Members :

Advisor: Arjan J. van Genderen, CE, TU Delft

Advisor: Marcel J.T. Reinders, ICT, TU Delft

Chairperson: Koen Bertels, CE, TU Delft

Member: Georgi N. Gaydadjiev, CE, TU Delft

Member: Jeroen de Ridder, ICT, TU Delft

Page 6: 1222 740 thesis - TU Delftce-publications.et.tudelft.nl/publications/302_fpga_hardware... · This thesis proposes a partial FPGA based design were the number of parallel ... Figure

ii

Page 7: 1222 740 thesis - TU Delftce-publications.et.tudelft.nl/publications/302_fpga_hardware... · This thesis proposes a partial FPGA based design were the number of parallel ... Figure

iii

Page 8: 1222 740 thesis - TU Delftce-publications.et.tudelft.nl/publications/302_fpga_hardware... · This thesis proposes a partial FPGA based design were the number of parallel ... Figure

iv

Page 9: 1222 740 thesis - TU Delftce-publications.et.tudelft.nl/publications/302_fpga_hardware... · This thesis proposes a partial FPGA based design were the number of parallel ... Figure

v

Contents _

List of Figures vii

List of Table viii

Acknowledgements ix

1 Introduction 1 1.1 Aberrations in aCGH data .................................................................................................... 1 1.2 Problem statement ................................................................................................................ 2 1.3 Project goals and approach ................................................................................................... 3 1.4 Chapter overview.................................................................................................................. 3

2 The Algorithm 5

2.1 Inputs and outputs................................................................................................................. 5 2.2 Pair-wise space ..................................................................................................................... 6 2.3 Covariance and normalization .............................................................................................. 7 2.4 2D kernel .............................................................................................................................. 8 2.5 Peaks..................................................................................................................................... 9

3 Implementation in software 11

3.1 Pair-wise space ................................................................................................................... 11 3.2 Covariance and normalization ............................................................................................ 12 3.3 2D kernel ............................................................................................................................ 14 3.4 Peaks................................................................................................................................... 15

4 Implementation considerations 17

4.1 Parallelism .......................................................................................................................... 17 4.2 Implementation platforms................................................................................................... 18

4.2.1 General purpose processor........................................................................................ 18 4.2.2 Cell (microprocessor) ............................................................................................... 18 4.2.3 Graphic Processor Unit ............................................................................................. 19 4.2.3 Field programmable gate array ................................................................................. 19 4.2.4 Conclusion ................................................................................................................ 20

5 Architecture 21

5.1 Partitioning of the algorithm............................................................................................... 21 5.2 Software versus hardware................................................................................................... 22 5.3 Software.............................................................................................................................. 23 5.4 Hardware ............................................................................................................................ 25

6 Implementation in hardware 27

6.1 Ethernet communication..................................................................................................... 27 6.1.1 The TEMAC and the FIFO’s .................................................................................... 27 6.1.2 Communication data format ..................................................................................... 28 6.1.3 The Ethernet controller ............................................................................................. 28

6.2 The accelerator ................................................................................................................... 29 6.2.1 Receive...................................................................................................................... 30 6.2.2 Calculate ................................................................................................................... 30 6.3 Computational hardware.............................................................................................. 32

Page 10: 1222 740 thesis - TU Delftce-publications.et.tudelft.nl/publications/302_fpga_hardware... · This thesis proposes a partial FPGA based design were the number of parallel ... Figure

vi

7 Results 33 7.1 Platform and resources ....................................................................................................... 33 7.2 Execution times .................................................................................................................. 34 7.2 Scalability ........................................................................................................................... 35 7.3 Verification......................................................................................................................... 35

8 Conclusions and future work 37 8.1 Conclusion.......................................................................................................................... 37 8.2 Future work ........................................................................................................................ 38

Bibliography 39

Page 11: 1222 740 thesis - TU Delftce-publications.et.tudelft.nl/publications/302_fpga_hardware... · This thesis proposes a partial FPGA based design were the number of parallel ... Figure

vii

List of Figures _

Figure 1.1: Loss of a tumor suppressor gene and the gain of an oncogene ...................................... 1 Figure 1.2: Tumor DNA’s from different persons............................................................................ 2 Figure 1.3: Co-occurrences alterations in tumor DNA ..................................................................... 3

Figure 2.1: The inputs....................................................................................................................... 5 Figure 2.2: The output ...................................................................................................................... 5 Figure 2.3: Pre computation (pseudo code) ...................................................................................... 6 Figure 2.4: Calculate the minimums................................................................................................. 7 Figure 2.5: Sums the C matrices ....................................................................................................... 7 Figure 2.6: The NORM matrix ......................................................................................................... 7 Figure 2.7: Normalize the pair-wise space ....................................................................................... 8 Figure 2.8: Normal 2D kernel convolution....................................................................................... 8 Figure 2.9: Separated 2D kernel convolution ................................................................................... 9 Figure 2.10: Finding peaks ............................................................................................................... 9

Figure 3.1: Pseudo code of calculating the pair-wise space ........................................................... 11 Figure 3.2: Pseudo code of the covariance and normalization step ................................................ 12 Figure 3.3: Pseudo code of calculating the covariance matrix........................................................ 13 Figure 3.4: Matlab code defining the kernel matrix........................................................................ 14 Figure 3.5: Pseudo code of the 2D kernel convolution................................................................... 15 Figure 3.6: Pseudo code of the peak finding algorithm .................................................................. 16 Figure 3.7: Pseudo code of the subroutine EXPAND..................................................................... 16

Figure 5.1: Pseudo code of to_fpga function .................................................................................. 22 Figure 5.2: Software architecture.................................................................................................... 23 Figure 5.3: Pseudo code of fpga function ....................................................................................... 23 Figure 5.4: New NORM function ................................................................................................... 24 Figure 5.5: Pseudo code of fpga_buffer function ........................................................................... 24 Figure 5.6: Total hardware architecture.......................................................................................... 25

Figure 6.1: Communication TEMAC-FIFO’s ................................................................................ 27 Figure 6.2: accelerator architecture................................................................................................. 29 Figure 6.3: Computational unit ....................................................................................................... 30 Figure 6.4: Receive process ............................................................................................................ 31 Figure 6.5: Calculate process (0 < g < G)....................................................................................... 31 Figure 6.7: Calculate Result Mins (left) and calculate Result Covar (right) .................................. 32 Figure 6.8: Calculate Data OUT ..................................................................................................... 32

Figure 7.1: Execution time of a small two small input matrices..................................................... 34 Figure 7.2: Execution time of one small and one big input matrix................................................. 34 Figure 7.3: FPGA execution time for different input sizes............................................................. 35

Page 12: 1222 740 thesis - TU Delftce-publications.et.tudelft.nl/publications/302_fpga_hardware... · This thesis proposes a partial FPGA based design were the number of parallel ... Figure

viii

List of Tables _

Figure 4.1: Number of arithmetic operations.................................................................................. 17 Figure 4.2: Arithmetic and time complexity................................................................................... 17

Figure 5.1: Sizes of the matrices..................................................................................................... 21

Figure 7.1: Resources used for the accelerator hardware ............................................................... 33 Figure 7.2: Errors between KCSMART v8 software and KCSMART v8...................................... 35

Page 13: 1222 740 thesis - TU Delftce-publications.et.tudelft.nl/publications/302_fpga_hardware... · This thesis proposes a partial FPGA based design were the number of parallel ... Figure

ix

Acknowledgements _

The report is the result of several months of work on improving the discussed algorithm. It is

a great challenge and many aspect of designing came along. Without some help this was very

hard to realize.

I like to thank Arjan van Genderen for his help and trust in me. He gave good advice and

he gave me much freedom in working at home. He also critically checked my first report,

where he has given some great advice. I also like to thank Marcel Reinders and Jeroen de

Ridder. They explained the algorithm and help me to stay on the right track. Finally I want to

thank Chris Klijn, because it was his matlab implementation. He gave insights in the

algorithm and explained all to get started.

Marco R. van der Leije

Delft, The Netherlands

August 20, 2009

Page 14: 1222 740 thesis - TU Delftce-publications.et.tudelft.nl/publications/302_fpga_hardware... · This thesis proposes a partial FPGA based design were the number of parallel ... Figure

x

Page 15: 1222 740 thesis - TU Delftce-publications.et.tudelft.nl/publications/302_fpga_hardware... · This thesis proposes a partial FPGA based design were the number of parallel ... Figure

1

Introduction 1

The report discusses an acceleration of a process that is used to find co-occurrences in

alterations in DNA strings. First some background information of these alterations is described

(1.1). To find the co-occurrences in these alterations, an algorithm has been designed. This

algorithm takes a lot of execution time, which leads to the problem statement (1.2). Thereafter

the project goals and approach are explained (1.3). Finally a chapter overview is given in

paragraph 1.4, which will explain the structure of this report.

1.1 Aberrations in aCGH data

Genomic instability is often observed in tumor cells [9]. This instability can lead to the loss of a

tumor suppressor gene and the gain of an oncogene, which is called an unbalanced transaction

(Figure 1.1). This means that deletions and additions in DNA pieces can occur. Where normally

the genes come in pairs, tumor cells have more (or less) of the same genes. This abnormal

number of genes is called copy number alterations (CNA’s). These alterations are interesting to

find, because this can help in tumor research and other DNA studies.

One of the procedures to measure the CNA’s is array Comparative Genomic Hybridization

(aCGH) [3]. This method actually compares ‘healthy DNA’ with ‘tumor DNA‘. So a ratio of

the number of genes between ‘healthy DNA’ and ‘tumor DNA‘ is calculated. This ratio is

usually represented in a Log2 form, where positive number represent a gain in the number of

genes and negative number represent a loss in the number of genes (compared to a ‘healthy

DNA’).

Figure 1.1: Loss of a tumor suppressor gene and the gain of an oncogene

Page 16: 1222 740 thesis - TU Delftce-publications.et.tudelft.nl/publications/302_fpga_hardware... · This thesis proposes a partial FPGA based design were the number of parallel ... Figure

2

1.2 Problem statement

There are many analyses that focus to find CNA’s, but most of them are looking for single

location variations. In Figure 1.2 are three different tumor DNA’s displayed, where each arrow

represents the ratio (between the number of genes of healthy and tumor DNA). The analyses

search for shared alterations in the DNA. In this example these analyses will find a large

variation peak on position four (fourth arrow of all tumor DNA’s is high) and some smaller

variation peak on position 2.

However not all single DNA changes lead to tumors or other dangerous cell mutations. For

research purposes the need of finding co-occurrences in the DNA alterations is growing.

Therefore an algorithm was designed to find co-occurring aberrations [2]. In Figure 1.3 two

pieces of a DNA string are combined to find co-occurring alterations. The arrows represent the

same ratio as in Figure 1.2 and the size of the circle represents the importance of the co-

occurrence alteration of the position in DNA piece A and B. So the big circles are co-occurring

aberrations in one tumor DNA and these positions have to be determined for each tumor DNA.

The size of the circles is calculated with the minimum function. So a circle stands for the

minimum of two ratio’s between healthy and tumor DNA. The sum of all these circles for all

tumor DNA’s is called the pair-wise space. The algorithm will be further explained in chapter

two.

The problem is that the DNA string is large. This means many computations have to be

done to find alterations (the data is in terms of hundreds of megabytes). To find co-occurrences,

the number of computations per simultaneous alteration grows exponentially. Because the

algorithm is optimal in number of calculations (as far as known), there is a need to improve the

speed of these computations.

The total DNA string is divided, because the total computation is expensive to compute on

one platform. To divide the calculation, the tumor DNA strings are separated in chromosome

arms (Each person has 23 chromosome pairs and each chromosome consist of two chromosome

arms). These chromosome arms (chromatidis) are used to calculate a part of the total result. In

this way the problem is divided in roughly 10k jobs [1]. 46x46/2 chromosome arms are

compared for 5 kernel sizes and 3 modes (gain/gain, loss/loss and gain/loss), this result in

roughly 10k jobs.

Tumor DNA 1

Tumor DNA 2

Tumor DNA 3

Figure 1.2: Tumor DNA’s from different persons

Page 17: 1222 740 thesis - TU Delftce-publications.et.tudelft.nl/publications/302_fpga_hardware... · This thesis proposes a partial FPGA based design were the number of parallel ... Figure

3

DNA piece A

Figure 1.3: Co-occurrences alterations in tumor DNA

1.3 Project goals and approach

The main goal is to accelerate the computations of the algorithm. The approach is to consider

different platforms to do the computations and choose one of them. The algorithm is already

implemented in matlab and distributed over a network of computers. However this

implementation takes a long time to compute the result.

The approach is to improve one job (the co-occurrences aberrations of two chromosome

arms) on a different platform. In this way a new network can be created or combined with the

old network to achieve the same job, but faster. The approach will consist of the following

phases:

• Literature study: understanding the algorithm, background and studying other accelerator

approaches in software and on different platforms.

• Convert the matlab implementation to a plain C implementation to improve the algorithm

and to find parallelization possibilities.

• Search and choose an implementations platform and make an architecture.

• Implement, test en verify the architecture.

The goal of this report focuses on accelerating the algorithm that can find the co-occurrences of

two alterations in DNA. In the future more, co-occurrences will be calculated, however the

computational time grows exponentially (where the discussed algorithm was taking 10.000

days, when calculated fully).

1.4 Chapter overview

This report is written in the same order as the approach. Chapter 2 visualizes and explains the

algorithm. This algorithm is implemented in plain C, where some algorithm steps are

redesigned (chapter 3). An implementation platform (chapter 4) was chosen and an architecture

was designed (chapter 5). This architecture was implemented (chapter 6) and tested. The results

are mentioned in chapter 7. Finally the conclusions and future work can be found in chapter 8.

Page 18: 1222 740 thesis - TU Delftce-publications.et.tudelft.nl/publications/302_fpga_hardware... · This thesis proposes a partial FPGA based design were the number of parallel ... Figure

4

Page 19: 1222 740 thesis - TU Delftce-publications.et.tudelft.nl/publications/302_fpga_hardware... · This thesis proposes a partial FPGA based design were the number of parallel ... Figure

5

The Algorithm 2

This chapter explains the algorithm to calculate the co-occurring aberrations. First the exact

inputs and outputs of the algorithm are shown (2.1), then a description follows how the pair-

wise space is calculated (2.2). The next stage is to normalize the pair-wise space (2.3) and

finally the peaks are calculated (2.5) after the 2D kernel is applied to the normalized pair-wise

space (2.4).

2.1 Inputs and outputs

Within the algorithm two different chromosome arms for a number of tumor DNA’s from

different persons (P) are compared. These chromosome arms for the different tumor DNA’s are

displayed in Figure 2.1 and are called matrix A and B. So each row of matrix A and B contains

the ratio’s between healthy and tumor DNA for each tumor DNA (figure 1.2 is illustrative for

the matrices A and B). Matrix A (chromosome arm A) is of length M and matrix B

(chromosome arm B) is of length N. Some precomputed normalization vectors (NA and NB) are

also given and are used to normalize the pair-wise space, so all ratios have the same intensity.

Figure 2.1: The inputs

The output consists of the 500 highest peaks of the pair-wise space after a kernel is applied.

Only the 500 highest peaks are used because the smaller peaks have no significant value and

the output matrix would be too large. The given information includes the location (X is the

position in A and Y is the position in B) and the height of the peak in the pair-wise space

(Figure 2.2).

Figure 2.2: The output

Page 20: 1222 740 thesis - TU Delftce-publications.et.tudelft.nl/publications/302_fpga_hardware... · This thesis proposes a partial FPGA based design were the number of parallel ... Figure

6

The values in matrix A and B indicate the ratio between the tumor and healthy signal (in a log2

scale), where negative values indicate a loss and a positive values a gain. The algorithm can

calculate a gain/gain, loss/loss or gain/loss answer, which is selected with the ‘amp’ parameter.

All negative values are nullified for a gain situation. All positive values are nullified and all

negative values are inverted for a loss situation. All different possibilities are shown in Figure

2.3.

Figure 2.3: Pre computation (pseudo code)

2.2 Pair-wise space

To create the pair-wise space the two chromosome arms are combined. This is implemented

with a minimum function, because the common large gain or loss ratio is searched in both

chromosome arms. Each value of matrix A is compared with each value of matrix B for each

person. In this way a matrix C is computed for each person Figure 2.4.

switch amp

case 1 // gain/gain

for all 1 ≤ i ≤ M and 1 ≤ p ≤ P

0 0pi pi

A A< => =

for all 1 ≤ j ≤ N and 1 ≤ p ≤ P

0 0pj pj

B B< => =

case 0 // loss/loss

for all 1 ≤ i ≤ M and 1 ≤ p ≤ P

0 0pi pi pi pi

A A else A A> => = = −

for all 1 ≤ j ≤ N and 1 ≤ p ≤ P

0 0pj pj pj pj

B B else B B> => = = −

case 2 // gain/loss

for all 1 ≤ i ≤ M and 1 ≤ p ≤ P

0 0pi pi

A A< => =

for all 1 ≤ j ≤ N and 1 ≤ p ≤ P

0 0pj pj pj pj

B B else B B> => = = −

end;

Page 21: 1222 740 thesis - TU Delftce-publications.et.tudelft.nl/publications/302_fpga_hardware... · This thesis proposes a partial FPGA based design were the number of parallel ... Figure

7

Figure 2.4: Calculate the minimums

There are P different C matrices. Only high ratios that occur in all C matrices are important,

since systematic co-occurrence alterations in the chromosome arms are searched for. To find

these alterations all C matrices are added together (Figure 2.5). In this way each point in the

matrix D is the sum of the minimum of the ratios between the healthy and tumor DNA.

N

0

P

p=

=∑

Figure 2.5: Sums the C matrices

2.3 Covariance and normalization

The next step is to compute the covariance matrix to correct for continuous ratios. In addition

the normalization vectors are used to normalize the pair-wise space. The normalization matrix

is computed as shown in Figure 2.6. The matrix D is multiplied with the covariance matrix and

divided by the normalization matrix (Figure 2.7).

This covariance step looks like an easy computational step (in terms of multiplications and

divisions), but the covariance has the complexity of MxNxP (the same complexity as the

previous step). In addition some divisions are needed for the covariance matrix and

normalization.

N NN

Figure 2.6: The NORM matrix

for all 1 ≤ i ≤ M and 1 ≤ j ≤ N and 1 ≤ p ≤ P

( , )p

ji pi pjC MIN A B=

Page 22: 1222 740 thesis - TU Delftce-publications.et.tudelft.nl/publications/302_fpga_hardware... · This thesis proposes a partial FPGA based design were the number of parallel ... Figure

8

Figure 2.7: Normalize the pair-wise space

2.4 2D kernel

A 2D Gaussian kernel is applied on the normalized pair-wise space to look at the local

enrichment of the highest values within this space. A normal 2D kernel convolution is shown in

Figure 2.8, where K denotes the height and width of the kernel matrix V. In this way the

complexity is MxNxKxK. So this step takes the most computational power (when KxK > P).

Because the kernel is a Gaussian kernel, this kernel is separable. This means that its

convolution can be done with one horizontal and one vertical vector of width K (Figure 2.9). In

this way the complexity decreases to MxNxKx2 and so this step takes less computational power

(when 2xK < P).

Figure 2.8: Normal 2D kernel convolution

for all 1 ≤ i ≤ M and 1 ≤ j ≤ N

/ji ji ji ji

E D COV NORM= ×

for all 1 ≤ i ≤ M and 1 ≤ j ≤ N

( / 2 ),( / 2 ) ,

1 1

K K

ji j K k 2 i K k1 k1 k 2

k1 k 2

G E V+ − + −

= =

= ×∑∑

Page 23: 1222 740 thesis - TU Delftce-publications.et.tudelft.nl/publications/302_fpga_hardware... · This thesis proposes a partial FPGA based design were the number of parallel ... Figure

9

E

m

= CONV( X V

K

) F

m

F

m

= CONV( X ) G

m

Figure 2.9: Separated 2D kernel convolution

2.5 Peaks

This is the final step in the algorithm and it finds peaks in the pair-wise space to detect the

DNA locations that co-aberate to a certain degree. This peak function, as shown in Figure 2.10,

makes an array of the location and the height of the peaks. This function should be in the

complexity of MxN and so it should be fast. But as it will discussed in the next chapter this

function took the most time in the matlab implementation.

Figure 2.10: Finding peaks

for all 1 ≤ i ≤ M and 1 ≤ j ≤ N

,( / 2 )

1

K

ji j i K k1 k1

k1

F E V+ −

=

= ×∑

( / 2 ),

1

K

ji j k k 2 i k 2

k 2

G F V+ −

=

= ×∑

Page 24: 1222 740 thesis - TU Delftce-publications.et.tudelft.nl/publications/302_fpga_hardware... · This thesis proposes a partial FPGA based design were the number of parallel ... Figure

10

Page 25: 1222 740 thesis - TU Delftce-publications.et.tudelft.nl/publications/302_fpga_hardware... · This thesis proposes a partial FPGA based design were the number of parallel ... Figure

11

Implementation in software 3

An important step in optimizing the algorithm is to get insight in the number of operations and

parallelism available each step. This is done by translating the algorithm in pseudo code. All

different steps in the algorithm, calculating the pair-wise space (3.1), the covariance and the

normalization (3.2), the 2D kernel (3.3) and the peak function (3.4), are translated into pseudo

code. Also the most important differences compared to the original matlab code are mentioned.

Also it must be noted that the matrices are stored column wise in memory (this means that the

indices for the matrices are exchanged).

The pseudo codes mentioned in this chapter are a simple version of the real C code. This

means that advanced code is left out of the report, like memory mapping for the kernel

convolution and flat peak detection.

3.1 Pair-wise space

In this step the pair-wise space is calculated. First a matrix D is filled with zeros and

then this matrix is used to add the minimum value between matrix A and B (Figure 3.1).

This pseudo code shows that MxNxP minimum (MIN) functions must be performed. In this

minimum function there has to be one comparator and all of these comparators can work in

parallel. Only the P numbers of additions for each point in matrix C have to write to the same

address. The original matlab code was running 5 times slower than this code because of the

matlab interpreter.

Figure 3.1: Pseudo code of calculating the pair-wise space

//Input: matrix A and B are the ratios between the healty and tumor DNA

// for chromosome arms A and B for P number of tumor DNA’s

//Output: matrix D is the pair-wise space

mins(M, N, P, matrix A, matrix B, matrix D)

{

for (i=0; i<M; i++)

for (j=0; j<N; j++)

D[i,j] = 0;

for (i=0; i<M; i++)

for (j=0; j<N; j++)

for (p=0; p<P; p++)

D[i,j] += MIN( A[i,p], B[j,p]);

}

Page 26: 1222 740 thesis - TU Delftce-publications.et.tudelft.nl/publications/302_fpga_hardware... · This thesis proposes a partial FPGA based design were the number of parallel ... Figure

12

3.2 Covariance and normalization

The function COVAR calculates the covariance matrix, where some subroutines are needed to

perform the matrix multiplication and calculate the mean for each column (Figure 3.3). After

the covariance matrix is calculated, it is multiplied with the pair-wise space with the function

COV_NORM (Figure 3.2). In this function all negative values in the covariance matrix are

nullified. The result of this multiplication step is divided by the normalization matrix. To avoid

division by zero it is checked in this function (if this occurs the result will be zero like the

original matlab code does).

Most computations are done in the matrix multiplication, which is used in the COVAR

function. The number of multiplications is MxNxP. The P additions for every point in the

matrix have to be performed sequentially, because they write to the same address.

Figure 3.2: Pseudo code of the covariance and normalization step

//Input: matrix COV is covariance matrix

// matrix D is the pair-wise space

// array Na and Nb are the normalization vectors

//Output: matrix E is the pair-wise space corrected for continuous ratios and

// normalized.

cov_norm(M, N, matrix D, matrix COV, array Na, array Nb, matrix E)

{

// multiply step

for (i = 0; i < M; i++)

for (j = 0; j < N; j++)

if (COV [i,j]<0)

E_interm [i,j] = 0;

else

E_interm [i,j] = D[i,j] * COV[i,j];

//normalization step

for (i = 0; i < M; i++)

for (j = 0; j < N; j++)

temp = Na[i] + Nb[j];

if (temp==0)

E[i,j] = 0;

else

E[i,j] = E_interm [i,j] / temp;

}

Page 27: 1222 740 thesis - TU Delftce-publications.et.tudelft.nl/publications/302_fpga_hardware... · This thesis proposes a partial FPGA based design were the number of parallel ... Figure

13

Figure 3.3: Pseudo code of calculating the covariance matrix

covar(M, N, P, matrix A, matrix B, matrix COV)

{

mean(M,P,A,A_MEAN);

mean(N,P,B,B_MEAN);

sub_mean(M,P,A, A_MEAN,A_SUB_MEAN);

sub_mean(N,P,B,B_MEAN,B_ SUB_MEAN);

mult_cov(M,N,P, A_SUB_MEAN, B_SUB_MEAN, COV);

}

mult_cov(M, N, P, matrix A_SUB_MEAN, matrix B_SUB_MEAN, matrix COV)

{

for (i=0; i<M; i++)

for (j=0; j<N; j++)

COV[i,j] = 0;

for (i=0; i<M; i++)

for (j=0; j<N; j++)

for (k=0; k<P; k++)

COV [i,j] += A_SUB_MEAN [i,k] * B_SUB_MEAN [j,k];

for (i=0; i<M; i++)

for (j=0; j<N; j++)

COV [i,j] /= P; //P was M in the original code

//However the covariance is calculated over P persons

}

mean(M, P, matrix IN, array MEAN)

{

for (i=0; i<M; i++)

MEAN C[i] = 0;

for (i=0; i<M; i++)

for (k=0; k<P; k++)

MEAN [i] += IN[i,k];

for (i=0; i<M; i++)

MEAN [i] /= P;

}

sub_mean(M, P, matrix IN, array MEAN, matrix SUB_MEAN)

{

for (i=0; i<M; i++)

for (k=0; k<P; k++)

SUB_MEAN [i,k] = IN[i,k] - MEAN[i];

}

Page 28: 1222 740 thesis - TU Delftce-publications.et.tudelft.nl/publications/302_fpga_hardware... · This thesis proposes a partial FPGA based design were the number of parallel ... Figure

14

3.3 2D kernel

The 2D Gaussian kernel convolution can be separated as explained in the previous chapter. The

separation of the Gaussian kernel itself is shown in Figure 3.4. The upper half of the figure

shows the calculation of a 2D Gaussian and the lower half shows the calculation of the

separated Gaussian. First a horizontal convolution is calculated with the separated Gaussian

followed by a vertical convolution as depicted in figure 3.5. MxNxKx2 multiplications are

performed, where all Kx2 additions for each point in the matrix have to be done sequentially.

When performing a convolution, problems arise at the edges of the matrix, because values

are needed outside the matrix. Therefore the matrix is extended and the values at the edges of

the original matrix are mirror wise duplicated to the outside the original matrix. Another

alternative is to use zeros for the values outside the matrix, but this leads to significant smaller

values at the edges and false peaks in the next step.

In matlab the separation was done by SVD decomposition (it was calculated). However the

separation done in the definition of the Gaussian lead to better accuracy of the results (because

floating point errors lead to less accuracy from the decomposition).

Figure 3.4: Matlab code defining the kernel matrix

[X, Y] = meshgrid([-(scale/2):.1:scale/2] , [-(scale/2):.1:scale/2]);

Z_2D = exp( -( ((X-x0)./sigma_x).^ 2 + ((Y-y0)./sigma_y).^2 ));

X = [-(scale/2):.1:scale/2];

Z_1D = exp( -( ((X-x0)./sigma_x).^ 2));

Page 29: 1222 740 thesis - TU Delftce-publications.et.tudelft.nl/publications/302_fpga_hardware... · This thesis proposes a partial FPGA based design were the number of parallel ... Figure

15

Figure 3.5: Pseudo code of the 2D kernel convolution

3.4 Peaks

The PEAK function has to determine for each point if it is a peak (Figure 3.6). The output is

initiated to minus one, so the sub function expand knows if the point was already evaluated.

The sub function expand was implemented in the following way, each point in the matrix is

compared with all its neighbors and when this point is greater then its neighbors it is a peak.

This is done in the first FOR loop in the EXPAND function (Figure 3.7). The Second FOR loop

checks for flat peaks, whether a neighbor is of the same value. If this is the case the neighbor

has to be evaluated to determine if the point in the matrix is a peak, again using the EXPAND

function. In this way all peaks are determined in the matrix PEAK.

The complexity of this function is MxN. However in case of a plateau (number of same

values connected to each other) all points of the plateau have to be evaluated to determine if the

plateau is a peak. So in the worst case of one big plateau, all points have to be evaluated

sequentially to determine if it is a peak.

The matlab code for finding peaks in the pair-wise space was an inefficient algorithm in the

way the memory was used, because for every peak that was found the peak matrix was

extended with one and a new memory allocation had to be performed. So if the result had many

peaks the calculation time increases. The matlab code of the PEAK function took 60 percent of

the total execution time.

//Input: matrix E is the pair-wise space corrected for continuous ratios and normalized

// array V contains the kernel values

//Output: matrix F is the pair-wise space after the kernel

kernel(M, N,K,matrix E, array V, matrix G)

{

////////// horizontal //////////

extend_h(E, ext_ E);

for (i = 0; i < M; i++)

for (j = 0; j < N; j++)

F[i,j] = 0;

for (i = 0; i < M; i++)

for (j = 0; j < N; j++)

for (k = 0; k < K; k++)

F[i,j] += ext_ E[i+k,j] * V[k];

////////// Vertical //////////

extend_v(F, ext_ F);

for (i = 0; i < M; i++)

for (j = 0; j < N; j++)

G[i,j] = 0;

for (i = 0; i < M; i++)

for (j = 0; j < N; j++)

for (k = 0; k < K; k++)

G[i,j] += ext_F[i,j+k] * V[k];

}

Page 30: 1222 740 thesis - TU Delftce-publications.et.tudelft.nl/publications/302_fpga_hardware... · This thesis proposes a partial FPGA based design were the number of parallel ... Figure

16

Figure 3.6: Pseudo code of the peak finding algorithm

Figure 3.7: Pseudo code of the subroutine EXPAND

peak(M, N, matrix G, matrix PEAK)

{

for (i = 0; i < M; ++i)

for (j = 0; j < N; ++j)

PEAK [i,j] = -1;

for (i = 0; i < M; ++i)

for (j = 0; j < N; ++j)

expand(G, PEAK,i,j);

}

int expand(matrix G, matrix PEAK, x, y)

{

//N_x denotes the x position of neighbor N

//N_y denotes the y position of neighbor N

if (PEAK [x,y] != -1)

return PEAK [x,y];

//code for normal peaks

for all neighbors N

if (G[x,y] < G[N_x,N_y])

PEAK [x,y] = 0;

return 0;

PEAK [x,y] = 1;

//code for flat peaks

for all neighbors N

if (G[x,y] == G[N_x,N_y])

if (expand(G, PEAK,N_x,N_y) == 0)

PEAK [x,y] = 0;

return 0;

return 1;

}

Page 31: 1222 740 thesis - TU Delftce-publications.et.tudelft.nl/publications/302_fpga_hardware... · This thesis proposes a partial FPGA based design were the number of parallel ... Figure

17

Implementation considerations 4

In this chapter some considerations are discussed to implement the algorithm. These

considerations will lead to an implementation platform (4.2). But to make a good choice for the

platform first the parallelism in the algorithm is explored (4.1).

4.1 Parallelism

The algorithm has many operations to perform and therefore a lot of time is needed to execute

this algorithm. The minimum number of arithmetic operations for every process step is shown

in Table 4.1. From this table it can be concluded that the step that costs the most in terms of

operations depends on the values of P and K. When K*2 > P the 2D kernel step costs the most,

but when P is bigger the pair-wise space and the covariance matrix cost the most in terms of

number of operations.

The 2D kernel convolution step can be executed several times over the pair-wise space for

several kernel widths. This means that the 2D kernel convolution has some extra weight when it

is compared to the other steps. However for the (current) typical application most of the

calculated kernel convolutions are very small compared to P and not always all kernel widths

have to be calculated.

When NxM arithmetic hardware is used, this algorithm can be executed fast. The remaining

time is decreased to P and K*2, except for the peak finding algorithm (because of his sequential

behavior) as shown in Table 4.2. However the size of M and N is too large to make NxM

arithmetic hardware, but it shows the possibility to speedup the algorithm when more

operations can be performed at the same time.

Proces step Comparisons Additions Multiplications Divisions

Pair-wise space M*N*P M*N*P

Covariance matrix M*N*P+4*M*P M*N*P M*N+M

Covariance and normalization 2*M*N M*N M*N

2D kernel convolution M*N*K*2 M*N*K*2

Peaks M*N*9

Table 4.1: Number of arithmetic operations

Proces step Arithmetic complexity Time complexity

Pair-wise space M*N*P P

Covariance and normalization M*N*P P

2D Kernel M*N*K*2 K*2

Peaks M*N M*N

Table 4.2: Arithmetic and time complexity

Page 32: 1222 740 thesis - TU Delftce-publications.et.tudelft.nl/publications/302_fpga_hardware... · This thesis proposes a partial FPGA based design were the number of parallel ... Figure

18

4.2 Implementation platforms

There are few implementation platforms possible, each having its advantages and

disadvantages. For the algorithm the parallelism, speed, memory and flexibility of the platform

are important. Also the possibility to construct a network with these platforms is mentioned.

These aspects are discussed for the general purpose processor, cell (microprocessor), GPU and

FPGA.

4.2.1 General purpose processor

The general purpose processor is the most commonly used computational platform. It is often

used because of its flexibility. Many high level programming languages are used to make

programming easier. But also programming languages closer to the processor can be used to

obtain faster computations.

Parallelism in the general purpose processor can be achieved with instruction pipelining

and SIMD (single instruction, multiple data). With these methods instructions can be executed

at the same time. Another possibility is to generate threads. However when the threads are

executed on the same processor, the instructions are still executed sequentially and the threads

become only emulated parallelism. However on today’s machines there can be multiple cores,

where true parallelism can be achieved if all the tasks are executed on different cores.

General purpose processors run on gigahertz clock frequencies. However most instructions

take more the one clock cycle. The construction of a network with this platform is relatively

easy, because these networks already exist and communicate with Ethernet (standard up to 1

gigabit).

The memory in a general purpose processor is layered, where cache memory is used for

fast access and DRAM is used to store larger data sets. The cache is usable when an address in

the DRAM is needed several times right after each other.

4.2.2 Cell (microprocessor)

The Cell is a relatively new computational platform, which uses one main processing element

(PE) with multiple attached processing units (APU’s). The PE is not the main processor, but is

actually a controller for the APU’s. These APU (mostly less than ten) can each execute a

thread, which increases the level of parallelism. The APU’s can also be used sequentially where

each APU calculates some of the final product and then sends it to the next APU. The level of

parallelism thus depends on the number of APU’s.

Programming the Cell is a little bit harder, where knowledge of multithreading is necessary.

When the code written for a general purpose processor is run on the Cell, it is mainly executed

on the PE, while the Cell’s true power comes from the APU’s. The program has to be rewritten

to use threads, to define the memory sharing between the threads and between the threads and

the PE. So the code has to be split up in threads to optimally use all APU’s.

The Cell microprocessor is like the general purpose processor and can be easily connected

to a network with Ethernet. The PE distributes the data to the APU’s and this means the data

can be distributed by programming, which gives a great memory advantage.

Page 33: 1222 740 thesis - TU Delftce-publications.et.tudelft.nl/publications/302_fpga_hardware... · This thesis proposes a partial FPGA based design were the number of parallel ... Figure

19

4.2.3 Graphic Processor Unit

The Graphic Processor Unit (GPU) has evolved into a highly parallel, multithreaded, many core

processor. GPU’s were originally used for only video task and are located on the graphics card

in a computer. A GPU consist of many small computational units. These computational units

can function independent from each other, this creates good parallelism. To use a GPU the code

has to be completely rewritten and can’t be used on other platforms. It uses special libraries and

threads. This makes the flexibility of this platform bad.

The speed of a GPU is several gigahertz’s, however many instructions take more cycles to

execute. The code written for the GPU handles multiplications and additions very well, but

branching is very bad for the performance. The implementation in a network is relatively easy

because many computers have already a GPU in the graphics card.

The computational units have all small separated caches, which is a difference to the cell

microprocessor. This will lead to many cache misses when large data sets are inputted and

performance reduction.

4.2.3 Field programmable gate array

The field programmable gate array (FPGA) is currently the most used of different

programmable hardware. On this implementation platform the hardware is reconfigurable and

can be designed to the application. To program the FPGA, the languages VHDL and Verilog

are mostly used. Because all hardware must be programmed many code lines are needed to

implement a design. However the code can be used on other FPGA boards, which gives a good

flexibility and the hardware can be optimized for the design.

The parallelization depends on the number of logical elements that can be implemented on

the FPGA. All hardware components (adders, multipliers) can be created, but are limited to

hardware resources of the FPGA. In this way much parallelism can be achieved.

The FPGA runs on a clock frequency of 100-500 megahertz. But every cycle more

hardware components can be executed at the same time. FPGA’s (mostly) have gigabit Ethernet

on board, which makes it useful in a network.

FPGA’s have different kinds of memory: SDRAM, block RAM and flip-flops. This block

RAM is an array of small pieces of memory, which can be read or written at the same time. The

most important is that block ram and flip-flops can be accessed in parallel of each other, so

many computations can be done in parallel.

Page 34: 1222 740 thesis - TU Delftce-publications.et.tudelft.nl/publications/302_fpga_hardware... · This thesis proposes a partial FPGA based design were the number of parallel ... Figure

20

4.2.4 Conclusion

The algorithm can easily be parallelized, but then a lot of hardware is needed. So a platform

with many arithmetic operations per second has to be chosen. The Cell can deal with lots of

operations in parallel in a very high speed, which is restricted to the number of APU’s (mostly

less than 10). GPU’s have more computational units, but can’t handle branches very well.

These two platforms execute instructions in a sequential way, where many instructions take

more than clock cycle.

The FPGA is however a factor ten slower in terms of clock frequency, but can do more

instructions in parallel, which is restricted by the hardware resources needed to create a

computational unit. A FPGA can do an if-statement, addition and multiplication in one clock

cycle (as example). This is possible due to the parallel memory. Which makes the slower clock

rate FPGA, compared to the Cell and GPU, less important (if enough instruction can be done in

parallel)?

Because the hardware is configurable and the FPGA can do more in parallel, it is the most

suitable hardware platform. Since maybe not the whole algorithm can be implemented in

hardware and there has to be data transmitted to the board, the general purpose processor is also

used. Another advantage is that networks of computers can be used.

Page 35: 1222 740 thesis - TU Delftce-publications.et.tudelft.nl/publications/302_fpga_hardware... · This thesis proposes a partial FPGA based design were the number of parallel ... Figure

21

Architecture 5

In this chapter the global architecture is described. In the first paragraph it is discussed what is

implemented in hardware and why (5.1). Then the architecture in software is described and

explained (5.2). Finally all hardware blocks are discussed and how they are connected to make

an optimal design (5.3).

5.1 Partitioning of the algorithm

The first intention was to implement the whole design in hardware. However the total hardware

needed to implement the complete design is too large, because MxN hardware units should be

available. Therefore the algorithm was redesigned to divide the pair-wise space and calculate

smaller pieces. However the smallest pieces that can be created depend on the kernel size. This

kernel size was however too big to implement in hardware (Table 5.1), because KxKxPx2

elements have to be stored to calculate one value of the pair-wise space. However this size of

memory is not available on the FPGA.

Because the whole algorithm can’t be implemented in hardware due to the limited size of

the memory, the kernel convolution is implemented in software running on the general purpose

processor. Therefore only the values to calculate one point in the pair-wise space have to be

stored on the FPGA. These Px2 values consists of column of matrix A and a column of matrix

B (this is further explained in 5.2).

Meaning Size

M Length of chromosome arm A 100-10000

N Length of chromosome arm B 100-10000

P Number of tumor DNA's 1-300

K Kernel matrix size 1-1000

Table 5.1: Sizes of the matrices

Page 36: 1222 740 thesis - TU Delftce-publications.et.tudelft.nl/publications/302_fpga_hardware... · This thesis proposes a partial FPGA based design were the number of parallel ... Figure

22

5.2 Software versus hardware

Due the partitioning of the algorithm the hardware and software is almost determined. The

kernel convolution has to stay on the general purpose processor. But also divisions will stay on

the computer, because divisions are very area and time consuming on a FPGA. This means that

the normalization, the division by M in the MULT_COV function and the MEAN function

aren’t implemented in hardware.

In Figure 5.1 is depicted the pseudo code that will be implemented in hardware. The code

in the two outer loops will be implemented as one computational unit. With one column of

matrix A and B the MINS and COVAR functions can be computed and the results of these

functions can be multiplied to create one output for the pair-wise space (E_interm).

Figure 5.1: Pseudo code of to_fpga function

to_fpga(M, N, P, matrix A, matrix B, array MEAN_A, array MEAN_B, matrix E_interm)

{

for (i=0; i<M; i++)

for (j=0; j<N; j++)

// this will be implemented on the FPGA

// this is 1 computational unit

// the mins function

D[i,j] = 0;

for (k=0; k<P; k++)

D[i,j] += MIN( A[i,k], B[j,k]);

// the covar function

COV [i,j] = 0;

for (k=0; k<P; k++)

COV [i,j] += (A[i,k]-MEAN_A[i])

* (B[j,k]-MEAN_B[j]);

// first part of cov_norm function

if (COV [i,j]<0)

E_interm [i,j] = 0;

else

E_interm [i,j] = D[i,j] * COV[i,j];

}

E_interm

m

= FPGA( A

M

B

N

, )

Page 37: 1222 740 thesis - TU Delftce-publications.et.tudelft.nl/publications/302_fpga_hardware... · This thesis proposes a partial FPGA based design were the number of parallel ... Figure

23

5.3 Software

The software runs on the general purpose processor. As described in the previous paragraph

most of the computations are transferred to hardware. The software is important because it

sends and receives data to and from the FPGA and executes the smaller functions MEAN,

NORM and MAXIM. The biggest computational step that remains on the computer is the 2D

kernel convolution. The main software architecture is described in Figure 5.2. Here it is shown

that the MEAN function is extracted from the COVAR function computed before the FPGA

starts.

The 2D kernel convolutions should be as fast as possible, because it stays on the computer.

Therefore this function is optimized in terms of using pointers and writing the plain C code as

fast as possible (considering memory mapping of the matrices). This optimization also uses the

symmetric values in the Gaussian kernel, where a factor 2 is saved in multiplications.

The fpga function must send and receive packets to the FPGA. This is done by sending

UDP over the Ethernet and is implemented in the fpga function as depicted in figure 5.3. The

Send_A function sends as many columns as computational_units of matrix A. The Send_B

function sends as many columns as fit in a packet of 1024 bytes (P_per_packet) of matrix B,

because it takes the same time to send a packet of 1 byte as a packet of 1024 bytes (all headers

included). The receive function will return (computational_units x P_per_packet) values that

are stored in the matrix E_interm.

Figure 5.2: Software architecture

Figure 5.3: Pseudo code of fpga function

main()

{

mean(A);

mean(B);

fpga();

norm();

kernel();

peak();

}

fpga(M, N, matrix A, matrix B, array MEAN_A, array MEAN_B, matrix E_interm)

{

for (i = 0; i < M; i+=computational_units)

Send_MEAN_A();

Send_A();

for (j = 0; j < N; j+=P_per_packet)

{

Send_MEAN_B();

Send_B();

E_ interm = Receive();

}

}

Page 38: 1222 740 thesis - TU Delftce-publications.et.tudelft.nl/publications/302_fpga_hardware... · This thesis proposes a partial FPGA based design were the number of parallel ... Figure

24

As discussed earlier the sub function MEAN is excluded from the COVAR function because of

its dividers. This means that the mean value of each column has to be sent to the FPGA before

the column is send to the FPGA. The COV_NORM function has to be changed in the NORM

function. Also the extra division by M that is extracted out of the MULT_COV sub function is

implemented in this NORM function (Figure 5.4). This function is actually the second step in

the COV_NORM function, where the bold P is added for the extra division.

Normally the sequence is Send_B followed by Receive. To create a double buffer, Send_B

is executed before this sequence and a Receive after this sequence (Figure 5.5). In this way the

buffer in the FPGA is full and the FPGA can handle the next packet right after the previous

packet.

Figure 5.4: New NORM function

Figure 5.5: Pseudo code of fpga_buffer function

Fpga_buffer(M, N, matrix A, matrix B, array MEAN_A, array MEAN_B, matrix E_interm)

{

for (i = 0; i < M; i+=computational_units)

Send_MEAN_A();

Send_A();

Send_MEAN_B();

Send_B();

for (j = 0; j < N; j+=P_per_packet)

{

Send_MEAN_B();

Send_B();

E_interm = Receive();

}

E_interm = Receive();

}

norm(M, N, matrix E_interm, array Na, array Nb, matrix E)

{

for (i = 0; i < M; i++)

for (j = 0; j < N; j++)

{

temp = Na[i] + Nb[j];

if (temp==0)

E[i,j] = 0;

else

E[i,j] = E_interm [i,j] / (temp*P);

}

}

Page 39: 1222 740 thesis - TU Delftce-publications.et.tudelft.nl/publications/302_fpga_hardware... · This thesis proposes a partial FPGA based design were the number of parallel ... Figure

25

5.4 Hardware

The hardware architecture determines the performance of the design. The

communication hardware must use at least area as possible; so many computational

units can be implemented. This can be achieved with the architecture depicted in Figure

5.6. The components used are listed below:

• External PHY: The external PHY sends and receives frames from the Ethernet

network, which is connected to the computer. This component processes the frame and

communicates with the TEMAC through a GMII interface.

• The TEMAC: This component supports 1000 Mb/s and half and full duplex

operations. But the most important aspect is the minimal area that is needed by the

TEMAC. It consists only of an EMAC (implemented on most boards), a DCM (to

generate multiple clock signals), some buffers and minimal logic resources [4].

• The RX and TX FIFO: These first-in-first-out memory blocks are used to buffer the

in coming and out coming frames. The RX FIFO is to receive and the TX FIFO is to

send buffers. Besides for the buffering, the FIFO’s are used to synchronize the different

clock domains, because the TEMAC operates at different speeds and the Accelerator

and the Ethernet controller operate at 150MHZ. The FIFO’s are implemented with

block ram modules, which cost no extra logic resources [4].

• Ethernet controller: The Ethernet controller is to transform packets into real data or

real data into packets. This includes reading packet headers and extracting the data

form the RX FIFO, but also creating packet headers with the MAC address and IP of

the computer and including the data. The Ethernet controller also configures the

TEMAC.

• Accelerator: This component does the real computations and as many

parallelization as possible is used. To generate many parallel computations, the control

signals and registers are minimized [5, 6].

Figure 5.6: Total hardware architecture

Page 40: 1222 740 thesis - TU Delftce-publications.et.tudelft.nl/publications/302_fpga_hardware... · This thesis proposes a partial FPGA based design were the number of parallel ... Figure

26

Page 41: 1222 740 thesis - TU Delftce-publications.et.tudelft.nl/publications/302_fpga_hardware... · This thesis proposes a partial FPGA based design were the number of parallel ... Figure

27

Implementation in hardware 6

This chapter describes the implementation in hardware. There are two main hardware

components: the Ethernet communication (6.1) and the accelerator (6.2). These

implementations are visualized and explained. In the paragraph of the Ethernet communication

not only the Ethernet controller, FIFO’s and TEMAC are described, but also the packets type

and headers. In the paragraph of the accelerator, the architecture and the computational units

are discussed. Finally the computational hardware is explained (6.3).

6.1 Ethernet communication

6.1.1 The TEMAC and the FIFO’s

This part of the communication consumes almost no area when it is implemented [4]. The

TEMAC is implemented for a 1gigabit communication with half or full duplex. It will auto

negotiate to the connected computer, but will only work on a gigabit speed. The TEMAC send

its frame directly to the RX FIFO, where the frame is buffered (Figure 6.1). The final byte will

have a rx_valid of zero to mark the end of a frame. The TEMAC will read the TX FIFO when

the tx_fifo_lock_n is low. When the tx_fifo_lock_n is set the Ethernet controller can write to

the TX FIFO. At the end of the frame, 2 zero bytes must be added in the TX FIFO, with

tx_valid is zero to mark the end of the frame [8].

Figure 6.1: Communication TEMAC-FIFO’s

Page 42: 1222 740 thesis - TU Delftce-publications.et.tudelft.nl/publications/302_fpga_hardware... · This thesis proposes a partial FPGA based design were the number of parallel ... Figure

28

6.1.2 Communication data format

The data that is used in the algorithm consists of doubles. However to keep the FPGA

implementation as small as possible these doubles are converted to a fixed point representation.

The fixed point number is a 24 bit value with 16 fractional bits. This means a value of matrix A

or B must be smaller than 256 (8 bits). 256 is a relatively high limit, because the value

represents a ratio in log2 form. So every double is converted to three bytes.

When data is received from the FPGA the 24 bits aren’t enough, because there are P

number of additions. Therefore the result is 32 bits with 16 fractional bits. These 32 bits are

converted back to doubles and the algorithm continues.

6.1.3 The Ethernet controller

The Ethernet controller consists of two parts: the receiver and the sender. The receiver has to

accept two types of packets: ARP and UDP. The FPGA board is configured to a static IP

address. When an ARP packet is received from the computer to ask for the MAC address of the

FPGA board, the Ethernet controller must answer with an ARP packet. This ARP packet is

predefined in the FPGA and will send his MAC address back, when his IP address is found in

the ARP packet.

When a UDP packet is received, the MAC address is checked. When the packet is correct

the length of the data is stored and the header is skipped. Three bytes are buffered (24 bit) and a

flag is raised to indicate that the accelerator can use the buffered value. When the accelerator

has read the value it will lower the flag and the Ethernet controller will buffer the next value.

When all bytes are read from the frame, it will start all over.

The sender of the Ethernet controller works separately from the receiver to increase the

performance. When a complete UDP packet is received a UDP packet is send back to the

computer with the result of the accelerator, which is buffered in registers.

Page 43: 1222 740 thesis - TU Delftce-publications.et.tudelft.nl/publications/302_fpga_hardware... · This thesis proposes a partial FPGA based design were the number of parallel ... Figure

29

6.2 The accelerator

The main accelerator architecture consists of a number of computational units (Figure 6.2).

These units can work in parallel and are as small as possible. The accelerator controller is

necessary to regulate the computation units. This is necessary due to the fact that Data IN and

Data OUT are sequentially communicated with the Ethernet controller. The Accelerator

controller has G number of computational units and works in three main stages: receive (6.2.1),

calculate (6.2.2) and buffer. A short overview of the stages:

1. Receive: G number of columns of matrix A and G mean values of these columns are loaded

in the units.

2. Calculate: The mean value of a column of matrix B is stored in a register and the values of

the column are iterated to calculate the MINS and MULT_COV functions. Finally the results

of these functions are multiplied with each other and buffered in registers.

3. Buffer: The results are buffered in registers and are waiting to be sent over the Ethernet.

Computational

unit

Computational

unit

Computational

unit

|

|

|

|

Data IN

Data OUT

Accelerator

Controller

|

|

|

|

1

2

G

Figure 6.2: accelerator architecture

The architecture of the computational unit is described in Figure 6.3. A column vector of matrix

A is stored in the Block ram [6]. Mean_A, Result Mins and Result Covar are registers. Mean_A

holds the mean value of the vector stored in the block ram. Result Mins is the temporary result

of the MINS function and Result Covar is the temporary result of the MULT_COV function.

Page 44: 1222 740 thesis - TU Delftce-publications.et.tudelft.nl/publications/302_fpga_hardware... · This thesis proposes a partial FPGA based design were the number of parallel ... Figure

30

Figure 6.3: Computational unit

6.2.1 Receive

This stage consists of loading a column of matrix A in the block ram. Because all Data IN is

sequentially this stage cannot be done in parallel. Before each block ram is filled the Mean_A

register is filled. This process is visualized in Figure 6.4, where G denotes the number of

computational units and P denotes the number of persons and thus the length of the column. So

to fill all block rams there are two loops necessary and this makes the stage relatively slow.

However after this stage the other two (faster) stages can be executed many times.

6.2.2 Calculate

In this stage all computational operations are performed (Figure 6.5). For this stage the column

vector and its mean of matrix B needs to be send. First the mean value of the vector is stored in

the Mean register. Thereafter each value of column B is used to calculate one step in the MINS

and MULT_COV functions. Because all computations in all computational units can be done in

parallel, a large computation speed is achieved. After the whole column is processed the results

are multiplied with each other, where Result Covar is nullified if the value was negative.

Page 45: 1222 740 thesis - TU Delftce-publications.et.tudelft.nl/publications/302_fpga_hardware... · This thesis proposes a partial FPGA based design were the number of parallel ... Figure

31

Figure 6.4: Receive process

Mean = Data IN

p = 0

Result Mins += MIN(Block ram[g,p],Data IN)

Result Covar += (Block ram[g,p] – mean A[g]) * (Data IN – Mean B)

p < P

p = p+1

Data OUT[g] = Result Mins[g] * Result Covar[g]

Result Covar[g] < 0

Data OUT[g] = 0

Buffer = 1

Figure 6.5: Calculate process (0 < g < G)

Page 46: 1222 740 thesis - TU Delftce-publications.et.tudelft.nl/publications/302_fpga_hardware... · This thesis proposes a partial FPGA based design were the number of parallel ... Figure

32

6.3 Computational hardware

In this section the hardware of the real computations is discussed. Because the implementation

of the communication part consumes few resources, the computational hardware can consume

almost the whole FPGA (85%). But to do many parallel computations, the area of each

component may not be too large.

To compute Result Mins it requires one if-then-else statement, an addition, a read from

memory and a write to memory. In hardware this can be done all at the same time, because the

block ram, registers and computational hardware can be used in parallel. The if-then-else

statement is implemented with a comparator and the addition with an adder (the left of Figure

6.7). The calculation of Result COVAR can be done separately of Result Mins and is visualized

in Figure 6.7 (right). The final calculation of Data OUT is shown in Figure 6.8. Because more

computational units can be used together at the same time, a real acceleration is achieved.

One important side note has to be made. The computational hardware is implemented in a

32 bits fixed-point representation (16 fractional bits). In this way the computational hardware

can be kept small in terms of logical resources and more computational units can be

implemented [7].

adder adder

muxComparator

<

Block ram

Data IN

Result Mins

Result Mins

sub sub

mult

adder

Data IN Mean B Block ram Mean A

Result Covar

Result Covar

Figure 6.7: Calculate Result Mins (left) and calculate Result Covar (right)

Comperator

< mux

Zero

Result Covar

mult

Result Mins

Data OUT

Figure 6.8: Calculate Data OUT

Page 47: 1222 740 thesis - TU Delftce-publications.et.tudelft.nl/publications/302_fpga_hardware... · This thesis proposes a partial FPGA based design were the number of parallel ... Figure

33

Results 7

This chapter shows the result of the implementation in hardware. But first the number resources

and platform are mentioned (7.1). The results will consist of the following subjects: the

improvement in execution time (7.2), the scalability of hardware acceleration (7.3) and the

verification (7.4)

7.1 Platform and resources

To construct the result the following platforms were used:

• Computer: Acer Aspire T650 with two Pentium® 4 CPU 3.06 GHZ processors. The

computer runs on Windows XP Professional (service pack 3), 1024 MB memory.

• FPGA: Virtex®-4 XC4VFX12 device, namely a ML403. It runs on a 100 MHZ clock and

the FPGA configuration is loaded from a 512-MB compact flash card.

On the ML403 at most 9 computational units can implemented. The resources are listed in

Figure 7.1. As many DSP48 components, which are used to implement the multipliers, as

possible are used to save slices [7].

used total ratio

Slices 4948 5472 90%

Slice register 3401 10944 31%

4-input luts 7294 10944 67%

Block ram 11 36 31%

DSP48 30 32 94%

EMAC 1 1 100%

DCM 3 4 75%

Table 7.1: Resources used for the accelerator hardware

Page 48: 1222 740 thesis - TU Delftce-publications.et.tudelft.nl/publications/302_fpga_hardware... · This thesis proposes a partial FPGA based design were the number of parallel ... Figure

34

7.2 Execution times

There were three versions measured: KCSMART v7, KCSMART v8 software and KCSMART

v8. KCSMART v7 was the original code for matlab. KCSMART v8 software is the plain C

translation of the pseudo code described in this report (This is a very bad plain C

implementation). The final version with FPGA implementation and optimized kernel

convolution is KCSMART v8.

The execution times of the algorithm with relatively small input matrices for different

kernel widths are shown in Figure 7.1. Normally the execution time will grow with growing

kernel width. This is the case for both KCSMART v8 versions, where the hardware

implementation is two times faster. But as can be seen in the figure, for a small kernel width the

matlab implementation was incredible slow. This is explained due the slow peak finding step,

because when a small kernel is applied more peaks remain in the pair-wise space (and this

slows down the execution).

The execution times for one small and one big input matrix for different kernel widths are

shown in Figure 7.2. When this Figure is compared with the execution times of the smaller

input matrices, it can be concluded that the speedup achieved by the FPGA implementation

grows for larger matrices and smaller kernel widths. However the speedup for the larger kernel

sizes remains the same (factor three). This shows the trend of speedups when larger input

matrices are calculated.

Execution times with M = 1758, N = 1859 and P = 95

9 compuational units

0

50

100

150

200

250

0,2 1 2 10 20

Width of kernel (10^6)

Execu

tio

n t

ime (

sec)

KCSMART v7

KCSMART v8 softw are

KCSMART v8

Figure 7.1: Execution time of two small input matrices

Execution times with M = 1758, N = 5219 and P = 95

9 computational units

0

500

1000

1500

2000

2500

0,2 1 2 10 20

Width of kernel (10^6)

Ex

ec

uti

on

tim

e (

se

c)

KCSMART v7

KCSMART v8 software

KCSMART v8

Figure 7.2: Execution time of one small and one big input matrix

Page 49: 1222 740 thesis - TU Delftce-publications.et.tudelft.nl/publications/302_fpga_hardware... · This thesis proposes a partial FPGA based design were the number of parallel ... Figure

35

7.2 Scalability

The scalability of the KCSMART v8 is measured (Figure 7.3). This is an important aspect in

the FPGA design, because if a design is scalable it means the execution time can be calculated

by the input matrix sizes. As should be expected, the execution time of the FPGA is linear to

the size of the output matrix (MxN). Also it can be concluded that the number of computational

units is linear to the execution time. This is a very important fact, because when the

computational units are doubled the execution time is halved.

FPGA execution times (P = 95)

0

50

100

150

200

250

300

0 5000000 10000000 15000000 20000000 25000000 30000000

MxN

Execu

tio

n t

ime (

sec)

9 computational units

6 computational units

4 computational units

Figure 7.3: FPGA execution time for different input sizes

7.3 Verification

One of the most important steps in an implementation is the verification. Because the hardware

design is shifting from a floating point to a fixed point representation, errors will be introduced.

However due to 16 fractional bits this error stays relatively small. This conclusion can be made

from table 7.2, where the errors are measured between KCSMART v8 software and

KCSMART v8. The errors measured are the maximum errors when the peak heights are

compared between the KCSMART v8 software version and KCSMART v8 version. Despite

the errors, both implementations will find the same peaks and the hardware design is verified.

M N Max error

1758 1859 0,00076%

1758 5219 0,00065%

1859 5219 0,00048%

1758 7403 0,00083%

1859 7403 0,00073%

Table 7.2: Errors between KCSMART v8 software and KCSMART v8

Page 50: 1222 740 thesis - TU Delftce-publications.et.tudelft.nl/publications/302_fpga_hardware... · This thesis proposes a partial FPGA based design were the number of parallel ... Figure

36

Page 51: 1222 740 thesis - TU Delftce-publications.et.tudelft.nl/publications/302_fpga_hardware... · This thesis proposes a partial FPGA based design were the number of parallel ... Figure

37

Conclusions and future work 8

In this chapter the final conclusions of the thesis are described (8.1). Future work (8.2) will

show the main improvements that can be done in the future.

8.1 Conclusions

This master thesis focused on improving an algorithm to find co-occurring aberrations in DNA

strings. Due to unbalanced transactions Copy Number Alterations occur in DNA. These are

detected by the procedure of array Comparative Genomic Hybridization. This method measures

the ratio of the number of genes between ‘healthy DNA’ and ‘tumor DNA‘. The algorithm to

calculate the aberrations with these ratios was already designed and implemented in matlab. But

the problem was that the execution time takes much time to compute the location of the co-

occurring aberrations. The main reason is that many computations have to be performed,

because the DNA stings are long. So to improve the algorithm the computations have to be

speeded up.

The algorithm consists of four big steps: pair-wise, covariance and normalization, 2D

kernel convolution and peaks. When calculating the pair-wise space all ratios between the

healthy and tumor DNA’s of chromosome arm A and B are compared, where the minimum

value is added to the other minimum of the ratios from other tumor DNA’s. Then the

covariance matrix is multiplied with the pair-wise space to loose all continuous values in the

ratios and it is divided by a normalization matrix. The kernel convolution is applied on the

result to look for the local enrichment. Finally the peaks are sought to find the most co-

occurring aberrations in the DNA strings.

The algorithm has many parallel execution possibilities when more hardware

computational units become available. Therefore an FPGA is chosen, because of its

parallelization possibilities and computational power. The partitioning is done by sending

columns of the two different chromosome arms. For every column of chromosome A and

chromosome B, one output of the pair-wise space can be computed. Therefore the steps of

computation the pair-wise space and the computation of the covariance matrix are done one the

FPGA. All divisions are left on the general purpose processor, because they take to much time

and area on the FPGA. Also the kernel convolution and peak finding step are left on the general

purpose processor, because the implementation of the kernel convolution should take to much

area.

An accelerator was designed in the FPGA to implement the computations. This accelerator

receives UDP Ethernet frames with a build communication controller. The frames were

buffered where the computational controller can ask for the next data value. The computational

controller manages the computation units, who work with 32 bits fixed point number (16

fractional bits). The implemented design on a ML403 platform consists of 9 computational

units. The design is scalable and thus can be used on larger FPGA to increase the performance.

In conclusion, the performance of the algorithm to find co-occurring aberrations in DNA

has been increased, with a minimum speedup of 3 to a maximum of several hundreds. The

design can be further improved by using larger FPGA boards.

Page 52: 1222 740 thesis - TU Delftce-publications.et.tudelft.nl/publications/302_fpga_hardware... · This thesis proposes a partial FPGA based design were the number of parallel ... Figure

38

8.2 Future work

Future work on this accelerator should focus on two main aspects: mapping more

computational units and increasing the communication speed. To increase the number of

computational units the design can be mapped to a larger FPGA. Another advantage of a larger

FPGA is that the clock speed shall increase. With this method the execution time can be easily

increased.

The second method the increase the computation time is to increase the communication

speed. The design communicates with a gigabit Ethernet, however only 25 megabytes per

second is achieved when all data is send to the FPGA and the computer doesn’t wait for the

answer of the FPGA. This means the software implementation that sends the data over the

Ethernet should be enhanced. Also communication speed can be further increased by using

Rocket-IO (used on more advanced FPGA’s), which can communicate at 10 gigabit per second

[5].

Another approach to improve the performance is to implement the code left on the general

purpose processor to an other platform. For example the GPU is a good platform to implement

the kernel convolution. In this way, also the last computations can be accelerated.

Page 53: 1222 740 thesis - TU Delftce-publications.et.tudelft.nl/publications/302_fpga_hardware... · This thesis proposes a partial FPGA based design were the number of parallel ... Figure

39

Bibliography _

[1] Jan Bot, Grid Usecase BioMed, seminar (2008)

[2] Jeroen de Ridder, Jaap Kool, Co-occurrence analysis of insertional mutagenesis data

reveals cooperating oncogenes, Vol. 23 ISMB/ECCB (2007)

[3] P. Hupé, N. Stransky, Analysis of array CGH data: from signal ratio to gain and loss of

DNA regions, Bioinformatics, Vol. 20, no. 18, pp. 3413–3422 (2004).

[4] Xilinx system engineering group, Minimal Footprint Tri-Mode Ethernet MAC Processing

Engin, Xilinx application note XAPP807 (2007)

[5] Xilinx system engineering group, RocketIO™ Transceiver User Guide, Xilinx user guide

UG024 (2007)

[6] Xilinx system engineering group, Virtex-4 FPGA User Guid, Xilinx user guide UG070

(2008)

[7] Xilinx system engineering group, XtremeDSP for Virtex-4 FPGAs, Xilinx user guide

UG074 (2009)

[8] Xilinx system engineering group, Virtex-4 FPGA Embedded Tri-Mode Ethernet MAC,

Xilinx user guide UG074 (2009)

[9] Zhang F, Gu W, Copy number variation in human health, disease, and evolution, Annual

Review of Genomics and Human Genetics, Vol. 10, pp. 451-481 (2009)