1222 740 thesis - tu delftce-publications.et.tudelft.nl/publications/302_fpga_hardware... · this...
TRANSCRIPT
Computer Engineering Mekelweg 4,
2628 CD Delft The
Netherlands
http://ce.et.tudelft.nl/
2009
MSc THESIS
FPGA Hardware acceleration of co-occurring
a b e r r a t i o n s in aCGH data
Marco R. van der Leije
Abstract
CE-MS-2009-17
Unbalanced transaction can lead to addition and deletions in genes, which can be an indication of tumor cells. This is measured with array Comparative Genomic Hybridization. To find co-occurring aberrations in DNA, an algorithm was designed. However the execution takes days to find these DNA aberrations. This thesis proposes a partial FPGA based design were the number of parallel computations can be increased. The FPGA communicates with a computer on a gigabit Ethernet, where on the FPGA a hardware based Ethernet controller is build. This design is scalable for FPGA’s, so its performance is linear to the size of the FPGA’s resources. On a XC4VFX12 device, a minimum speedup of a factor 3 and a maximum speedup of several hundreds is achieved.
FPGA Hardware acceleration of co-occurring
aberrations in aCGH data
THESIS
submitted in partial fulfilment of the requirements for the degree of
MASTER OF SCIENCE
in
COMPUTER ENGINEERING
by
Marco R. van der Leije born in Rotterdam, The Netherlands
Computer Engineering
Department of Electrical Engineer ing
Faculty of Electrical Engineer ing, Ma t h e ma t i c s and Computer Science
Delft University of Technology
i
FPGA Hardware acceleration of co-occurring
aberrations in aCGH data
by Marco van der Leije
Abstract
Unbalanced transaction can lead to addition and deletions in genes, which can be an indication of tumor
cells. This is measured with array Comparative Genomic Hybridization. To find co-occurring aberrations in
DNA, an algorithm was designed. However the execution takes days to find these DNA aberrations. This
thesis proposes a partial FPGA based design were the number of parallel computations can be increased.
The FPGA communicates with a computer on a gigabit Ethernet, where on the FPGA a hardware based
Ethernet controller is build. This design is scalable for FPGA’s, so its performance is linear to the size of the
FPGA resources. On a XC4VFX12 device, a minimum speedup of a factor 3 and a maximum speedup of
several hundreds is achieved.
Laboratory : Computer Engineering
Codenumber : CE-MS-2009-17
Committee Members :
Advisor: Arjan J. van Genderen, CE, TU Delft
Advisor: Marcel J.T. Reinders, ICT, TU Delft
Chairperson: Koen Bertels, CE, TU Delft
Member: Georgi N. Gaydadjiev, CE, TU Delft
Member: Jeroen de Ridder, ICT, TU Delft
ii
iii
iv
v
Contents _
List of Figures vii
List of Table viii
Acknowledgements ix
1 Introduction 1 1.1 Aberrations in aCGH data .................................................................................................... 1 1.2 Problem statement ................................................................................................................ 2 1.3 Project goals and approach ................................................................................................... 3 1.4 Chapter overview.................................................................................................................. 3
2 The Algorithm 5
2.1 Inputs and outputs................................................................................................................. 5 2.2 Pair-wise space ..................................................................................................................... 6 2.3 Covariance and normalization .............................................................................................. 7 2.4 2D kernel .............................................................................................................................. 8 2.5 Peaks..................................................................................................................................... 9
3 Implementation in software 11
3.1 Pair-wise space ................................................................................................................... 11 3.2 Covariance and normalization ............................................................................................ 12 3.3 2D kernel ............................................................................................................................ 14 3.4 Peaks................................................................................................................................... 15
4 Implementation considerations 17
4.1 Parallelism .......................................................................................................................... 17 4.2 Implementation platforms................................................................................................... 18
4.2.1 General purpose processor........................................................................................ 18 4.2.2 Cell (microprocessor) ............................................................................................... 18 4.2.3 Graphic Processor Unit ............................................................................................. 19 4.2.3 Field programmable gate array ................................................................................. 19 4.2.4 Conclusion ................................................................................................................ 20
5 Architecture 21
5.1 Partitioning of the algorithm............................................................................................... 21 5.2 Software versus hardware................................................................................................... 22 5.3 Software.............................................................................................................................. 23 5.4 Hardware ............................................................................................................................ 25
6 Implementation in hardware 27
6.1 Ethernet communication..................................................................................................... 27 6.1.1 The TEMAC and the FIFO’s .................................................................................... 27 6.1.2 Communication data format ..................................................................................... 28 6.1.3 The Ethernet controller ............................................................................................. 28
6.2 The accelerator ................................................................................................................... 29 6.2.1 Receive...................................................................................................................... 30 6.2.2 Calculate ................................................................................................................... 30 6.3 Computational hardware.............................................................................................. 32
vi
7 Results 33 7.1 Platform and resources ....................................................................................................... 33 7.2 Execution times .................................................................................................................. 34 7.2 Scalability ........................................................................................................................... 35 7.3 Verification......................................................................................................................... 35
8 Conclusions and future work 37 8.1 Conclusion.......................................................................................................................... 37 8.2 Future work ........................................................................................................................ 38
Bibliography 39
vii
List of Figures _
Figure 1.1: Loss of a tumor suppressor gene and the gain of an oncogene ...................................... 1 Figure 1.2: Tumor DNA’s from different persons............................................................................ 2 Figure 1.3: Co-occurrences alterations in tumor DNA ..................................................................... 3
Figure 2.1: The inputs....................................................................................................................... 5 Figure 2.2: The output ...................................................................................................................... 5 Figure 2.3: Pre computation (pseudo code) ...................................................................................... 6 Figure 2.4: Calculate the minimums................................................................................................. 7 Figure 2.5: Sums the C matrices ....................................................................................................... 7 Figure 2.6: The NORM matrix ......................................................................................................... 7 Figure 2.7: Normalize the pair-wise space ....................................................................................... 8 Figure 2.8: Normal 2D kernel convolution....................................................................................... 8 Figure 2.9: Separated 2D kernel convolution ................................................................................... 9 Figure 2.10: Finding peaks ............................................................................................................... 9
Figure 3.1: Pseudo code of calculating the pair-wise space ........................................................... 11 Figure 3.2: Pseudo code of the covariance and normalization step ................................................ 12 Figure 3.3: Pseudo code of calculating the covariance matrix........................................................ 13 Figure 3.4: Matlab code defining the kernel matrix........................................................................ 14 Figure 3.5: Pseudo code of the 2D kernel convolution................................................................... 15 Figure 3.6: Pseudo code of the peak finding algorithm .................................................................. 16 Figure 3.7: Pseudo code of the subroutine EXPAND..................................................................... 16
Figure 5.1: Pseudo code of to_fpga function .................................................................................. 22 Figure 5.2: Software architecture.................................................................................................... 23 Figure 5.3: Pseudo code of fpga function ....................................................................................... 23 Figure 5.4: New NORM function ................................................................................................... 24 Figure 5.5: Pseudo code of fpga_buffer function ........................................................................... 24 Figure 5.6: Total hardware architecture.......................................................................................... 25
Figure 6.1: Communication TEMAC-FIFO’s ................................................................................ 27 Figure 6.2: accelerator architecture................................................................................................. 29 Figure 6.3: Computational unit ....................................................................................................... 30 Figure 6.4: Receive process ............................................................................................................ 31 Figure 6.5: Calculate process (0 < g < G)....................................................................................... 31 Figure 6.7: Calculate Result Mins (left) and calculate Result Covar (right) .................................. 32 Figure 6.8: Calculate Data OUT ..................................................................................................... 32
Figure 7.1: Execution time of a small two small input matrices..................................................... 34 Figure 7.2: Execution time of one small and one big input matrix................................................. 34 Figure 7.3: FPGA execution time for different input sizes............................................................. 35
viii
List of Tables _
Figure 4.1: Number of arithmetic operations.................................................................................. 17 Figure 4.2: Arithmetic and time complexity................................................................................... 17
Figure 5.1: Sizes of the matrices..................................................................................................... 21
Figure 7.1: Resources used for the accelerator hardware ............................................................... 33 Figure 7.2: Errors between KCSMART v8 software and KCSMART v8...................................... 35
ix
Acknowledgements _
The report is the result of several months of work on improving the discussed algorithm. It is
a great challenge and many aspect of designing came along. Without some help this was very
hard to realize.
I like to thank Arjan van Genderen for his help and trust in me. He gave good advice and
he gave me much freedom in working at home. He also critically checked my first report,
where he has given some great advice. I also like to thank Marcel Reinders and Jeroen de
Ridder. They explained the algorithm and help me to stay on the right track. Finally I want to
thank Chris Klijn, because it was his matlab implementation. He gave insights in the
algorithm and explained all to get started.
Marco R. van der Leije
Delft, The Netherlands
August 20, 2009
x
1
Introduction 1
The report discusses an acceleration of a process that is used to find co-occurrences in
alterations in DNA strings. First some background information of these alterations is described
(1.1). To find the co-occurrences in these alterations, an algorithm has been designed. This
algorithm takes a lot of execution time, which leads to the problem statement (1.2). Thereafter
the project goals and approach are explained (1.3). Finally a chapter overview is given in
paragraph 1.4, which will explain the structure of this report.
1.1 Aberrations in aCGH data
Genomic instability is often observed in tumor cells [9]. This instability can lead to the loss of a
tumor suppressor gene and the gain of an oncogene, which is called an unbalanced transaction
(Figure 1.1). This means that deletions and additions in DNA pieces can occur. Where normally
the genes come in pairs, tumor cells have more (or less) of the same genes. This abnormal
number of genes is called copy number alterations (CNA’s). These alterations are interesting to
find, because this can help in tumor research and other DNA studies.
One of the procedures to measure the CNA’s is array Comparative Genomic Hybridization
(aCGH) [3]. This method actually compares ‘healthy DNA’ with ‘tumor DNA‘. So a ratio of
the number of genes between ‘healthy DNA’ and ‘tumor DNA‘ is calculated. This ratio is
usually represented in a Log2 form, where positive number represent a gain in the number of
genes and negative number represent a loss in the number of genes (compared to a ‘healthy
DNA’).
Figure 1.1: Loss of a tumor suppressor gene and the gain of an oncogene
2
1.2 Problem statement
There are many analyses that focus to find CNA’s, but most of them are looking for single
location variations. In Figure 1.2 are three different tumor DNA’s displayed, where each arrow
represents the ratio (between the number of genes of healthy and tumor DNA). The analyses
search for shared alterations in the DNA. In this example these analyses will find a large
variation peak on position four (fourth arrow of all tumor DNA’s is high) and some smaller
variation peak on position 2.
However not all single DNA changes lead to tumors or other dangerous cell mutations. For
research purposes the need of finding co-occurrences in the DNA alterations is growing.
Therefore an algorithm was designed to find co-occurring aberrations [2]. In Figure 1.3 two
pieces of a DNA string are combined to find co-occurring alterations. The arrows represent the
same ratio as in Figure 1.2 and the size of the circle represents the importance of the co-
occurrence alteration of the position in DNA piece A and B. So the big circles are co-occurring
aberrations in one tumor DNA and these positions have to be determined for each tumor DNA.
The size of the circles is calculated with the minimum function. So a circle stands for the
minimum of two ratio’s between healthy and tumor DNA. The sum of all these circles for all
tumor DNA’s is called the pair-wise space. The algorithm will be further explained in chapter
two.
The problem is that the DNA string is large. This means many computations have to be
done to find alterations (the data is in terms of hundreds of megabytes). To find co-occurrences,
the number of computations per simultaneous alteration grows exponentially. Because the
algorithm is optimal in number of calculations (as far as known), there is a need to improve the
speed of these computations.
The total DNA string is divided, because the total computation is expensive to compute on
one platform. To divide the calculation, the tumor DNA strings are separated in chromosome
arms (Each person has 23 chromosome pairs and each chromosome consist of two chromosome
arms). These chromosome arms (chromatidis) are used to calculate a part of the total result. In
this way the problem is divided in roughly 10k jobs [1]. 46x46/2 chromosome arms are
compared for 5 kernel sizes and 3 modes (gain/gain, loss/loss and gain/loss), this result in
roughly 10k jobs.
Tumor DNA 1
Tumor DNA 2
Tumor DNA 3
Figure 1.2: Tumor DNA’s from different persons
3
DNA piece A
Figure 1.3: Co-occurrences alterations in tumor DNA
1.3 Project goals and approach
The main goal is to accelerate the computations of the algorithm. The approach is to consider
different platforms to do the computations and choose one of them. The algorithm is already
implemented in matlab and distributed over a network of computers. However this
implementation takes a long time to compute the result.
The approach is to improve one job (the co-occurrences aberrations of two chromosome
arms) on a different platform. In this way a new network can be created or combined with the
old network to achieve the same job, but faster. The approach will consist of the following
phases:
• Literature study: understanding the algorithm, background and studying other accelerator
approaches in software and on different platforms.
• Convert the matlab implementation to a plain C implementation to improve the algorithm
and to find parallelization possibilities.
• Search and choose an implementations platform and make an architecture.
• Implement, test en verify the architecture.
The goal of this report focuses on accelerating the algorithm that can find the co-occurrences of
two alterations in DNA. In the future more, co-occurrences will be calculated, however the
computational time grows exponentially (where the discussed algorithm was taking 10.000
days, when calculated fully).
1.4 Chapter overview
This report is written in the same order as the approach. Chapter 2 visualizes and explains the
algorithm. This algorithm is implemented in plain C, where some algorithm steps are
redesigned (chapter 3). An implementation platform (chapter 4) was chosen and an architecture
was designed (chapter 5). This architecture was implemented (chapter 6) and tested. The results
are mentioned in chapter 7. Finally the conclusions and future work can be found in chapter 8.
4
5
The Algorithm 2
This chapter explains the algorithm to calculate the co-occurring aberrations. First the exact
inputs and outputs of the algorithm are shown (2.1), then a description follows how the pair-
wise space is calculated (2.2). The next stage is to normalize the pair-wise space (2.3) and
finally the peaks are calculated (2.5) after the 2D kernel is applied to the normalized pair-wise
space (2.4).
2.1 Inputs and outputs
Within the algorithm two different chromosome arms for a number of tumor DNA’s from
different persons (P) are compared. These chromosome arms for the different tumor DNA’s are
displayed in Figure 2.1 and are called matrix A and B. So each row of matrix A and B contains
the ratio’s between healthy and tumor DNA for each tumor DNA (figure 1.2 is illustrative for
the matrices A and B). Matrix A (chromosome arm A) is of length M and matrix B
(chromosome arm B) is of length N. Some precomputed normalization vectors (NA and NB) are
also given and are used to normalize the pair-wise space, so all ratios have the same intensity.
Figure 2.1: The inputs
The output consists of the 500 highest peaks of the pair-wise space after a kernel is applied.
Only the 500 highest peaks are used because the smaller peaks have no significant value and
the output matrix would be too large. The given information includes the location (X is the
position in A and Y is the position in B) and the height of the peak in the pair-wise space
(Figure 2.2).
Figure 2.2: The output
6
The values in matrix A and B indicate the ratio between the tumor and healthy signal (in a log2
scale), where negative values indicate a loss and a positive values a gain. The algorithm can
calculate a gain/gain, loss/loss or gain/loss answer, which is selected with the ‘amp’ parameter.
All negative values are nullified for a gain situation. All positive values are nullified and all
negative values are inverted for a loss situation. All different possibilities are shown in Figure
2.3.
Figure 2.3: Pre computation (pseudo code)
2.2 Pair-wise space
To create the pair-wise space the two chromosome arms are combined. This is implemented
with a minimum function, because the common large gain or loss ratio is searched in both
chromosome arms. Each value of matrix A is compared with each value of matrix B for each
person. In this way a matrix C is computed for each person Figure 2.4.
switch amp
case 1 // gain/gain
for all 1 ≤ i ≤ M and 1 ≤ p ≤ P
0 0pi pi
A A< => =
for all 1 ≤ j ≤ N and 1 ≤ p ≤ P
0 0pj pj
B B< => =
case 0 // loss/loss
for all 1 ≤ i ≤ M and 1 ≤ p ≤ P
0 0pi pi pi pi
A A else A A> => = = −
for all 1 ≤ j ≤ N and 1 ≤ p ≤ P
0 0pj pj pj pj
B B else B B> => = = −
case 2 // gain/loss
for all 1 ≤ i ≤ M and 1 ≤ p ≤ P
0 0pi pi
A A< => =
for all 1 ≤ j ≤ N and 1 ≤ p ≤ P
0 0pj pj pj pj
B B else B B> => = = −
end;
7
Figure 2.4: Calculate the minimums
There are P different C matrices. Only high ratios that occur in all C matrices are important,
since systematic co-occurrence alterations in the chromosome arms are searched for. To find
these alterations all C matrices are added together (Figure 2.5). In this way each point in the
matrix D is the sum of the minimum of the ratios between the healthy and tumor DNA.
N
0
P
p=
=∑
Figure 2.5: Sums the C matrices
2.3 Covariance and normalization
The next step is to compute the covariance matrix to correct for continuous ratios. In addition
the normalization vectors are used to normalize the pair-wise space. The normalization matrix
is computed as shown in Figure 2.6. The matrix D is multiplied with the covariance matrix and
divided by the normalization matrix (Figure 2.7).
This covariance step looks like an easy computational step (in terms of multiplications and
divisions), but the covariance has the complexity of MxNxP (the same complexity as the
previous step). In addition some divisions are needed for the covariance matrix and
normalization.
N NN
Figure 2.6: The NORM matrix
for all 1 ≤ i ≤ M and 1 ≤ j ≤ N and 1 ≤ p ≤ P
( , )p
ji pi pjC MIN A B=
8
Figure 2.7: Normalize the pair-wise space
2.4 2D kernel
A 2D Gaussian kernel is applied on the normalized pair-wise space to look at the local
enrichment of the highest values within this space. A normal 2D kernel convolution is shown in
Figure 2.8, where K denotes the height and width of the kernel matrix V. In this way the
complexity is MxNxKxK. So this step takes the most computational power (when KxK > P).
Because the kernel is a Gaussian kernel, this kernel is separable. This means that its
convolution can be done with one horizontal and one vertical vector of width K (Figure 2.9). In
this way the complexity decreases to MxNxKx2 and so this step takes less computational power
(when 2xK < P).
Figure 2.8: Normal 2D kernel convolution
for all 1 ≤ i ≤ M and 1 ≤ j ≤ N
/ji ji ji ji
E D COV NORM= ×
for all 1 ≤ i ≤ M and 1 ≤ j ≤ N
( / 2 ),( / 2 ) ,
1 1
K K
ji j K k 2 i K k1 k1 k 2
k1 k 2
G E V+ − + −
= =
= ×∑∑
9
E
m
= CONV( X V
K
) F
m
F
m
= CONV( X ) G
m
Figure 2.9: Separated 2D kernel convolution
2.5 Peaks
This is the final step in the algorithm and it finds peaks in the pair-wise space to detect the
DNA locations that co-aberate to a certain degree. This peak function, as shown in Figure 2.10,
makes an array of the location and the height of the peaks. This function should be in the
complexity of MxN and so it should be fast. But as it will discussed in the next chapter this
function took the most time in the matlab implementation.
Figure 2.10: Finding peaks
for all 1 ≤ i ≤ M and 1 ≤ j ≤ N
,( / 2 )
1
K
ji j i K k1 k1
k1
F E V+ −
=
= ×∑
( / 2 ),
1
K
ji j k k 2 i k 2
k 2
G F V+ −
=
= ×∑
10
11
Implementation in software 3
An important step in optimizing the algorithm is to get insight in the number of operations and
parallelism available each step. This is done by translating the algorithm in pseudo code. All
different steps in the algorithm, calculating the pair-wise space (3.1), the covariance and the
normalization (3.2), the 2D kernel (3.3) and the peak function (3.4), are translated into pseudo
code. Also the most important differences compared to the original matlab code are mentioned.
Also it must be noted that the matrices are stored column wise in memory (this means that the
indices for the matrices are exchanged).
The pseudo codes mentioned in this chapter are a simple version of the real C code. This
means that advanced code is left out of the report, like memory mapping for the kernel
convolution and flat peak detection.
3.1 Pair-wise space
In this step the pair-wise space is calculated. First a matrix D is filled with zeros and
then this matrix is used to add the minimum value between matrix A and B (Figure 3.1).
This pseudo code shows that MxNxP minimum (MIN) functions must be performed. In this
minimum function there has to be one comparator and all of these comparators can work in
parallel. Only the P numbers of additions for each point in matrix C have to write to the same
address. The original matlab code was running 5 times slower than this code because of the
matlab interpreter.
Figure 3.1: Pseudo code of calculating the pair-wise space
//Input: matrix A and B are the ratios between the healty and tumor DNA
// for chromosome arms A and B for P number of tumor DNA’s
//Output: matrix D is the pair-wise space
mins(M, N, P, matrix A, matrix B, matrix D)
{
for (i=0; i<M; i++)
for (j=0; j<N; j++)
D[i,j] = 0;
for (i=0; i<M; i++)
for (j=0; j<N; j++)
for (p=0; p<P; p++)
D[i,j] += MIN( A[i,p], B[j,p]);
}
12
3.2 Covariance and normalization
The function COVAR calculates the covariance matrix, where some subroutines are needed to
perform the matrix multiplication and calculate the mean for each column (Figure 3.3). After
the covariance matrix is calculated, it is multiplied with the pair-wise space with the function
COV_NORM (Figure 3.2). In this function all negative values in the covariance matrix are
nullified. The result of this multiplication step is divided by the normalization matrix. To avoid
division by zero it is checked in this function (if this occurs the result will be zero like the
original matlab code does).
Most computations are done in the matrix multiplication, which is used in the COVAR
function. The number of multiplications is MxNxP. The P additions for every point in the
matrix have to be performed sequentially, because they write to the same address.
Figure 3.2: Pseudo code of the covariance and normalization step
//Input: matrix COV is covariance matrix
// matrix D is the pair-wise space
// array Na and Nb are the normalization vectors
//Output: matrix E is the pair-wise space corrected for continuous ratios and
// normalized.
cov_norm(M, N, matrix D, matrix COV, array Na, array Nb, matrix E)
{
// multiply step
for (i = 0; i < M; i++)
for (j = 0; j < N; j++)
if (COV [i,j]<0)
E_interm [i,j] = 0;
else
E_interm [i,j] = D[i,j] * COV[i,j];
//normalization step
for (i = 0; i < M; i++)
for (j = 0; j < N; j++)
temp = Na[i] + Nb[j];
if (temp==0)
E[i,j] = 0;
else
E[i,j] = E_interm [i,j] / temp;
}
13
Figure 3.3: Pseudo code of calculating the covariance matrix
covar(M, N, P, matrix A, matrix B, matrix COV)
{
mean(M,P,A,A_MEAN);
mean(N,P,B,B_MEAN);
sub_mean(M,P,A, A_MEAN,A_SUB_MEAN);
sub_mean(N,P,B,B_MEAN,B_ SUB_MEAN);
mult_cov(M,N,P, A_SUB_MEAN, B_SUB_MEAN, COV);
}
mult_cov(M, N, P, matrix A_SUB_MEAN, matrix B_SUB_MEAN, matrix COV)
{
for (i=0; i<M; i++)
for (j=0; j<N; j++)
COV[i,j] = 0;
for (i=0; i<M; i++)
for (j=0; j<N; j++)
for (k=0; k<P; k++)
COV [i,j] += A_SUB_MEAN [i,k] * B_SUB_MEAN [j,k];
for (i=0; i<M; i++)
for (j=0; j<N; j++)
COV [i,j] /= P; //P was M in the original code
//However the covariance is calculated over P persons
}
mean(M, P, matrix IN, array MEAN)
{
for (i=0; i<M; i++)
MEAN C[i] = 0;
for (i=0; i<M; i++)
for (k=0; k<P; k++)
MEAN [i] += IN[i,k];
for (i=0; i<M; i++)
MEAN [i] /= P;
}
sub_mean(M, P, matrix IN, array MEAN, matrix SUB_MEAN)
{
for (i=0; i<M; i++)
for (k=0; k<P; k++)
SUB_MEAN [i,k] = IN[i,k] - MEAN[i];
}
14
3.3 2D kernel
The 2D Gaussian kernel convolution can be separated as explained in the previous chapter. The
separation of the Gaussian kernel itself is shown in Figure 3.4. The upper half of the figure
shows the calculation of a 2D Gaussian and the lower half shows the calculation of the
separated Gaussian. First a horizontal convolution is calculated with the separated Gaussian
followed by a vertical convolution as depicted in figure 3.5. MxNxKx2 multiplications are
performed, where all Kx2 additions for each point in the matrix have to be done sequentially.
When performing a convolution, problems arise at the edges of the matrix, because values
are needed outside the matrix. Therefore the matrix is extended and the values at the edges of
the original matrix are mirror wise duplicated to the outside the original matrix. Another
alternative is to use zeros for the values outside the matrix, but this leads to significant smaller
values at the edges and false peaks in the next step.
In matlab the separation was done by SVD decomposition (it was calculated). However the
separation done in the definition of the Gaussian lead to better accuracy of the results (because
floating point errors lead to less accuracy from the decomposition).
Figure 3.4: Matlab code defining the kernel matrix
[X, Y] = meshgrid([-(scale/2):.1:scale/2] , [-(scale/2):.1:scale/2]);
Z_2D = exp( -( ((X-x0)./sigma_x).^ 2 + ((Y-y0)./sigma_y).^2 ));
X = [-(scale/2):.1:scale/2];
Z_1D = exp( -( ((X-x0)./sigma_x).^ 2));
15
Figure 3.5: Pseudo code of the 2D kernel convolution
3.4 Peaks
The PEAK function has to determine for each point if it is a peak (Figure 3.6). The output is
initiated to minus one, so the sub function expand knows if the point was already evaluated.
The sub function expand was implemented in the following way, each point in the matrix is
compared with all its neighbors and when this point is greater then its neighbors it is a peak.
This is done in the first FOR loop in the EXPAND function (Figure 3.7). The Second FOR loop
checks for flat peaks, whether a neighbor is of the same value. If this is the case the neighbor
has to be evaluated to determine if the point in the matrix is a peak, again using the EXPAND
function. In this way all peaks are determined in the matrix PEAK.
The complexity of this function is MxN. However in case of a plateau (number of same
values connected to each other) all points of the plateau have to be evaluated to determine if the
plateau is a peak. So in the worst case of one big plateau, all points have to be evaluated
sequentially to determine if it is a peak.
The matlab code for finding peaks in the pair-wise space was an inefficient algorithm in the
way the memory was used, because for every peak that was found the peak matrix was
extended with one and a new memory allocation had to be performed. So if the result had many
peaks the calculation time increases. The matlab code of the PEAK function took 60 percent of
the total execution time.
//Input: matrix E is the pair-wise space corrected for continuous ratios and normalized
// array V contains the kernel values
//Output: matrix F is the pair-wise space after the kernel
kernel(M, N,K,matrix E, array V, matrix G)
{
////////// horizontal //////////
extend_h(E, ext_ E);
for (i = 0; i < M; i++)
for (j = 0; j < N; j++)
F[i,j] = 0;
for (i = 0; i < M; i++)
for (j = 0; j < N; j++)
for (k = 0; k < K; k++)
F[i,j] += ext_ E[i+k,j] * V[k];
////////// Vertical //////////
extend_v(F, ext_ F);
for (i = 0; i < M; i++)
for (j = 0; j < N; j++)
G[i,j] = 0;
for (i = 0; i < M; i++)
for (j = 0; j < N; j++)
for (k = 0; k < K; k++)
G[i,j] += ext_F[i,j+k] * V[k];
}
16
Figure 3.6: Pseudo code of the peak finding algorithm
Figure 3.7: Pseudo code of the subroutine EXPAND
peak(M, N, matrix G, matrix PEAK)
{
for (i = 0; i < M; ++i)
for (j = 0; j < N; ++j)
PEAK [i,j] = -1;
for (i = 0; i < M; ++i)
for (j = 0; j < N; ++j)
expand(G, PEAK,i,j);
}
int expand(matrix G, matrix PEAK, x, y)
{
//N_x denotes the x position of neighbor N
//N_y denotes the y position of neighbor N
if (PEAK [x,y] != -1)
return PEAK [x,y];
//code for normal peaks
for all neighbors N
if (G[x,y] < G[N_x,N_y])
PEAK [x,y] = 0;
return 0;
PEAK [x,y] = 1;
//code for flat peaks
for all neighbors N
if (G[x,y] == G[N_x,N_y])
if (expand(G, PEAK,N_x,N_y) == 0)
PEAK [x,y] = 0;
return 0;
return 1;
}
17
Implementation considerations 4
In this chapter some considerations are discussed to implement the algorithm. These
considerations will lead to an implementation platform (4.2). But to make a good choice for the
platform first the parallelism in the algorithm is explored (4.1).
4.1 Parallelism
The algorithm has many operations to perform and therefore a lot of time is needed to execute
this algorithm. The minimum number of arithmetic operations for every process step is shown
in Table 4.1. From this table it can be concluded that the step that costs the most in terms of
operations depends on the values of P and K. When K*2 > P the 2D kernel step costs the most,
but when P is bigger the pair-wise space and the covariance matrix cost the most in terms of
number of operations.
The 2D kernel convolution step can be executed several times over the pair-wise space for
several kernel widths. This means that the 2D kernel convolution has some extra weight when it
is compared to the other steps. However for the (current) typical application most of the
calculated kernel convolutions are very small compared to P and not always all kernel widths
have to be calculated.
When NxM arithmetic hardware is used, this algorithm can be executed fast. The remaining
time is decreased to P and K*2, except for the peak finding algorithm (because of his sequential
behavior) as shown in Table 4.2. However the size of M and N is too large to make NxM
arithmetic hardware, but it shows the possibility to speedup the algorithm when more
operations can be performed at the same time.
Proces step Comparisons Additions Multiplications Divisions
Pair-wise space M*N*P M*N*P
Covariance matrix M*N*P+4*M*P M*N*P M*N+M
Covariance and normalization 2*M*N M*N M*N
2D kernel convolution M*N*K*2 M*N*K*2
Peaks M*N*9
Table 4.1: Number of arithmetic operations
Proces step Arithmetic complexity Time complexity
Pair-wise space M*N*P P
Covariance and normalization M*N*P P
2D Kernel M*N*K*2 K*2
Peaks M*N M*N
Table 4.2: Arithmetic and time complexity
18
4.2 Implementation platforms
There are few implementation platforms possible, each having its advantages and
disadvantages. For the algorithm the parallelism, speed, memory and flexibility of the platform
are important. Also the possibility to construct a network with these platforms is mentioned.
These aspects are discussed for the general purpose processor, cell (microprocessor), GPU and
FPGA.
4.2.1 General purpose processor
The general purpose processor is the most commonly used computational platform. It is often
used because of its flexibility. Many high level programming languages are used to make
programming easier. But also programming languages closer to the processor can be used to
obtain faster computations.
Parallelism in the general purpose processor can be achieved with instruction pipelining
and SIMD (single instruction, multiple data). With these methods instructions can be executed
at the same time. Another possibility is to generate threads. However when the threads are
executed on the same processor, the instructions are still executed sequentially and the threads
become only emulated parallelism. However on today’s machines there can be multiple cores,
where true parallelism can be achieved if all the tasks are executed on different cores.
General purpose processors run on gigahertz clock frequencies. However most instructions
take more the one clock cycle. The construction of a network with this platform is relatively
easy, because these networks already exist and communicate with Ethernet (standard up to 1
gigabit).
The memory in a general purpose processor is layered, where cache memory is used for
fast access and DRAM is used to store larger data sets. The cache is usable when an address in
the DRAM is needed several times right after each other.
4.2.2 Cell (microprocessor)
The Cell is a relatively new computational platform, which uses one main processing element
(PE) with multiple attached processing units (APU’s). The PE is not the main processor, but is
actually a controller for the APU’s. These APU (mostly less than ten) can each execute a
thread, which increases the level of parallelism. The APU’s can also be used sequentially where
each APU calculates some of the final product and then sends it to the next APU. The level of
parallelism thus depends on the number of APU’s.
Programming the Cell is a little bit harder, where knowledge of multithreading is necessary.
When the code written for a general purpose processor is run on the Cell, it is mainly executed
on the PE, while the Cell’s true power comes from the APU’s. The program has to be rewritten
to use threads, to define the memory sharing between the threads and between the threads and
the PE. So the code has to be split up in threads to optimally use all APU’s.
The Cell microprocessor is like the general purpose processor and can be easily connected
to a network with Ethernet. The PE distributes the data to the APU’s and this means the data
can be distributed by programming, which gives a great memory advantage.
19
4.2.3 Graphic Processor Unit
The Graphic Processor Unit (GPU) has evolved into a highly parallel, multithreaded, many core
processor. GPU’s were originally used for only video task and are located on the graphics card
in a computer. A GPU consist of many small computational units. These computational units
can function independent from each other, this creates good parallelism. To use a GPU the code
has to be completely rewritten and can’t be used on other platforms. It uses special libraries and
threads. This makes the flexibility of this platform bad.
The speed of a GPU is several gigahertz’s, however many instructions take more cycles to
execute. The code written for the GPU handles multiplications and additions very well, but
branching is very bad for the performance. The implementation in a network is relatively easy
because many computers have already a GPU in the graphics card.
The computational units have all small separated caches, which is a difference to the cell
microprocessor. This will lead to many cache misses when large data sets are inputted and
performance reduction.
4.2.3 Field programmable gate array
The field programmable gate array (FPGA) is currently the most used of different
programmable hardware. On this implementation platform the hardware is reconfigurable and
can be designed to the application. To program the FPGA, the languages VHDL and Verilog
are mostly used. Because all hardware must be programmed many code lines are needed to
implement a design. However the code can be used on other FPGA boards, which gives a good
flexibility and the hardware can be optimized for the design.
The parallelization depends on the number of logical elements that can be implemented on
the FPGA. All hardware components (adders, multipliers) can be created, but are limited to
hardware resources of the FPGA. In this way much parallelism can be achieved.
The FPGA runs on a clock frequency of 100-500 megahertz. But every cycle more
hardware components can be executed at the same time. FPGA’s (mostly) have gigabit Ethernet
on board, which makes it useful in a network.
FPGA’s have different kinds of memory: SDRAM, block RAM and flip-flops. This block
RAM is an array of small pieces of memory, which can be read or written at the same time. The
most important is that block ram and flip-flops can be accessed in parallel of each other, so
many computations can be done in parallel.
20
4.2.4 Conclusion
The algorithm can easily be parallelized, but then a lot of hardware is needed. So a platform
with many arithmetic operations per second has to be chosen. The Cell can deal with lots of
operations in parallel in a very high speed, which is restricted to the number of APU’s (mostly
less than 10). GPU’s have more computational units, but can’t handle branches very well.
These two platforms execute instructions in a sequential way, where many instructions take
more than clock cycle.
The FPGA is however a factor ten slower in terms of clock frequency, but can do more
instructions in parallel, which is restricted by the hardware resources needed to create a
computational unit. A FPGA can do an if-statement, addition and multiplication in one clock
cycle (as example). This is possible due to the parallel memory. Which makes the slower clock
rate FPGA, compared to the Cell and GPU, less important (if enough instruction can be done in
parallel)?
Because the hardware is configurable and the FPGA can do more in parallel, it is the most
suitable hardware platform. Since maybe not the whole algorithm can be implemented in
hardware and there has to be data transmitted to the board, the general purpose processor is also
used. Another advantage is that networks of computers can be used.
21
Architecture 5
In this chapter the global architecture is described. In the first paragraph it is discussed what is
implemented in hardware and why (5.1). Then the architecture in software is described and
explained (5.2). Finally all hardware blocks are discussed and how they are connected to make
an optimal design (5.3).
5.1 Partitioning of the algorithm
The first intention was to implement the whole design in hardware. However the total hardware
needed to implement the complete design is too large, because MxN hardware units should be
available. Therefore the algorithm was redesigned to divide the pair-wise space and calculate
smaller pieces. However the smallest pieces that can be created depend on the kernel size. This
kernel size was however too big to implement in hardware (Table 5.1), because KxKxPx2
elements have to be stored to calculate one value of the pair-wise space. However this size of
memory is not available on the FPGA.
Because the whole algorithm can’t be implemented in hardware due to the limited size of
the memory, the kernel convolution is implemented in software running on the general purpose
processor. Therefore only the values to calculate one point in the pair-wise space have to be
stored on the FPGA. These Px2 values consists of column of matrix A and a column of matrix
B (this is further explained in 5.2).
Meaning Size
M Length of chromosome arm A 100-10000
N Length of chromosome arm B 100-10000
P Number of tumor DNA's 1-300
K Kernel matrix size 1-1000
Table 5.1: Sizes of the matrices
22
5.2 Software versus hardware
Due the partitioning of the algorithm the hardware and software is almost determined. The
kernel convolution has to stay on the general purpose processor. But also divisions will stay on
the computer, because divisions are very area and time consuming on a FPGA. This means that
the normalization, the division by M in the MULT_COV function and the MEAN function
aren’t implemented in hardware.
In Figure 5.1 is depicted the pseudo code that will be implemented in hardware. The code
in the two outer loops will be implemented as one computational unit. With one column of
matrix A and B the MINS and COVAR functions can be computed and the results of these
functions can be multiplied to create one output for the pair-wise space (E_interm).
Figure 5.1: Pseudo code of to_fpga function
to_fpga(M, N, P, matrix A, matrix B, array MEAN_A, array MEAN_B, matrix E_interm)
{
for (i=0; i<M; i++)
for (j=0; j<N; j++)
// this will be implemented on the FPGA
// this is 1 computational unit
// the mins function
D[i,j] = 0;
for (k=0; k<P; k++)
D[i,j] += MIN( A[i,k], B[j,k]);
// the covar function
COV [i,j] = 0;
for (k=0; k<P; k++)
COV [i,j] += (A[i,k]-MEAN_A[i])
* (B[j,k]-MEAN_B[j]);
// first part of cov_norm function
if (COV [i,j]<0)
E_interm [i,j] = 0;
else
E_interm [i,j] = D[i,j] * COV[i,j];
}
E_interm
m
= FPGA( A
M
B
N
, )
23
5.3 Software
The software runs on the general purpose processor. As described in the previous paragraph
most of the computations are transferred to hardware. The software is important because it
sends and receives data to and from the FPGA and executes the smaller functions MEAN,
NORM and MAXIM. The biggest computational step that remains on the computer is the 2D
kernel convolution. The main software architecture is described in Figure 5.2. Here it is shown
that the MEAN function is extracted from the COVAR function computed before the FPGA
starts.
The 2D kernel convolutions should be as fast as possible, because it stays on the computer.
Therefore this function is optimized in terms of using pointers and writing the plain C code as
fast as possible (considering memory mapping of the matrices). This optimization also uses the
symmetric values in the Gaussian kernel, where a factor 2 is saved in multiplications.
The fpga function must send and receive packets to the FPGA. This is done by sending
UDP over the Ethernet and is implemented in the fpga function as depicted in figure 5.3. The
Send_A function sends as many columns as computational_units of matrix A. The Send_B
function sends as many columns as fit in a packet of 1024 bytes (P_per_packet) of matrix B,
because it takes the same time to send a packet of 1 byte as a packet of 1024 bytes (all headers
included). The receive function will return (computational_units x P_per_packet) values that
are stored in the matrix E_interm.
Figure 5.2: Software architecture
Figure 5.3: Pseudo code of fpga function
main()
{
mean(A);
mean(B);
fpga();
norm();
kernel();
peak();
}
fpga(M, N, matrix A, matrix B, array MEAN_A, array MEAN_B, matrix E_interm)
{
for (i = 0; i < M; i+=computational_units)
Send_MEAN_A();
Send_A();
for (j = 0; j < N; j+=P_per_packet)
{
Send_MEAN_B();
Send_B();
E_ interm = Receive();
}
}
24
As discussed earlier the sub function MEAN is excluded from the COVAR function because of
its dividers. This means that the mean value of each column has to be sent to the FPGA before
the column is send to the FPGA. The COV_NORM function has to be changed in the NORM
function. Also the extra division by M that is extracted out of the MULT_COV sub function is
implemented in this NORM function (Figure 5.4). This function is actually the second step in
the COV_NORM function, where the bold P is added for the extra division.
Normally the sequence is Send_B followed by Receive. To create a double buffer, Send_B
is executed before this sequence and a Receive after this sequence (Figure 5.5). In this way the
buffer in the FPGA is full and the FPGA can handle the next packet right after the previous
packet.
Figure 5.4: New NORM function
Figure 5.5: Pseudo code of fpga_buffer function
Fpga_buffer(M, N, matrix A, matrix B, array MEAN_A, array MEAN_B, matrix E_interm)
{
for (i = 0; i < M; i+=computational_units)
Send_MEAN_A();
Send_A();
Send_MEAN_B();
Send_B();
for (j = 0; j < N; j+=P_per_packet)
{
Send_MEAN_B();
Send_B();
E_interm = Receive();
}
E_interm = Receive();
}
norm(M, N, matrix E_interm, array Na, array Nb, matrix E)
{
for (i = 0; i < M; i++)
for (j = 0; j < N; j++)
{
temp = Na[i] + Nb[j];
if (temp==0)
E[i,j] = 0;
else
E[i,j] = E_interm [i,j] / (temp*P);
}
}
25
5.4 Hardware
The hardware architecture determines the performance of the design. The
communication hardware must use at least area as possible; so many computational
units can be implemented. This can be achieved with the architecture depicted in Figure
5.6. The components used are listed below:
• External PHY: The external PHY sends and receives frames from the Ethernet
network, which is connected to the computer. This component processes the frame and
communicates with the TEMAC through a GMII interface.
• The TEMAC: This component supports 1000 Mb/s and half and full duplex
operations. But the most important aspect is the minimal area that is needed by the
TEMAC. It consists only of an EMAC (implemented on most boards), a DCM (to
generate multiple clock signals), some buffers and minimal logic resources [4].
• The RX and TX FIFO: These first-in-first-out memory blocks are used to buffer the
in coming and out coming frames. The RX FIFO is to receive and the TX FIFO is to
send buffers. Besides for the buffering, the FIFO’s are used to synchronize the different
clock domains, because the TEMAC operates at different speeds and the Accelerator
and the Ethernet controller operate at 150MHZ. The FIFO’s are implemented with
block ram modules, which cost no extra logic resources [4].
• Ethernet controller: The Ethernet controller is to transform packets into real data or
real data into packets. This includes reading packet headers and extracting the data
form the RX FIFO, but also creating packet headers with the MAC address and IP of
the computer and including the data. The Ethernet controller also configures the
TEMAC.
• Accelerator: This component does the real computations and as many
parallelization as possible is used. To generate many parallel computations, the control
signals and registers are minimized [5, 6].
Figure 5.6: Total hardware architecture
26
27
Implementation in hardware 6
This chapter describes the implementation in hardware. There are two main hardware
components: the Ethernet communication (6.1) and the accelerator (6.2). These
implementations are visualized and explained. In the paragraph of the Ethernet communication
not only the Ethernet controller, FIFO’s and TEMAC are described, but also the packets type
and headers. In the paragraph of the accelerator, the architecture and the computational units
are discussed. Finally the computational hardware is explained (6.3).
6.1 Ethernet communication
6.1.1 The TEMAC and the FIFO’s
This part of the communication consumes almost no area when it is implemented [4]. The
TEMAC is implemented for a 1gigabit communication with half or full duplex. It will auto
negotiate to the connected computer, but will only work on a gigabit speed. The TEMAC send
its frame directly to the RX FIFO, where the frame is buffered (Figure 6.1). The final byte will
have a rx_valid of zero to mark the end of a frame. The TEMAC will read the TX FIFO when
the tx_fifo_lock_n is low. When the tx_fifo_lock_n is set the Ethernet controller can write to
the TX FIFO. At the end of the frame, 2 zero bytes must be added in the TX FIFO, with
tx_valid is zero to mark the end of the frame [8].
Figure 6.1: Communication TEMAC-FIFO’s
28
6.1.2 Communication data format
The data that is used in the algorithm consists of doubles. However to keep the FPGA
implementation as small as possible these doubles are converted to a fixed point representation.
The fixed point number is a 24 bit value with 16 fractional bits. This means a value of matrix A
or B must be smaller than 256 (8 bits). 256 is a relatively high limit, because the value
represents a ratio in log2 form. So every double is converted to three bytes.
When data is received from the FPGA the 24 bits aren’t enough, because there are P
number of additions. Therefore the result is 32 bits with 16 fractional bits. These 32 bits are
converted back to doubles and the algorithm continues.
6.1.3 The Ethernet controller
The Ethernet controller consists of two parts: the receiver and the sender. The receiver has to
accept two types of packets: ARP and UDP. The FPGA board is configured to a static IP
address. When an ARP packet is received from the computer to ask for the MAC address of the
FPGA board, the Ethernet controller must answer with an ARP packet. This ARP packet is
predefined in the FPGA and will send his MAC address back, when his IP address is found in
the ARP packet.
When a UDP packet is received, the MAC address is checked. When the packet is correct
the length of the data is stored and the header is skipped. Three bytes are buffered (24 bit) and a
flag is raised to indicate that the accelerator can use the buffered value. When the accelerator
has read the value it will lower the flag and the Ethernet controller will buffer the next value.
When all bytes are read from the frame, it will start all over.
The sender of the Ethernet controller works separately from the receiver to increase the
performance. When a complete UDP packet is received a UDP packet is send back to the
computer with the result of the accelerator, which is buffered in registers.
29
6.2 The accelerator
The main accelerator architecture consists of a number of computational units (Figure 6.2).
These units can work in parallel and are as small as possible. The accelerator controller is
necessary to regulate the computation units. This is necessary due to the fact that Data IN and
Data OUT are sequentially communicated with the Ethernet controller. The Accelerator
controller has G number of computational units and works in three main stages: receive (6.2.1),
calculate (6.2.2) and buffer. A short overview of the stages:
1. Receive: G number of columns of matrix A and G mean values of these columns are loaded
in the units.
2. Calculate: The mean value of a column of matrix B is stored in a register and the values of
the column are iterated to calculate the MINS and MULT_COV functions. Finally the results
of these functions are multiplied with each other and buffered in registers.
3. Buffer: The results are buffered in registers and are waiting to be sent over the Ethernet.
Computational
unit
Computational
unit
Computational
unit
|
|
|
|
Data IN
Data OUT
Accelerator
Controller
|
|
|
|
1
2
G
Figure 6.2: accelerator architecture
The architecture of the computational unit is described in Figure 6.3. A column vector of matrix
A is stored in the Block ram [6]. Mean_A, Result Mins and Result Covar are registers. Mean_A
holds the mean value of the vector stored in the block ram. Result Mins is the temporary result
of the MINS function and Result Covar is the temporary result of the MULT_COV function.
30
Figure 6.3: Computational unit
6.2.1 Receive
This stage consists of loading a column of matrix A in the block ram. Because all Data IN is
sequentially this stage cannot be done in parallel. Before each block ram is filled the Mean_A
register is filled. This process is visualized in Figure 6.4, where G denotes the number of
computational units and P denotes the number of persons and thus the length of the column. So
to fill all block rams there are two loops necessary and this makes the stage relatively slow.
However after this stage the other two (faster) stages can be executed many times.
6.2.2 Calculate
In this stage all computational operations are performed (Figure 6.5). For this stage the column
vector and its mean of matrix B needs to be send. First the mean value of the vector is stored in
the Mean register. Thereafter each value of column B is used to calculate one step in the MINS
and MULT_COV functions. Because all computations in all computational units can be done in
parallel, a large computation speed is achieved. After the whole column is processed the results
are multiplied with each other, where Result Covar is nullified if the value was negative.
31
Figure 6.4: Receive process
Mean = Data IN
p = 0
Result Mins += MIN(Block ram[g,p],Data IN)
Result Covar += (Block ram[g,p] – mean A[g]) * (Data IN – Mean B)
p < P
p = p+1
Data OUT[g] = Result Mins[g] * Result Covar[g]
Result Covar[g] < 0
Data OUT[g] = 0
Buffer = 1
Figure 6.5: Calculate process (0 < g < G)
32
6.3 Computational hardware
In this section the hardware of the real computations is discussed. Because the implementation
of the communication part consumes few resources, the computational hardware can consume
almost the whole FPGA (85%). But to do many parallel computations, the area of each
component may not be too large.
To compute Result Mins it requires one if-then-else statement, an addition, a read from
memory and a write to memory. In hardware this can be done all at the same time, because the
block ram, registers and computational hardware can be used in parallel. The if-then-else
statement is implemented with a comparator and the addition with an adder (the left of Figure
6.7). The calculation of Result COVAR can be done separately of Result Mins and is visualized
in Figure 6.7 (right). The final calculation of Data OUT is shown in Figure 6.8. Because more
computational units can be used together at the same time, a real acceleration is achieved.
One important side note has to be made. The computational hardware is implemented in a
32 bits fixed-point representation (16 fractional bits). In this way the computational hardware
can be kept small in terms of logical resources and more computational units can be
implemented [7].
adder adder
muxComparator
<
Block ram
Data IN
Result Mins
Result Mins
sub sub
mult
adder
Data IN Mean B Block ram Mean A
Result Covar
Result Covar
Figure 6.7: Calculate Result Mins (left) and calculate Result Covar (right)
Comperator
< mux
Zero
Result Covar
mult
Result Mins
Data OUT
Figure 6.8: Calculate Data OUT
33
Results 7
This chapter shows the result of the implementation in hardware. But first the number resources
and platform are mentioned (7.1). The results will consist of the following subjects: the
improvement in execution time (7.2), the scalability of hardware acceleration (7.3) and the
verification (7.4)
7.1 Platform and resources
To construct the result the following platforms were used:
• Computer: Acer Aspire T650 with two Pentium® 4 CPU 3.06 GHZ processors. The
computer runs on Windows XP Professional (service pack 3), 1024 MB memory.
• FPGA: Virtex®-4 XC4VFX12 device, namely a ML403. It runs on a 100 MHZ clock and
the FPGA configuration is loaded from a 512-MB compact flash card.
On the ML403 at most 9 computational units can implemented. The resources are listed in
Figure 7.1. As many DSP48 components, which are used to implement the multipliers, as
possible are used to save slices [7].
used total ratio
Slices 4948 5472 90%
Slice register 3401 10944 31%
4-input luts 7294 10944 67%
Block ram 11 36 31%
DSP48 30 32 94%
EMAC 1 1 100%
DCM 3 4 75%
Table 7.1: Resources used for the accelerator hardware
34
7.2 Execution times
There were three versions measured: KCSMART v7, KCSMART v8 software and KCSMART
v8. KCSMART v7 was the original code for matlab. KCSMART v8 software is the plain C
translation of the pseudo code described in this report (This is a very bad plain C
implementation). The final version with FPGA implementation and optimized kernel
convolution is KCSMART v8.
The execution times of the algorithm with relatively small input matrices for different
kernel widths are shown in Figure 7.1. Normally the execution time will grow with growing
kernel width. This is the case for both KCSMART v8 versions, where the hardware
implementation is two times faster. But as can be seen in the figure, for a small kernel width the
matlab implementation was incredible slow. This is explained due the slow peak finding step,
because when a small kernel is applied more peaks remain in the pair-wise space (and this
slows down the execution).
The execution times for one small and one big input matrix for different kernel widths are
shown in Figure 7.2. When this Figure is compared with the execution times of the smaller
input matrices, it can be concluded that the speedup achieved by the FPGA implementation
grows for larger matrices and smaller kernel widths. However the speedup for the larger kernel
sizes remains the same (factor three). This shows the trend of speedups when larger input
matrices are calculated.
Execution times with M = 1758, N = 1859 and P = 95
9 compuational units
0
50
100
150
200
250
0,2 1 2 10 20
Width of kernel (10^6)
Execu
tio
n t
ime (
sec)
KCSMART v7
KCSMART v8 softw are
KCSMART v8
Figure 7.1: Execution time of two small input matrices
Execution times with M = 1758, N = 5219 and P = 95
9 computational units
0
500
1000
1500
2000
2500
0,2 1 2 10 20
Width of kernel (10^6)
Ex
ec
uti
on
tim
e (
se
c)
KCSMART v7
KCSMART v8 software
KCSMART v8
Figure 7.2: Execution time of one small and one big input matrix
35
7.2 Scalability
The scalability of the KCSMART v8 is measured (Figure 7.3). This is an important aspect in
the FPGA design, because if a design is scalable it means the execution time can be calculated
by the input matrix sizes. As should be expected, the execution time of the FPGA is linear to
the size of the output matrix (MxN). Also it can be concluded that the number of computational
units is linear to the execution time. This is a very important fact, because when the
computational units are doubled the execution time is halved.
FPGA execution times (P = 95)
0
50
100
150
200
250
300
0 5000000 10000000 15000000 20000000 25000000 30000000
MxN
Execu
tio
n t
ime (
sec)
9 computational units
6 computational units
4 computational units
Figure 7.3: FPGA execution time for different input sizes
7.3 Verification
One of the most important steps in an implementation is the verification. Because the hardware
design is shifting from a floating point to a fixed point representation, errors will be introduced.
However due to 16 fractional bits this error stays relatively small. This conclusion can be made
from table 7.2, where the errors are measured between KCSMART v8 software and
KCSMART v8. The errors measured are the maximum errors when the peak heights are
compared between the KCSMART v8 software version and KCSMART v8 version. Despite
the errors, both implementations will find the same peaks and the hardware design is verified.
M N Max error
1758 1859 0,00076%
1758 5219 0,00065%
1859 5219 0,00048%
1758 7403 0,00083%
1859 7403 0,00073%
Table 7.2: Errors between KCSMART v8 software and KCSMART v8
36
37
Conclusions and future work 8
In this chapter the final conclusions of the thesis are described (8.1). Future work (8.2) will
show the main improvements that can be done in the future.
8.1 Conclusions
This master thesis focused on improving an algorithm to find co-occurring aberrations in DNA
strings. Due to unbalanced transactions Copy Number Alterations occur in DNA. These are
detected by the procedure of array Comparative Genomic Hybridization. This method measures
the ratio of the number of genes between ‘healthy DNA’ and ‘tumor DNA‘. The algorithm to
calculate the aberrations with these ratios was already designed and implemented in matlab. But
the problem was that the execution time takes much time to compute the location of the co-
occurring aberrations. The main reason is that many computations have to be performed,
because the DNA stings are long. So to improve the algorithm the computations have to be
speeded up.
The algorithm consists of four big steps: pair-wise, covariance and normalization, 2D
kernel convolution and peaks. When calculating the pair-wise space all ratios between the
healthy and tumor DNA’s of chromosome arm A and B are compared, where the minimum
value is added to the other minimum of the ratios from other tumor DNA’s. Then the
covariance matrix is multiplied with the pair-wise space to loose all continuous values in the
ratios and it is divided by a normalization matrix. The kernel convolution is applied on the
result to look for the local enrichment. Finally the peaks are sought to find the most co-
occurring aberrations in the DNA strings.
The algorithm has many parallel execution possibilities when more hardware
computational units become available. Therefore an FPGA is chosen, because of its
parallelization possibilities and computational power. The partitioning is done by sending
columns of the two different chromosome arms. For every column of chromosome A and
chromosome B, one output of the pair-wise space can be computed. Therefore the steps of
computation the pair-wise space and the computation of the covariance matrix are done one the
FPGA. All divisions are left on the general purpose processor, because they take to much time
and area on the FPGA. Also the kernel convolution and peak finding step are left on the general
purpose processor, because the implementation of the kernel convolution should take to much
area.
An accelerator was designed in the FPGA to implement the computations. This accelerator
receives UDP Ethernet frames with a build communication controller. The frames were
buffered where the computational controller can ask for the next data value. The computational
controller manages the computation units, who work with 32 bits fixed point number (16
fractional bits). The implemented design on a ML403 platform consists of 9 computational
units. The design is scalable and thus can be used on larger FPGA to increase the performance.
In conclusion, the performance of the algorithm to find co-occurring aberrations in DNA
has been increased, with a minimum speedup of 3 to a maximum of several hundreds. The
design can be further improved by using larger FPGA boards.
38
8.2 Future work
Future work on this accelerator should focus on two main aspects: mapping more
computational units and increasing the communication speed. To increase the number of
computational units the design can be mapped to a larger FPGA. Another advantage of a larger
FPGA is that the clock speed shall increase. With this method the execution time can be easily
increased.
The second method the increase the computation time is to increase the communication
speed. The design communicates with a gigabit Ethernet, however only 25 megabytes per
second is achieved when all data is send to the FPGA and the computer doesn’t wait for the
answer of the FPGA. This means the software implementation that sends the data over the
Ethernet should be enhanced. Also communication speed can be further increased by using
Rocket-IO (used on more advanced FPGA’s), which can communicate at 10 gigabit per second
[5].
Another approach to improve the performance is to implement the code left on the general
purpose processor to an other platform. For example the GPU is a good platform to implement
the kernel convolution. In this way, also the last computations can be accelerated.
39
Bibliography _
[1] Jan Bot, Grid Usecase BioMed, seminar (2008)
[2] Jeroen de Ridder, Jaap Kool, Co-occurrence analysis of insertional mutagenesis data
reveals cooperating oncogenes, Vol. 23 ISMB/ECCB (2007)
[3] P. Hupé, N. Stransky, Analysis of array CGH data: from signal ratio to gain and loss of
DNA regions, Bioinformatics, Vol. 20, no. 18, pp. 3413–3422 (2004).
[4] Xilinx system engineering group, Minimal Footprint Tri-Mode Ethernet MAC Processing
Engin, Xilinx application note XAPP807 (2007)
[5] Xilinx system engineering group, RocketIO™ Transceiver User Guide, Xilinx user guide
UG024 (2007)
[6] Xilinx system engineering group, Virtex-4 FPGA User Guid, Xilinx user guide UG070
(2008)
[7] Xilinx system engineering group, XtremeDSP for Virtex-4 FPGAs, Xilinx user guide
UG074 (2009)
[8] Xilinx system engineering group, Virtex-4 FPGA Embedded Tri-Mode Ethernet MAC,
Xilinx user guide UG074 (2009)
[9] Zhang F, Gu W, Copy number variation in human health, disease, and evolution, Annual
Review of Genomics and Human Genetics, Vol. 10, pp. 451-481 (2009)