expressed sequence tag clustering using commercial gaming hardware
Post on 19-Jan-2023
0 Views
Preview:
TRANSCRIPT
Expressed Sequence Tag Clustering using Commercial
Gaming Hardware
by
Charl van Deventer
DISSERTATION
submitted for partial fulfilment of the requirementsfor the degree
MAGISTER INGENERIAE
inELECTRICAL AND ELECTRONIC ENGINEERING SCIENCE
in the
FACULTY OF ENGINEERING AND THE BUILT ENVIRONMENT
at the
UNIVERSITY OF JOHANNESBURG
STUDY LEADERS: Willem A. Clarke & Scott Hazelhurst
October 14, 2013
Contents
Contents i
List of Symbols and Abbreviations vii
List of Figures ix
List of Tables xi
1 Objective/Scope 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.5 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.6 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.7 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Literature - Bioinformatics Theory and Algorithm Overview 7
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Bioinformatics Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Dataset Characteristics and Error Classification . . . . . . . . . . . . . . . 12
2.3.1 Alphabet (A,C,G,T,N) . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.2 Read Length . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.3 Orientation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.4 Redundancy(Coverage) . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.5 Quantity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.6 Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.7 Reverse Complement . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.8 Forward Reverse Constraints . . . . . . . . . . . . . . . . . . . . . . 13
i
ii Contents
2.3.9 Lane Tracking Errors . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.10 Gene Expression Differences . . . . . . . . . . . . . . . . . . . . . . 14
2.3.11 Low-Complexity Regions and Repeats . . . . . . . . . . . . . . . . 14
2.3.12 Masking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.13 Alternative Splicing . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.14 Single Nucleotide Polymorphism(SNPs) . . . . . . . . . . . . . . . . 14
2.3.15 Base Calling Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3.16 Vector or Primer Contamination . . . . . . . . . . . . . . . . . . . . 15
2.3.17 Chimera . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3.18 Cellular RNA contamination . . . . . . . . . . . . . . . . . . . . . . 15
2.4 Bioinformatics Literature Study . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4.1 Bioinformatics History . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4.2 Expressed Sequence Tags History . . . . . . . . . . . . . . . . . . . 17
2.4.3 Rise of GPGPU in High Performance Computing . . . . . . . . . . 18
2.4.4 GPUs in Bioinformatics . . . . . . . . . . . . . . . . . . . . . . . . 19
2.5 Data Representation Overview . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.5.1 FASTA File Format . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.5.2 Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4-bit Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2-bit Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.6 Algorithm Types Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.6.1 Distance Based Algorithms . . . . . . . . . . . . . . . . . . . . . . . 23
2.6.2 Alignment Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.6.3 Database Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.6.4 Heuristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3 Theory - GPU Theory Study 27
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2 GPU Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3 General Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.4 CUDA API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.4.1 Introduction to the CUDA API . . . . . . . . . . . . . . . . . . . . 30
3.4.2 CUDA Compute Capabilities . . . . . . . . . . . . . . . . . . . . . 30
3.4.3 GPU Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Contents iii
Shared memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Global memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Local Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Texture memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Constant memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.4.4 CUDA Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Job level Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . 36
Block level Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . 37
Thread Level Parallelism . . . . . . . . . . . . . . . . . . . . . . . . 38
Instruction Level Parallelism . . . . . . . . . . . . . . . . . . . . . . 40
3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4 Experimental Design 43
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.2 Assumptions and Experimental Framework . . . . . . . . . . . . . . . . . . 44
4.2.1 Common Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . 44
Scalability of CPU Cores . . . . . . . . . . . . . . . . . . . . . . . . 44
CPU speed has a negligible effect on GPU computation . . . . . . . 44
Operating systems have a negligible effect on performance . . . . . 44
4.2.2 Experimental Concerns . . . . . . . . . . . . . . . . . . . . . . . . . 45
Fair Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
Sensitivity and Correctness . . . . . . . . . . . . . . . . . . . . . . 45
Differing Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.2.3 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Test PC Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.2.4 Theory and Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Timing Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . 46
GFLOPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Jaccard Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Sensitivity Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
CUDA Occupancy . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.3 Dataset Descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.3.1 Arabidopsis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.3.2 SANBI 10000 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.3.3 Public Cotton . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.3.4 C-Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
iv Contents
4.3.5 Mouse Curated . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.4 Overview of Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.5 Investigation 1: Theoretical Performance and Cost Evaluation . . . . . . . 51
4.5.1 Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.5.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.5.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.5.4 Expected Outcome . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.6 Experiment 1: Sensitivity Comparison . . . . . . . . . . . . . . . . . . . . 52
4.6.1 Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.6.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.6.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.6.4 Expected Outcome . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.7 Experiment 2: Performance Benchmarking . . . . . . . . . . . . . . . . . . 53
4.7.1 Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.7.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.7.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.7.4 Expected Outcome . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.8 Experiment 3: Dataset scaling tests . . . . . . . . . . . . . . . . . . . . . . 54
4.8.1 Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.8.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.8.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.8.4 Expected Outcome . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.9 Experiment 4: Profiling Analysis . . . . . . . . . . . . . . . . . . . . . . . 55
4.9.1 Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.9.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.9.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.9.4 Expected Outcome . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.10 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5 Selection of Algorithms 59
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.2 Selection Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.2.1 Large-scale Parallelizability . . . . . . . . . . . . . . . . . . . . . . 60
5.2.2 Data Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.2.3 Random seeks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.2.4 Computation Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
Contents v
5.2.5 Division into smaller tasks . . . . . . . . . . . . . . . . . . . . . . . 61
5.2.6 Simplicity and Established algorithms . . . . . . . . . . . . . . . . . 61
5.2.7 Sensitivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.3 Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.3.1 File Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.3.2 Memory Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.3.3 Job Data Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.3.4 Results Data Structure . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.3.5 Output Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.4 Program Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.4.1 Basic Program Structure . . . . . . . . . . . . . . . . . . . . . . . . 66
5.4.2 Parallel Program Structure . . . . . . . . . . . . . . . . . . . . . . . 67
5.5 Heuristics Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.5.1 Common word heuristics . . . . . . . . . . . . . . . . . . . . . . . . 68
Common n-word Heuristic . . . . . . . . . . . . . . . . . . . . . . . 68
t/v-word Heuristic . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
u/v-sample Heuristic . . . . . . . . . . . . . . . . . . . . . . . . . . 69
Chained Heuristic . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.5.2 Suffix Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.6 Comparison Algorithm Selection . . . . . . . . . . . . . . . . . . . . . . . . 71
5.6.1 Simple Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.6.2 FFT Based . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.6.3 d2 distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.6.4 Levenshtein Edit Distance . . . . . . . . . . . . . . . . . . . . . . . 74
5.6.5 Smith-Waterman . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.6.6 Modified Smith-Waterman . . . . . . . . . . . . . . . . . . . . . . . 77
5.7 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.8.1 Program Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.8.2 Heuristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.8.3 Comparison Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 81
6 Implementation and Issues 83
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.2 Program Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.3 Detailed Program Structure . . . . . . . . . . . . . . . . . . . . . . . . . . 83
vi Contents
6.3.1 Job Division . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.3.2 Memory management and paging . . . . . . . . . . . . . . . . . . . 85
6.4 Detailed Heuristics Algoritms . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.4.1 Word Count . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.4.2 u/v-sample Heuristic . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.4.3 t/v-word Heuristic . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.5 Detailed Comparison Algoritms . . . . . . . . . . . . . . . . . . . . . . . . 93
6.5.1 d2-Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.5.2 Cumulative Smith-Waterman Distance . . . . . . . . . . . . . . . . 94
6.6 Conclusion and summary of concerns . . . . . . . . . . . . . . . . . . . . . 95
7 Results and Analysis 97
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
7.2 Investigation 1: Theoretical Performance and Cost Evaluation . . . . . . . 98
7.3 Experiment 1: Sensitivity Comparison . . . . . . . . . . . . . . . . . . . . 98
7.4 Experiment 2: Performance Benchmarking . . . . . . . . . . . . . . . . . . 100
7.5 Experiment 3: Dataset scaling tests . . . . . . . . . . . . . . . . . . . . . . 100
7.6 Experiment 4: Profiling Analysis . . . . . . . . . . . . . . . . . . . . . . . 102
7.7 Critical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
7.7.1 Multiple Threads . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
7.7.2 Concurrent Execution . . . . . . . . . . . . . . . . . . . . . . . . . 105
7.7.3 Sequences Data Size . . . . . . . . . . . . . . . . . . . . . . . . . . 105
7.7.4 Random Reads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
7.7.5 Branching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
7.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
8 Conclusion and Further Work 107
8.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
8.2 Research Question Resolution . . . . . . . . . . . . . . . . . . . . . . . . . 108
8.3 Further Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
8.3.1 Faster Heuristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
8.3.2 Multiple GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
8.3.3 CPU Concurrent Use . . . . . . . . . . . . . . . . . . . . . . . . . . 109
8.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
Bibliography 111
List of Symbols
and Abbreviations
Abbreviation Description Definition
API Application Programming Interface page 27
BOINC Berkeley Open Infrastructure for Network Computing page 4
cDNA complementary DNA page 10
CPU Central Processing Unit page 1
CUDA Compute Unified Device Architecture page 28
DNA DeoxyriboNucleic Acid page 7
EST Expressed Sequence Tag page 1
GFLOPS Giga Floating Operations Per Second page 47
GPGPU General Purpose Graphics Processing Unit page 1
GPU Graphics Processing Unit page 1
mRNA messenger RNA page 9
PHP PHP: Hypertext Preprocesso page 46
RNA RiboNucleic Acid page 7
vii
List of Figures
2.1 The 5 Common Nucleotides [1] . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 DNA Chemical Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.1 The GPU devotes more transistors to data processing [2] . . . . . . . . . . . . 28
3.2 CUDA Memory Model [2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3 CUDA Grid of Thread Blocks [2] . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.4 CUDA Block Scheduling [2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.1 Visualization of the dataset as a collection of EST sequences . . . . . . . . . . 62
5.2 Many-to-Many comparison between 6 elements . . . . . . . . . . . . . . . . . 63
5.3 Many-to-many comparison between 6 elements in grid format . . . . . . . . . 64
5.4 Many-to-many comparisons of 8 elements divided into 3 seperate 4 by 4 sized
jobs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.5 Basic Program Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.6 Parallel Program Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.7 Comparison of Cumulative Score versus default Smith-Waterman scoring . . . 78
6.1 Detailed Program Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
6.2 Word Count table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
7.1 Performance on the Arabidopsis data set for different sized subsets of the data 101
7.2 Ratio of performance of GPUcluster and wcdest with dataset size . . . . . . . 101
7.3 Kernel execution time plot (time is in micro-seconds) . . . . . . . . . . . . . . 104
ix
List of Tables
2.1 Characters and meanings for FASTA sequences . . . . . . . . . . . . . . . . . 22
3.1 Comparison of various CUDA Capabilities [2] . . . . . . . . . . . . . . . . . . 31
3.2 Summary of Memory Types available to CUDA Programmers . . . . . . . . . 32
3.3 Optimal Maximums for 2.x Compute Capability GPUs for different block sizes 39
3.4 Optimal Maximums for 1.3 Compute Capability GPUs for different block sizes 39
5.1 66% Similarity Substitution Matrix . . . . . . . . . . . . . . . . . . . . . . . . 76
5.2 Alignment matrix between ’GATTCGTTA’ and ’GGATCGTA’ . . . . . . . . 76
5.3 Comparison of the various algorithms introduced in this chapter . . . . . . . . 79
6.1 Comparison of Word Count kernels memory use for different k-lengths . . . . 89
7.1 Price and performance comparison of various hardware . . . . . . . . . . . . . 98
7.2 Performance comparison between different datasets . . . . . . . . . . . . . . . 99
7.3 Timing profiling results for the 10K dataset (≈ 10K ESTs) . . . . . . . . . . . 102
7.4 Timing profiling results for the A032 dataset (≈ 70K ESTs) . . . . . . . . . . 103
xi
List of Algorithms
1 Instruction Level Parallelism Example . . . . . . . . . . . . . . . . . . . . 40
2 CPU-side Program Structure . . . . . . . . . . . . . . . . . . . . . . . . . . 85
3 Word Count kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4 Word Presence kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5 u/v - Sample Heuristic Kernel . . . . . . . . . . . . . . . . . . . . . . . . . 90
6 t/v - Word Heuristic Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . 92
xiii
Chapter 1
Objective/Scope
1.1 Introduction
Bioinformatics is one of the most rapidly advancing sciences today. It is a scientific
domain that attempts to apply modern computing and information technologies to the
field of biology, the study of life itself and involves documenting and analysing genetics,
proteins, viruses, bacteria and cancer as well as hereditary traits and diseases, as well as
researching cures and treatments for whole ranges of health threats.
The growth of bioinformatics and developments, both theoretical and experimental
in biology, can largely be linked to the IT explosion which gives the field more powerful
processing options with much cheaper solutions, limited only by the steady yet significant
improvements as promised by Moore’s Law [3].
This IT explosion has also caused significant advances due to the high consumer de-
mand region of computer graphics hardware, or GPUs (Graphics Processing Units). The
consumer demand has actually managed to advance GPUs far faster than classical CPUs
(Central Processing Units), outpacing CPU performance improvements by a large margin.
As of early 2010, the fastest available PC processor(Intel Core i7 980 XE) has a theoret-
ical performance of 107.55 GFLOPS [4], while GPUs with TFLOPS (1000 GFLOPS) of
performance have been commercially available since 2008 (ATI HD4800).
While typically used only for graphical rendering, modern innovations have greatly
increased GPU flexibility and has given rise to the field of GPGPU (General Purpose
GPU) which allows graphics processors to be applied to non-graphics applications.
By utilizing GPU processing power to solve bioinformatics problems, the field can the-
oretically be boosted once again, increasing the amount of computational power available
to scientists by an order of magnitude or more.
1
2 Chapter 1. Objective/Scope
This document will primarily deal with the possibility of utilizing GPUs in the prob-
lem of EST (Expressed Sequence Tag) clustering, chosen due to its high data volume,
complexity and overlap with other bioinformatics problems such as sequence reassembly.
1.2 Problem Statement
It is proposed that GPUs are appropriate and useful for bioinformatics problems, specif-
ically in the domain of clustering of EST sequences.
There are many possible advantages to implementing any computationally intensive
algorithm on the GPU, such as large speed-ups and reduced costs. Improvements for
ported applications have reported 1.16×-431× speed-ups over CPU implementations [5].
The problems where GPU computation are normally applied and where it achieves
high performance usually involve high volumes of structured numerical data, intense but
predictable processing with significant spatial locality in memory reads and little branch-
ing.
Bioinformatics makes use of a large set of computer algorithms, including but not
limited to string manipulation, database search and manipulation, molecular physics sim-
ulation and graph theory, all algorithms not known for their strength on the GPU. EST
clustering, a string manipulation problem, was selected specifically for this dissertation
due to its importance in many bioinformatics applications, yet perceived unsuitability to
the GPU pipeline.
Many factors count against efficient bioinformatics algorithm implementation on the
GPU such as very large datasets compared to GPU memory (a gigabyte or more is not
uncommon for a dataset), the dissimilarity between string operations required for bioin-
formatics and the native graphics pipeline which is specialised for numerical data, un-
desirability of branching statements and the parallelizability requirements of the ported
algorithms.
Some of these disadvantages can be mitigated by performing part of the processing on
the GPU and part on the CPU, utilizing each architecture to its strength.
1.3. Research Questions 3
1.3 Research Questions
This thesis will seek to address the following questions:
1. Is GPGPU a practical computing platform for bioinformatics algorithms?
2. Can existing bioinformatics algorithms be practically ported to GPGPU?
3. Is the cost of GPGPU competitive with classical CPU computing?
4. Is the performance of GPGPU competitive with classical CPU computing?
1.4 Objective
The objective of this thesis is to answer the posed research questions. To do so, the
following approaches will be used:
1. The specifications of bioinformatics algorithms are dependent on the highly spe-
cialized nature of the biological data it is designed to process. Proper research is
required to understand the unique demands, constraints and limits of the domain.
Research is required on the GPGPU platform and its strengths and weaknesses.
This can indicate whether GPGPU would be a good match for the requirements of
bioinformatics algorithms and data.
Research of previous GPGPU implementations in the bioinformatics field should
serve to provide evidence either supporting or rejecting its practicality.
2. The best way to prove the practicality of porting a bioinformatics algorithm is to
perform such a port as part of the project.
The suitability of prospective algorithms to port should be researched with chal-
lenges identified.
3. Analysis of the cost of GPGPU is performed by identifying the commercial cost of
both GPU and CPU platforms.
The advertised GFLOPS of both platforms can be compared to its cost, from which
the cost per GFLOP of both can be computed and compared.
4. The advertised GFLOPs is not always representative of real-world performance. In
order to measure this benchmark tests need to be run on CPU and GPU implemen-
tations of the same algorithm.
4 Chapter 1. Objective/Scope
1.5 Scope
Due to the open nature of the research questions, the scope of this project will be limited
to the domain of EST clustering.
EST clustering is a well-researched topic dealing with sequence comparisons. It in-
volves high volumes of relatively short sequences and is related to genome identification,
an important process in bioinformatics.
Though the domain of EST clustering is not representative of all bioinformatics prob-
lems, it is a relatively simple processing step for short nucleotide sequences and should
lend insight into the use of GPGPU in sequence reassembly.
The clustering algorithm picked depends on the algorithm’s suitability to the GPU
work-flow, its scalability to large datasets and its ability to parallelise over multiple
threads. It is not expected that a simple software port of the most common modern
algorithm will result in the ideal performance, so research is likely to find another algo-
rithm that is known and proven and best suited to parallelisation.
The project is not meant to research and develop an entirely new clustering algorithm,
but to merely adapt an existing proven one to a different platform.
The project will deal with the clustering stage only and only mention the reassembly
stage where the clustering stage can improve the reassembly stage, either by increasing
speed or by reducing errors and improving quality.
This project will not deal with the EST cleaning phase and assumes a dataset that
has already been pre-processed by the base caller.
This project will not deal with repeat masking.
1.6 Contributions
During the course of this project significant research was done on GPU Cluster Computing
for the purpose of scaling up to multiple PCs and GPUs. Many possible solutions exist,
but eventually BOINC was chosen.
BOINC (Berkeley Open Infrastructure for Network Computing) is a distributed grid
middleware platform. This means that it hosts distributed applications and provides
client and server software to allow new desktop grid computers to be added to a project
with a minimum of configuration [6]. This allows a whole office of computers, all outfitted
with GPUs, to contribute to a computing task without the need of special or dedicated
1.7. Overview 5
hardware. All the PCs used in this manner can still serve as normal desktop computers
for everyday use.
The project was initially developed with BOINC support in mind and the feasibility
study was submitted and accepted at the SATNAC 2010 conference [7], but time con-
straints has prevented the development of full support for the BOINC framework in the
final application.
1.7 Overview
The remainder of this document is organised as follows:
• Chapter 2 - Literature - Bioinformatics Theory and Algorithm Overview
– Brief introduction and literature study of bioinformatics, ESTs and an expla-
nation of why clustering ESTs is important for reassembly.
– Characteristics of EST Datasets and explanation of terms from a data analysis
standpoint.
– Brief history of bioinformatics and the important advances that resulted in our
current level of understanding of ESTs and their processing.
– Brief history of contributions GPUs have made in bioinformatics.
– Explanation of the FASTA file format, used to store ESTs.
– Introduction of the categories of GPU algorithms under review in this disser-
tation.
• Chapter 3 - Literature - GPU Literature Study
– An introduction to GPU computing, its strengths and its limitations.
– Introduces and explains the theory surrounding parallelism on the GPU.
– Introduction to the CUDA programming language including:
∗ Compute capabilities of different generations of GPU.
∗ Explanation of GPU memory types.
∗ Different types of parallelism offered by GPUs.
6 Chapter 1. Objective/Scope
• Chapter 4 - Experimental Design
– Common terms and measurement metrics are explained.
– Experimental assumptions enumerated.
– Datasets used for experimentation is listed and introduced.
– Individual tests and experiments explained in detail.
– Expected results are proposed.
• Chapter 5 - Selection of Algorithms
– Selection Criteria is listed and explained.
– Expected data, memory and program structures are introduced.
– Individual algorithms for use with heuristics are introduced and their strengths
and limitations are provided.
– Individual comparison algorithms are introduced, and their strengths and lim-
itations are provided.
– Proposed algorithms are compared and specific ones selected for GPU imple-
mentation.
• Chapter 6 - Implementation and Issues
– Implementation details surrounding program structure and memory manage-
ment is provided.
– Details on the implementation of individual kernels is provided.
– Issues with implementation are discussed.
– Implementation concerns are listed, including the shortcomings of parallelizing
algorithms originally meant for the CPU.
• Chapter 7 - Results and Analysis
– Experiments proposed in Chapter 4 are executed and its results provided.
– Critical analysis of experiments are discussed.
• Chapter 8 - Conclusion and Further Work
– Summary of the project is presented.
– Areas where further work can be performed are identified.
– Conclusion of work provided.
Chapter 2
Literature - Bioinformatics Theory
and Algorithm Overview
2.1 Introduction
This chapter provides basic introductory knowledge of the bioinformatics theory and terms
used both in the field and throughout the rest of the thesis. This is not meant to be a
comprehensive introduction to the bioinformatics field as a whole, but should be sufficient
to understand the problem and the solutions provided by this document.
A literature study is included that gives a basic overview of both the history of EST
bioinformatics processing and the contributions of GPU computation in the bioinformatics
field in general. Many of these references are used as an inspiration to this thesis, forming
the body of knowledge that this thesis intends to contribute to.
Finally, a higher level survey of the algorithms and processes used in EST comparison
and processing is introduced, which will be analysed more thoroughly in a later chapter.
2.2 Bioinformatics Theory
A nucleotide is the most basic building block of the nucleic acid macromolecules found
in any living species, the best known of which are DNA (DeoxyriboNucleic Acid) and
RNA (RiboNucleic Acid). There are 5 common nucleotides that this paper will deal with.
These nucleotides are Adenine (A), Guanine (G), Cytosine (C), Thymine (T) and Uracil
(U). Uracil is only found in RNA while Thymine is found only in DNA. The information
that these two represent can be considered equivalent.
7
8 Chapter 2. Literature - Bioinformatics Theory and Algorithm Overview
Figure 2.1: The 5 Common Nucleotides [1]
Figure 2.2: DNA Chemical Structure
Every nucleotide base bonds natu-
rally with one other nucleotide base (Fig-
ure 2.2). Adenine bonds with Thymine
and Guanine bonds with Cytosine. Two
nucleotides bonded like this is usually re-
ferred to as a base pair. Base pairs in
DNA and RNA form sequences. Every
base pair additionally has a direction, in-
dicated by the layout of the sugars on
the base pair. The 3’ side of a base pair
can only link with the 5’ of another base
pair to form long chains. This terminol-
ogy is used to indicate the direction of a
sequence.
By convention, sequences are usually
written from the 5’ end to the 3’. The
characters A,G,C,T are used to refer to
the different nucleotides, Adenine, Gua-
nine, Cytosine and Thymine. In addition, the character N refers to an unknown low
quality read that could be any of the 4 bases. When needed, the character ’-’ could also
refer to an undetermined gap in the sequence. These 6 characters will be used throughout
the paper to represent nucleotide sequences.
The famous double helix shape of DNA occurs to two strands of nucleotides, connected
2.2. Bioinformatics Theory 9
as base pairs running in opposite directions to one another. It is important to note that
the following two sequences are equivalent and represent the same subsequence of DNA:
5’ end - ACTGGC - 3’ end
3’ end - TGACCG - 5’ end
If both sides are read from 5’ to 3’, the sequences are simply written as:
ACTGGC
GCCAGT
These are called complementary sequences.
The term base pairs (bps) also refers to the length of such sequences, with a kilobases
(kbps) being equal to a thousand base pairs of RNA or DNA and mega base pairs (Mbps)
being equal to a million base pairs. The above units only refer to double stranded bonded
base pairs. The single stranded equivalent unit of length is Nucleotides (nts). This paper
will however not make this distinction and use the unit bps in all indications of nucleotide
sequence length.
A gene is unit of heredity in living organisms and is encoded in a sequence of DNA.
They determine the growth and maintenance of cells, how they divide and how and when
they die. They determine features such as eye colour, hair colour, blood type. Genes
define every characteristic of every DNA carrying living species.
Genes, as they are coded in DNA, consist of sequences of alternating coding nucleotide
sequences (exons) and non-coding nucleotide sequences (introns). During gene expression,
these genes are transcribed to RNA sequences and the introns spliced out, leaving only a
sequence of the coding exons.
Sometimes these resulting sequences are used in cellular processes, but the interest in
this paper is the subset of RNA called mRNA (messenger RNA), that transport codon
sequences to the ribosomes, which are cellular factories that create proteins from these
sequences.
A codon is a sequence of 3 nucleotides that refer to a specific amino acid, the building
block of proteins. Though there are 64 possible codons (sequence of 3 nucleotides, resulting
in 34 permutations), there are only 20 amino acids. Most of the amino acids have many
redundant codons that they can be translated from.
At any time, many gene expressions will be occurring in any cell simultaneously. Scien-
tists however can extract this mixture of mRNA strands before they reach the Ribosomes.
10 Chapter 2. Literature - Bioinformatics Theory and Algorithm Overview
mRNA is invaluable to gene discovery: Comparing a mRNA sequence to a full DNA se-
quence allows you to search and find where the gene is in the DNA, as well as tell where
the introns and exons are.
The process of actually sequencing the mRNA involves writing the mRNA back into a
DNA sequence, called cDNA (complementary DNA). This cDNA can then be sequenced,
using equipment and tools similar to that used in full genome sequencing.
Current sequencing technology can only sequence approximately 50 to 600 base pairs
per read. To take advantage of this, the cDNA is randomly fragmented into sections much
longer than this. These fragments are then read from both ends with an unknown sized
gap in the middle. The information of which reads are the 3’ side and what its opposite
5’ side’s read was, is recorded and called forward-reverse constraints.
Manually examining the raw sequencer data is a time-consuming and difficult process,
so software called a ’base caller’ is used to turn the raw data output of a sequencer into
the nucleotide sequence that is normally used. It outputs the characters A,C,G and T
where it is sure of the base and N where it is unsure. Advanced base callers also have
an optional output where the certainty of the bases are given in the form of quality data.
This quality data can be used to further improve the accuracy of subsequent processing
of the data.
These sequenced fragments are called Expressed Sequence Tags (ESTs). Their value
is not in themselves, but rather that when they are reassembled, they will provide the
nucleotide sequence of the original mRNA strand: The nucleotide sequence used to create
proteins.
Before these sequences can be properly used, they need to be cleaned. Sequencing
is an error-prone process, additionally hampered by the fact that many lab errors are
indistinguishable from natural errors or mutations. However, it is still possible to pre-
process the sequences in an attempt to eliminate the most obvious ones.
First, all the vectors that were accidentally read need to be removed. Vectors are
artificial nucleotide sequences that bind to the target sequence and is essential in the
process of sequencing them. They are usually ignored, but can sometimes be mistaken
for part of an EST sequence. The vectors used is usually known, so this is a trivial step.
Secondly, the ends of the sequence are often removed. Whether they are removed and
how much is removed varies, but the reason for this is that sequence reads near the start
and end of a EST, are usually significantly error prone and uncertain.
Another significant sequence cleaning step is called repeat masking. DNA often con-
tains sub-sequences that are repeated several times in the same transcript or sub-sequences
2.2. Bioinformatics Theory 11
that is repeated across a large amount of different and otherwise independent transcript.
These repeats makes it difficult to find non-repeating sequences that can be reassembled,
or even cause unrelated ESTs to be considered to be from the same transcript due to their
shared repeats. Repeat masking is the process of identifying and marking repeats as low
quality regions.
These cleaned sequences are then processed in a step called clustering, which groups
similar overlapping ESTs together before finally being reassembled into a sequence that
ideally is identical to the originating mRNA nucleotide sequence.
Reassembly however is a difficult process. An EST dataset could have up to millions of
individual EST sequences, each of which need to be compared to every other EST sequence
in an expensive alignment process. Additionally, an EST dataset can contain sequences
from many of different expressed mRNA sequences as well as genetic information from
viruses and bacteria.
Thus the clustering process. Each EST sequence is compared to every other EST
sequence, but instead of the expensive alignment algorithm, a much quicker heuristic and
comparison algorithm is used to determine whether the two tested sequences have enough
in common to be considered to be overlapping. If they are determined to overlap, they
are clustered together. If not, they are placed in separate clusters.
This clustering process separates the sequences belonging to different sources, ideally
with each individual cluster representing a separate original mRNA sequence.
The reassembly process is then run on each separate cluster as opposed to the entire
EST database. In practice this could save weeks or months from the reassembly process,
depending on the complexity of the organism and the amount of ESTs gathered from it.
Once reassembled, the output will resemble the original mRNA sequence that the
ESTs were sourced from. Errors are possible and likely, as well as an incomplete sequence.
Errors as well as mutation differences from individual to individual makes it valuable to
repeat this process, but by this point the information is already in a format desired by
biologists for genetic analysis.
12 Chapter 2. Literature - Bioinformatics Theory and Algorithm Overview
2.3 Dataset Characteristics and Error Classification
2.3.1 Alphabet (A,C,G,T,N)
EST files contain sequences made from 5 characters:
A - Adenine
G - Guanine
C - Cytosine
T - Thymine
N - Unsure, could be any nucleotide
The last character (N) only occurs with low quality reads when the base caller is
uncertain.
2.3.2 Read Length
Read length is an indication of the expected amount of characters in every EST. Depend-
ing on the sequencing hardware this can be as long as 600 or as short as 40. Since errors
are most likely to occur on the ends of the reads, it is common for the ends to be trimmed,
reducing the final read length used in clustering and reassembly.
2.3.3 Orientation
Sequences have a natural direction. By convention nucleotide sequences are written from
the 5’ end to the 3’. This refers to which of the carbon atoms in a nucleotide base the
next one is bonded to, something that can be sensed by the sequencing hardware.
2.3.4 Redundancy(Coverage)
The estimated coverage suggests the amount of times any single base is represented in
an EST dataset. A coverage of 3x for instance means that any base has been read three
times and appears in three ESTs. This is only an estimation however, and may differ
based on gene expression and random chance.
The amount of coverage aimed for is dependent on the sequence read length. Short
read ESTs may need coverage as high as 8x or 16x to be reassembled correctly, while
some long reads may be reassembled with as low coverage as 3x.
2.3. Dataset Characteristics and Error Classification 13
2.3.5 Quantity
The high amount of redundancy results in an incredible amount of data. A 5 000 base
pair mRNA read with 8x coverage means 40 000 characters that need to be stored. A
complex enough organism with large amount of mRNA expressed at once can easily create
millions of individual ESTs requiring gigabytes of storage.
2.3.6 Quality
Quality data is an optional output from base calling software. Quality data refers to the
certainty of the base caller that the base read is in fact the correct one. Quality usually
starts low near the start of a read sequence and degrades again near the end of a read,
resulting in it being common practice to clip the ends of a sequence.
High quality bases can still be erroneous, but low quality bases have a greater chance
of such.
2.3.7 Reverse Complement
DNA consists of two nucleotide sequences bonded together and running in opposite di-
rections, wrapped around in a double helix shape. Adenine bonds with Thymine and
Guanine bonds with Cytosine. Both sequences represent the same information however,
and as such they are called the reverse compliment of one another. To illustrate:
5’ end - ACTGGC - 3’ end
3’ end - TGACCG - 5’ end
If both sides are read from 5’ to 3’, the sequences are simply written as:
ACTGGC
GCCAGT
2.3.8 Forward Reverse Constraints
Some sequencing hardware tracks reads from both ends of a cDNA fragment. With this
information two reads can be paired, with knowledge that one sequence is on the 3’ or
5’ of the other. This information means that these two sequences should be clustered
together, and helps with erroneous reassembly.
14 Chapter 2. Literature - Bioinformatics Theory and Algorithm Overview
2.3.9 Lane Tracking Errors
The forward-reverse constraints can commonly also include errors. These lane tracking
errors can result in unrelated pairs being said to be read from the same fragment.
2.3.10 Gene Expression Differences
The only mRNA that will be gathered is from those genes that are undergoing expression
at the time and will be proportional to the magnitude of its expression. The practical
result is that the amount of mRNA gathered related to some genes may be far more
numerous than other less expressed genes. Additionally, genes that are expressed only
rarely and only under certain circumstances may not be represented at all.
2.3.11 Low-Complexity Regions and Repeats
Low-complexity sequences appear in many unrelated proteins and consists of repetitive
short fragments. Since the same region can occur over a wide range of proteins these can
easily lead to a mis-clustering of ESTs.
2.3.12 Masking
Masking is the process of identifying and marking Repeats and Low-Complexity regions.
Once marked these sections can be assumed to have a lower weight in clustering, greatly
reducing the chance of ESTs being mis-clustered or wrongly assembled.
2.3.13 Alternative Splicing
The same DNA sequence can contain many separate genes. These genes have much of
the same nucleotide sequences but have different introns and extrons and as such are
spliced differently. This phenomenon is called Alternative Splicing. Alternative Spliced
mRNA will share much of each other’s sequence and will often be clustered together with
detection only during the reassembly stage.
2.3.14 Single Nucleotide Polymorphism(SNPs)
Single Nucleotide Polymorphism is a common natural mutation. It refers to the event
where a single base in a sequence varies from individual to individual. This mutation is
2.4. Bioinformatics Literature Study 15
often the reason for genes working differently, or not at all, so the identification of SNPs
is valuable to the medical community.
2.3.15 Base Calling Errors
It is common for the base caller to mistake one nucleotide for another, or assume an
inserted or deleted nucleotide. This means that exact string matching over long sequences
of nucleotides (>20 characters) will often fail to find matches. For this reason most
algorithms either utilize distance metrics (how similiar but not exact two sequences are)
or performs exact string matching over many shorter sequences.
2.3.16 Vector or Primer Contamination
Vectors and Primers are special artificial DNA used in the sequencing process. These
sequences are usually removed during the sequencing and base calling process, but this
sometimes fails, resulting in contamination in the EST dataset.
2.3.17 Chimera
Chimeras are an artefact of the imperfect sequencing process. They are created when two
or more transcripts contribute to a single cDNA sequence. This sequence is then cloned
and appears to be a valid transcript when the EST fragments are reassembled.
2.3.18 Cellular RNA contamination
When extracting cells from the organism, it is possible to also extract bacteria and virus
samples, or have those contaminate the sample post-extraction. This is then sequenced
along with the organism mRNA, creating ESTs completely unrelated to the organism.
Databases of common bacteria and viruses can be used to identify and remove these
erroneous ESTs.
2.4 Bioinformatics Literature Study
2.4.1 Bioinformatics History
The study of biology has a rich history that arguably dates from ancient times [8] partic-
ularly due to farming and animal husbandry, but much of our scientific understanding of
16 Chapter 2. Literature - Bioinformatics Theory and Algorithm Overview
biology comes from more recent discoveries. Modern understanding of biology owes much
to the discoveries made in the 19th century, even if the significance of many of them only
became apparent later.
Today proteins are widely known as the building blocks of life, but the term itself
only first appeared in a letter written by Gerardus Johannes Mulder in 1838 [9]. Though
he initially believed that there was only a single common large type of protein, recent
estimates suggest that the human body produces up to 84 000 different proteins [10].
Gregor Mendel is credited with the idea of genes as a unit of heredity due to the work
he published in 1865 which deals with his studies on controlled breeding of pea plants
and the propagation of traits along family lines [11, 12]. The importance of his work was
not recognised when it was first published, but its rediscovery in the 1900s led to it being
considered the foundation of modern genetic studies.
Charles Darwin’s famous ‘On the Origin of the Species’ was poorly received when it
was published in 1859 even if it is today recognised as absolutely essential in explaining
the biodiversity of species on Earth [13]. Though it sought to explain evolution through
survival of the fittest and sexual selection, the mechanism for heredity was not yet known.
Thomas Hunt Morgan, though initially critical of Darwin’s theory of evolution, used
fruit flies to replicate Gregor Mendel’s experiments. His experiments in 1910 and onward
both confirmed Gregor Mendel’s work and led to understanding of the importance of sex
chromosones in genetics, as well as a greater understanding of how genes are inherited
between generations [14].
DNA was isolated for the first time by Friedrich Miescher in 1869 during experiments
to determine the chemical composition of cells [15]. DNA is now recognised as the the
mechanism by which genes are encoded, stored and propagated.
The field of bioinformatics emerges from the overlap of biology studies and digital
computing which begins with the invention of the first digital computers in 1940s. It is
only in the 1970s [16] that the field gained prevalence due to the rise in availability of
the personal computer, allowing individual researchers without large budgets to digitally
analyse their data.
The mid-1900s saw incredible advances in our understanding of biology, including
the discovery of the structure of DNA [17, 18], the encoding of genetic information for
proteins [19] and understanding of the information content of DNA [20]. Simultaneously
new theories of computing and informatics was being developed [21].
Based on and building on these advances, 1970s saw the beginning of radical new
methods to analyse the information content of DNA and the formulation of the first
2.4. Bioinformatics Literature Study 17
sequence alignment algorithms [22, 23, 24, 25, 26, 27, 28] and the wide-spread use of
molecular data in evolutionary studies [29].
By the mid-1970s the theory and practice of sequence alignment was well understood
which resulted in increased activity and innovation in the latter half of the decade, a
key part being the establishment of standards used in the archiving and distribution of
protein sequences and protein structure information [30, 31, 32].
The availability of public databases in the 1980s [33, 34] and the increasing rate of
generation of molecular sequencing data in that decade [35] led to key advances such as the
formulation of the Smith-Waterman algorithm [36] and the FASTA family of algorithms
for database searching [37, 38].
Advances in hardware also occurred more rapidly and this included the use of dedicated
parallel hardware to more efficiently process this flood of data [39, 40, 41, 42, 43, 44].
Though ARPANET, the forbear of the internet existed since the late 1960s, it is only
in the 1990s when the internet as we know it started becoming more publicly available
that the bioinformatics data available to researchers dramatically increased [45]. Before
this point access to databases such as Genbank [34] was limited and distributed mostly
through CD-rom discs [31].
Additionally new algorithms and tool-kits such as BLAST [46] became available that
further improved on the processing that can be done with the available data.
In 1990 the Human Genome Project was started with the stated goal of providing a
complete high-quality sequence of human genomic DNA to the research community as a
publicly available resource [47]. Though well-funded, the complexity of this endeavour
meant that this effort only provided a working draft of the human genome in 2001 [48]
and was only completed in 2003 [49, 50].
2.4.2 Expressed Sequence Tags History
Before the Human Genome Project began, there were debates about the need for large
scale DNA sequencing since the sequencing of ESTs would allow identification of all of the
important gene coding regions of DNA. EST sequencing proved to be a cheaper technique
which sequences only expressed genes, allowing useful genes to be identified for medical
and research use far ahead of the 12 to 15 year estimation for the completion of the Human
Genome Project.
It was then estimated that only 3% of the information content of DNA contained
coding sequences for genes and the sequencing of these regions should take priority [51].
18 Chapter 2. Literature - Bioinformatics Theory and Algorithm Overview
Though whole-DNA sequencing was eventually used to map the human genome, far
ahead of schedule due to novel new methods, sequencing and analysis of ESTs remain an
important tool to cheaply discover novel genes in a wide array of species [52].
Several software programs have been developed before to deal with the problem of
EST clustering and reassembly. Most of them have roots in DNA shotgun sequencing,
but since the data structure and the algorithms are similar the applications can often be
adapted to be able to deal with EST data as well. What follows is some of the more
notable EST clustering/reassembly programs.
One of the best known early genome assembly programs is called PHRAP [53]. PHRAP
is part of a suite of programs that was originally designed for whole-sequence shotgun se-
quencing of DNA, but has since been adapted and used in EST assembly [54].
CAP3 [55], another well known DNA sequence assembly program, has also since been
used for EST assembly. It operates by finding all possible overlaps quickly using a BLAST-
like method, then Smith-Waterman alignment is used to align the overlaps and generate
contigs (set of overlapping DNA segments) and the final assembly. CAP3 is known to
have fewer errors than PHRAP when assembling EST data [54].
CAP3 was not originally designed for EST sequences however, so a tool called TGI
Clustering Tool [56, 52] was developed, intended to cluster sequences, greatly decreasing
the time needed for CAP3 to reassemble the sequences.
Two notable programs that have been developed purely for the purpose of clustering
EST sequences is PaCE [57], which uses a maximal common substring algorithm to find
overlaps, and d2 cluster [58], which uses a common words method to detect similiarity.
More recently, the wcdest [59] application has been developed which is based on
d2 cluster, utilizing aggressive heuristics to improve the speed performance significantly,
while having a negligible impact on the clustering accuracy. This tool only clusters the
sequences however, so a second tool is required to reassemble them. The focus on the
effectiveness of effective heuristics by wcdest served as as the groundwork and source for
the heuristics employed in this project.
2.4.3 Rise of GPGPU in High Performance Computing
The prominence of the GPU on the non-graphical field is a relatively new occurrence,
with some of the earliest papers in the field happening during the 1990s. During this
time period GPUs were still generally limited to graphics related problems, resulting in
most GPGPU programs of the era being rendering or image manipulation projects such
as real-time textures [60], image-composition [61] and video flow detection [62].
2.4. Bioinformatics Literature Study 19
The 1990s also saw one of the first non-graphics or visualisation related problems solved
using GPU computing, namely using clever rendering techniques to compute Neural Net-
works [63], which both highlights the limitations of early GPUs as well the inventiveness
of researchers to challenge them.
It is not until 2001 and the release of the first GPU with programmable shaders (the
Geforce 3) that general purpose programming on the GPU really took off with Ray Tracing
[64, 65], cellular automata [66, 67], sorting [68, 69] n-body simulations [70] and fluid flow
simulations [71].
Though these applications impressively use the power of GPUs they are still limited
to programmable shaders and programming APIs designed greatly for graphics orientated
problems. Some efforts have been made to negate these disadvantages through middle-
ware APIs that provide these graphics APIs in a more domain-neutral stream computing
format, such as BrookGPU [72, 73] or Sh (later Rapidmind) [74, 73], but it is only with
the release of CUDA in 2007 that the GPGPU computing field greatly expanded. CUDA
was developed by Ian Buck, the developer of BrookGPU, and backed by nVidia, a man-
ufacturer of GPUs. CUDA allows programming of GPUs on a very low level as opposed
to simply being a middle-ware API. CUDA is described in more detail in Chapter 3.
GPGPU has recently entered the public eye due to the publicity of the GPU clients
of the SETI@home [75, 76] and Folding@home [77, 78] projects, both of which allow the
general public to use spare GPU cycles to help solve huge scientific problems (the search
for intelligent life and protein folding simulation respectively) that traditionally requires
huge and expensive supercomputers. These projects have proven to be monumental in
raising public awareness of the computational power that GPUs can provide.
2.4.4 GPUs in Bioinformatics
Before the release of CUDA (a development language for GPUs), there were a number
of bioinformatics projects utilizing GPGPU, usually through using the OpenGL API
and modelling the data as images. These include GPU implementations of the Smith-
Waterman algorithm [79, 80], inference of evolutionary trees from DNA Sequence data
[81] and fast exact string matching [82]. These generally reported favourable speed-ups
as high as 35x compared to CPU-only implementations. These improvements serve as
evidence of the value that GPGPU computing can provide to the bioinformatics field.
After the release of CUDA, many new bioinformatics applications have become avail-
able [83] due to the increased flexibility of general purpose APIs. Most common are new
CUDA Smith-Waterman implementations such as SWcuda [84], CUDASW++ [85] and
20 Chapter 2. Literature - Bioinformatics Theory and Algorithm Overview
more [86, 87, 88]. These Smith-Waterman implementations are evidence of the importance
this specific algorithm has in bioinformatics computing and has aided in the development
of the custom implementation that is used in this project (See Section 5.6.6).
A large number of CUDA bioinformatics applications of various descriptions have been
developed, including genetic database searching though exact string matching such as
GPU-HMMER [89] and MUMmerGPU [90], several projects dealing with medical imaging
[91, 92, 93], a protein blast implementation [94] as well as the well known Folding@home
project [77, 78].
MUMmerGPU [90] is an interesting project since it addresses the issue of low memory
on GPUs (as little as 256MB) and large datasets. It allows high-throughput sequence
alignment of a set of queries to a reference database by transforming that database into
a suffix tree, a technique that allows for fast exact string matching, but requires a large
amount of memory to function. Through aggressive optimization of the data-structures
and subdivision of the suffix tree and the queries, MUMmerGPU pages different subsets
of the task in and out of GPU memory. Even with this overhead, MUMmerGPU still
reports 3.5 times faster performance than the C implementation. MUMmerGPU 2.0 [95]
reports 13x improvement over the C implementation largely though further memory and
data structure optimization.
More recently projects based on OpenCL, a more platform independent API than
CUDA, have begun to appear. To date however, CUDA is known to have better perfor-
mance than OpenCL [96], though this might change as time passes and newer GPUs and
improved drivers are released.
Bioinformatics toolkits such as Unipro UGENE [97] provide an extensive set of tools
for manipulating sequence data, alignment and assembly in a visual manner while fully
utilizing either a local CUDA-capable GPU or a remote one. This integrated approach
is valuable to scientists since every tool will work well with one another while providing
convenience in setup and configuration. As toolkits such as this is developed and mature
further, it is expected that GPU computing will become more common in laboratories
around the world.
2.5. Data Representation Overview 21
2.5 Data Representation Overview
2.5.1 FASTA File Format
The industry standard data format to represent and transfer genetic data is called the
FASTA file format [98]. This file format is human-readable and flexible, capable of storing
both nucleotide sequence data and amino acid sequence data.
The FASTA file format supports a large number of characters for both nucleotide and
amino acid sequences, shown in Table 2.1. Of these we will only use the 5 basic characters
for nucleotide sequences: A, C, G, T and N. The table is included for completeness and
the additional characters is not used in this project or the experimental datasets.
An example of this file format is presented here, from the SANBI10000 dataset:
>T30671 g612769 | T30671 CLONE_LIB: Human Eye. LEN: 319 b.p. FILE
gbest3.seq 5-PRIME DEFN: EST20487 Homo sapiens cDNA 5’ end
ATGATAATGAAAGACTCTCGAAAGTTGAAAAAGCTAGACAGCTAAGAGAACAAGTGAATG
ACCTCTTTAGTCGGAAATTTGGTGAAGCTATTGGTATGGGTTTTCCTGTGAAAGTTCCCT
ACAGGAAAATCACAATTAACCCTGGCTGTNTGGTGGTTGATGGCATGCCCCCGGGGGTGT
CCTTCAAAGCCCCCAGCTACCTGGAAATCAGCTCCATGAGAAGGATCTTAGACTCTGCCG
AGTTTATCAAATTCACGGTCATTAGACCATTTCCAGGACTTGTGAATTAANAACCAGCTG
GTTGATCAGAGTGAGTCAG
This entry begins with a header, identified by the starting > character. The header
includes information such as its unique code, its source, clone data and other annotation.
Following this header is the actual sequence. When encoding nucleotide data this is
usually the 4 base characters, A, C, G and T, as well as the character N, which represents
an unknown base.
2.5.2 Compression
The disadvantage of the above FASTA data format is that it is not very memory efficient.
To improve on this and store the sequences in memory using less memory, compression
can be used.
In contrast to ASCII, which has character mappings for all 256 possibilities of a 8-bit
byte, nucleotide sequences only have an alphabet of 5 characters(A, C, G, T, N). Since
not all 8-bits of a byte is needed to represent a nucleotide, the data can be compressed by
22 Chapter 2. Literature - Bioinformatics Theory and Algorithm Overview
A AdenosineC CytosineG GuanineT ThymidineU UracilR A or G (puRine)Y C or T (pYrimidine)K G or T (Ketone)M A or C (aMino group)S C or G (Strong interaction)W A or T (Weak interaction)B C, G or T (not A)D A, G or T (not C)H A, C or T (not G)V A, C or G (not T)N aNyX Masked- Gap of indeterminate length
(a) Nucleotide sequence
A AlanineB Aspartic acid or AsparagineC CysteineD Aspartic acidE Glutamic acidF PhenylalanineG GlycineH HistidineI IsoleucineK LysineL LeucineM MethionineN AsparagineO PyrrolysineP ProlineQ GlutamineR ArginineS SerineT ThreonineU SelenocysteineV ValineW TryptophanY TyrosineZ Glutamic acid or GlutamineX any* translation stop- gap of indeterminate length
(b) Amino acid sequence
Table 2.1: Characters and meanings for FASTA sequences
2.6. Algorithm Types Overview 23
having a single byte represent multiple nucleotides, a technique called data packing. Com-
pression has an advantage in reducing the memory footprint of an application, and in the
case of GPGPU might present speed advantages due to less data having to be transferred
to and from GPU and host memory. Compression though increases the computational
complexity of an algorithm, so choosing the right compression is often a speed/memory
tradeoff. Two compression schemes are given below:
4-bit Compression
This compression is achieved by assigning every nucleotide its own bit. While it does not
have as good compression as a 2-bit compression scheme, it does have a speed advantage in
that comparisons are quick bit operations. Individual nucleotides are represented by mak-
ing the bit that they represent 1 and all other bits 0, while the N character, representing
a match for any nucleotide, is indicated by making all 4 bits 1.
This scheme allows 2 nucleotides to be packed into a single byte, potentially halving
the amount of memory needed for the application without requiring a computationally
expensive decompression algorithm.
2-bit Compression
The 5th character, N, referring to a wildcard that can match to any of the other 4, can be
removed by assigning it a random nucleotide. The resulting 4 character alphabet can be
represented in 2 bits, allowing 4 nucleotides to be stored in a single byte. This is the best
practical compression available, though it does increase the computational complexity of
decompression somewhat.
2.6 Algorithm Types Overview
This section details the various classes of algorithms of concern to this research. In
Chapter 5 specific examples will be given and considered.
2.6.1 Distance Based Algorithms
Distance based algorithms is the term used to refer to algorithms that compares two
sequences pairwise, then provides a single value as an output that represents how similar
the two sequences are.
24 Chapter 2. Literature - Bioinformatics Theory and Algorithm Overview
These sequences can be of any length and might only have a small subsection in
common with one another. It is considered advantageous if the algorithm scores match-
ing continuous regions more than two sequences with shorter matching regions spread
throughout.
2.6.2 Alignment Algorithms
Alignment algorithms have a lot in common with Distance algorithms. In a nutshell
the goal of alignment algorithms is to add insertions, deletions and substitutions to the
two sequences in an attempt to minimize their distances. This alignment provides a
visual representation of the similarity between two sequences and is an important tool in
bioinformatics.
Alignment algorithms are usually much more expensive than simple distance algo-
rithms due to the more information it presents, but there has been work done to properly
parallelize this class of algorithms which makes it interesting in this study.
While alignment is not expressly used in EST Clustering, the algorithms and de-
velopments in alignment can potentially be used to provide a higher quality distance
measurement.
As an example of alignment, consider the following two sequences:
Sequence 1: GATTCGTTA
Sequence 2: GGATCGTA
If these sequences are pairwise aligned to result in the minimum distance it would look
like the following:
Sequence 1: -GATTCGTTA
Sequence 2: GGAT-CGT-A
where the ’-’ character represents gaps and the bolded characters are characters that
match in both sequences.
In addition to providing an optimal alignment, alignment algorithms often also provide
a distance measurement of this final alignment.
The simplest metric for distance between two aligned sequences is called the Leven-
shtein edit distance. Using this metric every ’error’, whether it is an insertion, deletion
or substitution increases the score by 1. The goal is the minimize this distance.
For the two example aligned sequences above, their Levenshtein edit distance would
be 3.
2.7. Conclusion 25
2.6.3 Database Algorithms
Database algorithms are algorithms that instead of processing a multitude of short se-
quences pairwise to each other, instead compare a single sequence against either one large
sequence (a gene against an entire genome of a creature), or a single sequence against a
preprocessed database of sequences.
These algorithms are usually characterised by the introduction of database indices or
by representing the sequences as a fingerprint rather than the raw sequence to facilitate
faster searches.
Database algorithms usually involve an expensive preprocessing stage which provides
an output which can be reused multiple times and a quick search and comparison.
2.6.4 Heuristics
Heuristics can be any class of algorithm that provides much faster comparison than other
algorithms in its class, usually by optimizations that greatly reduce accuracy. Due to this
loss of accuracy their output is usually characterised by a binary pass or fail result and
is rarely used alone, with the failing comparisons rejected as potential matches and the
pass matches passed on to other algorithms for more exhaustive comparison.
The best heuristics are typically the ones which have a low false positive rate, thus
rejecting the fewest actually similar pairs, and a high true negative rate, rejecting a large
number of unrelated sequence pairs.
Heuristics serve as a valuable component of an EST clustering program due to mas-
sively decreasing the expensive number of computations needed for larger datasets.
2.7 Conclusion
In this chapter a basic primer to bioinformatics in general and EST clustering in specific
is provided. Basic terms used throughout this document is given and the basic classes of
algorithms that will be considered are explained.
This chapter also includes a short history of relevant bioinformatics algorithms and
a description of how modern GPGPU advances have been introduced and changed the
field.
This chapter is meant to provide basic groundwork of understanding for engineers
who are not familiar with bioinformatics and biology. These descriptions and theory are
not meant to be comprehensive since that would be outside the scope of this document.
26 Chapter 2. Literature - Bioinformatics Theory and Algorithm Overview
Since the information provided is only applicable to this project, it is advised that fur-
ther research be done on the subjects introduced rather than using this document as a
comprehensive authority on the subject of bioinformatics.
The next chapter provides a groundwork to the GPGPU field, introducing both con-
cepts and information on GPU’s requirements and limitations when applied to the bioin-
formatics field.
Chapter 3
Theory - GPU Theory Study
3.1 Introduction
In this chapter a brief introduction to the state of GPGPU (General Purpose Graphics
Processing Units) is provided, providing context for the selection of CUDA as the API
(Application Programming Interface) of choice for this project.
The theory of the different ways in which GPUs can provide parallelism to an appli-
cation is explored and the capabilities and limits of different generation NVidia GPUs is
provided.
3.2 GPU Introduction
A rapidly advancing technology within IT has been computer graphics. Constant de-
mand of higher quality, better and more flexible processing and a large market for GPUs
(Graphics Processing Units) has resulted in GPUs computing ability advancing faster
than that of classical CPUs (Central Processing Units). While CPUs have been steadily
keeping to Moore’s Law (stating that the amount of transistors that can be placed on an
integrated circuit roughly doubles every two years [3]), GPUs have been outpacing this
law [99].
The GPUs’ advantage over CPUs are their specialized design using massive multi-
threading to utilize a large number of cores and hide global memory latency [5]. While it
is commonplace for commercial CPUs at the time of writing to reach six cores, the newly
released Geforce GTX 480 already possesses 480 separate processors. These processors are
very limited compared to CPU cores, lacking the caching capabilities and branch predic-
tion of the CPU ones, but the sheer number of them allow much greater computational
27
28 Chapter 3. Theory - GPU Theory Study
Figure 3.1: The GPU devotes more transistors to data processing [2]
throughput. Market demands have also resulted in GPUs becoming more flexible and
improved programmability.
GPUs are capable of operating on large amounts of data simultaneously, equivalent to
a thread per datum on the CPU. This design makes it a good platform for parrelizable
algorithms, but the requirement that there is a minimum amount of inter-thread commu-
nication and the undefined order of completion of these threads makes GPUs unsuitable
for many complex and serial algorithms [5].
The field of GPGPU is the domain where these more flexible GPUs are applied to
non-graphics related problems. It started with the introduction of programmable shaders,
where these non-graphics problems were represented as graphics elements such as pixels,
vertices and textures. Though the theoretical and practical performance gain was great,
the format the data and the problem had to be presented in, as well as the required
expertise of the implementer, limited the number of problems the GPU could easily be
used to solve [100].
Languages and APIs are designed to apply the GPU to problems without first having to
format the non-graphics problem in a graphics format, has mitigated these disadvantages
in recent years. It is true however that best performance is reached when the problems
most closely resemble graphics problems.
Early APIs such as BrookGPU [72, 73] or Sh (later Rapidmind) [74, 73], though
instumental in the early shaping of the GPGPU field, does not enjoy the widespread
GPU Vendor support that later APIs do.
3.3. General Theory 29
Among the modern and widely supported APIs are CUDA (Compute Unified Device
Architecture), developed by NVidia for its range of commercial GPUs and publically
released in 2004, OpenCL, developed initially by Apple Inc. in 2008 but supported by a
wide range of hardware vendors and DirectCompute, an API developed by Microsoft as
part of its DirectX11 API released in 2009.
Of these well-supported APIs, both OpenCL and DirectCompute are recent inno-
vations. DirectCompute is available only to the Windows platform, which makes it
unsuitable for Linux or Unix-based operating systems which are common in research.
OpenCL has not yet been properly implemented across all platforms and suffers from
performance problems [96], though these concerns are expected to be eliminated with
time. For this project I have decided to use CUDA as my API of choice due to its
maturity and widespread use in existing bioinformatics research.
CUDA is based on the C programming language [2]. It extends the language with new
keywords, but otherwise should be familiar to anyone who has programmed in C before.
It is important to note that when programming in any GPGPU API that a distinction
must be made between the host (CPU side) and GPU side, since they have non-shared
memory and different functions may execute on either one. It becomes even more complex
in multi-GPU situations where each GPU is explicitly managed.
3.3 General Theory
Attempting to use the GPU as a CPU (single pieces of data processed, one after another)
will leave the GPU severely underutilized and will result in sub-optimal performance [2].
In order to fully utilize the GPU, the data to be processed must be sufficiently large
and capable of being processed in parallel. There is a non-trivial overhead for copying
data to and from GPU memory, but when dealing with streaming data this can be easily
hidden by concurrently processing data already on the GPU and copying over the next
set of data simultaneously [100].
Of note is the lack of memory on the GPU. Host-side CPU computation can make
use of 64-bit addressable RAM, potentially up to 8TB worth and can page memory
to disk, using hard drive space as temporary storage. The GPU has far more modest
memory amounts, from under a gigabyte for most commercial GPUs up to 4GB on some
professional hardware models available to date(2010). This is not an issue for small
problems where the entire dataset can fit into a few hundred megabytes, but realistic
applications can utilize datasets of sizes measured in terabytes or even petabytes. In
30 Chapter 3. Theory - GPU Theory Study
addition GPUs host several different types of memory, each requiring proper management
in order to fully optimize performance. For this, data streaming and explicit memory
management is essential to performance in applications utilizing large data sources.
A noted disadvantage of GPUs compared to CPUs is the fact that GPUs are stream-
lined for 32-bit floating point numbers, not integers or 64-bit floating point numbers.
While 64-bit applications are supported on modern GPUs, computation will suffer re-
duced performance.
3.4 CUDA API
3.4.1 Introduction to the CUDA API
CUDA is a GPU API developed by NVidia Corporation that allows developers low level
programming ability on NVidia GPUs by programming in ‘C for CUDA’, which is the
standard C programming language with a number of extensions and restrictions [2].
This allows developers to use the computational ability of NVidia GPUs without
having to refactor the logic and data in a format that suits the graphics pipeline, widely
increasing the potential applications of GPU hardware.
Since February 2007 when the first CUDA SDK was made public, various language
bindings and wrappers for a wide variety of programming languages have been devel-
oped, including Fortran, Java and Python. For this implementation though we will limit
ourselves to ‘C for CUDA’ and C++.
CUDA also provides various libraries built on top of CUDA to provide specialized
high performance mathematical functions: CUFFT [101] provides high performance Fast
Fourier Transforms, CUBLAS [102] is a library for linear algebra functions, CUSPARSE
[103] is a library containing functions for handling sparse matrices and CURAND [104] fo-
cusses on the generation of high quality pseudorandom numbers. None of these additional
libraries will be used in this project however.
3.4.2 CUDA Compute Capabilities
The CUDA Programming Guide [2] provides table 3.1, the technical specifications of the
various CUDA capable GPUs available at the moment.
In addition to improving specifications, newer CUDA compute capability GPUs also
provide additional features not available in previous versions. For example, compute
3.4. CUDA API 31
1.0 1.1 1.2 1.3 2.xMaximum number of threads per block 512 1024Number of threads per warp 32Maximum number of resident blocks per multiprocessor 8Maximum number of resident threads per multiprocessor 768 1024 1536Maximum number of resident warps per multiprocessor 24 32 48Number of 32-bit registers per multiprocessor 8 K 16 K 32 KMaximum amount of shared memory per multiprocessor 16 KB 48 KBAmount of shared memory banks per multiprocessor 16 32Amount of local memory per thread 16 KB 512 KBConstant memory size 64 KBConstant memory cache per multiprocessor 8 KBMaximum number of instructions per kernel 2 million
Table 3.1: Comparison of various CUDA Capabilities [2]
capability 1.1 introduced atomic functions while compute capability 1.3 was the first to
implement native 64-bit functionality.
The differences between low, mid and high-end GPUs of the same compute capability
family include varying the clock speed, global memory size and the speed of the memory.
The most important difference however is the amount of multiprocessors active on the
GPU.
The GPU being used for development of this project, the GTX 260 has 24 active
multiprocessors. For comparison, the GTX 285 which is the high-end GPU of the same
range, has 30 active multiprocessors. In contrast, some laptop or embedded versions of
the GPUs can have as few as 1 or 2 multiprocessors.
3.4.3 GPU Memory
In order to optimize the architecture of the GPU, several different types of on-GPU
memory exists. The choice of which to use is dependent mainly on the intended use. The
different types of relevant memory are summarized here.
32 Chapter 3. Theory - GPU Theory Study
Figure 3.2: CUDA Memory Model [2]
Type Speed Size Access Cached Scope
Registers Very Fast Very Limited Read/Write No Thread
Local Memory Very Slow Limited Read/Write No Thread
Shared Memory Fast Limited Read/Write No Block
Constant Memory Very Fast Limited Read Yes Global
Global Memory Very Slow Large Read/Write No Global
Texture Memory Slow Large Read/Write Yes Global
Table 3.2: Summary of Memory Types available to CUDA Programmers
Unlike CPUs, CUDA capable GPUs do not rely on cache to obtain their high per-
formance. Instead of relying on increasing the response time of memory reads through
expensive cache optimization, GPUs instead focus on high throughput. Though this re-
sults in slow and unresponsive memory access, larger reads performed by multiple blocks
and threads concurrently can allow a huge amount of data to be read and processed at
the same time.
3.4. CUDA API 33
Global memory reads can take up to hundreds of cycles, so proper use of the different
types of memory is needed for optimal performance of a CUDA application.
Registers
The fastest memory available to a kernel, but also the most limited in size. Every block
has a limited number of registers (32K 32-bit registers on a 2.x compute capability GPU)
that is divided between all active threads on a block. Blocks with fewer threads will
provide more registers per thread, but might suffer performance issues due to the reduced
parallelism employed.
Every variable created in a kernel is a register and it is important to learn to find
ways to minimize the number of registers an application uses to maximize the number of
concurrent threads allowed on a block.
When optimizing a kernel it is important to test the effects your choice of registers has
and whether to offload some to shared memory or even to use registers instead of shared
memory in other cases.
Shared memory
Shared memory is a section of on-multiprocessor temporary memory that is shared among
all threads within a warp. This shared memory is much faster than global memory though
slower than registers and can be used for temporary storage of data during computation.
Since even 2.x Compute Capability GPUs have only 48 KB to work with (16 KB for
earlier GPUs), this memory space needs to be well managed.
In addition all data of all threads within a block are accessible to one another. This
means that shared memory can be used as a form of inter-thread communication and
data sharing. This property of shared memory allows many algorithms to utilize gather
and scatter operations.
Though fast, shared memory accesses should still be designed to avoid bank conflicts
where multiple threads attempt to access the same section of memory. In worst case
scenario this can lead to a 16× degradation in shared memory performance if these accesses
have to be serialized for each thread instead of occurring simultaneously. More information
on bank conflicts can be found in the CUDA programming manual [2].
34 Chapter 3. Theory - GPU Theory Study
Global memory
Global memory is the main GPU memory with the largest size but also slowest access
times. This is most likely where the input data your kernel will use will be stored and
where the results of your computation will be written, so proper management of global
memory is important for optimal performance. The global memory capacity of individual
GPUs vary, usually between 256 MB to 4 GB, but is usually shown on its packaging.
When accessing global memory, the accessing threads are blocked, allowing non-
waiting threads access to the GPU multiprocessor. When the memory access is completed
these threads are unblocked and allowed to continue processing. If enough threads with
enough workload is provided then memory access latency can be completely hidden.
Since the memory latency can be hidden, memory throughput becomes much more
important and can be improved through a method known as coalescing. If a half-warp
of threads (16 sequential threads) all access sequential 64 or 128 bytes of memory then
the memory accesses can be combined into a single transaction [2]. Uncoalesced random
reads can be many times slower since several 64 bytes reads will need to be queued one
after another which severely limits the ability for the GPU to hide memory latency.
Local Memory
In memory diagrams, local memory is often indicated as being close to individual threads.
This indicates simply that local memory is private to that thread and cannot be accessed
by any other thread. In regards to performance local memory is about equal to that of
global memory since in reality local memory is simply an abstraction of global memory.
Local memory is automatically allocated by the compiler if the registers available to
the kernel is not enough. This automatic assignment is often one of the reasons for sub-par
performance and should be minimized whenever possible.
Local memory use should be minimized by algorithm improvements that reduce the
number of registers needed in the kernel or by using shared memory to store data instead
of registers.
Texture memory
Texture memory is usually used to store textures, large images that GPUs render with.
This is stored on global memory, but by marking the memory as read-only texture memory
allows various features to become available.
3.4. CUDA API 35
Most important is that texture memory is spatially cached. This caching is done to
optimize throughput, not latency, and operated best on 2D image data. To make the best
of it, sequential memory accesses should be as near one another in 2D space as possible.
This is useful for images, but can suffer performance penalties when applied to random
access memory patterns.
Another feature of texture memory is the ability to bilinear or trilinear filter the reads
in the same way that can be done in DirectX and OpenGL. This means that attempts to
read data that is not at integer positions can return a value that is interpolated between the
surrounding values. This is provided at no performance penalty since the GPU contains
specialized pathways separate from the shaders to perform this operation.
Texture memory is however very limited when it comes to non-image data and it is
important to experiment with the performance when considering using it.
Constant memory
Constant memory is second only to registers, but as the name implies it is read-only during
kernel execution. It can be written to before and after a kernel executes however and is
often used for constant and configuration information, as well as small lookup tables.
It is very limited in size, allowing only 64 KB of data to be stored.
3.4.4 CUDA Parallelism
When a CUDA kernel is executed, a grid of threads are launched. The total number of
threads launched is dependant on the choice of CUDA grid and block sizes. A CUDA
grid is a 1D, 2D or 3D array of blocks, while a block is a 1D, 2D or 3D array of threads.
This design allows a certain amount of hardware independence across generations
and markets for CUDA applications. Any GPU meeting at least the minimum CUDA
compute capability of the kernel will be able to run it. The higher end GPUs with more
multiprocessors will simply be able to execute more blocks in parallel, while the lower end
ones will execute the blocks sequentially.
Applications using large amounts of blocks can be considered future-proof since it is
expected that newer and higher end GPUs will have an increase in the amount of cores
in addition to technology improvements.
36 Chapter 3. Theory - GPU Theory Study
Figure 3.3: CUDA Grid of Thread Blocks [2]
Job level Parallelism
Job level parallelism refers to breaking up a dataset or computational workload into
independent jobs, each of which has no dependency on any other. This type of parallelism
is typically used to distribute a computational workload over multiple different computers.
Job level parallelism is in fact required even when using a single computer with two GPUs,
but it also has value when only a single GPU is involved due to streaming.
Streaming is the act of launching multiple kernels from the host side, each a different
job. This is required to take advantage of multiple-GPU deployments as well as to allow
concurrent kernel execution and host<=>device memory copies. This is not typically
required in situations of high computational intensity and low memory requirements where
the time of copies take an insignificant proportion of total computing time.
3.4. CUDA API 37
Block level Parallelism
Block level Parallelism is the scalable method used by CUDA to distribute the workload
across multiple independent multi-processors, regardless of the number of actual multi-
processors on the GPU. There is little communication between blocks and the blocks
can be executed in any order. This necessitates that no block has data dependencies
on any other block. Block level parallelism thus shares much of the same requirements
and limitations of job level parallelism, differing only in the fact that blocks are launches
simultaneously on the same GPU on a single GPU. If some sort of synchronization is
needed, multiple kernel calls are advised.
Figure 3.4: CUDA Block Scheduling [2]
The size of the grid, i.e. the total number of blocks launched by the kernel is not
a critical variable for performance beyond the requirement that enough blocks need to
be launched at once so that every multiprocessor has several blocks queued at once,
preventing under-utilization of the GPU. Attempts to experiment with block sizes has
shown a small performance increase if the number of blocks is a multiple of the amount
of multiprocessors, but the effect was relatively minor.
If the algorithm is not reliant on a set number of blocks, the grid size could be tailored
at runtime to a multiple of the number of multiprocessors reported by the GPU. If all
blocks take equal time to complete, this will allow for the most efficient allocation of
resources to the problem.
38 Chapter 3. Theory - GPU Theory Study
Thread Level Parallelism
Unlike blocks, threads do have the ability to synchronize and communicate between them-
selves. Though not necessary to use, threads have shared memory that is available and
readable by all threads in a block. This shared memory can be used as a user-managed
cache, shared data for inter-thread communication as well as individual temporary stor-
age. All threads are executed in warps, a group of 32 threads that proceed in lock-step
with each other. This lock-step behaviour makes branches that diverge within a warp
expensive [105].
The number of threads in a block is even more of a critical variable for maximum
throughput than the choice of grid size. Threads are executed in warps of size 32, so
selecting a thread size that is not a multiple of 32 is strongly discouraged.
When selecting a block size, it is important to maximize what is called occupancy.
Occupancy is the percentage of cores active per multiprocessor. Though each multipro-
cessor only executes one warp at a time, it is capable of storing multiple warps, called
resident warps, at once.
Instructions such as loading from global or shared memory can take many cycles to
complete. Since shared and global reads are not cached, every thread needs to wait
the full amount of time to load data from memory. Instead of implementing caching to
combat this effect, GPUs simply switch to another warp when such blocking operations are
reached and continues execution while the blocked warp finishes its memory read. In this
way memory latencies can be hidden with concurrent execution, rather than eliminated.
Texture memory is the one form of global memory that is cached, but in keeping with this
model, it is cached to reduce memory bandwidth usage rather than latency. A cached
texture memory read has the same latency as an uncached read.
By making sure the occupancy is as high as possible, the GPU is given many more
warps to switch to while others are busy, reducing any downtime where there is no available
warps available to execute.
Table 3.3 and Table 3.4 shows how the choice of threads per block can affect the
maximum possible occupancy and the availability of resurces such as shared memory and
registers for different generations of GPU.
Important to note is the fact that total register count and total shared memory are
limited per multiprocessor. If a kernel uses more than the maximum available, less blocks
will be resident than otherwise for this kernel, resulting in reduced occupancy.
Even with full occupancy the memory accesses of some very memory heavy kernels
might still not be sufficiently hidden. In other cases where the kernel is more computa-
3.4. CUDA API 39
# of Max Blocks Registers Shared MemoryThreads Warps Occupancy per MP per thread per block
64 2 33% 8 64 6 KB128 4 67% 8 32 6 KB192 6 100% 8 20 6 KB256 8 100% 6 20 8 KB384 12 100% 4 20 12 KB512 16 100% 3 20 16 KB
Table 3.3: Optimal Maximums for 2.x Compute Capability GPUs for different block sizes
# of Max Blocks Registers Shared MemoryThreads Warps Occupancy per MP per thread per block
64 2 50% 8 32 2 KB128 4 100% 8 16 2 KB192 6 94% 5 16 3 KB256 8 100% 4 16 4 KB384 12 75% 2 21 8 KB512 16 100% 2 16 8 KB
Table 3.4: Optimal Maximums for 1.3 Compute Capability GPUs for different block sizes
tionally complex a 100% occupancy is not required. In some situations it has even been
shown that programming for maximum occupancy can result in degraded performance
[106], usually in cases where using more of the fast registers per thread allows faster ex-
ecution. Maximizing occupancy is a good starting point for initial choices on block size,
but experimentation and benchmarking should be used to find the optimal solutions.
Utilization of Instruction Level Parallelism detailed in the next section would further
reduce the reliance on occupancy.
In the case of the NVidia GTX 480, a 2.x Compute Capability GPU, maximizing
occupancy would result in 1536 threads per multiprocessor. With 15 multiprocessors this
suggests a minimum of 23040 individual threads to be used as an optimal minimum for
each kernel. Algorithms that can take advantage of this high level parallelism are the
ones that tend to be best suited to the GPU.
NVidia supplies a spreadsheet called the CUDA Occupancy Calculator which can be
used to help maximize occupancy for different kernels and compute capability GPUs [107].
40 Chapter 3. Theory - GPU Theory Study
Instruction Level Parallelism
Occupancy on the CUDA architecture is not the only tool possible to hide memory laten-
cies. CUDA code has the ability to continue execution out of order if the next instruction
does not require a blocked operation. To illustrate, refer to Algorithm 1.
Algorithm 1 Instruction Level Parallelism Example
1 int a = load (mem[ 0 ] ) ; // Blocking memory load2 int b = load (mem[ 1 ] ) ; // Blocking memory load3 doWork( a ) ; // Only requires that a i s loaded4 doWork(b ) ; // Only requires that b i s loaded5 int c = a+b ;
In this example line 2 does not require that line 1, the load of a from memory, has
completed, so it queues the load of b immediately, then blocks until a is finally loaded.
Once a has been retrieved from memory, line 3, execution on a, continues concurrently
with the loading of b since line 3 does not depend on b. Line 4 finally blocks till b is also
loaded.
In this way much of the memory fetches latency is hidden, even without the use of
multiple threads. A common way to take advantage of ILP is to perform multiple outputs
in one thread, hiding latencies by starting execution of loaded data while waiting for the
load of the next. Methods like this would usually increase register usage, but it would
also reduce the need for full occupancy. Important to note is that some GPUs such as
the GF104 based GPUs contain 48 cores per multiprocessor instead of 32, which means
they require ILP to gain greater than 66% of its peak performance.
3.5 Conclusion
Even though the concept of GPGPU is a relatively recent one, the field has advanced
rapidly in a short time. Though the capabilities of current generation GPUs is impressive,
it only hints at the possibilities that future GPUs can provide.
The skills needed to effectively program applications that take advantage of the GPU
is still in short supply, but newer APIs and greater exposure to the concept should drive
not only a greater variety of fields that employ GPGPU, but also the capabilities of future
GPUs that will cater to these programmers.
3.5. Conclusion 41
In the next chapter, the experimentation design that will be used for the developed
application will be detailed, providing information on the metrics that is to be used and
the experiments that will be performed.
Chapter 4
Experimental Design
4.1 Introduction
“In theory, there is no difference between theory and practice. But, in practice, there is.”
– Yogi Berra
This thesis is concerned with the creation of a practical implementation of EST clus-
tering program utilizing the GPU. In order to gauge the success of such an attempt several
experiments are needed, identifying both the characteristics, strengths and weaknesses of
the created implementation as well as comparing this implementation to existing and
wide-spread used CPU implementations.
This chapter will describe experiments whose purpose is to measure and objectively
compare the performance of the algorithms that will be discussed and implemented in the
next chapters. It is important to plan the experiments in order to identify the criteria
necessary to be satisfied for any viable clustering algorithm.
This chapter will describe the terms and definitions of the various metrics that are
used for fair comparisons in these experiments. This includes not just performance mea-
surements, but also the correctness of the results and identifications of the shortcomings
of the chosen algorithms.
The chapter will begin with the listing of assumptions in the experiments due to the
non-conventional hardware platform used. Experimental concerns are listed as to provide
an introduction to the difficulty of a fair comparison across different hardware platforms.
The specifics of the experimental text platform is provided.
The theory and metrics section provides accurate definitions of the terms and measures
used in the experiments as well as indication of how experimental metrics are measured.
43
44 Chapter 4. Experimental Design
The experiments are listed. Each experiment will detail the aim of the experiment,
background information detailing its relevancy to the subject as a whole, as well as the
expected results from the experiment.
The final results provided by the experiments is provided in Chapter 7.
4.2 Assumptions and Experimental Framework
4.2.1 Common Assumptions
Scalability of CPU Cores
Modern CPU improvements have focussed largely on memory bandwidth optimizations
and increasing the number of cores, rather than increasing the speed of an individual
core. Testing of CPU implementations will for that reason only utilize a single core of
the CPU with the expectation that this will allow better comparison over a wider range
of hardware. This needs to be taken note of when making comparisons between the GPU
and CPU implementations, since in practice all the CPU cores will be used and their
running times will be much faster.
While a quad core CPU is unlikely to perform its computations in a quarter of the
time, they are expected to be faster by a significant factor. Comparisons between the
GPU, quad-core and dual-core setups should keep this in mind.
CPU speed has a negligible effect on GPU computation
One of the most significant assumptions in these experiments is that the CPU speed
performance has a negligible effect on the performance of GPU-only applications and
that only the GPU computational performance and memory access performance has an
effect on the final running times of the application.
For this reason only a single CPU core is sufficient to drive the application and theo-
retically should be enough to drive computation across multiple GPUs, though this ability
is not yet implemented.
Operating systems have a negligible effect on performance
Preliminary testing has also shown no difference between programs running on linux
versus programs running on windows if the hardware is identical. It is an assumption
that operating systems have little to no impact on performance.
4.2. Assumptions and Experimental Framework 45
4.2.2 Experimental Concerns
Fair Comparison
The goal of the experiment is to compare the created algorithm with those created by other
authors. Though speed and performance is the main considered attribute and advantages,
the many variations in clustering approaches all strive to solve different concerns and have
varying sensitivity to different experimental datasets.
In an attempt to be as fair as possible, many different datasets will be used in this
experiments, each different from differing sources and created by different sequencing
experiments.
While it is not possible to be exhaustive and be able to claim a single algorithm the
best at all possible datasets, testing with a large variety allows one to at least find the
most obvious strengths and shortcomings.
Sensitivity and Correctness
Some artificial datasets can provide a ‘correct’ reference clustering along with its data, but
this cannot be assumed for all datasets, especially not ones sequenced from real organisms.
In order to test datasets without provided reference clusterings, one is created from the
same dataset using a known high-quality algorithm that serves as a reference clustering.
It is important to note that the correct results where reference clusterings are not
provided are effectively unknown since varying quality of the ESTs have a great effect on
comparisons.
What is tested in this case is the similarity of the output between two algorithms,
not whether either output is correct. If the reference algorithm erroneously clusters two
sequences then the tested algorithm will in fact be penalized for a correct clustering.
This disadvantage is accepted however on the grounds that algorithms already ac-
cepted by the bioinformatics community would produce clusters of high enough quality
for professional use.
The developed algorithm is thus precluded from making any statements about the
correctness of its clustering, which is why the terms accuracy, similarity or sensitivity will
be used instead, denoting how well it matches the professional quality output from other
algorithms.
46 Chapter 4. Experimental Design
Differing Hardware
CPU and GPU technology is constantly improving, revealed by newer released GPU’s
greater performance and flexibility, but also by newer CPU’s increase in cores, speeds,
cache size and inter-core communication. Both platforms are moving towards greater
high performance computing capabilities and any experiments done in this project with
a specific generation of hardware is likely to be out of date before even the results are
released.
For this reason it is stressed that this experiment exists only as an indicator of future
direction and potential, and should not be used as an authoritative claim as to how specific
future generations of hardware will compare.
4.2.3 Experimental Setup
Test PC Platform
The computer that will be used to perform the experiments has the following specifica-
tions:
• CPU AMD Athlon(tm) 7750 2.7GHz Dual-Core Processor
• RAM 4GB DDR2-800 Memory
• Operating System Gentoo Linux
• GPU NVidia GTX480
In order to perform experimental repetition, a scripting language is needed to repeat-
edly execute the application on the same or differing datasets. The scripting language
PHP (PHP: Hypertext Preprocessor) was chosen for this task for no reason other than
author familiarity. Since it simply queues executions the choice of scripting language is
not expected to provide any detrimental or advantageous effect to the performance.
4.2.4 Theory and Metrics
Timing Methodology
In order to gain accurate performance results for comparison of gpucluster with other
tools, proper timing instuments are required. Two have been identified.
4.2. Assumptions and Experimental Framework 47
The first is the ‘time <command>’ command, available on all unix-based operating
systems and in MinGW on windows. This timer times execution to the returning of a
result, including the loading of libraries, loading the application to memory and clean-up
after the program has executed.
The chosen scripting language, PHP, contains the microtime() function, used to mea-
sure the performance results of repeated executions using the computer’s inbuilt high-
performance clock.
GFLOPS
Peak GFLOPS (Giga Floating Operations Per Second) is the theoretical maximum float-
ing point operations a computing device is possible.
Measured GFLOPS is a figure that is measured, typically via a loop computing a
large number of algorithmic operations in rapid succession with no memory access. These
measured figures offer a more realistic indication of what a computing device is capable
of, but can differ even between two devices of identical specifications.
While the measured GFLOPS is typically much lower than the peak GFLOPS, they
provide an indication of the performance of a device compared to another. Note though
that real-world performance is usually far more restrained by memory accesses and through-
put than GFLOPS.
Jaccard Index
In order to determine the accuracy of a clustering, a reference clustering is needed. This
reference clustering is created using an existing and proven cpu algorithm and is then
compared against the clustering created by GPU Cluster to provide a similarity score
that will be used to determine accuracy.
The metric chosen to compare the similarity of two different clusterings is the Jaccard
Index [108].
The Jaccard Index is a commonly used metric to compare sets or clusters and is the
ratio between the intersection and the union of two sets. It is defined as:
J(A,B) =|A
⋂B|
|A⋃B|
(4.1)
In order to apply this set theory to clustering, one has to consider a cluster as a set
48 Chapter 4. Experimental Design
of pairs. For example, for 4 possible elements, {1,2,3,4}, there exists two sets A and B:
A = {{1, 2, 3}, {4}}
B = {{1}, {2, 3, 4}}
these then need to be converted to unique pairs:
A = {{1, 2}, {2, 3}, {1, 3}}
B = {{2, 3}, {3, 4}, {2, 4}}
assuming clusterings A and B, this can then be implemented as:
J(A,B) =NAB
NAB +NA +NB
(4.2)
where
• NAB is the count of sequence pairs that are in both clustering A and B.
• NA is the count of sequence pairs that is in clustering A but not in clustering B.
• NB is the count of sequence pairs that is in clustering B but not in clustering A.
For the above example that would result in:
J(A,B) =NAB
NAB +NA +NB
=1
1 + 2 + 2
= 0.2
The Jaccard Index gives a value between 0.0 (none of the clusters overlap) and 1.0
(identical clusters).
Sensitivity Index
In the field of EST clustering, not clustering two related ESTs often have drastic effects in
the reassembly stage resulting in incorrectness or not combining two contigs that should
have been combined. On the other hand, clustering two unrelated ESTs have minimal
impact on the correctness of reassembly, only on the performance if there is an excessive
amount of over-clustering. This is because reassembly programs are already designed to
deal with chimeras and alternate splicing so they have the ability to deal with different
transcripts clustered together.
4.3. Dataset Descriptions 49
For this reason, an alternate index is also used in addition to the Jaccard Index, the
Sensitivity Index [59]. The benefit of this alternate Index is that it penalizes under-
clustering but not over-clustering, serving as a useful indicator of usefulness of the clus-
tering in terms of accuracy of final assembly, even if the exact clustering differs.
The Sensitivity Index is defined very similar to the Jaccard Index:
S(A,B) =NAB
NAB +NA
(4.3)
The sensitivity Index gives a value between 0.0 (clustering B contains none of clustering
A’s clusters) and 1.0 (clustering A is a subset of clustering B).
Due to its definition, the sensitivity will always be equal or greater than the Jaccard
Index.
While a high sensitivity is required for correct reassembly, a high sensitivity and a low
Jaccard Index is undesired due to the great performance deterioration that the reassembly
stage will experience.
Both the Jaccard Index and the Sensitivity Index will used to gauge accuracy.
CUDA Occupancy
Occupancy in terms of CUDA execution is a measure of the number of ’active’ CUDA cores
at any time [2]. While higher occupancy does not always indicate higher performance, it
does serve to assist in hiding of memory latencies in memory intensive applications.
Occupancy is discussed in more detail in Section 3.4.4, but in summary a larger oc-
cupation value results in less chance that multi-processors are left idle with all warps
blocking due to a pending memory read.
4.3 Dataset Descriptions
The following datasets were chosen mainly due to their availability and use in other
found benchmarks and comparisons. They consist of a wide range of differing species and
situations that should provide a fair test of the different datasets one should expect in
reality.
It is however only representative and not exhaustive, so datasets of unexpected length
or complexity might perform poorly compared to the datasets used in this experiment.
50 Chapter 4. Experimental Design
4.3.1 Arabidopsis
Arabidopsis is a popular and commonly used dataset to test and benchmark clustering
and assembly applications due to it being sourced from a well known and understood
model plant organism.
Two datasets are used from the ESTs downloaded from Genbank:
• A686904 - A dataset containing the full ESTs from Genbank; and
• A032 - A smaller random subset of the full dataset. It has the same number of
cDNA sources as the full dataset, but its coverage will be lower due to having much
less sequences.
4.3.2 SANBI 10000
This is a benchmark dataset popularised and provided by the South African Bioinformat-
ics Institute. The SANBI 10000 dataset contains 10k sequences and is provided so that
applications can perform comparative benchmarks.
4.3.3 Public Cotton
This is a set of Sanger-style ESTs from the public cotton data set.
4.3.4 C-Series
The C-Series is an artificial dataset created using ESTsim. As opposed to the A032
dataset, the C-series varies the cDNA sources with the size of the dataset, allowing the
same coverage of different amount of clusters with increasing or decreasing dataset size.
The C10 dataset is a full-size dataset, with C01 being 10% its size, the C02 being 20%
its size, and so on.
4.3.5 Mouse Curated
This is a curated dataset that was created from a limited selection of genes chosen from
chromosone 4 of the mouse. As a curated dataset, any clusterings derived from it should
contain few large clusters with very few orphaned ESTs.
4.4. Overview of Experiments 51
4.4 Overview of Experiments
This study will consist of 5 experiments. The first experiment will deal with purely theo-
retical performance while the second will gauge the practical sensitivity of the algorithm.
The rest of the experiments will deal with practical performance measurements of the
algorithm in question.
The following list provides a brief summary of the experiments:
1. Theoretical Performance and Cost Evaluation This experiment deals with
the comparison of theoretical ‘Peak GFLOPS’ between modern GPU and CPUs.
2. Sensitivity Comparison This experiment measures the sensitivity of the devel-
oped application in comparison to a known good benchmark.
3. Performance Benchmarking This experiment measures performance of the de-
veloped algorithm results for different datasets and compares against the perfor-
mance of a CPU implementation.
4. Dataset scaling tests This experiment tests the performance of the algorithm
when the dataset size is varied.
5. Profiling Analysis This experiment subjects the algorithm to a dynamic profiling
analysis in order to identify its inefficiencies and computational bottlenecks.
4.5 Investigation 1: Theoretical Performance and
Cost Evaluation
4.5.1 Aim
The aim of this experiment is to calculate and compare a theoretical FLOPS per rand
investment between GPU and classical computing.
4.5.2 Background
One of the main reasons for this project is due to the perceived cost savings of GPU hard-
ware compared to similar performing PC or server hardware. While a FLOPS measure-
ment is not guaranteed evidence of performance, this figure does serve as a performance
indicator.
52 Chapter 4. Experimental Design
4.5.3 Method
Various computing options will be found including GPU, personal computing solutions,
server solutions and large data server solutions. The ratio between their price and their
peak FLOPS will be compared.
Peak FLOPS is defined as the clock rate multiplied by the number of cores multiplied
by the number of floating point operations each core can theoretically perform every clock
cycle.
4.5.4 Expected Outcome
It is expected that GPU hardware will be shown to theoretically provide more performance
for the same price than classical computing alternatives.
It is not expected that this will reflect real-life performance, but it will indicate the
reason why GPUs are used in this project. Experiment 2 will show whether this translates
into real-world performance advantages.
4.6 Experiment 1: Sensitivity Comparison
4.6.1 Aim
In order to prove that sensitivity and accuracy is not lost with the transfer to new hardware
and algorithms, a test is needed to validate that the results given are of good quality when
compared to other tools.
4.6.2 Background
While a fully ‘correct’ clustering cannot be determined due to interdependancy between
the algorithms and the generation of the reference cluster, algorithms that are well-
documented to provide quality results can provide the reference that the developed algo-
rithm can be compared against.
Not all datasets are supplied with a reference clustering, so a comparison between
gpucluster and a known high quality algorithm is used instead. The wcdest tool is chosen
for this purpose.
4.7. Experiment 2: Performance Benchmarking 53
4.6.3 Method
1. A dataset is chosen and used as an input to wcdest to create the reference clustering.
2. The same dataset is used as an input to gpucluster to create the experimental
clustering.
3. The reference and experimental datasets are compared and the Jaccard and Sensi-
tivity Indexes are computed.
4. Steps 1-3 are repeated for all datasets.
5. Results are reported and tabulated.
4.6.4 Expected Outcome
While JI and SE scores of 1.00 are not expected due to the differences in distance algo-
rithms, it is still expected that gpucluster will show high sensitivity.
Gpucluster is expected to achieve senstivity and Jaccard scores of above 0.95. Any
values lower than this suggests a deviance that can negatively influence EST reassembly.
4.7 Experiment 2: Performance Benchmarking
4.7.1 Aim
The aim of this experiment is to obtain fair benchmarks comparing the performance of
gpucluster to that of wcdest.
4.7.2 Background
While one aim of this project is toward the relative cheapness of GPUs when compared
to classical computing hardware, it is important to provide a comparison as to any per-
formance benefit that may be gained.
The CPU used in this comparison is detailed above in Section 4.2.3. The reference
application is wcdest.
It is difficult to adequately measure the performance of two different algorithms on
two different sets of hardware fairly. Exact execution times will differ greatly with the
generation of hardware, the software implementation, the number of cores used and the
input dataset.
54 Chapter 4. Experimental Design
It is important to note that the CPU results are reported for only a single CPU Core
used. An upper bound of 4× faster execution exists when all cores on a quad-core CPU is
used, though in reality performance will not increase linearly with core numbers [109, 110].
Based on this it is important to assume a fairly large error on any exact results,
since it will differ depending on the exact hardware used. They serve mainly to provide
indications of suitability and for the identification of trends.
4.7.3 Method
1. Select a database.
2. Execute wcdest on the database and time its execution.
3. Execute gpucluster on the database and time its execution.
4. Repeat 2 and 3 for a total of 5 times and average the result.
5. Repeat 1-4 for all databases.
6. Results are reported and tabulated.
4.7.4 Expected Outcome
While gpucluster is not expected to be over an order of magnitude faster than wcdest, the
theoretical performance disparity between the CPU and GPU is still expected to provide
a significant performance advantage to gpucluster in this experiment.
4.8 Experiment 3: Dataset scaling tests
4.8.1 Aim
The aim of this experiment is to provide evidence that the GPU performance scales with
larger datasets.
4.8.2 Background
As sequencing technology becomes cheaper, so does the volume of data made available to
geneticists.
4.9. Experiment 4: Profiling Analysis 55
This experiment should show that the GPU scales well to larger datasets in terms of
performance. While an upper bound of memory use exists, so long as this limit is not
reached there is no reason that any performance advantages in the small scale would not
exist in the large scale.
4.8.3 Method
1. The A686904 is used, a very large database of ESTs.
2. Execute wcdest on a subset of the database and time its execution.
3. Execute gpucluster on the database and time its execution.
4. Repeat steps 2-3 with gradually increasing sized subsets.
5. Results are graphed and tabulated.
In our case the subset of sequences chosen was the first 5 000 ESTs of the A686904
database, then the first 10 000 sequences and so on.
4.8.4 Expected Outcome
Due to the higher overheads involved in GPU computation it is expected that performance
will be relatively low when small datasets are used, but that as dataset size increases the
overhead will become insignificant.
It is expected that this experiment will show that the GPU scales well compared to
the CPU. Since both the CPU and GPU will use similar algorithms, it is expected that
the execution time ratio between the two will remain somewhat constant even at higher
dataset sizes.
4.9 Experiment 4: Profiling Analysis
4.9.1 Aim
In this experiment we aim to evaluate the developed application for its efficiencies and
shortcomings. An analysis of its execution should both show the suitability of the GPU
for this algorithm and provide directions that future development can focus on.
56 Chapter 4. Experimental Design
4.9.2 Background
The computer science term profiling refers to the act of dynamically analysing a com-
puter program as it executes. Special software is needed for this process, but it provides
information on memory use, execution times, function calls and instruction executions,
all of which is valuable when optimizing a program.
NVidia provides software called the NVidia Compute Visual Profiler which serves this
purpose [111].
Profiling allows identification of the bottleneck that to the greatest degree limits the
performance. An application may be computationally bottlenecked, with several possi-
bilities discussied here:
1. Computational capacity limited
The majority of processing time is spent on instruction execution. Algorithm and
instruction optimizations can lead to increase in performance.
2. Data Transfer limited
The data transfer between the CPU and the GPU begins to take a non-trivial
amount of processing time. To optimize this streams should be utilized to concur-
rently process blocks and copy the data needed for the following block.
3. Memory throughput limited
Not enough data can be read fast enough for computation. Memory read optimiza-
tion can lead to an increase in performance. Random access reads lowers memory
throughput and would cause this bottleneck.
4. Memory latency limited
Computation is idle while long memory reads are performed. More concurrent
threads should be used and efforts made to increase occupancy. Low occupancy
or data dependence should be avoided and some values should be considered for
re-computation instead of memory storage.
4.9.3 Method
1. Initialize NVidia Visual Profiler.
2. Create a new project for GPUCluster.
4.10. Conclusion 57
3. Add arguments to point to the benchmark 10000 dataset.
4. Perform profiling.
5. Present results.
4.9.4 Expected Outcome
Optimally the application would prove to be computationally bottlenecked. This indicates
that the application execution speed is limited only by the speed of the GPU.
However, since this is a string manipulation problem it is very likely that the bottleneck
will be memory related. Random access read patterns is expected to be a primary negative
effect on performance.
4.10 Conclusion
In this chapter, the theory and concepts used for objective experimentation with the GPU
was listed and introduced. The assumptions used were introduced and described.
Any comparison of performance between a GPU and CPU application can draw the
criticism of comparing apples and oranges. It is attempted to make the comparison fair,
but any such results should be critically considered.
It should also be kept in mind, however, that GPU technology when applied to non-
graphics problems is still relatively in its infancy, even with expectations of it greatly
outpacing the performance improvements of the CPU.
The next chapter deals with the introduction of various algorithms that will be con-
sidered for porting to the GPU.
Chapter 5
Selection of Algorithms
5.1 Introduction
In Section 2.6 the basic classes of algorithms have been introduced. In this Chapter we
will offer specific implementations of each and evaluate their suitability to the GPGPU
platform.
The selection criteria used to judge the portability of a specific algorithm is first
discussed, based on the GPU platform and CUDA API’s strengths and limitations that
was presented in Chapter 2.
Various specific algorithms of interest are introduced and advantages and disadvan-
tages of each will be objectively discussed. The ones judged most suitable for GPU ac-
celeration will be ported to CUDA for use in an application to perform high performance
EST Clustering.
5.2 Selection Criteria
Several factors will need to be considered for the selection process of any algorithm that
is to be ported to the GPU.
Speed is one of the most important factors, but due to the platform differences this is
something that cannot be properly observed or guessed at until it is actually implemented.
Instead, factors that influence the suitability of the algorithm will be analysed instead,
with the assumption that a suitable algorithm will perform much faster than an unsuitable
one.
The identified criteria are described below.
59
60 Chapter 5. Selection of Algorithms
5.2.1 Large-scale Parallelizability
In order for an algorithm to be considered for porting to a GPU it needs to be inherently
parallelizable. That is to say, separate parts of its execution can run concurrently.
Unlike CPU parallelization however, the scale is much larger. Where a multi-core CPU
requires 2 to 8 separate threads in order to properly utilize all its cores a GPU requires
over a thousand per multiprocessor, with most GPUs having dozens of multiprocessors.
On the other hand however, every GPU thread is lightweight and can be assigned per
data point instead of per independent thread.
Any algorithm selected for porting to the GPU needs to be able to properly take
advantage of such a large amount of concurrent threads.
5.2.2 Data Independence
Ideally every thread will work independently, read in a piece of data, process, then output.
Realistically many algorithms require either data to be processed in sequence or require
the input of many separate pieces of data to form an output.
Shared memory on a GPU can be used as a limited way of inter-thread communication
to help deal with data dependence, but this is not always possible or implements a large
performance penalty.
For this reason algorithms will be selected for their data independence or suitability
for various techniques to overcome this limitation.
The data independence of an algorithm is often an indicator of its large-scale paral-
lelizability, though it sometimes happens that a data dependent algorithm can still be
scaled (such as n-body simulations where the results rely on computation with every
other body), or situations where high data independence does not result in large-scale
parallelizability such as applications with small sized datasets.
5.2.3 Random seeks
An algorithm that seeks to implement per-thread random seeks imposes great performance
penalties on the GPU. GPU memory is designed to read continuous memory with spatial
locality, able to in a single read operation read the data of multiple threads at once.
If there is no spatial locality to reads then these reads need to be performed sequen-
tially, negatively influencing the throughput of the memory.
5.2. Selection Criteria 61
5.2.4 Computation Size
Executing a GPU kernel causes an implicit delay as instructions and data are transmitted
to the GPU, the processing occurs, then the results are transmitted back to the CPU.
For large jobs this is a negligible impact, though if a large amount of very fast jobs
occur one after another this delay can become significant.
For this reason, large computation jobs are preferred over many small jobs.
5.2.5 Division into smaller tasks
It is an assumption that entire datasets as well as its indexing and structural overhead
will not be able to be loaded on the limited memory of a GPU at once. For this reason it
is a requirement that the algorithms should be able to operate on smaller self-contained
subsections of the full task. Algorithms that require random memory access on the full
dataset should therefore be rejected under this constraint.
5.2.6 Simplicity and Established algorithms
Rather than develop new algorithms or seek to port algorithms that have long established
histories with complex optimizations, simple algorithms will be preferred.
This is a new platform, so algorithms are unlikely to work perfectly first time. For
this reason simple and easy to understand and debug algorithms will be preferred.
5.2.7 Sensitivity
Where the other criteria focus mostly on performance-related concerns, sensitivity is based
on the accuracy and correctness of the algorithm. Bioinformatics applications dealing
with sequenced data is not a very binary science and results can be ’more correct’ or ’less
correct’ than others while not being incorrect. This is due to sequencing data having high
error rates owing to both individual genetic variation and sequencing errors.
This criteria serves as an indicator as to the algorithm’s performance in this regard.
It should not be used as an absolute judgement of an algorithm’s value, since even one
that is insensitive yet fast can prove to be very useful.
62 Chapter 5. Selection of Algorithms
5.3 Data Structures
5.3.1 File Structure
In order to design the application’s data structure, it is important to first detail the format
of the data. The data is provided in an industry standard FASTA format. An example
of this format, from the SANBI10000 dataset, is given below.
>T30671 g612769 | T30671 CLONE_LIB: Human Eye. LEN: 319 b.p. FILE
gbest3.seq 5-PRIME DEFN: EST20487 Homo sapiens cDNA 5’ end
ATGATAATGAAAGACTCTCGAAAGTTGAAAAAGCTAGACAGCTAAGAGAACAAGTGAATG
ACCTCTTTAGTCGGAAATTTGGTGAAGCTATTGGTATGGGTTTTCCTGTGAAAGTTCCCT
ACAGGAAAATCACAATTAACCCTGGCTGTNTGGTGGTTGATGGCATGCCCCCGGGGGTGT
CCTTCAAAGCCCCCAGCTACCTGGAAATCAGCTCCATGAGAAGGATCTTAGACTCTGCCG
AGTTTATCAAATTCACGGTCATTAGACCATTTCCAGGACTTGTGAATTAANAACCAGCTG
GTTGATCAGAGTGAGTCAG
This entry begins with a header, identified by the starting > character. The header
includes information such as its unique code, its source, clone data and other annotation.
For the purpose of this application the only information in the header is its code.
Following this header is the actual EST sequence. This is limited to the 4 base char-
acters, A, C, T and G, as well as the character N, which represents an unknown base. For
simplicity, any unknown bases will be replaced with a random base.
5.3.2 Memory Structure
All of the entries in the FASTA file together form a dataset. Each is assigned a consecutive
unique index, which is used throughout the rest of the program to keep track of individual
sequences.
Dataset =
(1) T27875 CAGAGA · · ·(2) T27876 TCCCTG · · ·
......
(N) H86369 ATTCGG · · ·
Figure 5.1: Visualization of the dataset as a collection of EST sequences
Every loaded sequence has a data structure that details the index of the sequence, its
header, certain meta-data, and the starting and ending position of that sequence. This
5.3. Data Structures 63
starting and ending position relates to a large character array that contains every sequence
of the loaded dataset sequentially.
For 1.x Compute Capability GPUs the sequences are additionally padded so that every
new sequence begins at a 16 byte mark, a requirement for optimal speed when reading
from global memory. This padding is not implemented for 2.x Compute Capability GPUs
since they are much better at performing unaligned reads with no performance penalty.
The single large character array containing all used sequences is maintained because
large numbers of sequential sequences are copied at a time to the GPU for comparison.
By keeping all sequences together in this way this can be performed with a single copy
operation, which is much more efficient than dozens of smaller copy operations.
5.3.3 Job Data Structure
EST clustering at its core involves a quick many-to-many comparison between all of the
elements of the dataset with each other. Figure 5.2 illustrates this, where each comparison
is shown as a line between each EST.
12
3
4 5
6
Figure 5.2: Many-to-Many comparison between 6 elements
In order to do this programatically however, this needs to be presented and stored
in a format that can easily be programmed into a PC. A matrix is the easiest to work
with, with the caveat that no EST needs to be compared to itself and no EST needs to be
compared to another twice. Figure 5.3 represents the same logical structure but presented
as an upper triangle matrix. This format wastes some space, but is the logical equivalent
of a many-to-many comparison between many elements.
As mentioned, limited GPU memory enforces the constraint that large datasets have
the capability of being subdivided into multiple smaller jobs. It is with this in mind that
64 Chapter 5. Selection of Algorithms
(1) (2) (3) (4) (5) (6)(1) ×
√ √ √ √ √
(2) × ×√ √ √ √
(3) × × ×√ √ √
(4) × × × ×√ √
(5) × × × × ×√
(6) × × × × × ×
Figure 5.3: Many-to-many comparison between 6 elements in grid format
the subdivision strategy illustrated in Figure 5.4 can be employed.
(1) (2) (3) (4) (5) (6) (7) (8)i0 i1 i2 i3 i0 i1 i2 i3
(1) j0 ×√ √ √ √ √ √ √
(2) j1 × ×√ √
. . .√ √ √ √
(3) j2 × × ×√ √ √ √ √
(4) j3 × × × ×√ √ √ √
......
(5) j0 ×√ √ √
(6) j1 . . . × ×√ √
(7) j2 × × ×√
(8) j3 × × × ×
Figure 5.4: Many-to-many comparisons of 8 elements divided into 3 seperate 4 by 4 sizedjobs
This strategy has a number of advantages such as the need to only store the sequences
involved in a specific block in GPU memory at any time. It also reduces the memory
overhead of the irrelevant lower triangle by simply eliminating those jobs entirely. The
diagonal jobs where the same set of sequences are compared to each other will still have
less than half the workload of other blocks, which the scheduler and algorithm will need
to be programmed to process correctly.
5.3.4 Results Data Structure
There are two ways to perform the retrieval of results from the GPU once processing
has completed. The first is to return a matrix the same size of the job with each cell
5.4. Program Structure 65
containing a boolean pass or fail.
The alternative is to have the GPU process this matrix and provide to the CPU a
sorted list of passes, eliminating the overhead of transmitting the failures entirely. Since
the majority of comparisons are expected to result in a mismatch, this should serve to
dramatically reduce the data sent back to the CPU.
The expectation though is that the former method of passing back unfiltered results
will provide better results, since the throughput of memory copying between the GPU
and CPU is not very limited. Only the latency of the copies are a problem, but this is
incurred for either solution. In addition sorting operations usually involve random access
or multiple iterations that can be done efficiently on the CPU.
For this project, the returned results will thus be of a similar format to that detailed
in Figure 5.4 and it will be the task of the CPU to iterate through the results and extract
those pairs that pass the comparison score threshold. This is a task that the CPU excels
at so negative performance is expected to be minimal.
5.3.5 Output Structure
The eventual output of the application after comparison and clustering is a new line
delimited list of the indexes of the identified clusters.
1.
2 4.
3.
5 6 7.
8.
In this example 5 clusters have been identified, 3 of which are single EST clusters.
5.4 Program Structure
While the incorrect program structure selection can severely impact the performance of an
application, it is not expected that different ’correct’ program structures offer a significant
advantage comparable to improvements in the algorithms used.
The planning of the higher level program is highly influenced by the selection of the
algorithms the application will use, but at this stage in the program planning several
assumptions can be made and higher level design proposals can be discussed.
66 Chapter 5. Selection of Algorithms
The first assumption is that the program will utilize an input stage that will read in
standard FASTA files, the industry standard representation of EST and other sequence
data. It will read all of the data before moving on to the next stage.
The next stage of computation invokes heuristics, lightweight algorithms that can with
much less processing than a full comparison reject clear non-matches. Using of such a
heuristic can greatly improve the performance of the program by minimizing the more
expensive full matches. This program will utilize a heuristic algorithm for this purpose.
The full selection algorithm is then performed on all the pairs that pass the heuristic
algorithm. If a pair passes the selection algorithm then the pair is clustered together in a
clustering stage.
Once all jobs are executed, the results are collected and a cluster file is output that
details the discovered clusters.
Using these assumptions, the following proposals for the final program structure are
provided.
5.4.1 Basic Program Structure
Figure 5.5: Basic Program Structure
This proposal is typical of serial programs, completing one stage after another before
arriving at the output. This is detailed in Figure 5.5.
While simple and easy to maintain and develop, this approach unfortunately provides
little ability to properly utilize higher level parallel implementations. Without subdividing
the workload into tasks, the entire workload would need to be loaded into memory at the
same time, something that is not possible for larger datasets.
A paging system can be used to allow only a portion of the required data to be loaded
at once, over which the ESTs are compared before loading the next set of data. This will
allow larger datasets than available memory to be processed, but at a performance cost
owing to contant loading of new data.
While this approach is certainly simple, it is not scalable or particularly parallizable.
5.4. Program Structure 67
Figure 5.6: Parallel Program Structure
5.4.2 Parallel Program Structure
The proposed program structure shown in Figure 5.6 is logically equivalent to the Basic
Program Structure, but takes advantage of subdividing the workload into discrete ’jobs’.
The process of subdivision is explained in Section 5.3. The subdivion of the workload
is done to minimize the amount of memory paging required, keeping ESTs in memory to
compare against as many sequences possible before being replaced by a new set.
One way this is done is by staging the comparison process immediately after the
heuristic stage so that comparison can be performed on the sequences already loaded into
memory by the heursitic stage.
This is disadvantaged by the additional volatility on the workload required by every job
since each job might have a dozen pairs that pass the heuristic step or none at all, with on
average most of the sequences loaded in memory already rejected by the heuristics step.
This volatility can introduce situations where the GPU is underutilized during blocks
where particularly few pairs pass the heuristics stage.
Though a queuing system would need to be developed, this division allows much
better flexibility to take advantage of parallel platforms or even using several platforms
concurrently (Multiple-GPU, GPU/CPU, Multiple PCs). Single-GPU scenarios however
will not see any significant performance benefit from this approach since it will not allow
multiple jobs to be executed concurrently.
This approach does however result in the heuristics and comparison stages to be tied
closer together, decreasing modularity and making it much harder to replace either stage
with another algorithm at a later date as well as debug the process when errors occur.
Despite this, this proposal has greater advantages in more self-contained subdivisions
which will be required should multiple platforms (multiple PCs or multiple GPUs) working
concurrently ever be used.
68 Chapter 5. Selection of Algorithms
5.5 Heuristics Selection
A good heuristic algorithm is defined by both its speed and its ability to filter large
numbers of false negatives while more importantly having a minimum of impact on true
positives. In this project where the vast majority of comparisons are expected to be
mismatches, heuristics have a great ability to dramatically the reduce the computation
required by more expensive algorithms on rejecting obvious mismatches.
5.5.1 Common word heuristics
Common n-word Heuristic
This is a very simple heuristic, simply trying to confirm whether each sequence in a
pair contains a common number of n-length words. Since two sequences that are very
similar to each other can be almost guaranteed to share a large number of short common
sub-sequences, this algorithm matches the requirements of a heuristic algorithm.
Initially a word count table should be setup, but this can be done in a parallel manner.
For the actual matching every word in one sequence can be assigned its own thread with
little data dependence requirements. Searches in the word table does however use per-
thread random seeks, possibly negatively affecting memory throughput.
Since this algorithm operates in linear memory and is based around sub-sequences
debugging is simplified. Despite the concern about random seeks, this algorithm is a
potential for porting to the GPU, suiting the GPU architecture almost perfectly.
Yet despite its simplicity it is expected to be slow and not as sensitive as it can be.
Improvements on this base algorithm, such as the t/v-word and u/v-sample heuristics
should prove to be much more valuable for practical use.
t/v-word Heuristic
A concern of the common n-word heuristic is its insensitivity to locality of similarity. In
order to address this concern the t/v-word heuristic introduces the additional constraint
of using a 100-character wide window within which at least t of these v-words is required
to appear before passing, rather than allowing matches anywhere in the sequence.
This constraint introduces a challenge by reducing the data independence of the al-
gorithm, since each thread is now concerned with the results of other threads within a
locality.
5.5. Heuristics Selection 69
Though this heuristic is more difficult to implement and has the potential to cause
parallelisation problems, it does improve the sensitivity of the heuristic significantly.
u/v-sample Heuristic
Where the t/v-word heuristic trades performance for greater sensitivity, the u/v-sample
heuristic does the opposite, reducing sensitivity while greatly improving the performance.
Where the Common n-word heuristic would compare every word in a sequence with
the word table, the u/v-sample heuristic skips a number of words(usually skipping 8 or
16), testing only a fraction of the total words, requiring only that at least u v-words
appear in both the sequence and the word table.
This causes the heuristic to pass many algorithms that do not match, but more im-
portantly allows fast and easy rejection of the majority of the pairs that do not match,
which constitutes the overwhelming majority of the pairs.
Chained Heuristic
This is not a separate heuristic, but rather a combination of the t/v-word and u/v-sample
heuristics.
By chaining the two heuristics, first executing the u/v-sample Heuristic followed by
the t/v-word heuristic if the former passes, you can obtain the best of both heuristics.
The quick but insensitive sample heuristic rejects the majority of the EST pairs, while
the more expensive but more sensitive t/v-word heuristic confirms the pair’s candidacy for
a match. Once confirmed, the much more expensive and extensive comparison algorithm
can be used to accept or reject the clustering of the pair.
Through utilizing this combination of heuristics both high speed and high sensitivity
can be realised.
70 Chapter 5. Selection of Algorithms
5.5.2 Suffix Arrays
Originally introduced in 1990 [112], suffix arrays have been proven to be a highly efficient
method for searching for maximal sub-sequences in a database [113].
Suffix arrays can be used very effectively as a heuristic algorithm due to the fact that
two sequences that share long sub-sequences are more likely to have a common source.
Suffix arrays operate on making the suffixes of strings searchable. As an example, if
one wishes to create a suffix array of the word ”ACTGCGA$” (where $ is a termination
character), one can construct a database of all possible suffixes of this word and their
indexes in the original word:
0 - ACTGCGA$
1 - CTGCGA$
2 - TGCGA$
3 - GCGA$
4 - CGA$
5 - GA$
6 - A$
7 - $
Assuming a sorting order of $<A<C<T<G, these suffixes can then be sorted:
$ A C T G
7 6,0 1,4 2 5,3
At its simplest, a binary search on the above suffix array and a reference sequence
can be used to find the longest common exact string match. This search will need to
be performed for every starting letter of the reference sequence to perform an exhaustive
search for the longest common exact string match that does not originate with the first
character.
Many improvements and derivative algorithms have been developed that increases
searching speed and improves space consumption [113].
Suffix arrays require the initial expensive setup of a database, after which searches
on that database can be done in parallel with any number of threads with great data
independence.
Unfortunately this database tends to operate best when not subdivided, with the full
database size scaling with the dataset size and being realistically larger than can be stored
on the GPU.
5.6. Comparison Algorithm Selection 71
Even using the subdivided database on a GPU, this algorithm is based around random
searches, limiting the performance of a GPU. Additionally a suffix array database uses a
large amount of pointers and references, resulting in increased complexity and a difficulty
to debug in linear memory.
While this method can be considered a good possibility as a CPU side algorithm, it is
rejected as a candidate for GPU processing, though this assertion could be revisited and
proven incorrect in the future.
5.6 Comparison Algorithm Selection
Due to the implementation of heuristics, the selection algorithm can be far more expensive
than otherwise, since it will only be utilized for likely matches instead of for every pair.
The main criteria of the selection algorithm is its suitability on the GPU and the
expected quality of its results.
5.6.1 Simple Comparison
The simple comparison can be represented with the following formula:
d(k) =
|x|∑i=1
f(x(i), y(i+ k)) (5.1)
f(a, b) =
{1 if a = b
0 if a 6= b(5.2)
where |x| is the length of the string represented by x, x(i) is the ith character of the string
represented by x and y(i+ k) is the (i+ k)th character represented by the string y. This
function will return the count of all the matching characters at positional offset k.
When using the simple comparison the goal is to find a value of k where the greatest
number of characters in both sequences match.
The simple comparison has the advantage of its simplicity, but it is not widely used
in any but the most naive applications due to being slow and not taking into account
insertions and deletions.
The simple comparison is however a good candidate for GPU porting due to its par-
allelizability, high data independence and lack of random seeks. It does however suffer
from low accuracy due to above mentioned issues.
As such this comparison will not be considered for this project.
72 Chapter 5. Selection of Algorithms
5.6.2 FFT Based
An optimization of the simple comparison algorithm can be made by noticing the simi-
larity to convolution [114].
Unfortunately this approach is much more useful dealing with protein sequence align-
ments, but suitable adjustments can be made such as representing the nucleotide sequence
as a 4 dimensional binary sequence:
x: ACGTNA
A: 100011
C: 010010
G: 001010
T: 000110
Once in this format, each dimension can be Fourier Transformed into the frequency
domain.
xA(i)⇔ XA(n) (5.3)
Convolution in the frequency domain is a simple multiplication:
DA(n) = X∗A(n)YA(n) (5.4)
where ∗ represents complex conjugation. To obtain d(k) it is only needed to reverse the
Fourier transform:
dA(k)⇔ DA(n) (5.5)
The above is then repeated similarly for dC(k), dG(k) and dT (k), before the true
distance metric calculated as the maximum of the sequence:
d(k) = dA(k) + dC(k) + dG(k) + dT (k) (5.6)
Though the comparison of two sequences will result in O(n log n) operations (as
opposed to the simple comparison’s O(n2), practical use will involve much less, since
the FFT transformation is only required once per reference window and reused for every
compared window.
FFT on the GPU is a subject that has been greatly investigated by NVidia, even
resulting in a CUDA library dedicated to the problem [101]. In this regard it has been
5.6. Comparison Algorithm Selection 73
shown to be greatly suited to the GPU platform, showing great improvements over the
CPU for larger datasets [115].
Further research however revealed that while the FFT function is used extensively in
this algorithm, the actual FFT computation does not dominate the total computational
time of the algorithm. For this reason, even if the FFT portion can be improved greatly,
the limit to the improvement of the full algorithm will depend largely on the dynamic
programming portion, a domain that is typically much more difficult to parallelize on the
GPU.
While this algorithm features decent data independance and parallelism in the FFT
stage, this algorithm is not easily scalable due to the reduced returns of only parallelizing
a part of this algorithm.
5.6.3 d2 distance
The d2 clustering algorithm is a word based distance algorithm. When applied to pair-
wise comparisons it involves decomposing both sub-sequences into n-length words, then
comparing the two sequences word counts.
By example, the sequence x=ACGTATAT can be composed into 6 words each of length
3: ACG, CGT, GTA, TAT, ATA and TAT.
The function cx(w) is used to refer to the count of the occurrence of a particular word
in the sequence x. For the above example, cx(TAT) would be 2.
The definition of d2 as given by [116] is the following:
d2(x, y) =K∑k=1
4k∑i=1
ρ(wi)n(wi){cx(wi)− cy(wi)}2 (5.7)
where K is the maximum word length, ρ(wi) is the weighting of word wi and n(wi) is
the length of word wi.
A simplified version of the above formula can be created by fixing the value of k to a
single word length [117].
d2k(x, y) =4k∑i=1
(cx(wi)− cy(wi))2 (5.8)
A weighting term can also be added, possibly utilizing masking data to help minimize
the effects of repeats.
74 Chapter 5. Selection of Algorithms
The d2 algorithm is shown to provide good quality results while being relatively simple,
but there are concerns of its parallelizability due to the serial computations of windows
that are not data independent.
The d2 algorithm operates on ’moving windows’. This means that only a part of the
sequence is considered at once. This limits the amount of threads that can be assigned to
simultaneously operate on a pair of sequences. While multiple simultaneous windows can
be used at once, it does prove to be a limit to the parallelizability of the algorithm and
the number of simultaneous threads that can contribute to the same comparison. These
issues makes this algorithm a poor candidate for GPU computation.
Due to its simplicity and sensitivity however, a CPU version of this algorithm can be
used to compare GPU results to confirm accuracy of results.
5.6.4 Levenshtein Edit Distance
The Levenshtein Edit distance is a distance metric that is defined as the number of edits
needed to transform one string into another.
These edits can consist of insertions, deletions or substitutions and the weight of each
is 1. Two pairs with lower Levenshtein distances means that fewer edits were needed to
make the strings identical, thus they are more similar to each other than a pair with a
higher distance would be.
For instance, in the following example, 3 edits are needed to render the two sequences
identical to one another, 1 insertion, 1 deletion and 1 substitution.
ACGT-TCAG
-CGTATCGG
The Levenshtein edit distance usually assumes two strings of equal or similar length,
both of which have similar frames. It does not work well with partial matches since even
if the partial part scores well, edit operations will still be needed to transform the rest of
the string.
It is for this reason that Levenshtein edit distances are not often used in bioinformatics
where the matches could have radically different frames and only partly overlap.
This limitation makes this algorithm unsuited for the purpose of EST clustering.
Related algorithms, detailed below, are considered instead.
5.6. Comparison Algorithm Selection 75
5.6.5 Smith-Waterman
Smith-Waterman alignment is one of the most important algorithms in use in the bioin-
formatics field. It is used to obtain both a similarity score and an alignment for any pair
of sequences. The process of alignment is described in Section 2.6.2.
The Smith-Waterman algorithm is based on edit distances, but unlike the Levenshtein
edit distance it can obtain partial or overlapping matches, instead of attempting to match
the entirety of the two sequences. This is called local alignment and is a valuable quality
for EST comparisons where only part of the sequences need to match for the two sequences
to be clustered together.
To explain the algorithm, recall the example used in Section 2.6.2. The goal is to align
the following two sequences:
Sequence 1: GATTCGTTA
Sequence 2: GGATCGTA
To perform this alignment it is important to first create what is known as a Substi-
tution Matrix. Substitution matrices for protein comparisons are usually complex with
different weightings for every protein, dependent on the biological possibility of one pro-
tein randomly mutating or being read as a different protein. Thus this substitution matrix
accounts for both evolutionary divergence and experimental error.
When applied to this application though the substitution matrix tends to be simple,
due to fewer characters in an EST sequence than a whole genome, as well as the fact that
evolutionary divergence does not have to be accounted for due to all the ESTs all sourced
from the same individual. To account for experimental error only 2 variables are needed:
The match score and the mismatch score. These are chosen to provide the similarity
required. For instance, a match score of 1 and a mismatch of -1 will only pass pairs with
a 50% similarity, or where every second character matches.
For this algorithm to be more useful in this application however, higher similarity re-
quirements are needed, usually that of 95% or more. For this example though, a similarity
of 66% will be used, shown in Table 5.1.
This substitution matrix results in the requirement that for every 1 mismatch, the
region has at least 2 matches.
The data structure that Smith-Waterman operates on is best represented as a matrix
with the height of the matrix being equal to the length of one sequence and the width
being equal to the length of the other.
76 Chapter 5. Selection of Algorithms
A C T GA 1 -2 -2 -2C -2 1 -2 -2T -2 -2 1 -2G -2 -2 -2 1
Table 5.1: 66% Similarity Substitution Matrix
Beginning at the top left of the matrix, scores are calculated based on the following
mathamatical rules: [118]
Hi,j = max
0
H(i−1),(j−1) + s(Ai, Bj) Match/Mismatch
H(i−1),j + s(Ai,−) Deletion
Hi,(j−1) + s(−, Bj) Insertion
(5.9)
where Hi,j is the score at a certain position in the matrix, s() is the substitution
matrix and Ai and Bj is the characters of the two sequences at the ith and jth position
respectively.
From the formula, Hi,j relies on H(i−1),(j−1), H(i−1),j and Hi,(j−1) to compute. For this
reason computation is done starting from H0,0 and proceeds in anti-diagonals until the
entire matrix is computed.
Of note is that the score can never dip below 0. This means that long sections of
mismatched does not penalize a match when it occurs. This is where the property of local
alignment comes from.
The alignment matrix for the two example sequences is shown in Table 5.2, with the
highest scoring path highlighted.
G A T T C G T T AG 1 0 0 0 0 1 0 0 0G 1 0 0 0 0 1 0 0 0A 0 2 0 0 0 0 0 0 1T 0 0 3 1 0 0 1 1 0C 0 0 1 1 2 0 0 0 0G 1 0 0 0 0 3 0 0 0T 0 0 1 1 0 1 4 2 0A 0 1 0 0 0 0 2 2 3
Table 5.2: Alignment matrix between ’GATTCGTTA’ and ’GGATCGTA’
5.6. Comparison Algorithm Selection 77
The maximum value in the above matrix serves a dual purpose, namely that of being
an indicator of the similarity of the two sequences and providing a starting point for
tracking back along the highest values, from which the alignment can be obtained.
Sequence 1: -GATTCGTTA
Sequence 2: GGAT-CGT-A
where the ’−’ character represents gaps.
When adapting Smith-Waterman for the use of this application, it should be noted
that by using the Smith-Waterman algorithm as a distance metric and not an alignment
algorithm, the track-back stage of this algorithm is unnecessary and can be ignored.
The result of this adaption is that Smith-Waterman as written does not score pairs
that have long matching regions with few errors higher than short regions with no errors.
This score is dependent only on the ratio of errors versus matches and does not favor long
regions highly. Only after performing the back-tracking stage would the actual length of
the match become known. The next section deals with an attempt to compensate for this
shortcoming.
5.6.6 Modified Smith-Waterman
In order to address some of the shortcomings of the adapted Smith-Waterman algorithm,
the modified Smith-Waterman is proposed.
In this algorithm, in addition to the score variable another variable named the Cumu-
lative Score is used. This variable will increment with every match and not decrease with
mismatches, only resetting to 0 when the Smith-Waterman score is also set to 0.
The advantage of this alternate scoring is that it favors long matching sequences, even
if only barely above the needed similarity. The effect of this alternate scoring is shown
below in Figure 5.7.
In addition, several small modifications are made, such as gap penalties being equal
to mismatch penalties and the elimination of the track-back stage is done, intended to
simplify Smith-Waterman and use it as a more efficient distance metric.
This algorithm is not very data-dependent due to the matrix-format of the compu-
tation. Though the calculation can be parallelized using a diagonal method, this is not
optimal and often leaves processors idle. The maximum length of the sequences compared
is dependent on the registers and the threads utilized, making arbitrarily large sequences
difficult to compare. The algorithm is also not very simple.
78 Chapter 5. Selection of Algorithms
Figure 5.7: Comparison of Cumulative Score versus default Smith-Waterman scoring
Despite all this, the memory access of this algorithm is linear and predictable with no
random seeks which should allow great memory throughput, and its sensitivity is known
to be good due to its wide use. This makes this algorithm a good candidate for EST pair
comparisons.
5.7 Comparison
The metrics used for comparison in this scorecard is given in Section 5.2.
• Parallelizability is the algorithm’s capability to scale through multiple threads rather
than faster threads.
• Data Independance is the ability to keep each thread independent of one another.
• Random Seeks negatively affect the memory throughput of a GPU, so it is to be
avoided.
• Computation Size is intended to be as large as possible, rather than many smaller
executions.
5.7. Comparison 79
• The ability to subdivide into smaller tasks is an important feature to allow the
computation size to be customized to the amount of available memory.
• Simplicity is important for finding errors and debugging.
The algorithms and options for this project are detailed above. Table 5.3 contains a
summary of these options’ score on various criteria.
Name Par
alle
l
Indep
enden
ce
Ran
dom
See
ks
Siz
e
Sub
div
ide
Sim
plici
ty
Sen
siti
vit
y
Ove
rall
Program Structure Options (Section 5.4)Basic Program Structure × − −
√× © − ×
Parallel Program Structure © − −√©
√− ©
Heuristics (Section 5.5 )Common n-word © ©
√© © © × ×
t/v-wordé
√ √© × ©
√
u/v-sample © ©√ √
©√ √
©Suffix Arrays © © × × × ×
√×
Comparison Algorithm (Section 5.6)Simple Comparison © © © ©
√© × ×
FFT Based√ √
© × ×√ √ √
d2 Distance ×√ √ √ √ √
©√
Edit Distance√
× © ×√ √ √
×Smith-Waterman
√× © ×
√× © ©
Table 5.3: Comparison of the various algorithms introduced in this chapter
The symbols used in the comparison chart are following:
• × This option does not meet this criteria particularly well.
•√
This option meets this criteria sufficiently well.
• © This option meets this criteria very well.
• − Meeting this criteria depends on implementation details, not the algorithm itself.
The criteria are judged in comparison relative expectations to other algorithms in the
same category.
80 Chapter 5. Selection of Algorithms
Note that these assigned scores are highly subjective since these metrics are in many
cases almost impossible to objectively compare between separate algorithm proposals.
They should still serve as a guide as to the reason why certain algorithms were chosen
above others.
5.8 Conclusion
From the algorithm inspection it appears that the simpler and often more naive an algo-
rithm is, the better it tends to suit the GPU platform. The more complex and demanding
algorithms, while they offer better results, often score worse on suitability.
This was expected from the literature due to the lack of useful fields that GPGPU
has been applied to. There simply isn’t an over-abundance of fields that require massive
computation over many processors yet is simple and useful.
Based on the criteria on selection and the research done on the individual algorithms,
the following algorithms have been selected for implementation on the GPU, which will
be discussed in more detail in the following chapter.
5.8.1 Program Structure
The Parallel Program structure is selected due to the advantages in scalability and the
increased ease with which the workload can be subdivided and parallelized.
Since every subdivided job is independent from input to results it simplifies the pro-
gram structure and allows future modifications that allows distributed computing to be
used to improve performance.
5.8.2 Heuristics
On the face of it, the common n-word heuristic is the preferred one due to its simplicity and
scalability. However, its lack of sensitivity and the improvements done by the derivative
algorithms though mean that far more calculations are done per comparison than required
by the selected heuristics.
• u/v-sample Heuristic
• t/v-word Heuristic
These two heuristics are selected and intended to be chained, with the output of the u/v-
sample heuristic providing the input to the t/v-word heuristic. Such a chained algorithm is
5.8. Conclusion 81
expected to provide advantages that in accuracy and performance that cannot be realised
by either heuristic alone.
5.8.3 Comparison Algorithm
Two algorithms were selected for the comparison algorithm. Following is a list and an
explanation for each.
• d2 algorithm
While the d2 algorithm does not map very well to the GPU, it provides good accuracy
and is simple enough to create a CPU implementation. This implementation is to be used
for debugging as well as used to create a reference clustering to measure sensitivity issues.
• Modified Smith-Waterman
The Smith-Waterman algorithm is a well-known and respected workhorse for the bioinfor-
matics community. While the usual implementation does not serve our needs the modified
version detailed in Section 5.6.6 appears to match the sensitivity expected of the origi-
nal while having advantages in performance when ported to the GPU and used for the
purpose of identifying EST pairs to be clustered.
Chapter 6
Implementation and Issues
6.1 Introduction
In this chapter detailed information will be provided about the implementation of the
algorithms that were selected in the previous chapter. Details about the design decisions
and pseudocode for many of them are provided.
This chapter will also detail the shortcomings that have been discovered during im-
plementation and the choices made to overcome them.
6.2 Program Implementation
The program structure used in the design of the application is based on the proposal for
division of datasets into smaller ’jobs’ provided in section 5.3. Figure 6.1 illustrates a
higher level overview of the implemented program structure.
Pseudocode of the CPU-side program structure is displayed as Algorithm 2.
Details and reasoning for this structure is provided in section 6.3.
6.3 Detailed Program Structure
6.3.1 Job Division
An advantage of this approach is future enhancement of the program to utilize multiple
GPUs and even multiple computers. Such an approach, while beyond the scope of this
project, has the potential to parallelise the currently sequential while loops of the program,
shown in Algorithm 2.
83
84 Chapter 6. Implementation and Issues
i : 0..n
i : n..2n
i : 2n..3n
i : 3n..4n
i : 4n..5n
j:
0..n
j:n..2n
j:
2n..3n
j:
3n..4n
j:
4n..5n
x : a..b Subset of ESTs
Wordcount kernel
Comparison Kernel
Heuristic Kernel
Diagonal Heuristic Kernel
Figure 6.1: Detailed Program Structure
If it is assumed that within each job, a block will be launched for every possible pair,
this tactic will result in less than half as many blocks being used for diagonal kernels, a
concern that should be kept in mind when selecting job sizes. The selection of job sizes
should be large as possible to keep the GPU fully utilized even when executing these
diagonal kernels.
6.3. Detailed Program Structure 85
Algorithm 2 CPU-side Program Structure
n← jobsize parameterCPU: Load Dataset D from file containing N ESTsi, j ← 0while i < N do
Load EST(i, ..., i+ n) to GPUGPU: WCi,...,i+n ← Kernel Word Count on EST(i, ..., i+ n)while j < N do
Load EST(j, ..., j + n) to GPUGPU: Run Heuristic between WCi,...,i+n and EST(j, ..., j + n)GPU: Run Comparison kernel on passing pairsCopy comparison results to CPUCPU: Cluster pairs that passes comparisonj ← j + n
end whilei, j ← i+ n
end while
Memory latency and CPU processing can be hidden by larger job sizes which result
in less overhead involving context switches. There is no significant advantage to smaller
job sizes. For this reason it is desirable to have a job size that is as large as possible.
A job size of 512 by 512 ESTs being compared at the same time was experimentally
determined to be the largest job size that can be selected without incurring memory or
addressing issues. Solutions to these issues might result in even larger jobs that can be
issued, but the benefits is not expected to be large beyond this point.
6.3.2 Memory management and paging
EST databases can be arbitrarily large, from a few megabytes to many gigabytes. The
memory available to GPUs however tend to be far more modest, ranging from half a
gigabyte for modern lower-level GPUs to a current maximum of 4 gigabytes on a NVidia
Tesla GPU. In addition, if a GPU is used to drive both a monitor and perform computation
at the same time, it is advised to reserve a few hundred megabytes of GPU memory for
the desktop to avoid graphical latency and errors and thus not use the available GPU
memory to its theoretical maximum.
This effectively limits the GPU memory size to a point generally lower than the ex-
pected database size. In order to circumvent this limitation, a strategy of paging memory
between RAM and GPU memory is employed. Only the ESTs currently involved in the
86 Chapter 6. Implementation and Issues
processing is required to be in GPU memory at any time.
When the application is initialized, memory is reserved for storing the ESTs, word
tables and results matrices. This storage is reserved when the application launches and
if properly implemented it should result in memory declarations being made only once
for the entire application runtime. This has advantages in eliminating costly memory
allocations and garbage collection as memory is freed mid-execution. However, since the
memory available will be static and not scaled to the needs of the application during
run-time, an upper bound must be found for reservation which will not be exceeded for
any expected dataset.
The maximum number of threads per block is 512 for 1.3 compute capability GPUs.
Though later GPUs support more threads per block, this would break backwards compat-
ibility. For this reason a maximum of 512 threads per block will be assumed for memory
use calculations.
Assuming a job size of 512 by 512 ESTs being compared on the GPU at once, this
means that enough space for 1024 ESTs should be reserved. Since each EST can differ
in length, an analysis of the experimental databases was performed. It was found that
all ESTs in these databases are under 2000 characters in length. For this reason 2kB of
memory is reserved per EST as an upper expected bound. Since the memory is pooled
for all ESTs, ESTs of longer length than this length are supported so long as the total
EST storage for the entire job does not exceed 2MB of space or approximately 2 million
characters. This is expected to serve the requirements for most standard datasets, though
the reserved memory can be increased if datasets are found for which this is not sufficient.
Every row of ESTs also requires the space for a word count table. If a word size of 6
is chosen and the word presence table detailed in Section 6.4.1 is used, this will total an
additional 512 B per EST. Multiplied by a job size of 512 ESTs this results in 256kB of
memory per job.
The results matrix that serves as a part of global memory that each block and inde-
pendently output to is sized as 512x512 bytes (1 byte for every pair). This totals another
256kB of memory per job.
If you include other memory requirements such as indexing and constant storage, the
total GPU memory used by gpucluster if paging is utilized will be less than 3MB per job,
compared to the 200 megabytes a database of only 100 000 ESTs will require. Many jobs
are launched simultaneously, each requiring approximately 3MB of storage, but even so
is unlikely to exceed the global memory available to even low-range GPUs.
An attempt was made to utilize the CUDA concept known as streams to optimize
6.4. Detailed Heuristics Algoritms 87
these transfers. CUDA Streams refers to the capability of modern NVidia GPUs to accept
asynchronous commands, allowing the ability to initiate or queue either kernel calls or
memory transfers. In theory this allows the execution of a GPU kernel at the same time
as a memory copy while the CPU continues processing without having to wait for the
GPU kernel to complete, all without the need of complex multi-threaded libraries.
In practice it was found that increasing the job sizes causes the time spent on GPU
memory copies to become an insignificant fraction of the total execution time. Due to
this observation a simpler sequential copy-then-process rather than interleaving copies and
process was chosen, greatly simplifying debugging and error detection. CUDA streams is
still utilized to queue kernel calls and copies without explicit program involvement and
to allow interleaving CPU and GPU computation where possible.
Note that since the current program loads the database into memory, the RAM avail-
able to the program does still create a limit to the size of a database that can effectively
be processed. However, since computer memory can be increased far easier than GPU
memory and since an automatic paging system is already present in modern operating
systems that writes underused memory to disk, this is not a large concern. There is a
potential to have the code explicitly page to and from physical disk to lessen the strain
on computer memory, but such an endeavour is beyond the scope of this project.
6.4 Detailed Heuristics Algoritms
The advantage of the chosen u/v-sample and t/v-word heuristics is that it avoids the
expensive O(m2) cost of comparison functions, where m is the average length of an EST,
and implements a O(2m) function instead by creating a word table, then comparing an
EST to the word table instead of the other EST.
The word count kernel is performed once for every sequence every time a new job
series is loaded. The word heuristics is then performed on the produced word count table
during every job.
A pair of heuristics was chosen to be implemented, both designed to compliment
one another. Since no CPU processing is required in between the heuristic stages the
heuristics were chained together in a single kernel, reducing the overhead involved with
their execution.
88 Chapter 6. Implementation and Issues
6.4.1 Word Count
The word count table is a lookup table that details the number of each possible k-length
word in a sequence. This table is used in heuristics to quickly look up the number of any
k-length word in a sequence without having to parse the entire sequence every time.
Figure 6.2 shows an example length 6 word count table and the words each position
corresponds to. Since the words are all sorted in order of A, C, G and T it allows the
index in the table corresponding to any word to be computed mathematically.
2 1 0 3 0 1
AA
AA
AA
AA
AA
AC
AA
AA
AG
AA
AA
AT
AA
AA
CA
TT
TT
TT
Figure 6.2: Word Count table
The method to compute a word table is simple and easy to parallelise, detailed in
Algorithm 3. Each block of the GPU can process a different sequence and each thread
in a block can be assigned a successive word. This assignment has the least number of
conflicts and allows the best use of the memory bandwidth. The incrementation of the
words is a potential source of race conditions if one or more threads attempt to increment
the same word’s counter. Atomic increment functions, though slower, can be used to
avoid this potential source of incorrect results.
The longer the word length of the word table, the more accurate the heuristic will be.
In reality however, the word length is limited more by GPU memory and performance.
For a k-length word table, the memory required to store it is 4k bytes, assuming 1 byte per
position. Longer length word tables will be increasingly sparse, potentially being many
times the size of the sequence it decomposes. These larger tables take longer to copy,
reducing the potential performance.
Algorithm 3 Word Count kernel
Require: Sequence xfor all wordx in x do {Parallel section}countx[wordx]← countx[wordx] + 1
end for
6.4. Detailed Heuristics Algoritms 89
Since memory constraints are likely to be the largest obstacle to larger word tables, it
becomes a goal to reduce the memory footprint of this table. An analysis of the heuristic
algorithms that utilize this word count table leads to the observation that the exact count
of a word is not as important as the recording of its presence or absence. This observation
leads to the subtly improved Algorithm 4. By recording the binary true or false value
instead of a count, it is possible to store every word table element as a bit rather than a
byte, reducing the memory requirements of the word table by 8.
Algorithm 4 Word Presence kernel
Require: Sequence xfor all wordx in x do {Parallel section}countx[wordx]← true
end for
This modification requires a little more processing since bit operations will be required
to retrieve the exact bit that represents a specific word, but the reduced memory use and
more compact data structure is expected to outweigh this concern, since more compact
data structures will lead to better utilization of the memory bandwidth which directly
impacts performance for memory intensive applications such as gpucluster.
The memory use for different k-lengths for both strategies is given in Table 6.1.
k-length 3 4 5 6 7 8 9 10Word Count 64 B 256 B 1024 B 4096 B 16 kB 65 kB 262 kB 1024 kBWord Presence 8 B 32 B 128 B 512 B 2 kB 8 kB 32 kB 131 kB
Table 6.1: Comparison of Word Count kernels memory use for different k-lengths
A k-value of 6 was initially chosen due to its use in the reference application, wcdest,
that will be used in the experiments. Practical testing with higher values shows that k-
values of up to 10 are easily supported before memory use becomes an obstacle, but this
also greatly reduces the performance of the application due to the additional memory op-
erations and computations required. Lower selections of k-values improved performance
but introduced undesirable divergence of clustering results from the reference applica-
tion. Though modifying the k-value is supported, the default of 6 will be used in the
experiments.
Every word count table will only have to be computed once for every sequence, while
every sequence will be compared multiple times to different sequences, the latter of which
is more likely to dominate the computational time. For this reason the significant saving
90 Chapter 6. Implementation and Issues
in memory is considered more valuable than the slightly increased computational cost. It
is for this reason that the word presence algorithm was chosen to be implemented.
6.4.2 u/v-sample Heuristic
The purpose of the u/v-sample Heuristic is high speed and throughput with a high rate
of rejection of true negatives. It is not meant to reject any possible true positives, so
its parameters are suggested to remain conservative, existing only to quickly reject the
obvious mismatches.
In this kernel, every 8th word is chosen from the comparison sequence to be checked
against the word table. If the word is present in the word table, a counter is incremented.
Algorithm 5 u/v - Sample Heuristic Kernel
Require: Sequence y and the wordcount array of xscore← 0for all wordy in y STEP 8 do {Parallel section}
if countx[wordy] > 0 thenscore← score+ 1
end ifend forif score > u then
result ← PASSelse
result ← FAILend if
A single block is assigned to every pair in the job while every thread within the block
operates on a different word in the sequence, each word’s starting character separated by
8 characters.
Shown below is a visualization of the way that words are sampled and threads and
blocks are assigned, assuming a word length of 6.
Words - [----] [----] [----] [----]
Index - 0123456789012345678901234567890...
Block1 - GGTGTTAAAACCCTGGATTGTCGAAACGTTT...
Block2 - GGAAAAAAGGAACTTTTCGTGACTTGGGACA...
Block3 - ACTTGGTGTTAAAACCCTGGATTGTCGAAAA...
6.4. Detailed Heuristics Algoritms 91
The disadvantage of this approach is that even if only using 128 threads this requires
sequences of length 1029 in order to fully utilize all available threads in every block.
Typical ESTs tend to be much shorter than this length.
It is not possible to customize the number of threads per block individually for each
block. A selection of thread numbers will apply to all blocks executed at the same time.
This leaves two possible methods for selecting the number of threads per block to be used
for this kernel.
The first is to choose a low number of threads such as 32 or 64, repeating the kernel
over subsets of the sequence until the entire sequence has been processed. The other is to
choose a high number of threads that should cover any potential sequence length while
only performing the kernel once.
Both methods were implemented but the latter option of choosing 128 threads sup-
porting a sequence up to 1029 characters in length resulted in higher performance. This
suggests that there is not much cost to idle threads if the GPU is otherwise fully utilized
which in this case was due to the expensive memory read operations.
A future improvement on this kernel though might allow a single block to deal with 2
or 4 pairs simultaneously to minimize the number of idle threads, but on current hardware
that would seriously impact the limited number of registers available to each block and
reduce performance even further.
In this kernel, the threshold u, the word length v and the number of words to skip can
also be customized. It is with experimentation that a stride of 8 words, a word length of 6
and a threshold of 7 was selected. These were determined to optimize the performance of
the algorithm while still providing a sufficient sensitivity. These parameters can however
be customized by a user of gpucluster through command line parameters.
6.4.3 t/v-word Heuristic
The t/v-word Heuristic is effectively more complex and harder to compute than the sim-
pler u/v-sample Heuristic. Its basic structure is shown in Algorithm 6.
92 Chapter 6. Implementation and Issues
Algorithm 6 t/v - Word Heuristic Kernel
Require: Sequence y and the wordcount array of x
score← 0
for all i = 1 to numWordsy do
if countx[wordy(i)] > 0 then
score← score+ 1
end if
if i ≥ 100 then {Iteration dependent section}if countx[wordy(i− 100)] > 0 then
score← score+ 1
end if
end if
end for
if score > t then
result ← PASS
else
result ← FAIL
end if
Despite the similarity to the u/v-sample Heuristic, the introduction of windows in this
algorithm greatly limits the parallelism that can be employed.
The presence of every word in the word table can easily be calculated with one thread
assigned to every sequential word, repeating if the number of words exceed the number
of threads.
Unlike the u/v-sample Heuristic every word is being checked instead of every 8th word,
so the tactic of using only one iteration to parse the entire sequence is not practical.
Thread1 - [----]
Thread2 - [----]
Thread3 - [----]
Thread4 - [----]
Index - 0123456789012345678901234567890...
Block1 - GGTGTTAAAACCCTGGATTGTCGAAACGTTT...
Block2 - GGAAAAAAGGAACTTTTCGTGACTTGGGACA...
Block3 - ACTTGGTGTTAAAACCCTGGATTGTCGAAAA...
6.5. Detailed Comparison Algoritms 93
Once the presence of every word has been calculated, then a serial process to slide
the window of 100 characters across the entire EST is used to determine if there exists
enough matches within 100 characters to pass. The implementation of this serial process
on the GPU results in much of the GPU is being underutilized. Despite this, executing
this computation on the GPU is still faster than copying the results to the CPU and
testing every window.
The default parameters for this function ware experimentally determined to result in
the best performance and sensitivity if a threshold of 45 words within 100 characters
appear in the word table. More than this results in the rejection of a high number of
potential matches.
Since the t/v-word Heuristic operates directly on the results of the u/v-sample Heuris-
tic without the need for CPU processing, there exists two options for chaining the two
Heuristics together.
The first involves calling the two kernels directly after one another using the data pro-
vided from the first kernel while the other requires that the two kernels be combined into
a single kernel, the t/v-word Heuristic part of the kernel only executing if the u/v-sample
threshold is reached.
Since the entire block’s decision is determined by the threshold, concerns of divergent
branches, which occur when threads follow different branches, is eliminated.
Experimentation showed that a single kernel containing both Heuristics, though it
stresses the available per-thread memory, shared memory and registers, does have slight
performance advantages. It is considered that this advantage occurs due to the less
overhead required in launching two separate kernels.
6.5 Detailed Comparison Algoritms
A customized Smith-Waterman implementation was chosen to be used as the comparison
algorithm for this program. However, since Smith-Waterman is used more typically for
alignment and not for distance, a CPU version of the d2-Algorithm is used for correctness
analysis.
6.5.1 d2-Algorithm
The CPU implementation of the d2-Algorithm was taken and used with permission from
the wcdest tool [117]. Though some editing and customization was needed to use the
94 Chapter 6. Implementation and Issues
algorithm sucessfully in the gpucluster program, it was confirmed to have results identical
to that of the wcdest application. The algorithm performs slower than the original tool,
but this deemed acceptable since this implementation is required simply for correctness
analysis and not performance comparison. The wcdest tool will be used directly for the
purpose of computation of performance in the experiments.
6.5.2 Cumulative Smith-Waterman Distance
The Smith-Waterman algorithm has been implemented on the GPU various times in the
past. Rather than develop a novel implementation, it was decided to simply build up on
an existing implementation, namely that of CUDASW++ [85].
The chosen Cumulative Smith-Waterman Distance is a variation of the Smith-Waterman
algorithm with the addition of a cumulative score and the removal of the requirement to
actually compute the alignment. These customizations of the standard Smith-Waterman
algorithm implies that the exact code that CUDASW++ uses cannot be used, since its
larger scope includes various unneeded computations such as for alignment and support
for protein matching that impacts performance negatively and is not used for nucleotide
sequence clustering.
Instead, an original implementation was developed that uses the same structure as
CUDASW++’s anti-diagonal method, but with the addition of the required modifica-
tions: the removal of alignment support, addition of a cumulative scoring and removing
the algorithm’s support of protein sequences, limiting it to only nucleotide sequences as
detailed in Section 5.6.6. This original implementation additionally allowed much finer
control over the variables and execution which leads to greater optimization opportunities
in terms of memory manamgement, access, operation ordering and placing different parts
of execution in shared memory.
The performance of this implementation was deemed to be sufficient for our purposes,
even if the modifications in terms of scoring makes a direct comparison with the original
implementation impossible. A profile of the kernel revealed no significant bottlenecks.
Once the implementation of the distance algorithm was complete, it still needed to be
tuned to produce results that are similar to that of the d2-Algorithm. This process was
largely automated by executing the program multiple times for every possible variable
combination across a large selection of datasets. This experimentation revealed that a
match reward of 1, a mismatch and gap penalty of 20 and a threshold of a cumulative
score of 100 results in the best and most accurate similarity to the d2-Algorithm.
6.6. Conclusion and summary of concerns 95
The round numbers this experimentation provided is not completely unexpected, since
the threshold matches the default window length of the d2-Algorithm and the mismatch
and gap penalties correspond to a 95% similarity over that window length in order to
pass as a match. This observation increases the confidence that the Cumulative Smith-
Waterman Distance kernel and the d2-distance function will generate similar output over
any other dataset as well.
6.6 Conclusion and summary of concerns
It was found that the majority of processing time of these algorithms was in random access
read patterns. The GPU has great memory bandwidth, but it is optimized for either
sequential reads or in the case of texture memory, reads with 2D locality. Having a large
array such as the word count table randomly accessed stresses the memory capabilities
of the GPU more than intended, since for every byte read and used, at least 32-bytes are
actually fetched from the GPU global memory.
Storing the word table in faster memory such as the constant memory is not optimal
due to the size of the word count tables, which implies that either lower k-values should
be used, which degrades sensitivity, or smaller jobs used at once, which would result in a
higher level of idle multi-processors and thus lower performance.
Another concern is in the branching costs of control statements. It was surprisingly
found that optimizations to exit loops early often serve to slow down the algorithm, as
opposed to speeding it up in CPU equivalents. It is expected though that the ability to
deal with control branches will improve with successive GPUs, but it serves to illustrate
the differences in expectations between CPU and GPU optimization.It does underline the
importance to optimize iteratively and test assumptions under benchmark conditions.
Comparing the algorithm over various GPUs illustrate the clear advancements that
GPUs have made from capability 1.3 to 2.0. There are raw performance improvements,
but the true benefits are realised over the greater amount of instructions and the better
capability to deal with random reads and branching structures.
Lack of proper debugging tools under Linux and inability to expose the internal state
of GPUs contributed negatively to the development time of this project, though the recent
release of the NVIDIA Parallel Nsight Visual Studio plug-in offers to assist with these
concerns. This product is only available for Visual Studio 2008 or Visual Studio 2005
at the time of this writing and requires that the development machine have more than a
single GPU installed (one to debug on, one to continue driving the desktop).
96 Chapter 6. Implementation and Issues
Despite these concerns, the implementation can be considered a success with output
correct and comparable to the output of a CPU only implementation.
The next chapter will detail how this implementation performs in the experiments set
forth in Chapter 4.
Chapter 7
Results and Analysis
7.1 Introduction
With the implementation of our application complete, the experiments detailed in Chapter
4 can now be performed to measure the performance, sensitivity, scalability and short-
comings of the developed application.
The following experiments will be performed:
1. Theoretical Performance and Cost Evaluation This experiment deals with
the comparison of theoretical ‘Peak GFLOPS’ between modern GPU and CPUs.
2. Sensitivity Comparison This experiment measures the sensitivity of the devel-
oped application in comparison to a known good benchmark.
3. Performance Benchmarking This experiment measures performance of the de-
veloped algorithm for different datasets and compares against the performance of a
CPU implementation.
4. Dataset scaling tests This experiment tests the performance of the algorithm
when the dataset size is varied.
5. Profiling Analysis This experiment subjects the algorithm to a dynamic profiling
analysis in order to identify its inefficiencies and computational bottlenecks.
97
98 Chapter 7. Results and Analysis
7.2 Investigation 1: Theoretical Performance and
Cost Evaluation
Due to the requirement of CUDA capability, only NVidia GPU hardware is compared
here. The exact prices might differ with time and supplier, but given in Table 7.1 is the
cheapest prices that can be found on May 2012.
Prices are given for discrete CPU and GPU only. To utilize these a complete PC
system is needed.
Hardware Price (R) Performance Rand per
(GFLOPS) GFLOPS
Core i5-3550 (mid) R 1989 105.6 18.85
Core i7-3930K (high) R 5687 153.6 37.02
NVidia Geforce GTX560 (mid) R 1880 1088.6 1.73
NVidia Geforce GTX580 (high) R 4560 1581.1 2.88
Table 7.1: Price and performance comparison of various hardware
Observations and conclusion
From Table 7.1 is can be seen that when only pure theoretical performance is considered
the GPU exceeds CPU on a price basis by an order of magnitude.
This advantage is one of the main reasons why GPUs were selected for this experiment.
Experiment 2 will indicate whether these theoretical advantages translate into real-
world performance.
7.3 Experiment 1: Sensitivity Comparison
Table 7.2 includes information as to the datasets used in this experiment, the measured
execution times and the computed JI and SE scored between the experimental and refer-
ence clusterings.
7.3. Experiment 1: Sensitivity Comparison 99
Dataset # seqs Size Time(s) SE JI
Name K (MB) gpucluster wcd Ratio
SANBI 10000 10.0 3.5 4.1 12.2 2.98x 0.942 0.915
pubcot 30.0 16.6 57.7 307.1 5.32x 0.965 0.960
A032 71.5 32.4 183.3 672.5 3.66x 0.968 0.967
C01 12.5 5.6 11.6 46.3 3.99x 0.965 0.966
C02 25.0 11.3 39.5 200.8 5.08x 0.912 0.909
C10 125.7 56.3 1229.9 5695.8 4.63x 0.481 0.479
Table 7.2: Performance comparison between different datasets
As can be seen from the table, the sensitivity of the algorithm varies widely with the
dataset used, but generally remains above 0.95 except in the case of C10.
Observations and conclusion
While the results are good across all datasets, C10’s experimental and reference clusterings
are shown to have a very low Jaccard and Sensitivity score.
Visual inspection of C10’s clustering revealed that two large clusters were clustered
together in the experimental clustering that was not clustered together in the reference
clustering. The large divergence in clustering is due to the large ratio of the total clus-
terings these two clusters represent.
Such errors are the result of even one or two incorrect clusterings and can greatly
affect the clustering scores. However, reassembly algorithms are designed to deal with
such situations. This will result in the reassembly algorithm taking increased time due to
the larger solution space presented to it, but will not result in incorrect results.
Situations where the experimental clustering aggressively over-clusters can be identi-
fied by a much lower Jaccard score with a relatively high Sensitivity score. This is not
observed in this instance.
Further investigation does not reveal that this clustering is in any way incorrect when
compared to the reference clustering, only the result of a borderline similar EST that is
accepted as a match by gpucluster while rejected by wcdest.
Gpucluster is not markedly more aggressive or reluctant to make matches than wcdest,
and borderline matches is expected to go either way. Thus this is not expected to have a
large negative influence when used on real-world datasets.
100 Chapter 7. Results and Analysis
7.4 Experiment 2: Performance Benchmarking
Table 7.2 contains the results of this experiment. As expected the performance varies
greatly with the dataset used, ranging from 2.98x for the SANBI 10000 to 5.32x for the
public cotton database. It is not currently known what attributes in a dataset lends to
increased or decreased performance.
The possible link between database size and performance is explored further in Ex-
periment 3.
The experiment reflects favorably on the GPU as a computation device compared to
the CPU, even if the performance gains are not as high as previously expected.
Observations and conclusion
While greatly improved performance using the GPU compared to the CPU is shown, the
ratio of improvement is still within an order of magnitude of one another and much less
than the theoretical computational advantage of GPUs would suggest.
This gap can be closed by using wcdest in multi-threaded mode which results in much
more modest performance comparisons.
Section 7.6 is an experiment intended to identify the shortcomings of the GPU imple-
mentation and identify the reasons for the sub-optimal performance.
Section 7.7 presents reasoning for the reported performance figures and indicates rea-
sons for it and offer solutions.
7.5 Experiment 3: Dataset scaling tests
The A686904 database is used in this experiment, a particularly large database without
any particular characteristics that advantage or disadvantage the GPU.
In Figure 7.1 the results are shown, with the CPU marker being in red and the GPU
marker in blue. These results are not a surprise and meet the average of 4x performance
improvement that previous experiments have shown.
7.5. Experiment 3: Dataset scaling tests 101
0 50 100 150 200 250
0
2,000
4,000
6,000
8,000
Size(k)
Tim
e(s)
GPUCPU
Figure 7.1: Performance on the Arabidopsis data set for different sized subsets of the data
Figure 7.2 shows the same data normalized.
0 50 100 150 200 250
3
4
5
6
Size(k)
Rat
io
Figure 7.2: Ratio of performance of GPUcluster and wcdest with dataset size
The low performance seen in small datasets less than 10k sequences is expected due to
the greater impact program initialization and cleanup has to the total, but it is surprising
that there is large performance gain seen at sizes of around 30k sequences.
102 Chapter 7. Results and Analysis
Observations and conclusion
From this experiment it can be seen that the performance of this algorithm is very de-
pendent on large enough datasets to allow it to reach proper throughput.
Much smaller than expected datasets have the expected drop in performance. Of note
is that this lower performance is still competitive with CPU processing, though not as
optimal as it otherwise might be.
Of particular interest is the possibility of an optimal dataset volume. It is currently
expected to simply be an attribute of the dataset used in this experiment, but further
experimentation is needed to formally make this conclusion.
7.6 Experiment 4: Profiling Analysis
In this experiment our application was executed on various datasets while profiled by
NVidia’s Visual Profiler, the official tool for profiling CUDA applications.
First we analysed the kernel timing, with the intention of identifying the kernels that
constitute the greatest execution time of our application. These are presented in Table
7.3 for a smaller dataset and Table 7.4 for a larger dataset.
Note that the estimation of idling time only includes time not spent within a kernel,
such as application initialization and during CPU computation. An underutilized kernel
will still report as not idle in that time-frame.
Kernel Number of Calls %GPU Time
Heuristics 378 53.79 %
SW-Distance 377 18.31 %
Memory Operations 3139 1.68 %
Wordcount 27 0.09 %
Idle - ≈ 26.13 %
Table 7.3: Timing profiling results for the 10K dataset (≈ 10K ESTs)
7.6. Experiment 4: Profiling Analysis 103
Kernel Number of Calls %GPU Time
Heuristics 18336 66.6 %
SW-Distance 6797 10.55 %
Memory Operations 100675 1.78 %
Wordcount 191 0.02 %
Idle - ≈ 21.05 %
Table 7.4: Timing profiling results for the A032 dataset (≈ 70K ESTs)
In all cases the Heuristics kernel took the majority of execution time, with the execu-
tion time of the Distance function depending largely on the dataset and the number of
matches found.
As expected, the time taken by memory copies between the CPU and GPU constitute
a negligible negative effect on performance.
Observations and conclusion
Of interest is the fact that the GPU is not under constant use and is in fact idle a quarter
to a fifth of the execution time. A kernel execution time plot, provided as Figure 7.3
reveals the reason for this.
After execution of the heuristics, the results are sent back to CPU which organises
and finds pairs that match the needed thresholds, before executing the distance function.
After the distance function it then returns results again, where the CPU clusters the
matching pairs.
These operations are simple and efficient, but due to the volume of data involved it
takes an appreciable ratio of execution time to perform.
Optimizations of the CPU portions of the application can serve to improve the per-
formance of the application as written, only multi-threaded execution is expected to fully
eliminate the idle periods and eliminate the GPU idle times.
Since the heuristics kernel is responsible for the majority of kernel execution time, any
further work on optimizing the performance of gpucluster would focus on this.
A deeper analysis of the heuristics kernel was performed, revealing that the achieved
occupancy of the kernel is not as high as it should be. A theoretical occupancy of 0.67 is
possible with the choice of thread count and register usage, but only 0.62 is achieved.
104 Chapter 7. Results and Analysis
Figure 7.3: Kernel execution time plot (time is in micro-seconds)
This suggests that the memory throughput and latency does not meet the requirements
of completely hiding memory accesses. It is expected that this is due to the random access
pattern used by the heuristic kernel to access the word count table.
That said, the level of suboptimal occupancy is slight and the automated profiling tool
had no further suggestions. Any optimization method would likely involve more extensive
refactoring or rewriting in order to achieve greater performance.
7.7 Critical Analysis
The performance improvement of GPU EST clustering on the GPU is far less than the
reported improvement of other GPU implementations in other domains. While 300×improvement is rarely obtained and requires a domain ideally suited for a GPU processor,
a performance increase of at least 10× was expected as opposed to the observed 2×-5×increase without an equal increase in hardware costs.
In the process of profiling the application and critical thought about the domain,
several theories are presented to help explain the disparity between observed and expected
7.7. Critical Analysis 105
performance results.
7.7.1 Multiple Threads
When this application was first designed and programmed, CUDA did not support multi-
ple CPU threads accessing the same GPU Context. Since that time though their support
for this scenario has increased. This offers the best opportunity for further work to im-
prove the performance of the application by eliminating GPU idle times, as well as add
support for multiple GPU operation.
7.7.2 Concurrent Execution
Many CUDA GPUs have the capability to ’hide’ memory latencies by performing copies
at the same time. While this is viable to employ in this application, Table 7.3 and
Table 7.4 indicates that this is unlikely to provide a large benefit, at best improving the
performance by 1%-2%. The negative effects of the increased complexity in application
structure suggests that this is not a worthwhile effort.
7.7.3 Sequences Data Size
In many GPU applications the application deals with data points or single or composite
numbers such as floating positions or color components. When dealing with biological
sequences however analysis is done on many bytes worth of data all of which needs to be
compared with all of the bytes of data of the comparison sequence.
This explosion in data read and memory requirements greatly increases the amount of
data needed to be read from memory and processed without greatly increasing the com-
putational requirements, leaving the GPU to be memory bandwidth limited and unable
to properly utilize its full computational ability.
7.7.4 Random Reads
Random reads of GPU memory is much much slower than reads of consecutive or spatially
related reads, possibly up to 16× slower if there is no spatial similarity. This is one of the
greatest negative effects when profiling the application, being one of the main causes of
memory bandwidth saturation.
This occurs especially often in the u/v-sample and t/v-word heuristics, which is per-
formed on each and every sequence comparison.
106 Chapter 7. Results and Analysis
7.7.5 Branching
Efforts have been made to minimize the amount of branching in the application to its
minimum, limiting it to block-level decisions as much as possible, but when profiling there
is still an apparent small negative effect of branching present.
7.8 Conclusion
The performance results of this experiment is disappointing, but it does serve to illustrate
why GPU computation has not spread to more domains than it already has.
While performance far greater than the CPU is certainly possible, there is additional
requirements on the type of data, its data dependencies, how it is streamed to the GPU
and how it is computed that limit the applications for which GPU computation is useful.
Many of these limits can be overcome as in this application, but not without a potentially
large performance decrease.
Regardless, this does show that if provided a PC with a powerful GPU, gpucluster
can greatly increase performance, though not by an order of magnitude. This advantage
can however be largely negated by a more powerful multi-core CPU operating across all
of its cores.
The quality results show however that the application is certainly useful for EST
clustering, providing correct results at good performance.
Chapter 8
Conclusion and Further Work
8.1 Summary
In this dissertation we had the aim of utilizing GPU technology in order to optimize and
improve on the problem of EST clustering.
Extensive research on this cross-disciplinary approach was required before even con-
sidering such an approach. It was found that though this line of research has not received
significant attention, there are significant gains that can be made through a project that
utilizes GPU computing for bioinformatics problems.
GPU programming differs from classical CPU programming in significant ways so a
familiarity with the CUDA API is needed in order to achieve the performance goals.
Understanding of the various types of parallelism and memory provided by GPUs is
essential to optimizing the execution of a CUDA application.
The metrics for performance and sensitivity measurement is important to consider for
fair comparison between different platforms. The details and goals for the project is de-
fined and expectations used for a measurement of success is defined before the application
is implemented.
EST clustering is a wide field with no single correct algorithm or implementation. For
this reason extensive research had to be done in order to identify potential algorithms
that this project will utilize. Each of the proposed algorithms are analysed for suitability
for the GPU platform with their weaknesses and strengths identified. Most had to be
discarded due to the limited scope of the project, but suitable algorithms for porting were
found.
Implementation involved a lot of learning, adapting, and a few surprises, but eventually
a program was completed that met the goals of the project.
107
108 Chapter 8. Conclusion and Further Work
Though the performance improvement was not as great as initially expected, the GPU
implementation shows promise as an alternative to the classical CPU computing approach
that is currently used. Though many shortcomings of the implementation was identified,
it still performed well and produced correct results.
It is the opinion of the author that this project has proven to be a success, not just in
its implementation, but more importantly it can serve as an example of GPU use in the
bioinformatics field. By identifying the many pitfalls and issues it is hoped that similar
problems can be avoided by other researchers working on similar problems.
8.2 Research Question Resolution
The objective of this research was to answer several questions posed in Section 1.3. The
scope of this project deals primarily with EST clustering, so the answers given may not
apply to the entire bioinformatics field. The insight provided may prove useful for any
future research.
1. Is GPGPU a practical computing platform for bioinformatics algorithms?
Section 2.4.4 in Chapter 2 lists various cases where GPGPU has been successfully
utilized in bioinformatics applications.
The positive results of the project leads to the conclusion that GPGPU can be a
practical computing platform for bioinformatics algorithms.
2. Can existing bioinformatics algorithms be practically ported to GPGPU?
Research listed in Section 2.4.4 provide examples of other ported algorithms that
was successfully used on the GPGPU platform. In addition, the positive results of
this project leads to the conclusion that bioinformatics algorithms can be ported
successfully to the GPGPU platform.
3. Is the cost of GPGPU competitive with classical CPU computing?
Yes, the costs have been shown to be competitive as per the cost evaluation in
Section 7.2.
4. Is the performance of GPGPU competitive with classical CPU comput-
ing?
Yes, the performance has been shown to be competitive as per the results of Exper-
iment 4 in Section 7.4.
8.3. Further Work 109
8.3 Further Work
Much of the work detailed in this thesis can be expanded on and improved through further
research and development. Though this project met its goals, various avenues of potential
further research and development has become apparent. This section will list the various
possible approaches that can allow further work to improve on the developed application.
8.3.1 Faster Heuristics
While the selected heuristics perform well on the CPU, this thesis has shown that they
port poorly to the GPU due to the great need for lookup tables utilizing random reads.
An alternate heuristic using fewer random reads and more linear reads can potentially
increase the performance of gpucluster significantly.
Another option is pre-sorting words before searching, potentially allowing the random
reads to occur much spatially closer to one another, decreasing their negative impact.
8.3.2 Multiple GPU
An obvious possible improvement would be to utilize the power of multiple GPUs on the
same PC. Using such a second GPU can potentially double the performance of gpucluster
for 2 GPUs.
Of note is the fact that many high end GPUs such as the Nvidia GTX 595 are two
separate GPUs located on the same board. Logistically and from the view of gpucluster
these remain separate GPUs and need to be managed separately, requiring the application
to implement multi-GPU support to properly utilize such a GPU.
8.3.3 CPU Concurrent Use
It is observed that the current gpucluster implementation largely leaves the CPU relatively
idle. A very possible performance improvement would be to use the CPU and GPU
concurrently on the same dataset, increasing utilization of all of a PC’s computational
assets.
110 Chapter 8. Conclusion and Further Work
8.4 Conclusion
GPU computation has great potential to be an invaluable tool in bioinformatics pro-
cessing. Though GPU computing is not dependent on overly expensive equipment, a
significant investment of time and effort on the part of developers is needed to make the
shift required for learning the programming paradigms involved in GPU programming.
This limits the pool of developers capable of fully taking advantage of GPGPU.
Despite this challenge, the advancing rate of GPU computing is promising for small
and large laboratories to enable much cheaper and more powerful computation of complex
data and interactions.
Bibliography
[1] Wikipedia, “Nucleotide — Wikipedia, The Free Encyclopedia,” 2011, [Online;
accessed 26-February-2011]. [Online]. Available: http://en.wikipedia.org/w/index.
php?title=Nucleotide&oldid=412067409
[2] NVidia Corporation. (2010, june) NVidia CUDA Programming Guide 3.1.
[Online]. Available: http://developer.download.nvidia.com/compute/cuda/3 1/
toolkit/docs/NVIDIA CUDA C ProgrammingGuide 3.1.pdf
[3] R. Schaller, “Moore’s law: past, present and future,” Spectrum, IEEE, vol. 34, no. 6,
pp. 52–59, 1997.
[4] Wikipedia, “FLOPS — Wikipedia, the free encyclopedia,” 2011, [Online; accessed
1-March-2011]. [Online]. Available: http://en.wikipedia.org/w/index.php?title=
FLOPS&oldid=416575050
[5] S. Ryoo, C. Rodrigues, S. Baghsorkhi, S. Stone, D. Kirk, and W. Hwu, “Optimiza-
tion principles and application performance evaluation of a multithreaded GPU us-
ing CUDA,” in Proceedings of the 13th ACM SIGPLAN Symposium on Principles
and practice of parallel programming. ACM, 2008, pp. 73–82.
[6] D. P. Anderson, “BOINC: A System for Public-Resource Computing and Storage,”
in Proceedings of the 5th IEEE/ACM International Workshop on Grid Computing,
ser. GRID ’04. Washington, DC, USA: IEEE Computer Society, 2004, pp. 4–10.
[Online]. Available: http://dx.doi.org/10.1109/GRID.2004.14
[7] C. van Deventer, W. Clarke, and S. Hazelhurst, “BOINC and CUDA: Distributed
High-Performance Computing for Bioinformatics String Matching Problems,” in
Proceedings of the Southern Africa Telecommunication Networks and Applications
Conference, Sept, 2010.
[8] E. Nordenskiold, “The history of biology.” 1928.
111
112 Bibliography
[9] H. Vickery, “The origin of the word protein,” The Yale journal of biology and
medicine, vol. 22, no. 5, p. 387, 1950.
[10] D. Davison, “The number of human genes and proteins,” Nanotech, vol. 2, pp. 6–11,
2002.
[11] G. Mendel, A. Corcos, and F. Monaghan, Gregor Mendel’s Experiments on plant
hybrids: a guided study. Rutgers Univ Pr, 1993.
[12] R. Henig, The monk in the garden: the lost and found genius of Gregor Mendel, the
father of genetics. Mariner Books, 2001.
[13] C. Darwin, “On the origin of species by means of natural selection. 1859,” Leipzig:
Verlag Philipp Reclam, 1859.
[14] T. Morgan et al., “Sex limited inheritance in Drosophila,” Science, vol. 32, no. 812,
pp. 120–122, 1910.
[15] R. Dahm, “Friedrich Miescher and the discovery of DNA,” Developmental Biology,
vol. 278, no. 2, pp. 274–288, 2005.
[16] P. Ceruzzi, A history of modern computing. The MIT press, 2003.
[17] J. Watson and F. Crick, “Molecular structure of nucleic acids,” Nature, vol. 171,
no. 4356, pp. 737–738, 1953.
[18] F. Crick and J. Watson, “A structure for deoxyribose nucleic acid,” Nature, vol.
171, no. 737-738, 1953.
[19] G. Gamow, A. Rich, and M. Ycas, “The problem of information transfer from the
nucleic acids to proteins.” Advances in biological and medical physics, vol. 4, p. 23,
1956.
[20] L. Gatlin, “The information content of DNA,” Journal of Theoretical Biology,
vol. 10, no. 2, pp. 281–300, 1966.
[21] C. Shannon, Mathematical theory of communication. University Illinois Press, 1963.
[22] A. Gibbs and G. Mcintyre, “The diagram, a method for comparing sequences,”
European Journal of Biochemistry, vol. 16, no. 1, pp. 1–11, 1970.
Bibliography 113
[23] W. Beyer, M. Stein, T. Smith, and S. Ulam, “A molecular sequence metric and
evolutionary trees,” Mathematical Biosciences, vol. 19, no. 1, pp. 9–25, 1974.
[24] A. Gibbs, M. Dale, H. Kinns, and H. MacKenzie, “The transition matrix method for
comparing sequences; its use in describing and classifying proteins by their amino
acid sequences,” Systematic Biology, vol. 20, no. 4, pp. 417–425, 1971.
[25] R. Grantham, “Amino acid difference formula to help explain protein evolution,”
Science, vol. 185, no. 4154, pp. 862–864, 1974.
[26] M. Sackin, “Crossassociation: a method of comparing protein sequences,” Biochem-
ical Genetics, vol. 5, no. 3, pp. 287–313, 1971.
[27] P. Sellers, “An algorithm for the distance between two finite sequences,” J. Comb.
Theory, Ser. A, vol. 16, no. 2, pp. 253–258, 1974.
[28] R. Wagner and M. Fischer, “The string-to-string correction problem,” Journal of
the ACM (JACM), vol. 21, no. 1, pp. 168–173, 1974.
[29] W. Fitch and E. Margoliash, “The usefulness of amino acid and nucleotide sequences
in evolutionary studies,” Evol. Biol, vol. 4, pp. 67–109, 1970.
[30] M. Dayhoff, W. Barker, and L. Hunt, “Establishing homologies in protein se-
quences,” Enzyme structure. Part 1. New York, Academic Press, 1983,, pp. 524–545,
1983.
[31] C. Ouzounis and A. Valencia, “Early bioinformatics: the birth of a discipline – a
personal view,” Bioinformatics, vol. 19, no. 17, pp. 2176–2190, 2003.
[32] F. C. Bernstein, T. F. Koetzle, G. J. Williams, E. F. Meyer Jr, M. D. Brice, J. R.
Rodgers, O. Kennard, T. Shimanouchi, and M. Tasumi, “The protein data bank: a
computer-based archival file for macromolecular structures,” Journal of molecular
biology, vol. 112, no. 3, pp. 535–542, 1977.
[33] L. Philipson, “The DNA data libraries,” Nature, vol. 332, no. 6166, pp. 676–676,
1988.
[34] H. Bilofsky and B. Christian, “The GenBank R© genetic sequence data bank,” Nu-
cleic acids research, vol. 16, no. 5, pp. 1861–1863, 1988.
114 Bibliography
[35] C. DeLisi, “Computers in molecular biology: current applications and emerging
trends,” Science, vol. 240, no. 4848, pp. 47–52, 1988.
[36] T. Smith, “Comparison of biosequences,” Advances in Applied Mathematics;(United
States), vol. 2, 1981.
[37] D. Lipman and W. Pearson, “Rapid and sensitive protein similarity searches,” Sci-
ence, vol. 227, no. 4693, p. 1435, 1985.
[38] W. Wilbur and D. Lipman, “Rapid similarity searches of nucleic acid and protein
data banks,” Proceedings of the National Academy of Sciences, vol. 80, no. 3, p.
726, 1983.
[39] J. Collins and A. Coulson, “Applications of parallel processing algorithms for DNA
sequence analysis,” Nucleic acids research, vol. 12, no. 1Part1, pp. 181–192, 1984.
[40] N. Core, E. Edmiston, J. Saltz, and R. Smith, “Supercomputers and biological
sequence comparison algorithms,” Computers and Biomedical Research, vol. 22,
no. 6, pp. 497–515, 1989.
[41] E. Edmiston, N. Core, J. Saltz, and R. Smith, “Parallel processing of biological
sequence comparison algorithms,” International Journal of Parallel Programming,
vol. 17, no. 3, pp. 259–275, 1988.
[42] O. Gotoh and Y. Tagashira, “Sequence search on a supercomputer,” Nucleic acids
research, vol. 14, no. 1, pp. 57–64, 1986.
[43] X. Huang, “A space-efficient parallel sequence comparison algorithm for a message-
passing multiprocessor,” International Journal of Parallel Programming, vol. 18,
no. 3, pp. 223–239, 1989.
[44] D. Lopresti, “P-NAC: A systolic array for comparing nucleic acid sequences,” Com-
puter, pp. 98–99, 1987.
[45] A. Baxevanis, Bioinformatics and the internet. Wiley Online Library, 2001.
[46] D. Altschuh, T. Vernet, P. Berti, D. Moras, and K. Nagai, “Coordinated amino
acid changes in homologous protein families,” Protein engineering, vol. 2, no. 3, pp.
193–199, 1988.
Bibliography 115
[47] F. Collins, A. Patrinos, E. Jordan, A. Chakravarti, R. Gesteland, L. Walters et al.,
“New goals for the US human genome project: 1998-2003,” Science, vol. 282, no.
5389, pp. 682–689, 1998.
[48] E. Lander, L. Linton, B. Birren, C. Nusbaum, M. Zody, J. Baldwin, K. Devon,
K. Dewar, M. Doyle, W. FitzHugh et al., “Initial sequencing and analysis of the
human genome,” Nature, vol. 409, no. 6822, pp. 860–921, 2001.
[49] F. Collins, E. Lander, J. Rogers, R. Waterston, and I. Conso, “Finishing the eu-
chromatic sequence of the human genome,” Nature, vol. 431, no. 7011, pp. 931–945,
2004.
[50] F. Collins, M. Morgan, and A. Patrinos, “The Human Genome Project: lessons
from large-scale biology,” Science, vol. 300, no. 5617, pp. 286–290, 2003.
[51] M. Adams, J. Kelley, J. Gocayne, M. Dubnick, M. Polymeropoulos, H. Xiao, C. Mer-
ril, A. Wu, B. Olde, R. Moreno et al., “Complementary DNA sequencing: expressed
sequence tags and human genome project,” Science, vol. 252, no. 5013, pp. 1651–
1656, 1991.
[52] Y. Lee, J. Tsai, S. Sunkara, S. Karamycheva, G. Pertea, R. Sultana, V. Antonescu,
A. Chan, F. Cheung, and J. Quackenbush, “The TIGR Gene Indices: clustering
and assembling EST and known genes and integration with eukaryotic genomes,”
Nucleic acids research, vol. 33, no. suppl 1, p. D71, 2005.
[53] P. Green, “Phrap,” Unpublished, available for download at
http://www.genome.washington.edu/UWGC/analysistools/phrap.htm, 1994.
[54] F. Liang, I. Holt, G. Pertea, S. Karamycheva, S. Salzberg, and J. Quackenbush, “An
optimized protocol for analysis of EST sequences,” Nucleic acids research, vol. 28,
no. 18, p. 3657, 2000.
[55] X. Huang and A. Madan, “CAP3: A DNA Sequence Assembly Program,”
Genome Research, vol. 9, no. 9, pp. 868–877, 1999. [Online]. Available:
http://genome.cshlp.org/content/9/9/868.abstract
[56] G. Pertea, X. Huang, F. Liang, V. Antonescu, R. Sultana, S. Karamycheva, Y. Lee,
J. White, F. Cheung, B. Parvizi et al., “TIGR Gene Indices clustering tools (TG-
ICL): a software system for fast clustering of large EST datasets,” Bioinformatics,
vol. 19, no. 5, p. 651, 2003.
116 Bibliography
[57] A. Kalyanaraman, S. Aluru, S. Kothari, and V. Brendel, “Efficient clustering of
large EST data sets on parallel computers,” Nucleic Acids Research, vol. 31, no. 11,
p. 2963, 2003.
[58] J. Burke, D. Davison, and W. Hide, “d2 cluster: a validated method for clustering
EST and full-length cDNA sequences,” Genome Research, vol. 9, no. 11, p. 1135,
1999.
[59] S. Hazelhurst, W. Hide, Z. Liptak, R. Nogueira, and R. Starfield, “An
overview of the wcd EST clustering tool.” Bioinformatics (Oxford, England),
vol. 24, no. 13, pp. 1542–1546, July 2008. [Online]. Available: http:
//dx.doi.org/10.1093/bioinformatics/btn203
[60] J. Rhoades, G. Turk, A. Bell, U. Neumann, A. Varshney et al., “Real-time proce-
dural textures,” in Proceedings of the 1992 symposium on Interactive 3D graphics.
ACM, 1992, pp. 95–100.
[61] J. Eyles, S. Molnar, J. Poulton, T. Greer, A. Lastra, N. England, and L. West-
over, “PixelFlow: the realization,” in Proceedings of the ACM SIGGRAPH/EURO-
GRAPHICS workshop on Graphics hardware. ACM, 1997, pp. 57–68.
[62] B. Jobard, G. Erlebacher, and M. Hussaini, “Lagrangian-eulerian advection for
unsteady flow visualization,” in Proceedings of the conference on Visualization’01.
IEEE Computer Society, 2001, pp. 53–60.
[63] C. Bohn, “Kohonen feature mapping through graphics hardware,” in Proceedings of
the 3rd Int. Conference on Computational Intelligence and Neurosciences, 1998.
[64] N. Carr, J. Hall, and J. Hart, “The ray engine,” in Proceedings of the ACM SIG-
GRAPH/EUROGRAPHICS conference on Graphics hardware. Eurographics As-
sociation, 2002, pp. 37–46.
[65] T. Purcell, I. Buck, W. Mark, and P. Hanrahan, “Ray tracing on programmable
graphics hardware,” ACM Transactions on Graphics (TOG), vol. 21, no. 3, pp.
703–712, 2002.
[66] J. Tran, D. Jordan, and D. Luebke, “New challenges for cellular automata simulation
on the GPU,” 2003.
Bibliography 117
[67] M. Harris, G. Coombe, T. Scheuermann, and A. Lastra, “Physically-based visual
simulation on graphics hardware,” in Proceedings of the ACM SIGGRAPH/EURO-
GRAPHICS conference on Graphics hardware. Eurographics Association, 2002,
pp. 109–118.
[68] N. Govindaraju, J. Gray, R. Kumar, and D. Manocha, “GPUTeraSort: high perfor-
mance graphics co-processor sorting for large database management,” in Proceed-
ings of the 2006 ACM SIGMOD international conference on Management of data.
ACM, 2006, pp. 325–336.
[69] N. Govindaraju, D. Manocha, N. Raghuvanshi, and D. Tuft, “Gpusort: High per-
formance sorting using graphics processors,” 2006.
[70] E. Elsen, V. Vishal, M. Houston, V. Pande, P. Hanrahan, and E. Darve, “N-body
simulations on GPUs,” Arxiv preprint arXiv:0706.3060, 2007.
[71] M. Harris, “Fast fluid dynamics simulation on the GPU,” GPU gems, vol. 1, pp.
637–665, 2004.
[72] I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, and P. Hanra-
han, “Brook for GPUs: stream computing on graphics hardware,” in ACM Trans-
actions on Graphics (TOG), vol. 23, no. 3. ACM, 2004, pp. 777–786.
[73] I. Buck, “High level languages for GPUs,” in ACM SIGGRAPH, 2005.
[74] M. McCool, K. Wadleigh, B. Henderson, and H. Lin, “Performance evaluation of
GPUs using the RapidMind development platform,” in Proceedings of the 2006
ACM/IEEE conference on Supercomputing. ACM, 2006, p. 181.
[75] Berkeley University. (2010, Aug.) SETI@home Website. [Online]. Available:
http://setiathome.berkeley.edu/
[76] D. Anderson, J. Cobb, E. Korpela, M. Lebofsky, and D. Werthimer, “SETI@home:
an experiment in public-resource computing,” Communications of the ACM, vol. 45,
no. 11, pp. 56–61, 2002.
[77] Stanford University. (2010, Aug.) Folding@home Website. [Online]. Available:
http://folding.stanford.edu
118 Bibliography
[78] A. Beberg, D. Ensign, G. Jayachandran, S. Khaliq, and V. Pande, “Folding@home:
Lessons from eight years of volunteer distributed computing,” in Parallel & Dis-
tributed Processing, 2009. IPDPS 2009. IEEE International Symposium on. IEEE,
2009, pp. 1–8.
[79] W. Liu, B. Schmidt, G. Voss, A. Schroder, and W. Muller-Wittig, “Bio-sequence
database scanning on a GPU,” in Parallel and Distributed Processing Symposium,
2006. IPDPS 2006. 20th International. IEEE, 2006, pp. 8–pp.
[80] Y. Liu, W. Huang, J. Johnson, and S. Vaidya, “GPU accelerated smith-waterman,”
Computational Science–ICCS 2006, pp. 188–195, 2006.
[81] M. Charalambous, P. Trancoso, and A. Stamatakis, “Initial experiences porting a
bioinformatics application to a graphics processor,” Advances in Informatics, pp.
415–425, 2005.
[82] M. Schatz and C. Trapnell, “Fast exact string matching on the GPU,” Center for
Bioinformatics and Computational Biology, 2007.
[83] NVidia. (2011, Aug.) NVidia: Bio-Informatics and Life Sciences. [Online].
Available: http://www.nvidia.com/object/bio info life sciences.html
[84] S. Manavski and G. Valle, “CUDA compatible GPU cards as efficient hardware
accelerators for Smith-Waterman sequence alignment,” BMC bioinformatics, vol. 9,
no. Suppl 2, p. S10, 2008.
[85] Y. Liu, B. Schmidt, and D. Maskell, “CUDASW++ 2. 0: enhanced Smith-
Waterman protein database search on CUDA-enabled GPUs based on SIMT and
virtualized SIMD abstractions,” BMC Research Notes, vol. 3, no. 1, p. 93, 2010.
[86] Y. Munekawa, F. Ino, and K. Hagihara, “Design and implementation of the Smith-
Waterman algorithm on the CUDA-compatible GPU,” in BioInformatics and Bio-
Engineering, 2008. BIBE 2008. 8th IEEE International Conference on. IEEE,
2008, pp. 1–6.
[87] A. Akoglu and G. Striemer, “Scalable and highly parallel implementation of Smith-
Waterman on graphics processing unit using CUDA,” Cluster Computing, vol. 12,
no. 3, pp. 341–352, 2009.
[88] G. Striemer and A. Akoglu, “Sequence alignment with GPU: Performance and de-
sign challenges,” 2009.
Bibliography 119
[89] J. Walters, V. Balu, S. Kompalli, and V. Chaudhary, “Evaluating the use of GPUs in
liver image segmentation and HMMER database searches,” in Parallel & Distributed
Processing, 2009. IPDPS 2009. IEEE International Symposium on. IEEE, 2009,
pp. 1–12.
[90] M. Schatz, C. Trapnell, A. Delcher, and A. Varshney, “High-throughput sequence
alignment using Graphics Processing Units,” BMC bioinformatics, vol. 8, no. 1, p.
474, 2007.
[91] A. Eklund, M. Andersson, and H. Knutsson, “fMRI analysis on the GPU - possi-
bilities and challenges,” Computer Methods and Programs in Biomedicine, vol. 105,
no. 2, pp. 145–161, 2012.
[92] T. Sumanaweera and D. Liu, “Medical image reconstruction with the FFT,” GPU
gems, vol. 2, pp. 765–784, 2005.
[93] T. Kroes, F. Post, and C. Botha, “Exposure render: An interactive photo-realistic
volume rendering framework,” PloS one, vol. 7, no. 7, p. e38586, 2012.
[94] W. Liu, B. Schmidt, and W. Muller-Wittig, “CUDA-BLASTP: Accelerating
BLASTP on CUDA-Enabled Graphics Hardware,” IEEE/ACM Transactions on
Computational Biology and Bioinformatics (TCBB), vol. 8, no. 6, pp. 1678–1684,
2011.
[95] C. Trapnell and M. Schatz, “Optimizing data intensive GPGPU computations for
DNA sequence alignment,” Parallel computing, vol. 35, no. 8, pp. 429–440, 2009.
[96] K. Karimi, N. Dickson, and F. Hamze, “A performance comparison of CUDA and
OpenCL,” Arxiv preprint arXiv:1005.2581, 2010.
[97] K. Okonechnikov, O. Golosova, M. Fursov et al., “Unipro UGENE: a unified bioin-
formatics toolkit,” Bioinformatics, vol. 28, no. 8, pp. 1166–1167, 2012.
[98] U. o. M. Yang Zhang’s Research Group, “What is FASTA format?” 2011,
[Online; accessed 26-November-2011]. [Online]. Available: http://zhanglab.ccmb.
med.umich.edu/FASTA/
[99] P. Hanrahan, “Why is graphics hardware so fast?” in Proceedings of the tenth ACM
SIGPLAN symposium on Principles and practice of parallel programming. ACM,
2005, pp. 1–1.
120 Bibliography
[100] C. Gregg and K. Hazelwood, “Where is the data? Why you cannot debate CPU
vs. GPU performance without the answer,” in Performance Analysis of Systems
and Software (ISPASS), 2011 IEEE International Symposium on. IEEE, 2011, pp.
134–144.
[101] V. Podlozhnyuk, “FFT-based 2D convolution,” NVIDIA white paper, 2007.
[102] C. NVIDIA, “CUBLAS Library,” NVIDIA Corporation, Santa Clara, California,
vol. 15, 2008.
[103] M. Naumov, “CUSPARSE Library: A Set of Basic Linear Algebra Subroutines for
Sparse Matrices,” in GPU Technology Conference, vol. 2070.
[104] C. NVIDIA, “CURAND library,” NVIDIA Corporation, Santa Clara, California,
vol. 50, 2008.
[105] M. Harris, “Optimizing CUDA,” SC07: High Performance Computing With CUDA,
2007.
[106] V. Volkov, “Better performance at lower occupancy,” in Proceedings of the GPU
Technology Conference, GTC, vol. 10, 2010.
[107] NVidia Corporation. (2010, Aug.) CUDA Occupancy Calculator. [Online]. Avail-
able: http://developer.download.nvidia.com/compute/cuda/CUDA Occupancy
calculator.xls
[108] P. Jaccard, “The distribution of the flora in the alpine zone,” New Phytologist,
vol. 11, no. 2, pp. 37–50, 1912.
[109] G. Amdahl, “Validity of the single processor approach to achieving large scale com-
puting capabilities,” in Proceedings of the April 18-20, 1967, spring joint computer
conference. ACM, 1967, pp. 483–485.
[110] S. Hazelhurst, “Computational Performance Benchmarking of the wcd EST Clus-
tering System,” Technical Report TR-Wits-CS-2007-1, School of Computer Science,
University of the Witwatersrand, Tech. Rep., 2007.
[111] S. Mayanglambam, A. Malony, and M. Sottile, “Performance measurement of ap-
plications with GPU acceleration using CUDA,” in International Conference on
Parallel Computing (ParCo), 2009.
Bibliography 121
[112] U. Manber and G. Myers, “Suffix arrays: a new method for on-line
string searches,” in Proceedings of the first annual ACM-SIAM symposium
on Discrete algorithms, ser. SODA ’90. Philadelphia, PA, USA: Society for
Industrial and Applied Mathematics, 1990, pp. 319–327. [Online]. Available:
http://portal.acm.org/citation.cfm?id=320176.320218
[113] S. Puglisi, W. Smyth, and A. Turpin, “A taxonomy of suffix array construction
algorithms,” ACM Computing Surveys (CSUR), vol. 39, no. 2, p. 4, 2007.
[114] K. Katoh, K. Misawa, K.-i. Kuma, and T. Miyata, “MAFFT: a novel method
for rapid multiple sequence alignment based on fast Fourier transform.” Nucleic
acids research, vol. 30, no. 14, pp. 3059–3066, July 2002. [Online]. Available:
http://dx.doi.org/10.1093/nar/gkf436
[115] M. McGraw-Herdeg, D. Enright, and B. Michel, “Benchmarking the NVIDIA
8800GTX with the CUDA Development Platform,” in Proceedings of the 11th An-
nual High-Performance Embedded Computing Workshop (HPEC’07), 2007.
[116] W. Hide, J. Burke, and D. B. Davison, “Biological evaluation of d2, an
algorithm for high-performance sequence comparison.” J Comput Biol, vol. 1,
no. 3, pp. 199–215, 1994. [Online]. Available: http://www.biomedsearch.com/nih/
Biological-evaluation-d2-algorithm-high/8790465.html
[117] S. Hazelhurst, “Algorithms for clustering expressed sequence tags: the wcd tool,”
South African Comput. J, vol. 40, pp. 51–62, 2008.
[118] T. F. Smith and M. S. Waterman, “Identification of common molecular
subsequences,” Journal of Molecular Biology, vol. 147, no. 1, pp. 195 –
197, 1981. [Online]. Available: http://www.sciencedirect.com/science/article/
B6WK7-4DN3Y5S-24/2/b00036bf942b543981e4b5b7943b3f9a
top related