hadoop summit 2010 multiple sequence alignment using hadoop

31
MSA using Hadoop Presented by: Dr. G.Sudha Sadasivam Professor, Dept of CSE, PSG College of Technology, PSG College of Technology, Coimbatore

Upload: yahoo-developer-network

Post on 11-Jul-2015

1.405 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Hadoop Summit 2010 Multiple Sequence Alignment Using Hadoop

MSA using Hadoop

Presented by:

Dr. G.Sudha Sadasivam

Professor, Dept of CSE,

PSG College of Technology,PSG College of Technology,

Coimbatore

Page 2: Hadoop Summit 2010 Multiple Sequence Alignment Using Hadoop

Agenda

� Sequence alignment

� Introduction to Clouds

� Approaches for MSA

� Approach 1� Approach 1

� Approach 2

� Results

� Other Projects

Page 3: Hadoop Summit 2010 Multiple Sequence Alignment Using Hadoop

What is Sequence Alignment?

The procedure of comparing two or more

sequences by searching for a series of individual

characters or character patterns that are in the

same order in the sequences.same order in the sequences.

� Uses

� For sequence similarity

�Phylogenetic tree analysis

� Factors – accuracy and speed

Page 4: Hadoop Summit 2010 Multiple Sequence Alignment Using Hadoop

Cloud computing

Provides scalable, on-demand, RT computing services

Suitability of cloud for Sequence Alignment

� On-demand scalability of cloud makes it suitable

for dynamic nature of MSA

� Low cost in maintenance of infrastructure for � Low cost in maintenance of infrastructure for

applications

� Data and compute parallelism in clouds through

map-reduce paradigm facilitates energy efficient and

fast MSA.

Page 5: Hadoop Summit 2010 Multiple Sequence Alignment Using Hadoop

Types of Sequence Alignment�Pair-wise Alignment

�Alignment of two sequences

�Global –using Needleman Wunsch algorithm.

L G P S S K Q T G K G S _ S R A W D N

| | | | | | |L N _ A T K S A G K G A I M R L G D AL N _ A T K S A G K G A I M R L G D A

�Local – using Smith Waterman algorithm.

_ _ _ _ _ _ _ _ _ T G K G _ _ _ _ _ _ _ _ _ _

| | |_ _ _ _ _ _ _ _ _ A G K G _ _ _ _ _ _ _ _ _ _

�Multiple Sequence Alignment

�Alignment of more than two sequences

Page 6: Hadoop Summit 2010 Multiple Sequence Alignment Using Hadoop

� Initialization

F(0, 0) = 0

F(0, i) = −i * d

F(j, 0) = −j* d

� Main Iteration

For each i=1…M and j=1….N

Case 1: xi aligns to yi

Case 2: xi aligns to gapCase 3: yi aligns to gap

Needleman Wunsch Algorithm

For each i=1…M and j=1….N

F(i-1,j-1)+s(xi,yj), case 1F(i,j) = max F(i-1,j)-d, case 2

F(i,j-1)-d, case 3

DIAG, if case 1Ptr(i,j) = UP, if case 2

LEFT, if case 3

s(xi,yj ) = +1 , match

-1 , mismatch

Page 7: Hadoop Summit 2010 Multiple Sequence Alignment Using Hadoop

Needleman Wunsch Algorithm

A G T A

0 -1 -2 -3 -4

F(i,j) i=0 1 2 3 4

j=0

f(0,0)+s(1,1) =1F(1,1)=max f(0,1)-1 = -2

f(1,0)-1 = -2 = 1(case 1)

Optimal

Alignment A_TA

AGTA

f(0,1)+s(1,2) =-2f(0,2)-1 = -3f(1,1)-1 = 0Max = 0 (case 3)

F(i-1,j-1)+s(xi,yj)

F(i-1,j)-d

F(i,j-1)-d0 -1 -2 -3 -4

A -1 1 0 -1 -2

T -2 0 0 1 0

A -3 -1 -1 0 2

1

2

3

Case 1: xi aligns to yi

Case 2: xi aligns to gapCase 3: yi aligns to gap

s(xi,yj ) = +1, match-1, mismatch

d=1

PTR =DIAG, if case 1UP, if case 2LEFT, if case 3

F(0, 0) = 0

F(0, i) = −i * d

F(j, 0) = −j* d

Page 8: Hadoop Summit 2010 Multiple Sequence Alignment Using Hadoop

� A multiple sequence alignment is a sequence

alignment of three or more biological sequences,

generally protein, DNA, or RNA.

� The input is a set of query sequences that are

Multiple Sequence Alignment

� The input is a set of query sequences that are

assumed to have an evolutionary relationship by

which they share a lineage and are descended from

a common ancestor.

� From the resulting multiple sequence alignment ,

phylogenetic analysis can be conducted to assess

the sequences shared evolutionary origins.

Page 9: Hadoop Summit 2010 Multiple Sequence Alignment Using Hadoop

� Dynamic programming

� Progressive alignment

MSA Approaches

� Progressive alignment

� Iterative approach

Page 10: Hadoop Summit 2010 Multiple Sequence Alignment Using Hadoop

MSA methods

Dynamic

Programming

(n – dim

matrix)

Accurate Computationally

complex

O(Nn)

Exhaustive

Progressive

approximation

Fast Alignment

Cannot be

ClustalW

MAFFTapproximation

(aligns closest

seq first -

heuristics)

Cannot be

modified

Local maxima

Less accurate

MAFFT

Iterative Probabilistic

/ Stochastic

(Random)

Slow & less

accurate

GA & HMM

N- sequence length; n- number of sequences

Page 11: Hadoop Summit 2010 Multiple Sequence Alignment Using Hadoop

MSA in cloud

� CloudBurst – RMAP

� Does not split sequences to load in cloud

environment

� Not for MSA� Not for MSA

� No automatic scale up/down of clusters

� CLUE- proposal from Maryland University

� VM cloning – Snowflock with MPIs

Page 12: Hadoop Summit 2010 Multiple Sequence Alignment Using Hadoop

S1 S2 S3

Map/ Reduce

aligner

Proposed MSA Approach – hadoop data grid

A1S1 A2S2

Map/ Reduce

aligner

A2S1 A2S2

Map/ Reduce

aligner

A1S3

Page 13: Hadoop Summit 2010 Multiple Sequence Alignment Using Hadoop

1) Identify different Permutations

S1,S2,S3; S1,S3,S2; S2,S1,S3; S2,S3,S1; S3,S1,S2; S3,S2,S1

2) Perform alignment of each permutation in parallel in Map2

S1 and S2 are aligned to form A1S1 and A2S2

3) Align the output of first Map-Reduce with the third

sequence S3 in Map Phase.sequence S3 in Map Phase.

A1S1 is aligned with S3

A1S2 is aligned with S3

Best among these two is chosen to form

A2S1, A2S2 and A1S3.

4) Step 2 & 3 is repeated for all the other permutations in Map1

5) The best possible combination is chosen (alignment score)

Page 14: Hadoop Summit 2010 Multiple Sequence Alignment Using Hadoop

4 0

6 0

8 0

1 0 0T

ime

in

Se

c

Varying Number of Sequences of Same Size

0

2 0

4 0

2 4 6 8 1 0N u m b e r o f s e q u e n c e s

Tim

e i

n S

ec

2 n o d e s 3 n o d e s

Page 15: Hadoop Summit 2010 Multiple Sequence Alignment Using Hadoop

2 0 0

2 5 0

3 0 0

3 5 0

Tim

e i

n S

ec

Different Block Sizes

0

5 0

1 0 0

1 5 0

2 0 0

1 0 1 0 0 1 0 0 0 6 4 0 0B l o c k S i z e i n K B

Tim

e i

n S

ec

2 n o d e s 3 n o d e s

Page 16: Hadoop Summit 2010 Multiple Sequence Alignment Using Hadoop

Complexity Proposed Conventional

‘n’ – Number of Sequences

‘N’ – Average length of a sequence

‘b’ – Average number of blocks in a sequence

‘K’ – Size of 1 block

Analysis

Complexity

Measure

Proposed

Method

Conventional

Method

Score

Calculation

O(N) O(n*N)

Pairwise

alignment

O(K2) O(N2)

MSA O[(n-1) *(N2)/b] O(Nn)

Page 17: Hadoop Summit 2010 Multiple Sequence Alignment Using Hadoop

Proposed MSA Approach on Cloud

Time efficient approach to sequence alignment with quality (accuracy) in Cloud

� Using hadoop framework� Dynamic approach � accuracy� Dynamic approach � accuracy

� Data and compute parallelism in hadoop � speed

� Blocking and scalability of hadoop

� Parallel transfer of sequence splits over the network to remote clusters

� Automated scale up/down of clusters based on computational needs of th environment.

Page 18: Hadoop Summit 2010 Multiple Sequence Alignment Using Hadoop

AGT….CG

AGT….CG

AGT….CG

AGT….CG

AGT….CG

Head Server

(VM)

New VMs

New VMs

……….

2. Parallel transmission

over Internet

4. Forking VMs / deleting VMs

System Architecture

3. Copy to HDFS

AGT….CG

New VMs

……….

.

.

CLIENT SIDE VIRTUAL

ENVIRONMENT

6. Report the resultSEQUENCE FRAGMENTS

1. Create virtual environment

2. Split the sequences

5. Perform Alignment

SERVER SIDE

HADOOP CLUSTER

Page 19: Hadoop Summit 2010 Multiple Sequence Alignment Using Hadoop

A single Combination –An illustration

Page 20: Hadoop Summit 2010 Multiple Sequence Alignment Using Hadoop

0 1 2 3 4

A G T A

0 0 -1 -2 -3 -4

1 A -1 1 0 -1 -2

2 T -2 0 0 1 0

0 1 2 3 4

A G T A

0 0 -1 -2 -3 -4

1. ALIGNMENT OF SI & S2

2. ALIGNMENT OF A1SI & S3

S1= “AGTA”; A2=“ATA”; A3=“GAT”

2 T -2 0 0 1 0

3 A -3 -1 -1 0 2

SCORE: 4

A1S1:“AGTA”; A1S2:“A_TA”

1 G -1 -1 0 -1 -2

2 A -2 0 -1 1 0

3 T -3 -1 -1 0 -1

SCORE: -5

A2S1:“AG_TA”; A1S3:“_GAT_”

Page 21: Hadoop Summit 2010 Multiple Sequence Alignment Using Hadoop

0 1 2 3 4 5

A _ T A _

0 0 -1 -2 -3 -4 -5

1 _ -1 0 0 -1 -2 -3

2 G -2 -1 -1 -1 -2 -2

3. ALIGNMENT OF A1S2 & A1S3

2 G -2 -1 -1 -1 -2 -2

3 A -3 -1 -1 -2 0 -1

4 T -4 -2 -1 0 -1 0

5 _ -5 -3 -1 -1 0 0

SCORE: -3

A2S2:“A _ _TA_”;

A2S3:“ _GAT_ _”

Page 22: Hadoop Summit 2010 Multiple Sequence Alignment Using Hadoop

Complexity Proposed Conventional

‘n’ – Number of Sequences

‘N’ – Average length of a sequence

‘k’ – Average number of blocks in a sequence

‘K’ – Size of 1 block

Analysis

Complexity

Measure

Proposed

Method

Conventional

Method

Score

Calculation

O(N) O(n*N)

Pairwise

alignment

O(K2) O(N2)

MSA O[K2 * ( n(n-1)/2] O(Nn)

Page 23: Hadoop Summit 2010 Multiple Sequence Alignment Using Hadoop

‘T’ – Time for sequence transfer serially & ‘k’ –

block size

T/k – Time for sequence transfer in parallel

2. Parallelised data trasfer

3. Dynamic cluster creation

Advantage: Computation power of remote cluster

is optimal and not wasted

Disadvantage: Time to set up the cluster

Page 24: Hadoop Summit 2010 Multiple Sequence Alignment Using Hadoop

Effect of parallel file transfer

File

Size

(MB)

File

Transfer

(sec)

Split

Time

(sec)

Merge

Time

(sec)

C1

(sec)

T1

(sec)

C2

(sec)

T2

(sec)

100 6.23 0.02 0.03 2.13 2.18 0.73 0.78

200 9.32 0.23 0.43 2.96 3.62 1.23 1.89

300 11.43 0.85 1.64 3.84 6.33 1.16 3.65

C1: Communication time from 3 client VMs to server without multithreading.

C2: Communication time from 3 client VMs to the server with multithreading.

T1: Total time for file transfer from client to server without multi threading

T2: Total time for file transfer from client to server with multi threading

Page 25: Hadoop Summit 2010 Multiple Sequence Alignment Using Hadoop

Time to start virtual machines

60

80

100

120T

ime i

n S

ec

0

20

40

1 2 3 4

Number of VMs

Tim

e i

n S

ec

Parallelised starting of VMs can be done to reduce time

Page 26: Hadoop Summit 2010 Multiple Sequence Alignment Using Hadoop

Cluster performance wrt number of VMs

30 KB sequences with 2 KB splits – upto 5 sequences

200

250

300

350

Tim

e in

Sec

Number of sequences is less than 6, a five node hadoop cluster is sufficient.

0

50

100

150

1 2 3 4 5 6 7 8 9 10

Number of sequences

Tim

e in

Sec

4 slave VMs (sec) 6 slave VMs (sec)

3 4 5 6 7 8 9 10 11 12

Page 27: Hadoop Summit 2010 Multiple Sequence Alignment Using Hadoop

Dynamic scaling up/down of clusters

File Size(GB)

Static VM creation based onPredicted application load(maps + reduces)

Dynamic VM creationbased on actualapplication load(maps + reduces)

VMs instantiated based on number of Map-Reduce Tasks

Dynamically number of tasks were checked up � New VMs started and tasks were

reallocated

Old VMs were destroyed if not used

Block size(10 MB)

(maps + reduces)

Time(min -sec)

VMs Time(min-sec)

New VMsadded

1 5-36 2 3-16 1

2 5-52 3 5-40 1

3 8-27 4 5-48 2

5 12-13 5 6-39 9

Page 28: Hadoop Summit 2010 Multiple Sequence Alignment Using Hadoop

Conclusion1) Proposed MSA improves on the computation time and also

maintains the accuracy.

� Parallelism of sequence alignment in three levels.

Hadoop data grids - Data and compute parallelism &

scalability

� Dynamic Programming - accuracy.

2) Complexity is reduced from O(Nn) to O[K2 * (n *(n-1)/2)]

� Combining progressive and dynamic approaches.

� Blocking in hadoop

3) Enhancements (using clouds for MSA)

� Automatic configuration of the cloud environment

based on the computational needs

� Efficient upload of data into the HDFS by parallel

transfer of sequence fragments over the Internet.

Page 29: Hadoop Summit 2010 Multiple Sequence Alignment Using Hadoop

Other Projects

� Enhancement of existing fairshare scheduler in hadoop

� Reliability using Reed Solomon codes

� Hybrid scheduler

Motif identification for MSA� Motif identification for MSA

� CBIR using image signatures

� Text categorization

� Hybrid PSO (PSO and GA) for job scheduling

� Semantic search using hadoop framework.

� Others – Globus and GridSim

Page 30: Hadoop Summit 2010 Multiple Sequence Alignment Using Hadoop

Acknowledgement

The Research has been carried out as a result of PSG-Yahoo

Research programme on Grid and Cloud computing.

Sincere Thanks to

1) Dr R Rudramoorthy, Principal, 1) Dr R Rudramoorthy, Principal,

PSG College of Techniology, Coimbatore.

2) Mr K V Chidambaran,

Director, Grid and Cloud Systems Group,

Yahoo, Bangalore

Page 31: Hadoop Summit 2010 Multiple Sequence Alignment Using Hadoop

THANK YOU

QUESTIONS?