parallelizing information retrieval · pdf fileparallelizing information retrieval ... p2...

9
1 Parallelizing Parallelizing Information Information Retrieval Retrieval Gautam B. Singh Gautam B. Singh Department of Computer Science & Engineering Department of Computer Science & Engineering Oakland University, Rochester, MI Oakland University, Rochester, MI Parallelizing Parallelizing DNA Sequence DNA Sequence DB DB Search Search b Goal: Apply parallel algorithms for finding Goal: Apply parallel algorithms for finding the neighbors of a query sequence in the the neighbors of a query sequence in the database. database. b Sequence Retrieval is one of the most Sequence Retrieval is one of the most common operations performed by common operations performed by biologists. biologists. b GenBank GenBank (NIH) has 6 million records and (NIH) has 6 million records and some 5 billion characters. some 5 billion characters.

Upload: trinhcong

Post on 14-Mar-2018

221 views

Category:

Documents


2 download

TRANSCRIPT

1

ParallelizingParallelizing Information InformationRetrievalRetrieval

Gautam B. SinghGautam B. SinghDepartment of Computer Science & EngineeringDepartment of Computer Science & EngineeringOakland University, Rochester, MIOakland University, Rochester, MI

ParallelizingParallelizing DNA Sequence DNA Sequence DB DBSearchSearchbb Goal: Apply parallel algorithms for findingGoal: Apply parallel algorithms for finding

the neighbors of a query sequence in thethe neighbors of a query sequence in thedatabase.database.

bb Sequence Retrieval is one of the mostSequence Retrieval is one of the mostcommon operations performed bycommon operations performed bybiologists.biologists.

bb GenBankGenBank (NIH) has 6 million records and (NIH) has 6 million records andsome 5 billion characters.some 5 billion characters.

2

Method and ActivitiesMethod and Activities

bb Method:Method: Use information theory based Use information theory basedretrieval of sequences from database.retrieval of sequences from database.

bb Current algorithms are based on findingCurrent algorithms are based on findingLongest Common Subsequence (LCS).Longest Common Subsequence (LCS).

bb Data Set:Data Set: Data set from human chromosome Data set from human chromosomesequences, about 2500 sequences.sequences, about 2500 sequences.

bb Query:Query: 20 sequences of variable lengths 20 sequences of variable lengthsfrom 500,000 to 2,000,000 characters infrom 500,000 to 2,000,000 characters insize.size.

Solution: Utilize a DistributedSolution: Utilize a DistributedParallel ParadigmParallel Paradigm

bb Processors have distributed memory/database.Processors have distributed memory/database.bb AnAn NxN NxN Processor Interconnection Network is Processor Interconnection Network is

implemented by the WUGS switch.implemented by the WUGS switch.

NxN Interconnection Network - WUGS Switch

. . .P2

Memory

P3 PnP1

CentralizedDatabase

Memory

MemoryMemory

MASTER PROCESSOR

SLAVE PROCESSORS

3

Sequence Retrieval AlgorithmSequence Retrieval Algorithm

bb Based on information theoreticalBased on information theoreticalmeasurements of sequence similarity.measurements of sequence similarity.

bb Constructs a Constructs a profile or a signatureprofile or a signature of the of thesequences in the Database.sequences in the Database.

bb Construct the profile of the query.Construct the profile of the query.bb Establish similarity by computing theEstablish similarity by computing the

distance between the two signatures.distance between the two signatures.

Profile or SignatureProfile or Signature

bb Profile = n-mer word frequency.Profile = n-mer word frequency.

bb An n-mer profile contains (4An n-mer profile contains (4****N) words.N) words.

bb Consider the sequence: Consider the sequence: A T A C C A G A C C AA T A C C A G A C C A

bb 4-mer profile (256 words): 4-mer profile (256 words): ATAC , TACC,ATAC , TACC,ACCA (2), CCAG, AGAC, GACCACCA (2), CCAG, AGAC, GACC..

bb 6-mer profile (4096 words): 6-mer profile (4096 words): ATACCA,ATACCA,TACCAG, ACCAGA, CCAGAC, AGACCATACCAG, ACCAGA, CCAGAC, AGACCA

4

Sequence Profile or SignatureSequence Profile or Signature

Comparing ProfilesComparing Profiles

bb Profiles are normalized and converted intoProfiles are normalized and converted intoprobability density functions.probability density functions.

bb Profiles are compared using theProfiles are compared using thedivergence measures.divergence measures.

5

Retrieval SystemRetrieval System

bb Profiles for theProfiles for thesequences are pre-sequences are pre-computed & stored.computed & stored.

bb Retrieval systemRetrieval systemcompares the querycompares the queryprofile to all profile to all DBDBprofiles.profiles.

bb Neighbors areNeighbors arereported.reported.

Database and System DetailsDatabase and System Details

bb Our pilot database comprises ofOur pilot database comprises of•• 2047 sequences2047 sequences•• 500 MB data500 MB data

bb Two (2) Intel 133 Two (2) Intel 133 Mhz Mhz CPUs connected viaCPUs connected viaAPIC/WUGSAPIC/WUGS

bb Profiles:Profiles:•• 8-mer profile (1.25 GB located on the8-mer profile (1.25 GB located on the

centralized database)centralized database)

6

Framework PossibilitiesFramework Possibilities

bb Remote Method InterfaceRemote Method Interface•• CORBA paradigmCORBA paradigm•• The program is written with the INTERFACEThe program is written with the INTERFACE•• INTRFACE are implemented by RMI Server.INTRFACE are implemented by RMI Server.•• Programming model is really elegant-Programming model is really elegant-

application application parallelizationparallelization is transparent. is transparent.•• This IS the parallel paradigm to program to!This IS the parallel paradigm to program to!

Framework Possibilities (Framework Possibilities (ContCont.).)

bb MPIMPI•• Similar programming model as PVMSimilar programming model as PVM•• Still, have to explicitly Still, have to explicitly parallelizeparallelize write the write the

parallel programparallel program•• Do not have to work as hard as socketsDo not have to work as hard as sockets

bb SocketsSockets•• Explicit forking and synchronizationExplicit forking and synchronization•• Difficult - Typically done after the parallelDifficult - Typically done after the parallel

algorithm is well established.algorithm is well established.

7

Parallel Application FrameworksParallel Application FrameworksApplication Framework

Bandwidth Comments

PVM: Parallel Virtual Machine

20 Mbps Easy to program

MPI: Message Passign Interface

32 Mbps Varies with block size.

Java RMI: Remote Method Invokation

32-36 MbpsEthernet and APIC perfromance comparable.

Sockets100-200 Mbps

Varies. Best for packet size of 32 kb.

Search AlgorithmSearch Algorithm

bb 8 Mer profiles for the 8 Mer profiles for the DBDB to be searched is to be searched isstored on the master processorstored on the master processor

bb Each query sequences communicated to allEach query sequences communicated to allslaves.slaves.

bb Slaves request work and get a databaseSlaves request work and get a databaseprofile (0.5 MB message). Return the scoreprofile (0.5 MB message). Return the scoreand get new profile (work).and get new profile (work).

8

Evaluating Parallel PerformanceEvaluating Parallel Performance

bb Performance ofPerformance of parallelization parallelization was wasmeasured by Speedup:measured by Speedup:

bb Speedup =Speedup =Exec. Time 1Exec. Time 1 Proc Proc. / Exec. Time N. / Exec. Time N Proc Proc..

bb Our experiments, N = 2Our experiments, N = 2bb Ideal Speedup = 2Ideal Speedup = 2

8-Mer DB Search Speedup

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1 2 3 4 5 6 7 8 9 10 11

Sequence Id

Sp

ee

du

p

Speedup (APIC)Speedup (Ether)

9

AcknowledgementsAcknowledgements

bb Washington UniversityWashington Universitybb Ron Ron SrodawaSrodawa, Oakland University, Oakland Universitybb DonglinDonglin Liu, Oakland University Liu, Oakland University