parallelizing information retrieval · pdf fileparallelizing information retrieval ... p2...
TRANSCRIPT
1
ParallelizingParallelizing Information InformationRetrievalRetrieval
Gautam B. SinghGautam B. SinghDepartment of Computer Science & EngineeringDepartment of Computer Science & EngineeringOakland University, Rochester, MIOakland University, Rochester, MI
ParallelizingParallelizing DNA Sequence DNA Sequence DB DBSearchSearchbb Goal: Apply parallel algorithms for findingGoal: Apply parallel algorithms for finding
the neighbors of a query sequence in thethe neighbors of a query sequence in thedatabase.database.
bb Sequence Retrieval is one of the mostSequence Retrieval is one of the mostcommon operations performed bycommon operations performed bybiologists.biologists.
bb GenBankGenBank (NIH) has 6 million records and (NIH) has 6 million records andsome 5 billion characters.some 5 billion characters.
2
Method and ActivitiesMethod and Activities
bb Method:Method: Use information theory based Use information theory basedretrieval of sequences from database.retrieval of sequences from database.
bb Current algorithms are based on findingCurrent algorithms are based on findingLongest Common Subsequence (LCS).Longest Common Subsequence (LCS).
bb Data Set:Data Set: Data set from human chromosome Data set from human chromosomesequences, about 2500 sequences.sequences, about 2500 sequences.
bb Query:Query: 20 sequences of variable lengths 20 sequences of variable lengthsfrom 500,000 to 2,000,000 characters infrom 500,000 to 2,000,000 characters insize.size.
Solution: Utilize a DistributedSolution: Utilize a DistributedParallel ParadigmParallel Paradigm
bb Processors have distributed memory/database.Processors have distributed memory/database.bb AnAn NxN NxN Processor Interconnection Network is Processor Interconnection Network is
implemented by the WUGS switch.implemented by the WUGS switch.
NxN Interconnection Network - WUGS Switch
. . .P2
Memory
P3 PnP1
CentralizedDatabase
Memory
MemoryMemory
MASTER PROCESSOR
SLAVE PROCESSORS
3
Sequence Retrieval AlgorithmSequence Retrieval Algorithm
bb Based on information theoreticalBased on information theoreticalmeasurements of sequence similarity.measurements of sequence similarity.
bb Constructs a Constructs a profile or a signatureprofile or a signature of the of thesequences in the Database.sequences in the Database.
bb Construct the profile of the query.Construct the profile of the query.bb Establish similarity by computing theEstablish similarity by computing the
distance between the two signatures.distance between the two signatures.
Profile or SignatureProfile or Signature
bb Profile = n-mer word frequency.Profile = n-mer word frequency.
bb An n-mer profile contains (4An n-mer profile contains (4****N) words.N) words.
bb Consider the sequence: Consider the sequence: A T A C C A G A C C AA T A C C A G A C C A
bb 4-mer profile (256 words): 4-mer profile (256 words): ATAC , TACC,ATAC , TACC,ACCA (2), CCAG, AGAC, GACCACCA (2), CCAG, AGAC, GACC..
bb 6-mer profile (4096 words): 6-mer profile (4096 words): ATACCA,ATACCA,TACCAG, ACCAGA, CCAGAC, AGACCATACCAG, ACCAGA, CCAGAC, AGACCA
4
Sequence Profile or SignatureSequence Profile or Signature
Comparing ProfilesComparing Profiles
bb Profiles are normalized and converted intoProfiles are normalized and converted intoprobability density functions.probability density functions.
bb Profiles are compared using theProfiles are compared using thedivergence measures.divergence measures.
5
Retrieval SystemRetrieval System
bb Profiles for theProfiles for thesequences are pre-sequences are pre-computed & stored.computed & stored.
bb Retrieval systemRetrieval systemcompares the querycompares the queryprofile to all profile to all DBDBprofiles.profiles.
bb Neighbors areNeighbors arereported.reported.
Database and System DetailsDatabase and System Details
bb Our pilot database comprises ofOur pilot database comprises of•• 2047 sequences2047 sequences•• 500 MB data500 MB data
bb Two (2) Intel 133 Two (2) Intel 133 Mhz Mhz CPUs connected viaCPUs connected viaAPIC/WUGSAPIC/WUGS
bb Profiles:Profiles:•• 8-mer profile (1.25 GB located on the8-mer profile (1.25 GB located on the
centralized database)centralized database)
6
Framework PossibilitiesFramework Possibilities
bb Remote Method InterfaceRemote Method Interface•• CORBA paradigmCORBA paradigm•• The program is written with the INTERFACEThe program is written with the INTERFACE•• INTRFACE are implemented by RMI Server.INTRFACE are implemented by RMI Server.•• Programming model is really elegant-Programming model is really elegant-
application application parallelizationparallelization is transparent. is transparent.•• This IS the parallel paradigm to program to!This IS the parallel paradigm to program to!
Framework Possibilities (Framework Possibilities (ContCont.).)
bb MPIMPI•• Similar programming model as PVMSimilar programming model as PVM•• Still, have to explicitly Still, have to explicitly parallelizeparallelize write the write the
parallel programparallel program•• Do not have to work as hard as socketsDo not have to work as hard as sockets
bb SocketsSockets•• Explicit forking and synchronizationExplicit forking and synchronization•• Difficult - Typically done after the parallelDifficult - Typically done after the parallel
algorithm is well established.algorithm is well established.
7
Parallel Application FrameworksParallel Application FrameworksApplication Framework
Bandwidth Comments
PVM: Parallel Virtual Machine
20 Mbps Easy to program
MPI: Message Passign Interface
32 Mbps Varies with block size.
Java RMI: Remote Method Invokation
32-36 MbpsEthernet and APIC perfromance comparable.
Sockets100-200 Mbps
Varies. Best for packet size of 32 kb.
Search AlgorithmSearch Algorithm
bb 8 Mer profiles for the 8 Mer profiles for the DBDB to be searched is to be searched isstored on the master processorstored on the master processor
bb Each query sequences communicated to allEach query sequences communicated to allslaves.slaves.
bb Slaves request work and get a databaseSlaves request work and get a databaseprofile (0.5 MB message). Return the scoreprofile (0.5 MB message). Return the scoreand get new profile (work).and get new profile (work).
8
Evaluating Parallel PerformanceEvaluating Parallel Performance
bb Performance ofPerformance of parallelization parallelization was wasmeasured by Speedup:measured by Speedup:
bb Speedup =Speedup =Exec. Time 1Exec. Time 1 Proc Proc. / Exec. Time N. / Exec. Time N Proc Proc..
bb Our experiments, N = 2Our experiments, N = 2bb Ideal Speedup = 2Ideal Speedup = 2
8-Mer DB Search Speedup
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1 2 3 4 5 6 7 8 9 10 11
Sequence Id
Sp
ee
du
p
Speedup (APIC)Speedup (Ether)