graph data management lab, school of computer science email: [email protected] gdm@fudan...
TRANSCRIPT
Graph Data Management Lab, School of Computer Science
Email: [email protected]
GDM@FUDAN
gdm.fudan.edu.cn
Luyiqigdm@fudan
Locus based alignment storage &
XMLSnippet
Graph Data Management Lab, School of Computer Science
GDM@FUDAN
Email: [email protected]
gdm.fudan.edu.cn
2
Summary of works Locus based storage: A Novel Approach for
Alignments Output Storage Problem Facing Clinical Scenarios
XMLSnippet
Graph Data Management Lab, School of Computer Science
GDM@FUDAN
Email: [email protected]
gdm.fudan.edu.cn
3
Background A sequence alignment is a way of arranging the
sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of relationships between the sequences.
Data storage costs have become an appreciable proportion of total cost
the rate of increase in DNA sequencing is significantly outstripping the rate of increase in disk storage capacity.
Graph Data Management Lab, School of Computer Science
GDM@FUDAN
Email: [email protected]
gdm.fudan.edu.cn
4
Related works Reference-based compression
• Markus Hsi-Yang Fritz et al., 2011 SAM/BAM toolset
Graph Data Management Lab, School of Computer Science
GDM@FUDAN
Email: [email protected]
gdm.fudan.edu.cn
5
Basic Idea Locus based storage
Graph Data Management Lab, School of Computer Science
GDM@FUDAN
Email: [email protected]
gdm.fudan.edu.cn
6
Basic idea Objectives
• Minimize the number of the intervals by renumbering the reads
Graph Data Management Lab, School of Computer Science
GDM@FUDAN
Email: [email protected]
gdm.fudan.edu.cn
7
Basic Idea Solution
1. Reduce to Travelling Salesman Problem(TSP) to get the suboptimal reads numbering function.
2. Generate the intervals under the new NF.3. Further optimizing
Graph Data Management Lab, School of Computer Science
GDM@FUDAN
Email: [email protected]
gdm.fudan.edu.cn
8
Basic Idea Solution
1. Reduce to Travelling Salesman Problem(TSP) to get the suboptimal reads numbering function.
2. Generate the intervals under the new NF.3. Further optimizing
Graph Data Management Lab, School of Computer Science
GDM@FUDAN
Email: [email protected]
gdm.fudan.edu.cn
9
Travelling Salesman Problem Input: A set C = {c1, ..., cn} of n cities, n × n
matrix d; Problem: Find the tour of C, i.e., a
permutation , ..., such that the length of the tour is minimal /maximum
Graph Data Management Lab, School of Computer Science
GDM@FUDAN
Email: [email protected]
gdm.fudan.edu.cn
10
Reduction TSP output: a city permutation , ...,
Our problem: a read permutation , ...,
Problem mapping:• Read => City• d( ) = -|Lij|, where Lij ={ l | both map onto locus l and
share the same ACGT value on l }.
Graph Data Management Lab, School of Computer Science
GDM@FUDAN
Email: [email protected]
gdm.fudan.edu.cn
11
Example
r001
r006
r005
r002
r003
r004
-5-4
-1
Graph Data Management Lab, School of Computer Science
GDM@FUDAN
Email: [email protected]
gdm.fudan.edu.cn
12
Example (Re-numbering)
r001
r006
r005
r002
r003
r004
Graph Data Management Lab, School of Computer Science
GDM@FUDAN
Email: [email protected]
gdm.fudan.edu.cn
13
Basic Idea Solution
1. Reduce to Travelling Salesman Problem(TSP) to get the suboptimal reads numbering function.
2. Generate the intervals under the new NF.3. Further optimizing
Graph Data Management Lab, School of Computer Science
GDM@FUDAN
Email: [email protected]
gdm.fudan.edu.cn
14
Optimization Technique Locality Optimization
• d( ) = 0, if map to the reference sequence far away from each other.
• split reference sequence into bins
Graph Data Management Lab, School of Computer Science
GDM@FUDAN
Email: [email protected]
gdm.fudan.edu.cn
15
Optimization Technique Ordered Intervals Refinement (current)
T [5,6],10
C 4, [7,8]
G [2,3],9,11
A 1,[12,15]
1. Determine the ACGT value order
T [5,6],10
C [4,8]
G [2,11]
A [1,15]
2. Refine the intervals
3. How to
restore?
Graph Data Management Lab, School of Computer Science
GDM@FUDAN
Email: [email protected]
gdm.fudan.edu.cn
16
Optimization Technique restore
temp
T [5,6],10T [5,6],10
C [4,8]
G [2,11]
A [1,15]
C 4, [7,8]
G [2,3],9,11
A 1,[12,15]
[5,6],10[4,8],10[2,11]
real_now = org_now – temp;
temp = temp + org_now;
Origin Real
OK!
Optimization Technique
Graph Data Management Lab, School of Computer Science
GDM@FUDAN
Email: [email protected]
gdm.fudan.edu.cn
17
Binary Format and Index Inverted Table GZIP
Graph Data Management Lab, School of Computer Science
GDM@FUDAN
Email: [email protected]
gdm.fudan.edu.cn
18
Summary of works Locus based storage: A Novel Approach for
Alignments Output Storage Problem Facing Clinical Scenarios
XMLSnippet
Graph Data Management Lab, School of Computer Science
GDM@FUDAN
Email: [email protected]
gdm.fudan.edu.cn
19
Motivation of XMLSnippet Framework Based Application is very popular in
current commercial application building environment. The most leading Framework based application is J2EE applications.
The framework is varies and each one requires certain learning curve for fresh man.
Most of these frameworks are open source from open communities• The document may not be complete.• The programmer may not have enough time to
command all the detail.
Graph Data Management Lab, School of Computer Science
GDM@FUDAN
Email: [email protected]
gdm.fudan.edu.cn
20
XMLSnippet – Related work Non-mining based
• compare the context of the code under development with the code samples in the example repositories
• interacts with a code search engine to gather relevant code samples and performs static analysis over the gathered samples
Mining based• Mine sequence association rules
Predefined• directly generating sub elements/attributes of a certain
element based on the predefined schema of the XML files
Graph Data Management Lab, School of Computer Science
GDM@FUDAN
Email: [email protected]
gdm.fudan.edu.cn
21
Our solution Basic idea
Graph Data Management Lab, School of Computer Science
GDM@FUDAN
Email: [email protected]
gdm.fudan.edu.cn
22
XMLSnippet – basic idea
Graph Data Management Lab, School of Computer Science
GDM@FUDAN
Email: [email protected]
gdm.fudan.edu.cn
23
XMLSnippet - Framework
Graph Data Management Lab, School of Computer Science
GDM@FUDAN
Email: [email protected]
gdm.fudan.edu.cn
24
XMLSnippet – Key techniques Closed frequent tree pattern mining XML tree pattern & syntax tree pattern mapping