graph data management lab, school of computer science email: [email protected] gdm@fudan...

24
Graph Data Management Lab, School of Computer Science Email: [email protected] GDM@FUDAN gdm.fudan.edu.cn Luyiqi gdm@fudan Locus based alignment storage & XMLSnippet

Upload: lesley-anderson

Post on 04-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Graph Data Management Lab, School of Computer Science

Email: [email protected]

GDM@FUDAN

gdm.fudan.edu.cn

Luyiqigdm@fudan

Locus based alignment storage &

XMLSnippet

Graph Data Management Lab, School of Computer Science

GDM@FUDAN

Email: [email protected]

gdm.fudan.edu.cn

2

Summary of works Locus based storage: A Novel Approach for

Alignments Output Storage Problem Facing Clinical Scenarios

XMLSnippet

Graph Data Management Lab, School of Computer Science

GDM@FUDAN

Email: [email protected]

gdm.fudan.edu.cn

3

Background A sequence alignment is a way of arranging the

sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of  relationships between the sequences.

Data storage costs have become an appreciable proportion of total cost

the rate of increase in DNA sequencing is significantly outstripping the rate of increase in disk storage capacity.

Graph Data Management Lab, School of Computer Science

GDM@FUDAN

Email: [email protected]

gdm.fudan.edu.cn

4

Related works Reference-based compression

• Markus Hsi-Yang Fritz et al., 2011 SAM/BAM toolset

Graph Data Management Lab, School of Computer Science

GDM@FUDAN

Email: [email protected]

gdm.fudan.edu.cn

5

Basic Idea Locus based storage

Graph Data Management Lab, School of Computer Science

GDM@FUDAN

Email: [email protected]

gdm.fudan.edu.cn

6

Basic idea Objectives

• Minimize the number of the intervals by renumbering the reads

Graph Data Management Lab, School of Computer Science

GDM@FUDAN

Email: [email protected]

gdm.fudan.edu.cn

7

Basic Idea Solution

1. Reduce to Travelling Salesman Problem(TSP) to get the suboptimal reads numbering function.

2. Generate the intervals under the new NF.3. Further optimizing

Graph Data Management Lab, School of Computer Science

GDM@FUDAN

Email: [email protected]

gdm.fudan.edu.cn

8

Basic Idea Solution

1. Reduce to Travelling Salesman Problem(TSP) to get the suboptimal reads numbering function.

2. Generate the intervals under the new NF.3. Further optimizing

Graph Data Management Lab, School of Computer Science

GDM@FUDAN

Email: [email protected]

gdm.fudan.edu.cn

9

Travelling Salesman Problem Input: A set C = {c1, ..., cn} of n cities, n × n

matrix d; Problem: Find the tour of C, i.e., a

permutation , ..., such that the length of the tour is minimal /maximum

Graph Data Management Lab, School of Computer Science

GDM@FUDAN

Email: [email protected]

gdm.fudan.edu.cn

10

Reduction TSP output: a city permutation , ...,

Our problem: a read permutation , ...,

Problem mapping:• Read => City• d( ) = -|Lij|, where Lij ={ l | both map onto locus l and

share the same ACGT value on l }.

Graph Data Management Lab, School of Computer Science

GDM@FUDAN

Email: [email protected]

gdm.fudan.edu.cn

11

Example

r001

r006

r005

r002

r003

r004

-5-4

-1

Graph Data Management Lab, School of Computer Science

GDM@FUDAN

Email: [email protected]

gdm.fudan.edu.cn

12

Example (Re-numbering)

r001

r006

r005

r002

r003

r004

Graph Data Management Lab, School of Computer Science

GDM@FUDAN

Email: [email protected]

gdm.fudan.edu.cn

13

Basic Idea Solution

1. Reduce to Travelling Salesman Problem(TSP) to get the suboptimal reads numbering function.

2. Generate the intervals under the new NF.3. Further optimizing

Graph Data Management Lab, School of Computer Science

GDM@FUDAN

Email: [email protected]

gdm.fudan.edu.cn

14

Optimization Technique Locality Optimization

• d( ) = 0, if map to the reference sequence far away from each other.

• split reference sequence into bins

Graph Data Management Lab, School of Computer Science

GDM@FUDAN

Email: [email protected]

gdm.fudan.edu.cn

15

Optimization Technique Ordered Intervals Refinement (current)

T [5,6],10

C 4, [7,8]

G [2,3],9,11

A 1,[12,15]

1. Determine the ACGT value order

T [5,6],10

C [4,8]

G [2,11]

A [1,15]

2. Refine the intervals

3. How to

restore?

Graph Data Management Lab, School of Computer Science

GDM@FUDAN

Email: [email protected]

gdm.fudan.edu.cn

16

Optimization Technique restore

temp

T [5,6],10T [5,6],10

C [4,8]

G [2,11]

A [1,15]

C 4, [7,8]

G [2,3],9,11

A 1,[12,15]

[5,6],10[4,8],10[2,11]

real_now = org_now – temp;

temp = temp + org_now;

Origin Real

OK!

Optimization Technique

Graph Data Management Lab, School of Computer Science

GDM@FUDAN

Email: [email protected]

gdm.fudan.edu.cn

17

Binary Format and Index Inverted Table GZIP

Graph Data Management Lab, School of Computer Science

GDM@FUDAN

Email: [email protected]

gdm.fudan.edu.cn

18

Summary of works Locus based storage: A Novel Approach for

Alignments Output Storage Problem Facing Clinical Scenarios

XMLSnippet

Graph Data Management Lab, School of Computer Science

GDM@FUDAN

Email: [email protected]

gdm.fudan.edu.cn

19

Motivation of XMLSnippet Framework Based Application is very popular in

current commercial application building environment. The most leading Framework based application is J2EE applications.

The framework is varies and each one requires certain learning curve for fresh man.

Most of these frameworks are open source from open communities• The document may not be complete.• The programmer may not have enough time to

command all the detail.

Graph Data Management Lab, School of Computer Science

GDM@FUDAN

Email: [email protected]

gdm.fudan.edu.cn

20

XMLSnippet – Related work Non-mining based

• compare the context of the code under development with the code samples in the example repositories

• interacts with a code search engine to gather relevant code samples and performs static analysis over the gathered samples

Mining based• Mine sequence association rules

Predefined• directly generating sub elements/attributes of a certain

element based on the predefined schema of the XML files

Graph Data Management Lab, School of Computer Science

GDM@FUDAN

Email: [email protected]

gdm.fudan.edu.cn

21

Our solution Basic idea

Graph Data Management Lab, School of Computer Science

GDM@FUDAN

Email: [email protected]

gdm.fudan.edu.cn

22

XMLSnippet – basic idea

Graph Data Management Lab, School of Computer Science

GDM@FUDAN

Email: [email protected]

gdm.fudan.edu.cn

23

XMLSnippet - Framework

Graph Data Management Lab, School of Computer Science

GDM@FUDAN

Email: [email protected]

gdm.fudan.edu.cn

24

XMLSnippet – Key techniques Closed frequent tree pattern mining XML tree pattern & syntax tree pattern mapping