a tool for supporting integration across multiple flat-file datasets xuan zhang, gagan agrawal ohio...

26
A Tool for Supporting Integration Across Multiple Flat-File Datasets Xuan Zhang, Gagan Agrawal Ohio State University

Upload: elwin-merritt

Post on 24-Dec-2015

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Tool for Supporting Integration Across Multiple Flat-File Datasets Xuan Zhang, Gagan Agrawal Ohio State University

A Tool for Supporting Integration Across Multiple Flat-File Datasets

Xuan Zhang, Gagan Agrawal

Ohio State University

Page 2: A Tool for Supporting Integration Across Multiple Flat-File Datasets Xuan Zhang, Gagan Agrawal Ohio State University

Outline

• Motivation

• System Overview

• System Implementation– Languages– Query Execution

• Experiments

Page 3: A Tool for Supporting Integration Across Multiple Flat-File Datasets Xuan Zhang, Gagan Agrawal Ohio State University

Motivation

• Biological researches ask for– Accessing multiple heterogeneous data

sources• Lack of common data model, data format

– Tracking multiple objects

• A motivating example: protein sequence analysis

Page 4: A Tool for Supporting Integration Across Multiple Flat-File Datasets Xuan Zhang, Gagan Agrawal Ohio State University

An Example

>unknown sequence. ……MCMFGSSVIECPNPRIWFVWPYEFPLFLLPGGDRMEI……

NCBI protein-protein BLAST service

List of similar sequences

Clustering Analysis (ClustalW, RiPE, etc)

Format Information

Out: BLAST Fixed column width

BLAST Partial name

In: FASTA In: Full name

To predict protein function

Page 5: A Tool for Supporting Integration Across Multiple Flat-File Datasets Xuan Zhang, Gagan Agrawal Ohio State University

Current Solution

• Manual– Copy-and-paste keyword search– Format conversion programs– NCBI link-out

• Database– Load data (BLAST output, sequence

database)– Parse input; Re-format output

Page 6: A Tool for Supporting Integration Across Multiple Flat-File Datasets Xuan Zhang, Gagan Agrawal Ohio State University

Our Approach

• Join request between BLAST output and SWISSPROT (sequence database)

• Data maintained in flat files

• Query specification and data description are high-level, declarative

• Data parsing and query processing are behind the scene

Page 7: A Tool for Supporting Integration Across Multiple Flat-File Datasets Xuan Zhang, Gagan Agrawal Ohio State University

Advantage

• Retrieve multiple pieces of information all at once

• Data easily available

• Declarative languages only

• High flexibility

• Low over-head

Page 8: A Tool for Supporting Integration Across Multiple Flat-File Datasets Xuan Zhang, Gagan Agrawal Ohio State University

System Overviewquery

Query parser

Metadatacollection

Datasetdescriptors

Descriptorparser

Application analyzer

QUERYINFOR

DataReader DataWriter

synchronizer

Source data files

TargetData file

Source/target names

Schema & Layout information

mappings

Page 9: A Tool for Supporting Integration Across Multiple Flat-File Datasets Xuan Zhang, Gagan Agrawal Ohio State University

Outline

• Motivation• System Overview• System Implementation

– Languages• Query Language• Metadata Description Language

– System• Query Analysis• Query Execution

• Experiments

Page 10: A Tool for Supporting Integration Across Multiple Flat-File Datasets Xuan Zhang, Gagan Agrawal Ohio State University

Query Language

• Declarative, SQL-like• Projection, selection, cross product, join queries• Example AUTOWRAP POSTBLAST

FROM BLASTP, SWISSPROT

BY BLASTP.SP_ID = SWISSPROT.ID

WHERE

POSTBLAST.QUERY = BLASTP.QUERY

POSTBLAST.SP_AC = BLASTP.SP_AC

POSTBLAST.SP_ID = BLASTP.SP_ID

POSTBLAST.FULL_DESCR = SWISSPROT.DEPOSTBLAST.FULL_DESCR = SWISSPROT.DE

POSTBLAST.SEQUENCE = SWISSPORT.SQPOSTBLAST.SEQUENCE = SWISSPORT.SQ

POSTBLAST.SCORE = BLASTP.SCORE

POSTBLAST.E_VALUE = BLASTP.E_VALUE

Target dataset

Source datasets

Join criteria

Attribute pairs

Page 11: A Tool for Supporting Integration Across Multiple Flat-File Datasets Xuan Zhang, Gagan Agrawal Ohio State University

Metadata Description Language

• One descriptor for each flat file dataset. – Schema– Layout

• Re-usable by different queries

• Can be learned semi-automatically using data mining techniques

• Example: BLAST output

Page 12: A Tool for Supporting Integration Across Multiple Flat-File Datasets Xuan Zhang, Gagan Agrawal Ohio State University

Schema Descriptors

• Written in XML DTD format

• Example<?xml version='1.0' encoding='UTF-8'?>

<!ELEMENT BLASTP (QUERY, SEQUENCE*)>

<!ELEMENT QUERY (#PCDATA)>

<!ELEMENT SEQUENCE (SP_AC, SP_ID, DESCR, SCORE, E_VALUE)>

<!ELEMENT SP_AC (#PCDATA)>

<!ELEMENT SP_ID (#PCDATA)>

<!ELEMENT DESCR (#PCDATA)>

<!ELEMENT SCORE (#PCDATA)>

<!ELEMENT E_VALUE (#PCDATA)>

Page 13: A Tool for Supporting Integration Across Multiple Flat-File Datasets Xuan Zhang, Gagan Agrawal Ohio State University

Layout Descriptors

• Example

DATASET "BLASTP" {

DATATYPE {BLASTP}

DATASPACE LINESIZE = 90 {

… …

}

DATA {data/Blast_htm.txt}

}

Dataset name

Schema name

File layout

File location

Page 14: A Tool for Supporting Integration Across Multiple Flat-File Datasets Xuan Zhang, Gagan Agrawal Ohio State University

Description of File Layout

• Layout descriptor • Actual data file"BLASTP" VERSION

… …

"Query=" QUERY

"\nDatabase:" DB_NAME

<

"\nsp|" SP_AC

"|" SP_ID

" " DESCR

" " SCORE

" " E_VALUE

>

"\n\nALIGNMENT" DUMMY

BLASTP 2.2.11 [Jun-05-2005]

Reference: … …

RID: … …

Query= Random 50 residue protein sequence.

Database: Non-redundant SwissProt sequences 175,661 sequences; 64,716,374 total letters

Score ESequences producing significant alignments: (Bits) Value

sp|P11884|AL1A1_SHEEP Modification methylase MwoI (N-4 cytosin... 30.0 1.5 sp|P00352|AL1A1_HUMAN Oxygen-independent coproporphyrinogen II... 28.1 5.7 sp|P40530|YIE2_YEAST Oxygen-independent coproporphyrinogen II... 28.1 5.7

ALIGNMENTS>sp|P11884|AL1A1_SHEEP Modification methylase MwoI (N-4 cytosine-specific … …

Page 15: A Tool for Supporting Integration Across Multiple Flat-File Datasets Xuan Zhang, Gagan Agrawal Ohio State University

Query Analysisquery

Query parser

Metadatacollection

Datasetdescriptors

Descriptorparser

QUERYINFOR

DataReader DataWriter

synchronizer

Source data files

TargetData file

Source/target names

Schema & Layout information

mappings

Application analyzerApplication analyzer

Page 16: A Tool for Supporting Integration Across Multiple Flat-File Datasets Xuan Zhang, Gagan Agrawal Ohio State University

Terminology

• DLM-VAR node/pair– a pairing of a delimiter and an attribute value– E.g "Query=" QUERY

• Reach-ability– DLM-VAR node r is reachable from node a iff

configuration “ar” is allowed by the layout description

• Regular v.s Semi-structured Attribute– Regular: fixed number of values per entry– Semi-structured: various number of values per entry

• Number v.s Index– Label for layout node v.s schema node– 1 index/number, 1+ number/index

Page 17: A Tool for Supporting Integration Across Multiple Flat-File Datasets Xuan Zhang, Gagan Agrawal Ohio State University

Application Analyzer

1. Label Schema and Layout Tree2. Query analysis

– Record layout information• Delimiter look-up table

– Draw correspondence between schema and layout• Label look-up table

– Collect constant values in query• Pseudo-label look-up table

– Calculate reachable nodes• Reachable look-up table

– Other information Parameters

QUERYINFORQUERYINFOR

Page 18: A Tool for Supporting Integration Across Multiple Flat-File Datasets Xuan Zhang, Gagan Agrawal Ohio State University

QUERY-PROC Structure

• Three general action modules– DataReader– DataWriter– Synchronizer

• One query-specific data module– QUERYINFOR

QUERYINFOR

DataWriterDataReader

Synchronizer

Source 1

Source 2Target

Page 19: A Tool for Supporting Integration Across Multiple Flat-File Datasets Xuan Zhang, Gagan Agrawal Ohio State University

QUERY-PROC Structure (cont.)

• One value buffer– Configuration vary from query to query– Accessible to three general modules

QUERY

SP_ID

SCORE

E_VALUE

SP_AC

BLASTP

Source 1

Source 2

Regular Semi-structured

Page 20: A Tool for Supporting Integration Across Multiple Flat-File Datasets Xuan Zhang, Gagan Agrawal Ohio State University

QUERY-PROC Action

• DataReader– Extract attribute value

• Start: Delimiter look-up table• End: Reachable look-up table

– Fill value buffer: Label look-up table

• DataWriter– Retrieve from value buffer: Label look-up

table– Write target file: Delimiter look-up table

• Truncate or wrap: Reachable look-up table + label look-up table

Page 21: A Tool for Supporting Integration Across Multiple Flat-File Datasets Xuan Zhang, Gagan Agrawal Ohio State University

QUERY-PROC Action (cont.)

• Synchronizer– Set up pseudo-attributes: Pseudo label look-

up table– Call DataReader on source 1 and 2, Call

DataWriter on target: Parameters– Test join conditions: Parameters– Clean value buffer: Parameters

Page 22: A Tool for Supporting Integration Across Multiple Flat-File Datasets Xuan Zhang, Gagan Agrawal Ohio State University

Outline

• Motivation• System Overview• System Implementation

– Languages– System

• Experiments

Outline

Page 23: A Tool for Supporting Integration Across Multiple Flat-File Datasets Xuan Zhang, Gagan Agrawal Ohio State University

Post-BLAST Query

• Enhance BLAST output

• Join query between BLAST output and SWISSPROT

• Results in FASTA format

• 2 modes– UNIQUE: halt once a

match is found in source 2

– ALL: search all source 2 entries

0

200

400

600

800

1000

1200

1400

1600

1800

2000

Tim

e (

se

c)

3 5 12

Query Size (Sequence Number)

UNIQUE

ALL

Page 24: A Tool for Supporting Integration Across Multiple Flat-File Datasets Xuan Zhang, Gagan Agrawal Ohio State University

Chip-supplement Query

• Look up microarray genes information

• Join query between protein array and yeast genome database

• Results in tabular form• 2 queries

– Chip-Supplement: array join genome

– Chip-Supplement-Sorted: genome join array

0

10

20

30

40

50

60

70

80

90

Tim

e (

se

c)

Chip-Supplement Chip-Supplement-

Sorted

Query Type

UNIQUE

ALL

Page 25: A Tool for Supporting Integration Across Multiple Flat-File Datasets Xuan Zhang, Gagan Agrawal Ohio State University

OMIM-plus Query

• Add reverse links of proteins to disease database

• Join query between OMIM database and SWISSPROT database

• Results in OMIM form

• 86.38 seconds/entry * 12,158 OMIM entry = 291.7 hours

Page 26: A Tool for Supporting Integration Across Multiple Flat-File Datasets Xuan Zhang, Gagan Agrawal Ohio State University

Summary

• A data integration tool

• Answers query on flat-file datasets

• Light-weighted– Modest programming efforts– No DBMS– Various flat file formats supported