supporting high- performance data processing on flat-files xuan zhang gagan agrawal ohio state...

25
Supporting High- Performance Data Processing on Flat-Files Xuan Zhang Gagan Agrawal Ohio State University

Upload: jared-lyons

Post on 30-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Supporting High- Performance Data Processing on Flat-Files Xuan Zhang Gagan Agrawal Ohio State University

Supporting High-Performance Data Processing on Flat-Files

Xuan Zhang

Gagan Agrawal

Ohio State University

Page 2: Supporting High- Performance Data Processing on Flat-Files Xuan Zhang Gagan Agrawal Ohio State University

Motivation

• Challenges of bioinformatics integration– Data volume: overwhelming

• DNA sequence: 100 gigabases (August, 2005)

– Data growth:

exponential

Figure provided by PDB

Page 3: Supporting High- Performance Data Processing on Flat-Files Xuan Zhang Gagan Agrawal Ohio State University

Existing Solutions

– (Relational) Databases• Support for indexing and high-level queries • Not suitable for biological data

– Flat Files with Scripts • Compact, Perl Scripts available • Lack indexing and high-level query processing

– Web-services • Significant overhead

Page 4: Supporting High- Performance Data Processing on Flat-Files Xuan Zhang Gagan Agrawal Ohio State University

• Enhance information integration systems on– Functionality

• On-the-fly data incorporation• Flat file data process

– Usability• Declarative interface• Low programming requirement

– Performance • Incorporate indexing support

Our Approach

Page 5: Supporting High- Performance Data Processing on Flat-Files Xuan Zhang Gagan Agrawal Ohio State University

Approach Summary

• Metadata– Declarative description of data– Data mining algorithms for semi-automatic

writing– Reusable by different requests on same data

• Code generation– Request analysis and execution separated– General modules with plug-in data module

Page 6: Supporting High- Performance Data Processing on Flat-Files Xuan Zhang Gagan Agrawal Ohio State University

System OverviewUnderstand Data Process Data

Data File User Request

Answ

er

Metadata Description

Layout Descriptor---------------------------------------------------

Schema DescriptorLayout Descriptor

---------------------------------------------------

Schema DescriptorLayout Descriptor

---------------------------------------------------

Schema Descriptor

CodeGeneration

RequestProcessor

Layout Miner

SchemaMiner

Information Integration System

Page 7: Supporting High- Performance Data Processing on Flat-Files Xuan Zhang Gagan Agrawal Ohio State University

Advantages

• Simple interface– At metadata level, declarative

• General data model– Semi-structured data– Flat file data

• Low human involvement– Semi-automatic data incorporation– Low maintenance cost

• OK Performance– Linear scale guaranteed – Can improve by using indexing

Page 8: Supporting High- Performance Data Processing on Flat-Files Xuan Zhang Gagan Agrawal Ohio State University

System Components

• Understand data– Layout mining– Schema mining

• Process data– Wrapper generation– Query Process– Query Process with indices

Page 9: Supporting High- Performance Data Processing on Flat-Files Xuan Zhang Gagan Agrawal Ohio State University

Data Process Overview

• Automatic code generation approach• Input

– Metadata about datasets involved– Optional:

• Implicit data transformation task• Request by users• Indexing functions

• Output– Executable programs

• General modules• Task-specific data module

Page 10: Supporting High- Performance Data Processing on Flat-Files Xuan Zhang Gagan Agrawal Ohio State University

Metadata Description

• Two aspects of data in flat files– Logical view of the data– Physical data organization

• Two components of every data descriptor– Schema description– Layout description

• Design goals– Powerful– Easy for writing and interpretation

Page 11: Supporting High- Performance Data Processing on Flat-Files Xuan Zhang Gagan Agrawal Ohio State University

Schema Descriptors

• Follow XML DTD standard for semi-structured data

• Simple attribute list for relational data

<?xml version='1.0' encoding='UTF-8'?><!ELEMENT FASTA (ID, DESCRIPTION, SEQ)><!ELEMENT ID (#PCDATA)><!ELEMENT DESCRIPTION (#PCDATA)><!ELEMENT SEQ (#PCDATA)>

[FASTA] //Schema NameID = string //Data type definitionsDESCRIPTION = stringSEQ = string

Page 12: Supporting High- Performance Data Processing on Flat-Files Xuan Zhang Gagan Agrawal Ohio State University

Layout Descriptors

• Overall structure (FASTA example)

DATASET “FASTAData” { //Dataset nameDATATYPE {FASTA} //Schema name

DATASPACE LINESIZE=80 {

// ---- File layout details goes here ----

}DATA {osu/fasta} //File location

}

Page 13: Supporting High- Performance Data Processing on Flat-Files Xuan Zhang Gagan Agrawal Ohio State University

Wrapper GenerationSystem Overview

DataReader DataWriterSynchronizer

SourceDataset

TargetDataset

WRAPINFO

Wrapper generationsystem

wrapper

Mapping File

Mapping Parser

Schema Mapping

Mapping Generator

Schema Descriptors

Layout Parser

Layout Descriptor

Data EntryRepresentation

Application Analyzer

Page 14: Supporting High- Performance Data Processing on Flat-Files Xuan Zhang Gagan Agrawal Ohio State University

Query With IndicesMotivation

• Goal– Improve the performance of query-proc program

• Index

– Maintain the advantages• Flat file based• Low requirement on programming

Page 15: Supporting High- Performance Data Processing on Flat-Files Xuan Zhang Gagan Agrawal Ohio State University

Challenges & Approaches

• Various indexing algorithms for various biological data– User defined indexing functions– Standard function interfaces

• Flat file data– Values parsed implicitly and ready to be indexed– Byte offset as pointer

• Metadata about indices– Layout descriptor

Page 16: Supporting High- Performance Data Processing on Flat-Files Xuan Zhang Gagan Agrawal Ohio State University

System Revisitedquery

Query parser

Metadatacollection

Datasetdescriptors

Descriptorparser

Application analyzer

QUERYINFOR

DataReader DataWriter

Synchronizer

Source data files

Targetdata file

Source/target names

Schema & Layout information mappings

Query analysis

Query execution

Index file Index functions

Page 17: Supporting High- Performance Data Processing on Flat-Files Xuan Zhang Gagan Agrawal Ohio State University

Language Enhancement

• Describe indices– Indexing is a property of dataset– Extend layout descriptors

– Maintain query format

DATASET “name”{…INDEX {attribute:index_file_loc:index_gen_fun:index_retr_fun:fun_loc[, attribute:index_file_loc:index_gen_fun:index_retr_fun:fun_loc]}}

AUTOWRAP GNAMESFROM CHIPDATA, YEASTGENOMEBY CHIPDATA.GENE = YEASTGENOME.IDWHERE …

New meaning of “=“:If index available, use index

retrieving functionElse, compare values directly

Page 18: Supporting High- Performance Data Processing on Flat-Files Xuan Zhang Gagan Agrawal Ohio State University

System Enhancement

• Metadata Descriptor Parser+ parse index information

• Application Analyzer+ index information: index look-up table

+ test condition: compare_field_indexing

Page 19: Supporting High- Performance Data Processing on Flat-Files Xuan Zhang Gagan Agrawal Ohio State University

Microarray Gene Information Look-up

• Goal: gather information about genes (120)

• Query: microarray output join genome database

• Index: gene names in genome

0.01 0.72

20.89

81.59

0

10

20

30

40

50

60

70

80

90

Per

form

ance

(se

c)

queryanalysis

indexgeneration

query withindices

query w/oindices

Page 20: Supporting High- Performance Data Processing on Flat-Files Xuan Zhang Gagan Agrawal Ohio State University

BLAST-ENHANCE Query

• Goal: Add extra information to BLAST output

• Query: BLAST output join Swiss-Prot database

• Index: protein ID in Swiss-Prot

0

200

400

600

800

1000

1200

Per

form

ance

(se

c)

indexgeneration

query w/indices

query w/oindices

3 5 12

Page 21: Supporting High- Performance Data Processing on Flat-Files Xuan Zhang Gagan Agrawal Ohio State University

OMIM-PLUS Query

• Goal: add Swiss-Prot link to OMIM

• Query: OMIM join Swiss-Prot

• Index: protein ID in Swiss-Prot

1

10

100

1000

10000

100000

1000000

10000000

Perf

orm

ance

(sec

)

indexgeneration

query w/indices

query w/oindices

Page 22: Supporting High- Performance Data Processing on Flat-Files Xuan Zhang Gagan Agrawal Ohio State University

Homology Search Query

• Goal: find similar sequences

• Query: query sequence list * sequence database

• Indexing algorithm– Sequence-based– Transformation of sub-string composition– Indexing n-D numerical values

Page 23: Supporting High- Performance Data Processing on Flat-Files Xuan Zhang Gagan Agrawal Ohio State University

Homology Search (1)

• Index (Singh’s algorithm)– Data: yeast

genome– wavelet

coefficients – minimum

bounding rectangles

0

50

100

150

200

250

300

350

Per

form

ance

(sec

)

1 2 3 4 5

Database size (9.8MB)

Index generation

10

20

40

Page 24: Supporting High- Performance Data Processing on Flat-Files Xuan Zhang Gagan Agrawal Ohio State University

Homology Search (2)

• Index (Ferhatosmanoglu’s algorithm)– Data: GenBank– Wavelet coefficients– Scalar quantization– R-tree 0

5

10

15

20

25

30

perf

orm

ance

(sec

)

1 2 3 4 5

Database size (250MB)

10

20

40

Page 25: Supporting High- Performance Data Processing on Flat-Files Xuan Zhang Gagan Agrawal Ohio State University

Conclusions

• A frame work and a set of tools for on-the-fly flat file data integration– New data source understood semi-automatically

by data mining tools– New data processed automatically by generated

programs – Support for indexing incorporated flexibly