supporting high- performance data processing on flat-files xuan zhang gagan agrawal ohio state...

Supporting High-Performance Data Processing on Flat-Files

Xuan Zhang

Gagan Agrawal

Ohio State University

Motivation

• Challenges of bioinformatics integration– Data volume: overwhelming

• DNA sequence: 100 gigabases (August, 2005)

– Data growth:

exponential

Figure provided by PDB

Existing Solutions

– (Relational) Databases• Support for indexing and high-level queries • Not suitable for biological data

– Flat Files with Scripts • Compact, Perl Scripts available • Lack indexing and high-level query processing

– Web-services • Significant overhead

• Enhance information integration systems on– Functionality

• On-the-fly data incorporation• Flat file data process

– Usability• Declarative interface• Low programming requirement

– Performance • Incorporate indexing support

Our Approach

Approach Summary

• Metadata– Declarative description of data– Data mining algorithms for semi-automatic

writing– Reusable by different requests on same data

• Code generation– Request analysis and execution separated– General modules with plug-in data module

System OverviewUnderstand Data Process Data

Data File User Request

Answ

er

Metadata Description

Layout Descriptor---------------------------------------------------

Schema DescriptorLayout Descriptor

---------------------------------------------------

Schema DescriptorLayout Descriptor

---------------------------------------------------

Schema Descriptor

CodeGeneration

RequestProcessor

Layout Miner

SchemaMiner

Information Integration System

Advantages

• Simple interface– At metadata level, declarative

• General data model– Semi-structured data– Flat file data

• Low human involvement– Semi-automatic data incorporation– Low maintenance cost

• OK Performance– Linear scale guaranteed – Can improve by using indexing

System Components

• Understand data– Layout mining– Schema mining

• Process data– Wrapper generation– Query Process– Query Process with indices

Data Process Overview

• Automatic code generation approach• Input

– Metadata about datasets involved– Optional:

• Implicit data transformation task• Request by users• Indexing functions

• Output– Executable programs

• General modules• Task-specific data module

Metadata Description

• Two aspects of data in flat files– Logical view of the data– Physical data organization

• Two components of every data descriptor– Schema description– Layout description

• Design goals– Powerful– Easy for writing and interpretation

Schema Descriptors

• Follow XML DTD standard for semi-structured data

• Simple attribute list for relational data

<?xml version='1.0' encoding='UTF-8'?><!ELEMENT FASTA (ID, DESCRIPTION, SEQ)><!ELEMENT ID (#PCDATA)><!ELEMENT DESCRIPTION (#PCDATA)><!ELEMENT SEQ (#PCDATA)>

[FASTA] //Schema NameID = string //Data type definitionsDESCRIPTION = stringSEQ = string

Layout Descriptors

• Overall structure (FASTA example)

DATASET “FASTAData” { //Dataset nameDATATYPE {FASTA} //Schema name

DATASPACE LINESIZE=80 {

// ---- File layout details goes here ----

}DATA {osu/fasta} //File location

}

Wrapper GenerationSystem Overview

DataReader DataWriterSynchronizer

SourceDataset

TargetDataset

WRAPINFO

Wrapper generationsystem

wrapper

Mapping File

Mapping Parser

Schema Mapping

Mapping Generator

Schema Descriptors

Layout Parser

Layout Descriptor

Data EntryRepresentation

Application Analyzer

Query With IndicesMotivation

• Goal– Improve the performance of query-proc program

• Index

– Maintain the advantages• Flat file based• Low requirement on programming

Challenges & Approaches

• Various indexing algorithms for various biological data– User defined indexing functions– Standard function interfaces

• Flat file data– Values parsed implicitly and ready to be indexed– Byte offset as pointer

• Metadata about indices– Layout descriptor

System Revisitedquery

Query parser

Metadatacollection

Datasetdescriptors

Descriptorparser

Application analyzer

QUERYINFOR

DataReader DataWriter

Synchronizer

Source data files

Targetdata file

Source/target names

Schema & Layout information mappings

Query analysis

Query execution

Index file Index functions

Language Enhancement

• Describe indices– Indexing is a property of dataset– Extend layout descriptors

– Maintain query format

DATASET “name”{…INDEX {attribute:index_file_loc:index_gen_fun:index_retr_fun:fun_loc[, attribute:index_file_loc:index_gen_fun:index_retr_fun:fun_loc]}}

AUTOWRAP GNAMESFROM CHIPDATA, YEASTGENOMEBY CHIPDATA.GENE = YEASTGENOME.IDWHERE …

New meaning of “=“:If index available, use index

retrieving functionElse, compare values directly

System Enhancement

• Metadata Descriptor Parser+ parse index information

• Application Analyzer+ index information: index look-up table

+ test condition: compare_field_indexing

Microarray Gene Information Look-up

• Goal: gather information about genes (120)

• Query: microarray output join genome database

• Index: gene names in genome

0.01 0.72

20.89

81.59

0

10

20

30

40

50

60

70

80

90

Per

form

ance

(se

c)

queryanalysis

indexgeneration

query withindices

query w/oindices

BLAST-ENHANCE Query

• Goal: Add extra information to BLAST output

• Query: BLAST output join Swiss-Prot database

• Index: protein ID in Swiss-Prot

0

200

400

600

800

1000

1200

Per

form

ance

(se

c)

indexgeneration

query w/indices

query w/oindices

3 5 12

OMIM-PLUS Query

• Goal: add Swiss-Prot link to OMIM

• Query: OMIM join Swiss-Prot

• Index: protein ID in Swiss-Prot

1

10

100

1000

10000

100000

1000000

10000000

Perf

orm

ance

(sec

)

indexgeneration

query w/indices

query w/oindices

Homology Search Query

• Goal: find similar sequences

• Query: query sequence list * sequence database

• Indexing algorithm– Sequence-based– Transformation of sub-string composition– Indexing n-D numerical values

Homology Search (1)

• Index (Singh’s algorithm)– Data: yeast

genome– wavelet

coefficients – minimum

bounding rectangles

0

50

100

150

200

250

300

350

Per

form

ance

(sec

)

1 2 3 4 5

Database size (9.8MB)

Index generation

10

20

40

Homology Search (2)

• Index (Ferhatosmanoglu’s algorithm)– Data: GenBank– Wavelet coefficients– Scalar quantization– R-tree 0

5

10

15

20

25

30

perf

orm

ance

(sec

)

1 2 3 4 5

Database size (250MB)

10

20

40

Conclusions

• A frame work and a set of tools for on-the-fly flat file data integration– New data source understood semi-automatically

by data mining tools– New data processed automatically by generated

programs – Support for indexing incorporated flexibly

supporting high- performance data processing on flat-files xuan zhang gagan agrawal ohio state...

Documents