high-level interfaces and abstractions for data-driven applications in a grid environment gagan...

High-level Interfaces and Abstractions for Data-Driven

Applications in a Grid Environment

Gagan Agrawal Department of Computer Science and

Engineering

(joint work with Liang Chen, Wei Du, Leo Glimcher, Ruoming Jin, Xiaogang Li, Swarup

Sahoo, Li Weng, and Xuan Zhang) (Funded by ACI-9733520, EIA-9703088, ACI-

9982087, ACI-0130437, EIA-0203846, ACI-0234273, DoD PET program)

Data-Driven Applications

Becoming increasingly important Can be extremely hard to develop for a grid-

environment We need to focus on end-users who have used

Matlab / SQL like systems for data retrieval and analysis

Some issues to consider Different data layouts and formats Flexibly exploit different forms of parallelism Adapting to available resources

Research Projects Automatic Data Virtualization

XML-based high-level abstractions and use of XQuery SQL-based front-end for the STORM system

FREERIDE (Framework for Rapid Implementation of Datamining Engines)

High-level specification of a parallel data mining algorithm Flexibly exploit different forms of parallelism

GATES (Grid-based AdapTive Execution on Streams) OGSA based Support for processing distributed streams in a grid

environment Self Adaptation to meet real-time constraints

Compiler-based front-end to DataCutter Includes support for program adaptation

More details through four student posters this afternoon

Automatic Data Virtualization

Data virtualization refers to an abstract view of data for access and processing

Data Services are methods that implement a virtual view of data

Our focus: using compiler techniques to automatically generate data services to support data virtualization

Two separate ongoing implementations Using XML Schema based high-level abstractions and

XQuery (ICS 2003, LCPC 2003, DBPL 2003, prior compiler work in ICS 2002, PACT 2001)

Supporting SQL front-end for data subsetting operations (jointly with Saltz, Kurc, Catalyurek, et al.)

TEXT

….

NetCDF

RMDB

HDF5

XML

XQuery

???

Project Overview

External Schema

XQuery/XPath

Compiler

XML Mapping Service

System Architecture

logical XML schema physical XML schema

C++/C

SQL-Based Front-end

System Architecture

SELECT * FROM IPARSWHERE RID in (0,6,26,27) AND TIME>1000 AND TIME<1100

AND SOIL>0.7 AND SPEED(OILVX, OILVY, OILVZ)<30.0;

Common operations: Subsetting, filtering, user defined filtering

FREERIDE Overview Framework for Rapid

Implementation of Datamining engines

Demonstrated for a variety of standard mining algos

Targets distributed memory

parallelism, shared memory parallelism, and combination

Can be used as basis for scalable grid-based data mining implementations

Developed on top of Active Data Repository (ADR) from Saltz’s group at Maryland

Publications: SDM 01,02,03,Sigmetrics 02, Ipdps 04, TKDE 04

Key Observation from Mining Algorithms

Most popular algorithms have a common canonical loop

Can be used as the basis for supporting a common middleware

Parallelism of different forms and execution on disk-resident datasets

While( ) {

forall( data instances d) {

I = process(d)

R(I) = R(I) op d

}

…….

}

Applications of FREERIDE

Apriori and FP-tree based association mining distributed memory, shared memory, combination

K-means and EM Clustering distributed memory, shared memory, combination

Nearest-neighbor search RainForest-based decision tree construction

shared memory A new decision tree algorithms – Statistical

Pruning of Intervals for Enhanced Scalability (SPIES)

distributed memory, shared memory, combination

Applying FREERIDE for Scientific Data Mining

Joint work with Machiraju and Parthasarathy

Focusing on feature extraction, tracking, and mining approach developed by Machiraju et al.

A feature is a region of interest in a dataset

A suite of algorithms for extracting and tracking features

FREERIDE forms a basis for supporting high-level interfaces

Data Parallel Java – lcpc 2002, IPDPS 2003

Matlab / mining operators – planned in the future

GATES

Grid-based AdapTive Execution on Streams

Targets (distributed) processing of (distributed) data streams

Built on OGSA model Self adaptation to meet

real-time constraint on processing

GATES: Motivation

Many applications involve high-volume data streams

Data from large scale experiments / simulations Digitized images from a movie camera Network traffic

Data may arise from distributed sources Analysis / consumption of results may be

distributed Many users wanting different analyses/results Insufficient compute power at one site

Self Adaptation in GATES

Goal: Achieve the best accuracy with available resources, subject to real-time constraint

GATES approach: Programmer exposes certain parameters in processing of

each stage Examples include: rate of sampling, size of summary

structure Programmer also specifies direction of sensitivity e.g. larger

summary structure means more computation/communication

Parameters adjusted at runtime Currently based upon size of buffers: signal previous stage to

become faster/slower if buffer too small / too large Future possibilities: use profiling / performance models …

Summary

Application development in a grid environment is hard

Need novel runtime techniques and middleware Innovative applications of compiler technology

can help Equipment needs:

Need a controlled distributed environment Need high-bandwidth connectivity – need to simulate

High rate of data arrival External clients with ability to receive data at high rates

Scaling work to systems with Tera-bytes of storage

high-level interfaces and abstractions for data-driven applications in a grid environment gagan...

Documents