high-level interfaces and abstractions for data-driven applications in a grid environment gagan...
TRANSCRIPT
High-level Interfaces and Abstractions for Data-Driven
Applications in a Grid Environment
Gagan Agrawal Department of Computer Science and
Engineering
(joint work with Liang Chen, Wei Du, Leo Glimcher, Ruoming Jin, Xiaogang Li, Swarup
Sahoo, Li Weng, and Xuan Zhang) (Funded by ACI-9733520, EIA-9703088, ACI-
9982087, ACI-0130437, EIA-0203846, ACI-0234273, DoD PET program)
Data-Driven Applications
Becoming increasingly important Can be extremely hard to develop for a grid-
environment We need to focus on end-users who have used
Matlab / SQL like systems for data retrieval and analysis
Some issues to consider Different data layouts and formats Flexibly exploit different forms of parallelism Adapting to available resources
Research Projects Automatic Data Virtualization
XML-based high-level abstractions and use of XQuery SQL-based front-end for the STORM system
FREERIDE (Framework for Rapid Implementation of Datamining Engines)
High-level specification of a parallel data mining algorithm Flexibly exploit different forms of parallelism
GATES (Grid-based AdapTive Execution on Streams) OGSA based Support for processing distributed streams in a grid
environment Self Adaptation to meet real-time constraints
Compiler-based front-end to DataCutter Includes support for program adaptation
More details through four student posters this afternoon
Automatic Data Virtualization
Data virtualization refers to an abstract view of data for access and processing
Data Services are methods that implement a virtual view of data
Our focus: using compiler techniques to automatically generate data services to support data virtualization
Two separate ongoing implementations Using XML Schema based high-level abstractions and
XQuery (ICS 2003, LCPC 2003, DBPL 2003, prior compiler work in ICS 2002, PACT 2001)
Supporting SQL front-end for data subsetting operations (jointly with Saltz, Kurc, Catalyurek, et al.)
TEXT
….
NetCDF
RMDB
HDF5
XML
XQuery
???
Project Overview
External Schema
XQuery/XPath
Compiler
XML Mapping Service
System Architecture
logical XML schema physical XML schema
C++/C
SQL-Based Front-end
System Architecture
SELECT * FROM IPARSWHERE RID in (0,6,26,27) AND TIME>1000 AND TIME<1100
AND SOIL>0.7 AND SPEED(OILVX, OILVY, OILVZ)<30.0;
Common operations: Subsetting, filtering, user defined filtering
FREERIDE Overview Framework for Rapid
Implementation of Datamining engines
Demonstrated for a variety of standard mining algos
Targets distributed memory
parallelism, shared memory parallelism, and combination
Can be used as basis for scalable grid-based data mining implementations
Developed on top of Active Data Repository (ADR) from Saltz’s group at Maryland
Publications: SDM 01,02,03,Sigmetrics 02, Ipdps 04, TKDE 04
Key Observation from Mining Algorithms
Most popular algorithms have a common canonical loop
Can be used as the basis for supporting a common middleware
Parallelism of different forms and execution on disk-resident datasets
While( ) {
forall( data instances d) {
I = process(d)
R(I) = R(I) op d
}
…….
}
Applications of FREERIDE
Apriori and FP-tree based association mining distributed memory, shared memory, combination
K-means and EM Clustering distributed memory, shared memory, combination
Nearest-neighbor search RainForest-based decision tree construction
shared memory A new decision tree algorithms – Statistical
Pruning of Intervals for Enhanced Scalability (SPIES)
distributed memory, shared memory, combination
Applying FREERIDE for Scientific Data Mining
Joint work with Machiraju and Parthasarathy
Focusing on feature extraction, tracking, and mining approach developed by Machiraju et al.
A feature is a region of interest in a dataset
A suite of algorithms for extracting and tracking features
FREERIDE forms a basis for supporting high-level interfaces
Data Parallel Java – lcpc 2002, IPDPS 2003
Matlab / mining operators – planned in the future
GATES
Grid-based AdapTive Execution on Streams
Targets (distributed) processing of (distributed) data streams
Built on OGSA model Self adaptation to meet
real-time constraint on processing
GATES: Motivation
Many applications involve high-volume data streams
Data from large scale experiments / simulations Digitized images from a movie camera Network traffic
Data may arise from distributed sources Analysis / consumption of results may be
distributed Many users wanting different analyses/results Insufficient compute power at one site
Self Adaptation in GATES
Goal: Achieve the best accuracy with available resources, subject to real-time constraint
GATES approach: Programmer exposes certain parameters in processing of
each stage Examples include: rate of sampling, size of summary
structure Programmer also specifies direction of sensitivity e.g. larger
summary structure means more computation/communication
Parameters adjusted at runtime Currently based upon size of buffers: signal previous stage to
become faster/slower if buffer too small / too large Future possibilities: use profiling / performance models …
Summary
Application development in a grid environment is hard
Need novel runtime techniques and middleware Innovative applications of compiler technology
can help Equipment needs:
Need a controlled distributed environment Need high-bandwidth connectivity – need to simulate
High rate of data arrival External clients with ability to receive data at high rates
Scaling work to systems with Tera-bytes of storage