data mining and access pattern discovery

33
OAK RIDGE NATIONAL LABORATORY U.S. DEPARTMENT OF ENERGY SDM cente r Data Mining and Access Pattern Discovery Subprojects: Dimension reduction and sampling (Chandrika, Imola) Access pattern Discovery (Ghaleb) “Run and Render” Capability in ASPECT (George, Joel, Nagiza) Common applications: climate and astrophysics Common goals: Explore data for knowledge discovery Knowledge is used in different ways: Explain volcano and El Niño effects on changes in the earth’s surface temperature Minimize disk access times Reduce the amount of data stored Quantify correlations between the neutrino flux and stellar core convection, between convection and spatial dimensionality, convection and rotation Common tools that we use: cluster analysis, dimension reduction Feed each other: dimension reduction <-> cluster analysis, ASPECT <->access pattern

Upload: dahlia

Post on 22-Jan-2016

64 views

Category:

Documents


0 download

DESCRIPTION

Data Mining and Access Pattern Discovery. Subprojects: Dimension reduction and sampling (Chandrika, Imola) Access pattern Discovery (Ghaleb) “Run and Render” Capability in ASPECT (George, Joel, Nagiza) Common applications: climate and astrophysics Common goals: - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Data Mining and Access Pattern Discovery

OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY

SDM center

Data Mining and Access Pattern Discovery Subprojects:

Dimension reduction and sampling (Chandrika, Imola) Access pattern Discovery (Ghaleb) “Run and Render” Capability in ASPECT (George, Joel, Nagiza)

Common applications: climate and astrophysics Common goals:

Explore data for knowledge discovery Knowledge is used in different ways:

Explain volcano and El Niño effects on changes in the earth’s surface temperature

Minimize disk access times Reduce the amount of data stored Quantify correlations between the neutrino flux and stellar core convection,

between convection and spatial dimensionality, convection and rotation Common tools that we use: cluster analysis, dimension reduction Feed each other: dimension reduction <-> cluster analysis, ASPECT <-

>access pattern

Page 2: Data Mining and Access Pattern Discovery

OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY

SDM center

Nagiza Samatova, George Ostrouchov, Faisal AbduKhzam, Joel Reed,Tom Potok & Randy Burris

Computer Science and Mathematics Divisionhttp://www.csm.ornl.gov/

SciDAC SDM ISIC All-Hands MeetingMarch 26-27, 2002

Gatlinburg, TN

ASPECT: Adaptable Simulation Product Exploration and Control Toolkit

Page 3: Data Mining and Access Pattern Discovery

OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY

SDM center

Team & Collaborators

AbduKhzam, Faisal – distributed & streamline data mining research Ostrouchov, George – Application coordination, sampling & data

reduction, data analysis Reed, Joel – ASPECT’s GUI Interface, Agents Samatova, Nagiza – Management, streamline & distributed data

mining algorithms in ASPECT, application tie-ins Summer students - Java-R back-end interface development

Burris, Randy – Establishing prototyping environment in Probe Drake, John – A lot of ideas have been inspired from Geist, Al – Distributed and streamline data analysis research Mezzacappa, Tony – TSI Application Driver Million, Dan – Establishing software environments in Probe Potok, Tom – ORMAC Agent Framework

Collaborators:

Team:

Page 4: Data Mining and Access Pattern Discovery

OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY

SDM center

Analysis & Visualization of Simulation Product – State of the Art

Post-processing data analysis tools (like PCMDI): Scientists must wait for the simulation completion Can use lots of CPU cycles on long-running simulations Can use up to 50% more storage and require unnecessary data

transfer for data-intensive simulations

Simulation monitoring tools: Need simulation code instrumentation (e.g., call to vis. libraries) Interference with simulation run: snapshot of data => can pause simulation

Computationally intensive data analysis task becomes part of simulation Synchronous view of data and simulation run More control over simulation

Page 5: Data Mining and Access Pattern Discovery

OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY

SDM center Improvements through — ASPECT

Data stream not simulation monitoring tool

PROBE

FFTFFTICAICAFiltersFilters D4 RACHET

Desktop

Filters

RACHET ICA

D4

GUI Interface

Plug-in modules

ASPECT

Disks TapesSimulation Data

ASPECT’s advantages:• No simulation code instrumentation• Single data — multiple views of data• No interference w/ simulation

ASPECT’s drawbacks:(e.g. unlike CUMULVS/ORNL)• No computational steering• No collaborative visualization• No high performance visualization

Page 6: Data Mining and Access Pattern Discovery

OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY

SDM center“Run and Render” Simulation Cycle in SciDAC: Our vision

SP3: TSI Simulation

Computational Environment

Disks

Tapes

PROBE for Storage & Analysis of Simulation Data:• High-Dimensional • Distributed• Dynamic• Massive

Data Management

Application Scientist

ASPECT

Data Analysis

Visualization:• Scalable• Adaptable • Interactive• Collaborative

Part of SciDACMissing

Page 7: Data Mining and Access Pattern Discovery

OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY

SDM center Approaching the Goal through a

Collaborative Set of Activities

Interact with Application Scientists

T. Mezzacappa, R. Toedte, D. Erickson, J. Drake

Interact with Application Scientists

T. Mezzacappa, R. Toedte, D. Erickson, J. Drake

Build a Workflow Environment (Probe)Build a Workflow

Environment (Probe)

Application Data Analysis ResearchApplication Data

Analysis Research

CS & Math Research driven by ApplicationsCS & Math Research driven by Applications

ASPECT Design & Implementation

ASPECT Design & Implementation

Publications, Meetings & Presentations

Publications, Meetings & Presentations

Learn Application Domain (problem, software)

Learn Application Domain (problem, software)

Data Preparation & Processing

Data Preparation & Processing

Page 8: Data Mining and Access Pattern Discovery

OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY

SDM center

Building a Workflow Environment

Page 9: Data Mining and Access Pattern Discovery

OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY

SDM center

80% => 20% Paradigm in Probe’s Research & Application driven Environment

Very limited resources General purpose software only Lack of interface with HPSS Homogenous platform

(e.g., Linux only)

From frustrations To smooth operationHardware Infrastructure:

RS6000 S80, 6 processors 2 GB memory,1 TB IDE FibreChannel RAID360 GB Sun RAID

Software Infrastructure:Compilers (Fortran, C, Java)Data Analysis (R, Java-R, Ggobi)Visualization (ncview, GrADS)Data Formats (netCDF, HDF)Data Storage & Transfer (HPSS, hsi, pftp, GridFTP, MPI-IO)

Page 10: Data Mining and Access Pattern Discovery

OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY

SDM center

ASPECT Design and Implementation

Page 11: Data Mining and Access Pattern Discovery

OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY

SDM center

Menu of Modules

Categories:• Data Acquisition• Data Filtering• Data Analysis• Visualization

Create Instance

Link Modules

Link Modules

FFTFFT

NetCDF Reader

Visualization Module Filter Module

ASPECT Front-End Infrastructure

Functionality:• Instantiate Modules• Link Modules• Control Valid Links• Synchronously Control• Add Modules by XML

<modules> <module-set>

<name> Data Acquisition </name> <module>

<name> NetCDF Reader </name> <code> datamonitor.NetCDFReader </code> </module> </module-set> <module-set> <name> Data Filtering </name> <module> <name> Invert Filter </name> <code> datamonitor.Inverter </code> </module> </module-set></modules>

<modules> <module-set>

<name> Data Acquisition </name> <module>

<name> NetCDF Reader </name> <code> datamonitor.NetCDFReader </code> </module> </module-set> <module-set> <name> Data Filtering </name> <module> <name> Invert Filter </name> <code> datamonitor.Inverter </code> </module> </module-set></modules> XML Config File

Page 12: Data Mining and Access Pattern Discovery

OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY

SDM center

ASPECT Implementation

Front-end interface: Java

Back-end data analysis: R (GNU S-Plus) (and C): provides richness of data analysis capabilities

Omegahat’s Java-R interface (http://omegahat.org)

Networking layer: ORNL’s ORMAC Agent Architecture based on RMI Other: Servlets, HORB (http://horb.a02.aist.go.jp/horb/), CORBA

File Readers: NetCDF ASCI HDF5 (later)

Page 13: Data Mining and Access Pattern Discovery

OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY

SDM center

Agents for Distributed Data Processing

Page 14: Data Mining and Access Pattern Discovery

OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY

SDM center

Agents and Parallel Computing

Astrophysics Example

Massive datasets Team of agents divide up the task Each agent contributes solution for his portion the

dataset Agent-derived partial solutions are merged to

create total solution Solution appropriately formatted for resource

Page 15: Data Mining and Access Pattern Discovery

OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY

SDM center

Team of Agents Divide Up Data

1)Resource Aware Agent Receives Request

Varying Resources

Page 16: Data Mining and Access Pattern Discovery

OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY

SDM center

Team of Agents Divide Up Data

1)Resource Aware Agent Receives Request

2) Announces Request to Agent Team

Varying Resources

Page 17: Data Mining and Access Pattern Discovery

OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY

SDM center

Team of Agents Divide Up Data

1)Resource Aware Agent Receives Request

2) Announces Request to Agent Team

3) Team Responds

Varying Resources

Page 18: Data Mining and Access Pattern Discovery

OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY

SDM center

Team of Agents Divide Up Data

1)Resource Aware Agent Receives Request

2) Announces Request to Agent Team

3) Team Responds

4) Resource Aware Agent

- Assembles and formats for resource

- Hands back solution

Varying Resources

Page 19: Data Mining and Access Pattern Discovery

OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY

SDM center

Distributed and Streamline Data Analysis Research

Page 20: Data Mining and Access Pattern Discovery

OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY

SDM center

Tera&Petabytes Existing methods do not scale in terms of time and storage

Challenge: Develop effective & efficient methods for mining scientific data sets

DistributedExisting methods work on single centralized dataset. Data transfer is prohibitive

High-dimensionalExisting methods do not scale up with the number of dimensions

DynamicExisting methods work w/ static data. Changes lead to complete re-computation

Complexity of Scientific Data Sets Drives Algorithmic Breakthroughs

Supernova Explosion: 1-D simulation: 2GB 2-D simulation: 1TB 3-D simulation: 50TB

Page 21: Data Mining and Access Pattern Discovery

OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY

SDM center

Need to break the Algorithmic Complexity Bottleneck

3 yrs.0.1 sec.10-2 sec.10GB

3 hrs10-3 sec.10-4 sec.100MB

1 sec.10-5 sec.10-6 sec.1MB

10-4sec.10-8 sec.10-8 sec.10KB

10-8 sec.10-10 sec.10-10sec.100B

n2nlog(n)n

Algorithm ComplexityData size,

nAlgorithmic Complexity:

Calculate means O(n)

Calculate FFT O(n log(n))

Calculate SVD O(r • c)

Clustering algorithms O(n2)

For illustration chart assumes 10-12 sec. calculation time per data point

Page 22: Data Mining and Access Pattern Discovery

OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY

SDM center

Perform cluster analysis in a distributed fashion

with reasonable data transfer overheads

Strategy

Compute local analyses using distributed agents Merge minimum info into a global analysis via peer-

to-peer agents’ collaboration & negotiation

Key idea

Benefits NO need to centralize data Linear scalability with data size

and with data dimensionality

RACHET: High Performance Framework for Distributed Cluster Analysis

Page 23: Data Mining and Access Pattern Discovery

OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY

SDM center

Data distribution is driven by a science application

Software code is sent to the data

One time communication No assumptions on

hardware architecture Provide an approximate

solution

Data distribution is driven by algorithm performance

Data is partitioned by a software code

Excessive data transfers Hardware architecture-

centric Aim for the “exact”

computation

Paradigm Shift in Data Analysis

Distributed Approach Parallel Approach

(RACHET approach)

Page 24: Data Mining and Access Pattern Discovery

OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY

SDM center

Distributed Cluster AnalysisLocal

DendrogramLocal

DendrogramLocal

Dendrogram

Global Dendrogram

RACHET

RACHETmerges local dendrogramsto determine global cluster

structure of the data

Intelligentagents

)|(|)|(| 2 NSOSOTime )(NOissionDataTransm

)()|(| 2 NOSOSpace

RACHET

|S|<<N O(N)

N data size S number of sitesk number of dimensions

Page 25: Data Mining and Access Pattern Discovery

OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY

SDM center

Distributed & Streamline Data Reduction:Merging Information Rather Than Raw Data

Global Principal Components transmit information, not data

Dynamic Principal Components no need to keep all data

Benefits: Little loss of information Much lower transmission costs

Method:Merge few local PCs and local means

Performance of Distributed PCA vs. Monolithic PCA

# of Data Sets

Rat

io

Page 26: Data Mining and Access Pattern Discovery

OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY

SDM center

t=t0 t=t1 t=t2new new

Incr

emen

tal u

pdat

e vi

a fu

sion

Stream of simulation data

Accuracy of approximation tooriginal data

0

0.10.2

0.3

0.4

0.50.6

0.7

1 2 3 4 5 6 7 8 9

Number of dimensions

Stre

ss

monolythic

t=2

t=4

Features: Linear time for each chunk One time communication for distributed

version ~5% deviation from monolithic version

Ratio of monolithic vs. streamline

0.92

0.94

0.96

0.98

1

1.02

1.04

1 2 3 4 5 6 7 8 9Number of dimensions k

Ratio

m/ t=2

m/ t=4

DFastMap: Fast Dimension Reduction for Distributed and Streamline Data

Page 27: Data Mining and Access Pattern Discovery

OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY

SDM center

Application Data Reduction and Potentials for Scientific Discovery

Page 28: Data Mining and Access Pattern Discovery

OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY

SDM center

Adaptive PCA-based Data Compression in Supernova Explosion Simulation

PCA vs. sub-sampling compression

Loss function: Mean Square Error (MSE)

Sub-sampling: 1 point out of 9 (black)

PCA approximation: k PCs out of 400 (red)

Compression Features: Adaptive

Rate: 200 to 20 times

PCA-based

3 times better than subsampling

Original PCA Restored

• Time step = 0; MSE = 0.004

• Compression rate = 200

• Number of PCs = 3 of 400

Page 29: Data Mining and Access Pattern Discovery

OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY

SDM center

Data Compression & Discovery of the Unusual by Fitting Local Models

Strategy Segment series Model the usual to find the unusual

Key ideas Fit simple local models to segments Use parameters for global analysis and monitoring

Resulting system Detects specific events (targeted or unusual) Provides a global description of one or several data

series Provides data reduction to parameters of local model

Page 30: Data Mining and Access Pattern Discovery

OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY

SDM center From Local Models to

Annotated Time Series

Segment series (100 obs)

Fit simple local model ( c0, c1, c2, ||e||, ||e||2)

Select extreme (10%)

Cluster extreme (4)

Map back to series

Page 31: Data Mining and Access Pattern Discovery

OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY

SDM center

EOF 1

EOF 2

EOF 3

EOF 4

Decomposition and Monitoring of a GCM Run

135 year CCM3 run at T42 resolutionAverage Monthly Temperature

CO2 increase to 3x

EOF 1 EOF 2 EOF 3 EOF 4

Periodic + Trend11-13 mo bandpass

15 yr lowpass

Anomaly13 mo-15 yr bandpass

11 mo highpass+

+ + + +. . . EOF N+

Circulation through 12 months

Winter warming more severe than summer warming

Page 32: Data Mining and Access Pattern Discovery

OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY

SDM center Publications & Presentations

F. AbuKhzam, N. F. Samatova, and G. Ostrouchov (2002). “FastMap for Distributed Data: Fast Dimension Reduction,” in preparation.

Y. Qu, G. Ostrouchov, N.F. Samatova, A. Geist (2002). “Principal Component Analysis for Dimension Reduction in Massive Distributed Data Sets”, in Proc. The Second SIAM International Conference on Data Mining, April 2002.

N.F. Samatova, G. Ostrouchov, A. Geist, A. Melechko. “RACHET: An Efficient Cover-Based Merging of Clustering Hierarchies from Distributed Datasets”, Special Issue on Parallel and Distributed Data Mining, International Journal of Distributed and Parallel Databases: An International Journal, 2002, Volume 11, No. 2, March 2002.

N. Samatova, A. Geist, G. Ostrouchov, “RACHET: Petascale Distributed Data Analysis Suite”, in Proc. SPEEDUP Workshop on Distributed Supercomputing Data Intensive Computing, March 4-6, 2002, Badehotel Bristol, Leukerbad, Valais, Switzerland.

Presentations:

Publications:

N. Samatova, A. Geist, G. Ostrouchov, “RACHET: Petascale Distributed Data Analysis Suite”, SPEEDUP Workshop on Distributed Supercomputing Data Intensive Computing, March 4-6, 2002, Badehotel Bristol, Leukerbad, Valais, Switzerland

A. Shoshani, R. Burris, T. Potok, N. Samatova, “SDM-ISIC”, TSI All-Hands Meeting, February, 2002.

Page 33: Data Mining and Access Pattern Discovery

OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY

SDM center

Thank You!