Download - Data Mining and Access Pattern Discovery
OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY
SDM center
Data Mining and Access Pattern Discovery Subprojects:
Dimension reduction and sampling (Chandrika, Imola) Access pattern Discovery (Ghaleb) “Run and Render” Capability in ASPECT (George, Joel, Nagiza)
Common applications: climate and astrophysics Common goals:
Explore data for knowledge discovery Knowledge is used in different ways:
Explain volcano and El Niño effects on changes in the earth’s surface temperature
Minimize disk access times Reduce the amount of data stored Quantify correlations between the neutrino flux and stellar core convection,
between convection and spatial dimensionality, convection and rotation Common tools that we use: cluster analysis, dimension reduction Feed each other: dimension reduction <-> cluster analysis, ASPECT <-
>access pattern
OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY
SDM center
Nagiza Samatova, George Ostrouchov, Faisal AbduKhzam, Joel Reed,Tom Potok & Randy Burris
Computer Science and Mathematics Divisionhttp://www.csm.ornl.gov/
SciDAC SDM ISIC All-Hands MeetingMarch 26-27, 2002
Gatlinburg, TN
ASPECT: Adaptable Simulation Product Exploration and Control Toolkit
OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY
SDM center
Team & Collaborators
AbduKhzam, Faisal – distributed & streamline data mining research Ostrouchov, George – Application coordination, sampling & data
reduction, data analysis Reed, Joel – ASPECT’s GUI Interface, Agents Samatova, Nagiza – Management, streamline & distributed data
mining algorithms in ASPECT, application tie-ins Summer students - Java-R back-end interface development
Burris, Randy – Establishing prototyping environment in Probe Drake, John – A lot of ideas have been inspired from Geist, Al – Distributed and streamline data analysis research Mezzacappa, Tony – TSI Application Driver Million, Dan – Establishing software environments in Probe Potok, Tom – ORMAC Agent Framework
Collaborators:
Team:
OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY
SDM center
Analysis & Visualization of Simulation Product – State of the Art
Post-processing data analysis tools (like PCMDI): Scientists must wait for the simulation completion Can use lots of CPU cycles on long-running simulations Can use up to 50% more storage and require unnecessary data
transfer for data-intensive simulations
Simulation monitoring tools: Need simulation code instrumentation (e.g., call to vis. libraries) Interference with simulation run: snapshot of data => can pause simulation
Computationally intensive data analysis task becomes part of simulation Synchronous view of data and simulation run More control over simulation
OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY
SDM center Improvements through — ASPECT
Data stream not simulation monitoring tool
PROBE
FFTFFTICAICAFiltersFilters D4 RACHET
Desktop
Filters
RACHET ICA
D4
GUI Interface
Plug-in modules
ASPECT
Disks TapesSimulation Data
ASPECT’s advantages:• No simulation code instrumentation• Single data — multiple views of data• No interference w/ simulation
ASPECT’s drawbacks:(e.g. unlike CUMULVS/ORNL)• No computational steering• No collaborative visualization• No high performance visualization
OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY
SDM center“Run and Render” Simulation Cycle in SciDAC: Our vision
SP3: TSI Simulation
Computational Environment
Disks
Tapes
PROBE for Storage & Analysis of Simulation Data:• High-Dimensional • Distributed• Dynamic• Massive
Data Management
Application Scientist
ASPECT
Data Analysis
Visualization:• Scalable• Adaptable • Interactive• Collaborative
Part of SciDACMissing
OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY
SDM center Approaching the Goal through a
Collaborative Set of Activities
Interact with Application Scientists
T. Mezzacappa, R. Toedte, D. Erickson, J. Drake
Interact with Application Scientists
T. Mezzacappa, R. Toedte, D. Erickson, J. Drake
Build a Workflow Environment (Probe)Build a Workflow
Environment (Probe)
Application Data Analysis ResearchApplication Data
Analysis Research
CS & Math Research driven by ApplicationsCS & Math Research driven by Applications
ASPECT Design & Implementation
ASPECT Design & Implementation
Publications, Meetings & Presentations
Publications, Meetings & Presentations
Learn Application Domain (problem, software)
Learn Application Domain (problem, software)
Data Preparation & Processing
Data Preparation & Processing
OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY
SDM center
Building a Workflow Environment
OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY
SDM center
80% => 20% Paradigm in Probe’s Research & Application driven Environment
Very limited resources General purpose software only Lack of interface with HPSS Homogenous platform
(e.g., Linux only)
From frustrations To smooth operationHardware Infrastructure:
RS6000 S80, 6 processors 2 GB memory,1 TB IDE FibreChannel RAID360 GB Sun RAID
Software Infrastructure:Compilers (Fortran, C, Java)Data Analysis (R, Java-R, Ggobi)Visualization (ncview, GrADS)Data Formats (netCDF, HDF)Data Storage & Transfer (HPSS, hsi, pftp, GridFTP, MPI-IO)
OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY
SDM center
ASPECT Design and Implementation
OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY
SDM center
Menu of Modules
Categories:• Data Acquisition• Data Filtering• Data Analysis• Visualization
Create Instance
Link Modules
Link Modules
FFTFFT
NetCDF Reader
Visualization Module Filter Module
ASPECT Front-End Infrastructure
Functionality:• Instantiate Modules• Link Modules• Control Valid Links• Synchronously Control• Add Modules by XML
<modules> <module-set>
<name> Data Acquisition </name> <module>
<name> NetCDF Reader </name> <code> datamonitor.NetCDFReader </code> </module> </module-set> <module-set> <name> Data Filtering </name> <module> <name> Invert Filter </name> <code> datamonitor.Inverter </code> </module> </module-set></modules>
<modules> <module-set>
<name> Data Acquisition </name> <module>
<name> NetCDF Reader </name> <code> datamonitor.NetCDFReader </code> </module> </module-set> <module-set> <name> Data Filtering </name> <module> <name> Invert Filter </name> <code> datamonitor.Inverter </code> </module> </module-set></modules> XML Config File
OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY
SDM center
ASPECT Implementation
Front-end interface: Java
Back-end data analysis: R (GNU S-Plus) (and C): provides richness of data analysis capabilities
Omegahat’s Java-R interface (http://omegahat.org)
Networking layer: ORNL’s ORMAC Agent Architecture based on RMI Other: Servlets, HORB (http://horb.a02.aist.go.jp/horb/), CORBA
File Readers: NetCDF ASCI HDF5 (later)
OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY
SDM center
Agents for Distributed Data Processing
OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY
SDM center
Agents and Parallel Computing
Astrophysics Example
Massive datasets Team of agents divide up the task Each agent contributes solution for his portion the
dataset Agent-derived partial solutions are merged to
create total solution Solution appropriately formatted for resource
OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY
SDM center
Team of Agents Divide Up Data
1)Resource Aware Agent Receives Request
Varying Resources
OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY
SDM center
Team of Agents Divide Up Data
1)Resource Aware Agent Receives Request
2) Announces Request to Agent Team
Varying Resources
OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY
SDM center
Team of Agents Divide Up Data
1)Resource Aware Agent Receives Request
2) Announces Request to Agent Team
3) Team Responds
Varying Resources
OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY
SDM center
Team of Agents Divide Up Data
1)Resource Aware Agent Receives Request
2) Announces Request to Agent Team
3) Team Responds
4) Resource Aware Agent
- Assembles and formats for resource
- Hands back solution
Varying Resources
OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY
SDM center
Distributed and Streamline Data Analysis Research
OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY
SDM center
Tera&Petabytes Existing methods do not scale in terms of time and storage
Challenge: Develop effective & efficient methods for mining scientific data sets
DistributedExisting methods work on single centralized dataset. Data transfer is prohibitive
High-dimensionalExisting methods do not scale up with the number of dimensions
DynamicExisting methods work w/ static data. Changes lead to complete re-computation
Complexity of Scientific Data Sets Drives Algorithmic Breakthroughs
Supernova Explosion: 1-D simulation: 2GB 2-D simulation: 1TB 3-D simulation: 50TB
OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY
SDM center
Need to break the Algorithmic Complexity Bottleneck
3 yrs.0.1 sec.10-2 sec.10GB
3 hrs10-3 sec.10-4 sec.100MB
1 sec.10-5 sec.10-6 sec.1MB
10-4sec.10-8 sec.10-8 sec.10KB
10-8 sec.10-10 sec.10-10sec.100B
n2nlog(n)n
Algorithm ComplexityData size,
nAlgorithmic Complexity:
Calculate means O(n)
Calculate FFT O(n log(n))
Calculate SVD O(r • c)
Clustering algorithms O(n2)
For illustration chart assumes 10-12 sec. calculation time per data point
OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY
SDM center
Perform cluster analysis in a distributed fashion
with reasonable data transfer overheads
Strategy
Compute local analyses using distributed agents Merge minimum info into a global analysis via peer-
to-peer agents’ collaboration & negotiation
Key idea
Benefits NO need to centralize data Linear scalability with data size
and with data dimensionality
RACHET: High Performance Framework for Distributed Cluster Analysis
OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY
SDM center
Data distribution is driven by a science application
Software code is sent to the data
One time communication No assumptions on
hardware architecture Provide an approximate
solution
Data distribution is driven by algorithm performance
Data is partitioned by a software code
Excessive data transfers Hardware architecture-
centric Aim for the “exact”
computation
Paradigm Shift in Data Analysis
Distributed Approach Parallel Approach
(RACHET approach)
OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY
SDM center
Distributed Cluster AnalysisLocal
DendrogramLocal
DendrogramLocal
Dendrogram
Global Dendrogram
RACHET
RACHETmerges local dendrogramsto determine global cluster
structure of the data
Intelligentagents
)|(|)|(| 2 NSOSOTime )(NOissionDataTransm
)()|(| 2 NOSOSpace
RACHET
|S|<<N O(N)
N data size S number of sitesk number of dimensions
OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY
SDM center
Distributed & Streamline Data Reduction:Merging Information Rather Than Raw Data
Global Principal Components transmit information, not data
Dynamic Principal Components no need to keep all data
Benefits: Little loss of information Much lower transmission costs
Method:Merge few local PCs and local means
Performance of Distributed PCA vs. Monolithic PCA
# of Data Sets
Rat
io
OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY
SDM center
t=t0 t=t1 t=t2new new
Incr
emen
tal u
pdat
e vi
a fu
sion
Stream of simulation data
Accuracy of approximation tooriginal data
0
0.10.2
0.3
0.4
0.50.6
0.7
1 2 3 4 5 6 7 8 9
Number of dimensions
Stre
ss
monolythic
t=2
t=4
Features: Linear time for each chunk One time communication for distributed
version ~5% deviation from monolithic version
Ratio of monolithic vs. streamline
0.92
0.94
0.96
0.98
1
1.02
1.04
1 2 3 4 5 6 7 8 9Number of dimensions k
Ratio
m/ t=2
m/ t=4
DFastMap: Fast Dimension Reduction for Distributed and Streamline Data
OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY
SDM center
Application Data Reduction and Potentials for Scientific Discovery
OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY
SDM center
Adaptive PCA-based Data Compression in Supernova Explosion Simulation
PCA vs. sub-sampling compression
Loss function: Mean Square Error (MSE)
Sub-sampling: 1 point out of 9 (black)
PCA approximation: k PCs out of 400 (red)
Compression Features: Adaptive
Rate: 200 to 20 times
PCA-based
3 times better than subsampling
Original PCA Restored
• Time step = 0; MSE = 0.004
• Compression rate = 200
• Number of PCs = 3 of 400
OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY
SDM center
Data Compression & Discovery of the Unusual by Fitting Local Models
Strategy Segment series Model the usual to find the unusual
Key ideas Fit simple local models to segments Use parameters for global analysis and monitoring
Resulting system Detects specific events (targeted or unusual) Provides a global description of one or several data
series Provides data reduction to parameters of local model
OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY
SDM center From Local Models to
Annotated Time Series
Segment series (100 obs)
Fit simple local model ( c0, c1, c2, ||e||, ||e||2)
Select extreme (10%)
Cluster extreme (4)
Map back to series
OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY
SDM center
EOF 1
EOF 2
EOF 3
EOF 4
Decomposition and Monitoring of a GCM Run
135 year CCM3 run at T42 resolutionAverage Monthly Temperature
CO2 increase to 3x
EOF 1 EOF 2 EOF 3 EOF 4
Periodic + Trend11-13 mo bandpass
15 yr lowpass
Anomaly13 mo-15 yr bandpass
11 mo highpass+
+ + + +. . . EOF N+
Circulation through 12 months
Winter warming more severe than summer warming
OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY
SDM center Publications & Presentations
F. AbuKhzam, N. F. Samatova, and G. Ostrouchov (2002). “FastMap for Distributed Data: Fast Dimension Reduction,” in preparation.
Y. Qu, G. Ostrouchov, N.F. Samatova, A. Geist (2002). “Principal Component Analysis for Dimension Reduction in Massive Distributed Data Sets”, in Proc. The Second SIAM International Conference on Data Mining, April 2002.
N.F. Samatova, G. Ostrouchov, A. Geist, A. Melechko. “RACHET: An Efficient Cover-Based Merging of Clustering Hierarchies from Distributed Datasets”, Special Issue on Parallel and Distributed Data Mining, International Journal of Distributed and Parallel Databases: An International Journal, 2002, Volume 11, No. 2, March 2002.
N. Samatova, A. Geist, G. Ostrouchov, “RACHET: Petascale Distributed Data Analysis Suite”, in Proc. SPEEDUP Workshop on Distributed Supercomputing Data Intensive Computing, March 4-6, 2002, Badehotel Bristol, Leukerbad, Valais, Switzerland.
Presentations:
Publications:
N. Samatova, A. Geist, G. Ostrouchov, “RACHET: Petascale Distributed Data Analysis Suite”, SPEEDUP Workshop on Distributed Supercomputing Data Intensive Computing, March 4-6, 2002, Badehotel Bristol, Leukerbad, Valais, Switzerland
A. Shoshani, R. Burris, T. Potok, N. Samatova, “SDM-ISIC”, TSI All-Hands Meeting, February, 2002.
OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY
SDM center
Thank You!