ddm kirk. lsst-vao discussion: distributed data mining (ddm) kirk borne george mason university...

DDM

Kirk

LSST-VAO discussion:Distributed Data Mining (DDM)

Kirk BorneGeorge Mason University

March 24, 2011

http://www.gmu.edu/

The LSST Data ChallengesThe LSST Data Challenges

The LSST Data Mining The LSST Data Mining ChallengesChallenges1.1. Massive data stream: ~2 Massive data stream: ~2

Terabytes of image data Terabytes of image data per hour that must be per hour that must be mined in real time (for 10 mined in real time (for 10 years).years).

2.2. Massive 20-Petabyte Massive 20-Petabyte database: more than 50 database: more than 50 billion objects need to be billion objects need to be classified, and most will be classified, and most will be monitored for important monitored for important variations in real time.variations in real time.

3.3. Massive event stream: Massive event stream: knowledge extraction in knowledge extraction in real time for 100,000 real time for 100,000 events each night. events each night.

• Challenge #1 includes Challenge #1 includes both the static data both the static data mining aspects of #2 mining aspects of #2 and the dynamic data and the dynamic data mining aspects of #3.mining aspects of #3.

• Look at #2 and #3 in Look at #2 and #3 in more detail ... more detail ...

LSST data mining challenge # 2• Accurately characterize and classify 50 billion

objects and 20 trillion source observations• Requires VO-accessible multi-wavelength data• Szalay’s Law: Astrophysical discovery potential

grows as (number of data sources)2

Benefits of very large datasets:• best statistical analysis of

“typical” events• automated search for “rare”

events

LSST data mining challenge # 3• Approximately 100,000 times each night for 10

years LSST will obtain the following data on a new sky event, and we will be challenged with classifying these data:


years LSST will obtain the following data on a new sky event, and we will be challenged with classifying these data:

time

flux


years LSST will obtain the following data on a new sky event, and we will be challenged with classifying these data: more data points help !

time

flux


years LSST will obtain the following data on a new sky event, and we will be challenged with classifying these data: more data points help !

time

flux

Characterize first !then Classify.

Characterization Use Case #1

• Feature detection and extraction:– Automated pipelines’ tasks: Characterize!

• Identify and describe features in the data

• Extract feature descriptors from the data

• Curating these features for scientific re-use

– Human experts’ tasks: Categorize and Classify!• Associate features with astrophysical processes

• Find boundaries between feature sets and label them

– Example: Star-Galaxy Separation


• The clustering problem:– Finding clusters of objects within a data set– Pipeline: apply an optimal algorithm for finding

friends-of-friends or nearest neighbors• N is >1010, so what is the most efficient way to sort?• Number of dimensions ~ 1000 – therefore, we have an

enormous subspace search problem

– Scientist: determine the significance of the clusters (statistically and scientifically) – categorize!

• Outlier detection: (unknown unknowns)– Finding the objects and events that are outside the

bounds of our expectations (outside known clusters)– These may be real scientific discoveries or garbage– Outlier detection is therefore useful for:

• Novelty Discovery – is my Nobel prize waiting?

• Anomaly Detection – is the detector system working?

• Data Quality Assurance – is the data pipeline working?

– How does one optimally find outliers in 103-D parameter space? or in interesting subspaces (in lower dimensions)?

– How do we measure their “interestingness”?


• The dimension reduction problem:– Finding correlations and “fundamental planes” of parameters– Number of attributes can be

hundreds or thousands• The Curse of High

Dimensionality !– Are there combinations (linear

or non-linear functions) of observational parameters that correlate strongly with one another?

– Are there eigenvectors or condensed representations (e.g., basis sets) that represent the full set of properties?


What’s the common theme?• Need multi-wavelength data in all use cases!• VO-accessible ancillary information is essential.

The LSST Data Mining The LSST Data Mining Challenges:Challenges:

Requirements for success:• Discovery of distributed data sources• Access to distributed data sources• Applying characterization and clustering (data

mining) algorithms on distributed data:• Unsupervised and Supervised Machine Learning

What’s the common theme?• Need multi-wavelength data in all use cases!• VO-accessible ancillary information is essential.

The LSST Data Mining The LSST Data Mining Challenges:Challenges:

Data Bottleneck• Mismatch:

• Data volumes increase 1000x in 10 yrs• I/O bandwidth improves ~3x in 10 years

• Therefore . . . Distributed Data Mining

Distributed Data Mining (DDM)• DDM comes in 2 types:

1. Mining of Distributed Data (MDD)2. Distributed Mining of Data (DMD)

• Type 1 takes many forms, with data being centralized (in whole or in partitions)

• Type 2 requires sophisticated algorithms that operate with data in situ …

• Ship the Code to the Data• The computations are done on the data locally,

with partial results shipped around to the different data nodes, and the DDM algorithm iterates until a solutionis converged upon.

• This can be pipeline-initiated or scientist end-user-initiated.• References: http://www.cs.umbc.edu/~hillol/DDMBIB/ • Ultimate goal: Knowledge Discovery through Data Discovery

http://www.cs.umbc.edu/~hillol/DDMBIB/

ddm kirk. lsst-vao discussion: distributed data mining (ddm) kirk borne george mason university...

Documents

following data

data points

data setpipeline

data pipeline

dynamic data mining

static data mining aspects

data quality assurance

terabytes of image data