ddm kirk. lsst-vao discussion: distributed data mining (ddm) kirk borne george mason university...

17
DDM Kirk

Upload: alannah-knight

Post on 20-Jan-2016

227 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: DDM Kirk. LSST-VAO discussion: Distributed Data Mining (DDM) Kirk Borne George Mason University March 24, 2011

DDM

Kirk

Page 2: DDM Kirk. LSST-VAO discussion: Distributed Data Mining (DDM) Kirk Borne George Mason University March 24, 2011

LSST-VAO discussion:Distributed Data Mining (DDM)

Kirk BorneGeorge Mason University

March 24, 2011

Page 3: DDM Kirk. LSST-VAO discussion: Distributed Data Mining (DDM) Kirk Borne George Mason University March 24, 2011

The LSST Data ChallengesThe LSST Data Challenges

Page 4: DDM Kirk. LSST-VAO discussion: Distributed Data Mining (DDM) Kirk Borne George Mason University March 24, 2011

The LSST Data Mining The LSST Data Mining ChallengesChallenges1.1. Massive data stream: ~2 Massive data stream: ~2

Terabytes of image data Terabytes of image data per hour that must be per hour that must be mined in real time (for 10 mined in real time (for 10 years).years).

2.2. Massive 20-Petabyte Massive 20-Petabyte database: more than 50 database: more than 50 billion objects need to be billion objects need to be classified, and most will be classified, and most will be monitored for important monitored for important variations in real time.variations in real time.

3.3. Massive event stream: Massive event stream: knowledge extraction in knowledge extraction in real time for 100,000 real time for 100,000 events each night. events each night.

• Challenge #1 includes Challenge #1 includes both the static data both the static data mining aspects of #2 mining aspects of #2 and the dynamic data and the dynamic data mining aspects of #3.mining aspects of #3.

• Look at #2 and #3 in Look at #2 and #3 in more detail ... more detail ...

Page 5: DDM Kirk. LSST-VAO discussion: Distributed Data Mining (DDM) Kirk Borne George Mason University March 24, 2011

LSST data mining challenge # 2• Accurately characterize and classify 50 billion

objects and 20 trillion source observations• Requires VO-accessible multi-wavelength data• Szalay’s Law: Astrophysical discovery potential

grows as (number of data sources)2

Benefits of very large datasets:• best statistical analysis of

“typical” events• automated search for “rare”

events

Page 6: DDM Kirk. LSST-VAO discussion: Distributed Data Mining (DDM) Kirk Borne George Mason University March 24, 2011

LSST data mining challenge # 3• Approximately 100,000 times each night for 10

years LSST will obtain the following data on a new sky event, and we will be challenged with classifying these data:

Page 7: DDM Kirk. LSST-VAO discussion: Distributed Data Mining (DDM) Kirk Borne George Mason University March 24, 2011

LSST data mining challenge # 3• Approximately 100,000 times each night for 10

years LSST will obtain the following data on a new sky event, and we will be challenged with classifying these data:

time

flux

Page 8: DDM Kirk. LSST-VAO discussion: Distributed Data Mining (DDM) Kirk Borne George Mason University March 24, 2011

LSST data mining challenge # 3• Approximately 100,000 times each night for 10

years LSST will obtain the following data on a new sky event, and we will be challenged with classifying these data: more data points help !

time

flux

Page 9: DDM Kirk. LSST-VAO discussion: Distributed Data Mining (DDM) Kirk Borne George Mason University March 24, 2011

LSST data mining challenge # 3• Approximately 100,000 times each night for 10

years LSST will obtain the following data on a new sky event, and we will be challenged with classifying these data: more data points help !

time

flux

Characterize first !then Classify.

Page 10: DDM Kirk. LSST-VAO discussion: Distributed Data Mining (DDM) Kirk Borne George Mason University March 24, 2011

Characterization Use Case #1

• Feature detection and extraction:– Automated pipelines’ tasks: Characterize!

• Identify and describe features in the data

• Extract feature descriptors from the data

• Curating these features for scientific re-use

– Human experts’ tasks: Categorize and Classify!• Associate features with astrophysical processes

• Find boundaries between feature sets and label them

– Example: Star-Galaxy Separation

Page 11: DDM Kirk. LSST-VAO discussion: Distributed Data Mining (DDM) Kirk Borne George Mason University March 24, 2011

Characterization Use Case #2

• The clustering problem:– Finding clusters of objects within a data set– Pipeline: apply an optimal algorithm for finding

friends-of-friends or nearest neighbors• N is >1010, so what is the most efficient way to sort?• Number of dimensions ~ 1000 – therefore, we have an

enormous subspace search problem

– Scientist: determine the significance of the clusters (statistically and scientifically) – categorize!

Page 12: DDM Kirk. LSST-VAO discussion: Distributed Data Mining (DDM) Kirk Borne George Mason University March 24, 2011

• Outlier detection: (unknown unknowns)– Finding the objects and events that are outside the

bounds of our expectations (outside known clusters)– These may be real scientific discoveries or garbage– Outlier detection is therefore useful for:

• Novelty Discovery – is my Nobel prize waiting?

• Anomaly Detection – is the detector system working?

• Data Quality Assurance – is the data pipeline working?

– How does one optimally find outliers in 103-D parameter space? or in interesting subspaces (in lower dimensions)?

– How do we measure their “interestingness”?

Characterization Use Case #3

Page 13: DDM Kirk. LSST-VAO discussion: Distributed Data Mining (DDM) Kirk Borne George Mason University March 24, 2011

• The dimension reduction problem:– Finding correlations and “fundamental planes” of parameters– Number of attributes can be

hundreds or thousands• The Curse of High

Dimensionality !– Are there combinations (linear

or non-linear functions) of observational parameters that correlate strongly with one another?

– Are there eigenvectors or condensed representations (e.g., basis sets) that represent the full set of properties?

Characterization Use Case #4

Page 14: DDM Kirk. LSST-VAO discussion: Distributed Data Mining (DDM) Kirk Borne George Mason University March 24, 2011

What’s the common theme?• Need multi-wavelength data in all use cases!• VO-accessible ancillary information is essential.

The LSST Data Mining The LSST Data Mining Challenges:Challenges:

Page 15: DDM Kirk. LSST-VAO discussion: Distributed Data Mining (DDM) Kirk Borne George Mason University March 24, 2011

Requirements for success:• Discovery of distributed data sources• Access to distributed data sources• Applying characterization and clustering (data

mining) algorithms on distributed data:• Unsupervised and Supervised Machine Learning

What’s the common theme?• Need multi-wavelength data in all use cases!• VO-accessible ancillary information is essential.

The LSST Data Mining The LSST Data Mining Challenges:Challenges:

Page 16: DDM Kirk. LSST-VAO discussion: Distributed Data Mining (DDM) Kirk Borne George Mason University March 24, 2011

Data Bottleneck• Mismatch:

• Data volumes increase 1000x in 10 yrs• I/O bandwidth improves ~3x in 10 years

• Therefore . . . Distributed Data Mining

Page 17: DDM Kirk. LSST-VAO discussion: Distributed Data Mining (DDM) Kirk Borne George Mason University March 24, 2011

Distributed Data Mining (DDM)• DDM comes in 2 types:

1. Mining of Distributed Data (MDD)2. Distributed Mining of Data (DMD)

• Type 1 takes many forms, with data being centralized (in whole or in partitions)

• Type 2 requires sophisticated algorithms that operate with data in situ …

• Ship the Code to the Data• The computations are done on the data locally,

with partial results shipped around to the different data nodes, and the DDM algorithm iterates until a solutionis converged upon.

• This can be pipeline-initiated or scientist end-user-initiated.• References: http://www.cs.umbc.edu/~hillol/DDMBIB/ • Ultimate goal: Knowledge Discovery through Data Discovery