ddm kirk. lsst-vao discussion: distributed data mining (ddm) kirk borne george mason university...
TRANSCRIPT
DDM
Kirk
LSST-VAO discussion:Distributed Data Mining (DDM)
Kirk BorneGeorge Mason University
March 24, 2011
The LSST Data ChallengesThe LSST Data Challenges
The LSST Data Mining The LSST Data Mining ChallengesChallenges1.1. Massive data stream: ~2 Massive data stream: ~2
Terabytes of image data Terabytes of image data per hour that must be per hour that must be mined in real time (for 10 mined in real time (for 10 years).years).
2.2. Massive 20-Petabyte Massive 20-Petabyte database: more than 50 database: more than 50 billion objects need to be billion objects need to be classified, and most will be classified, and most will be monitored for important monitored for important variations in real time.variations in real time.
3.3. Massive event stream: Massive event stream: knowledge extraction in knowledge extraction in real time for 100,000 real time for 100,000 events each night. events each night.
• Challenge #1 includes Challenge #1 includes both the static data both the static data mining aspects of #2 mining aspects of #2 and the dynamic data and the dynamic data mining aspects of #3.mining aspects of #3.
• Look at #2 and #3 in Look at #2 and #3 in more detail ... more detail ...
LSST data mining challenge # 2• Accurately characterize and classify 50 billion
objects and 20 trillion source observations• Requires VO-accessible multi-wavelength data• Szalay’s Law: Astrophysical discovery potential
grows as (number of data sources)2
Benefits of very large datasets:• best statistical analysis of
“typical” events• automated search for “rare”
events
LSST data mining challenge # 3• Approximately 100,000 times each night for 10
years LSST will obtain the following data on a new sky event, and we will be challenged with classifying these data:
LSST data mining challenge # 3• Approximately 100,000 times each night for 10
years LSST will obtain the following data on a new sky event, and we will be challenged with classifying these data:
time
flux
LSST data mining challenge # 3• Approximately 100,000 times each night for 10
years LSST will obtain the following data on a new sky event, and we will be challenged with classifying these data: more data points help !
time
flux
LSST data mining challenge # 3• Approximately 100,000 times each night for 10
years LSST will obtain the following data on a new sky event, and we will be challenged with classifying these data: more data points help !
time
flux
Characterize first !then Classify.
Characterization Use Case #1
• Feature detection and extraction:– Automated pipelines’ tasks: Characterize!
• Identify and describe features in the data
• Extract feature descriptors from the data
• Curating these features for scientific re-use
– Human experts’ tasks: Categorize and Classify!• Associate features with astrophysical processes
• Find boundaries between feature sets and label them
– Example: Star-Galaxy Separation
Characterization Use Case #2
• The clustering problem:– Finding clusters of objects within a data set– Pipeline: apply an optimal algorithm for finding
friends-of-friends or nearest neighbors• N is >1010, so what is the most efficient way to sort?• Number of dimensions ~ 1000 – therefore, we have an
enormous subspace search problem
– Scientist: determine the significance of the clusters (statistically and scientifically) – categorize!
• Outlier detection: (unknown unknowns)– Finding the objects and events that are outside the
bounds of our expectations (outside known clusters)– These may be real scientific discoveries or garbage– Outlier detection is therefore useful for:
• Novelty Discovery – is my Nobel prize waiting?
• Anomaly Detection – is the detector system working?
• Data Quality Assurance – is the data pipeline working?
– How does one optimally find outliers in 103-D parameter space? or in interesting subspaces (in lower dimensions)?
– How do we measure their “interestingness”?
Characterization Use Case #3
• The dimension reduction problem:– Finding correlations and “fundamental planes” of parameters– Number of attributes can be
hundreds or thousands• The Curse of High
Dimensionality !– Are there combinations (linear
or non-linear functions) of observational parameters that correlate strongly with one another?
– Are there eigenvectors or condensed representations (e.g., basis sets) that represent the full set of properties?
Characterization Use Case #4
What’s the common theme?• Need multi-wavelength data in all use cases!• VO-accessible ancillary information is essential.
The LSST Data Mining The LSST Data Mining Challenges:Challenges:
Requirements for success:• Discovery of distributed data sources• Access to distributed data sources• Applying characterization and clustering (data
mining) algorithms on distributed data:• Unsupervised and Supervised Machine Learning
What’s the common theme?• Need multi-wavelength data in all use cases!• VO-accessible ancillary information is essential.
The LSST Data Mining The LSST Data Mining Challenges:Challenges:
Data Bottleneck• Mismatch:
• Data volumes increase 1000x in 10 yrs• I/O bandwidth improves ~3x in 10 years
• Therefore . . . Distributed Data Mining
Distributed Data Mining (DDM)• DDM comes in 2 types:
1. Mining of Distributed Data (MDD)2. Distributed Mining of Data (DMD)
• Type 1 takes many forms, with data being centralized (in whole or in partitions)
• Type 2 requires sophisticated algorithms that operate with data in situ …
• Ship the Code to the Data• The computations are done on the data locally,
with partial results shipped around to the different data nodes, and the DDM algorithm iterates until a solutionis converged upon.
• This can be pipeline-initiated or scientist end-user-initiated.• References: http://www.cs.umbc.edu/~hillol/DDMBIB/ • Ultimate goal: Knowledge Discovery through Data Discovery