interactive data exploration using constraints
DESCRIPTION
Interactive Data Exploration using Constraints. Alexander Kalinin Ugur Cetintemel, Stan Zdonik. CP + DBMS for Data Intensive Exploration. Interactive Data Exploration (IDE). Where’s Horrible Gelatinous Blob?. Where’s Waldo?. Searching for the “interesting” within big data - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Interactive Data Exploration using Constraints](https://reader035.vdocuments.mx/reader035/viewer/2022070422/56816403550346895dd5a670/html5/thumbnails/1.jpg)
Interactive Data Exploration using
ConstraintsAlexander Kalinin
Ugur Cetintemel, Stan Zdonik
![Page 2: Interactive Data Exploration using Constraints](https://reader035.vdocuments.mx/reader035/viewer/2022070422/56816403550346895dd5a670/html5/thumbnails/2.jpg)
2
CP + DBMSfor Data Intensive Exploration
![Page 3: Interactive Data Exploration using Constraints](https://reader035.vdocuments.mx/reader035/viewer/2022070422/56816403550346895dd5a670/html5/thumbnails/3.jpg)
3
Interactive Data Exploration (IDE)Searching for the “interesting” within big data
• Exploratory-analysis: ad-hoc & repetitive• Questions are not well defined• “Interesting” can be complex
• Human-in-the loop operation• Fast, online results• Query refinement
Where’s Waldo?Where’s Horrible Gelatinous Blob?
![Page 4: Interactive Data Exploration using Constraints](https://reader035.vdocuments.mx/reader035/viewer/2022070422/56816403550346895dd5a670/html5/thumbnails/4.jpg)
4
Exploratory Queries: Some examples• First-order
• “Celestial 3-5o by 5-7o regions with brightness > 0.8”
• Higher-order• “Pairs of 2o by 2o celestial regions with similarity > 0.5”
• Optimized• “Celestial 3o by 7o region with maximum brightness”
Sloan Digital Sky Survey (SDSS)
![Page 5: Interactive Data Exploration using Constraints](https://reader035.vdocuments.mx/reader035/viewer/2022070422/56816403550346895dd5a670/html5/thumbnails/5.jpg)
5
“Celestial 3-5o by 5-7o regions with average brightness > 0.8” in SQL
1. Divide the data into cells2. Enumerate all regions3. Final filtering (> 0.8)
![Page 6: Interactive Data Exploration using Constraints](https://reader035.vdocuments.mx/reader035/viewer/2022070422/56816403550346895dd5a670/html5/thumbnails/6.jpg)
6
DBMSs for IDE?• No native support for exploratory constructs• No power set• No user-defined objective functions
• No support for interactivity• No online results• No notion of a “query session”
![Page 7: Interactive Data Exploration using Constraints](https://reader035.vdocuments.mx/reader035/viewer/2022070422/56816403550346895dd5a670/html5/thumbnails/7.jpg)
7
Data Exploration as a CP problem
Decision variables:
Constraints:
“Celestial 3-5o by 5-7o regions with average brightness > 0.8”
Left-most corner
Lengths
![Page 8: Interactive Data Exploration using Constraints](https://reader035.vdocuments.mx/reader035/viewer/2022070422/56816403550346895dd5a670/html5/thumbnails/8.jpg)
8
CP Solvers• Large variety of methods for exploring a search space
• Branch-and-Cut• Large Neighborhood Search (LNS)• Randomized search with Restarts
• Highly extensible – important for ad-hoc exploration!• New constraints/functions• New search heuristics
• But… comparing with DBMSs• In-memory data (CP) vs. efficient disk data handling (DBMS)• No I/O cost-awareness (CP) vs. cost-based query planning (DBMS)
![Page 9: Interactive Data Exploration using Constraints](https://reader035.vdocuments.mx/reader035/viewer/2022070422/56816403550346895dd5a670/html5/thumbnails/9.jpg)
9
SearchLight• A fusion of CP solvers and DBMSs
• The DBMS stores and maintains data• The CP solver explores the constrained
search space
• SearchLight is a mediator• Extends CP solvers• Provides buffering, prefetching• Distributes the search• Makes CP solvers cost-aware
CP Solver(OR-tools, Gecode)
Constraints/Functions
Search Heuristics
SearchLight
Metadata Buffering
DBMS(PostgreSQL, SciDB)
Data
, esti
mat
es, d
ecisi
ons
Requ
ests
, Sol
ution
s Data, schema info
Data requests, constraints
Exploration Query
![Page 10: Interactive Data Exploration using Constraints](https://reader035.vdocuments.mx/reader035/viewer/2022070422/56816403550346895dd5a670/html5/thumbnails/10.jpg)
10
Research Issues• A cost model for data-intensive CP
• Each search decision has an I/O cost
• Mediation of data access• Meta-data for guiding and optimizing search (annotated trees, samples, etc.)• Prefetching
• Distributed search• Multi-node parallel branch processing
• CP/DBMS integrated query planning• Propagating CP/Schema constraints
![Page 11: Interactive Data Exploration using Constraints](https://reader035.vdocuments.mx/reader035/viewer/2022070422/56816403550346895dd5a670/html5/thumbnails/11.jpg)
11
Semantic Windows (SW)• First step towards constraint-based exploration
• Supports first-order queries• Exploration via multi-dimensonal “windows of interest”• Shape-based constraints (“a 3-5o by 5-7o region”)• Content-based constraints (“avg_br() > 0.8")
• Custom distributed cost-aware solver
![Page 12: Interactive Data Exploration using Constraints](https://reader035.vdocuments.mx/reader035/viewer/2022070422/56816403550346895dd5a670/html5/thumbnails/12.jpg)
12
SQL/CP Extensions for Data ExplorationSELECT lb(ra), rb(ra), lb(dec), rb(dec),
avg(brightness)FROM sdssGRID BY ra BETWEEN 100 AND 300 STEP 1 dec BETWEEN 5 AND 40 STEP 1HAVING avg(brightness) > 0.8 AND
size(ra) = 5 AND size(dec) >= 5 AND size(dec) <= 7
![Page 13: Interactive Data Exploration using Constraints](https://reader035.vdocuments.mx/reader035/viewer/2022070422/56816403550346895dd5a670/html5/thumbnails/13.jpg)
13
Cost-aware Solver• Best-first search based on the utility
• Utility = f(benefit, cost)
• Benefit – how close a window is to satisfy the constraints• A distance between the constraint’s value and the estimated value
• Cost – how expensive it is to read a window from disk• Measured in cells we have to read• Adjustments are made for skewed data
![Page 14: Interactive Data Exploration using Constraints](https://reader035.vdocuments.mx/reader035/viewer/2022070422/56816403550346895dd5a670/html5/thumbnails/14.jpg)
14
Optimizations• Cost and benefit are estimated by sampling
• Objective function values are cached in a cell cache• Dynamic utility updates• Avoiding same cells re-reads
• Constraint-based pruning during the search
• Distributed search• Multiple nodes work in parallel
![Page 15: Interactive Data Exploration using Constraints](https://reader035.vdocuments.mx/reader035/viewer/2022070422/56816403550346895dd5a670/html5/thumbnails/15.jpg)
15
Adaptive Prefetching• Dispersed reads hit total performance
• Prefetching: read the neighborhood with every window
• Progress-driven prefetching: how much? • Finding new results? Prefetch a small amount• No new results? Increase the prefetch
exponentially
3
2
1
4
No prefetching
With prefetching
1
2
3
4
![Page 16: Interactive Data Exploration using Constraints](https://reader035.vdocuments.mx/reader035/viewer/2022070422/56816403550346895dd5a670/html5/thumbnails/16.jpg)
16
Online vs. Total Performance Results• 35GB data set (part of the SDSS)• 4GB total memory (1GB shared buffer)• First results in 10-20 seconds
20% 40% 60% 80% 100% total0
1000
2000
3000
4000
5000
6000
Static Adaptive PostgreSQL
% of results returned
Tim
e, s
![Page 17: Interactive Data Exploration using Constraints](https://reader035.vdocuments.mx/reader035/viewer/2022070422/56816403550346895dd5a670/html5/thumbnails/17.jpg)
17
Conclusions• Integrate CP and DBMS technologies
• SearchLight: Data-Intensive CP Engine
• Initial implementation: Semantic Windows• Cost-aware solver• Mediating disk access (sampling, prefetching)• Distributed search
• Current work:• OR-Tools as the CP solver• SciDB as the DBMS
![Page 18: Interactive Data Exploration using Constraints](https://reader035.vdocuments.mx/reader035/viewer/2022070422/56816403550346895dd5a670/html5/thumbnails/18.jpg)
18
Questions?
Supported by: