anastasia ailamaki
TRANSCRIPT
![Page 1: Anastasia Ailamaki](https://reader031.vdocuments.mx/reader031/viewer/2022012211/61df421737892c2c907e0306/html5/thumbnails/1.jpg)
Nothing is for granted:Making wise decisions using real-time intelligence
Anastasia Ailamaki
![Page 2: Anastasia Ailamaki](https://reader031.vdocuments.mx/reader031/viewer/2022012211/61df421737892c2c907e0306/html5/thumbnails/2.jpg)
Catching up With an Evolving LandscapeOne size fits all:
– Easier administration/development– Easier usage– Single point of truth
Modern requirements:– OLAP, OLTP, Streams, federated data, ML, FaaS, …., mixed– Custom & domain-specific types/operations– Diverse requirements and priorities
2Treat heterogeneity as a first-class citizen
ODBMS
DSMSSpatialDBMS
KV-store
“Useful” time: 9%(mining data for patterns)Source: Forbes
![Page 3: Anastasia Ailamaki](https://reader031.vdocuments.mx/reader031/viewer/2022012211/61df421737892c2c907e0306/html5/thumbnails/3.jpg)
Image: “Three-way tug of war” (https://www.momahler.com/ProArtistManifesto)
Hardware
Data
Workload
Change Means Trouble
Next generation systems must adapt
![Page 4: Anastasia Ailamaki](https://reader031.vdocuments.mx/reader031/viewer/2022012211/61df421737892c2c907e0306/html5/thumbnails/4.jpg)
Change-driven ArchitecturesData & operations hint to optimal architectureAllow operations to shape DBMS architecture to their sizeReduce uncertain, preemptive work and fixed components
4Each workload runs on its own custom DBMS
Capabilities Runtime
ComposabilityModularity
Specialization
![Page 5: Anastasia Ailamaki](https://reader031.vdocuments.mx/reader031/viewer/2022012211/61df421737892c2c907e0306/html5/thumbnails/5.jpg)
Runtime specialization embraces heterogeneity
5
Data
Heterogene
ous
Workload
Hardware
Unstandardized Inconsistencies Skewness
Uncertainty Complexity Volume
Underutilization Boundaries Interference
![Page 6: Anastasia Ailamaki](https://reader031.vdocuments.mx/reader031/viewer/2022012211/61df421737892c2c907e0306/html5/thumbnails/6.jpg)
data growth
From Data to Information Veracity
6Need timely & informed data preparation
TXT
LOAD
LOAD
LOAD
Clean Data
TuneDatabase
Time
Query
predictability
![Page 7: Anastasia Ailamaki](https://reader031.vdocuments.mx/reader031/viewer/2022012211/61df421737892c2c907e0306/html5/thumbnails/7.jpg)
5. Plan execution6. (finally!) run & get answer
Database
Preparation Kills Discovery
QUERY
LOAD
TXT
LOAD
LOAD
1. Load data
2. Clean Data
3. TuneDatabase
4. Ask analytical question
$$$$$
Load30%
Clean50%
Index17%
Cost grows with owned – not used! – dataPlanning is expensive, often even wrong
![Page 8: Anastasia Ailamaki](https://reader031.vdocuments.mx/reader031/viewer/2022012211/61df421737892c2c907e0306/html5/thumbnails/8.jpg)
Heterogeneous Data: Convert the EngineConservative: pre-load & convert data
– Time-consuming data adaptation to engine– Transformation: pre-determined order wrt execution– Wasteful loading of write-only data
Real-time: customize access paths at runtime– Virtualize data; specialize access paths to queries+data– On-the-fly access paths and positional caches– Pay-as-you-go heterogeneous data accesses
8Real-time adaptation to data formats à flexibility
TXT
LOAD
LOAD
LOAD
Query
TXT
Query
![Page 9: Anastasia Ailamaki](https://reader031.vdocuments.mx/reader031/viewer/2022012211/61df421737892c2c907e0306/html5/thumbnails/9.jpg)
CSV JSON.bin
Query
CSV JSON.bin
DBMS
Tran
sform
Query
Proteus: Customized Data Access
Each source as native format, generate special engine/query
Engine adapts to dataPlug-in per data sourceBuild auxiliary structures
PredicateExpression Type
Collection Type
ü Reduce branches
ü Minimize function calls
ü Pipelining
Krikellas [ICDE2010], Neumann [VLDB2011]
Interpretation Overhead Generate
Operators
![Page 10: Anastasia Ailamaki](https://reader031.vdocuments.mx/reader031/viewer/2022012211/61df421737892c2c907e0306/html5/thumbnails/10.jpg)
Full Cleaning: A Dirty SolutionConservative: query cleaned database copy
– Time-consuming transformation: Dirty=>Clean DB– Analysis dependent– Wasteful effort on unnecessary data
Real-time: Sanitize only interesting data– Virtualized clean DB– Query relaxation to guarantee correctness– Clean touched data as-you-go
10Focus on cleaning useful data
Query
Query Result
Cleaning
Offline Cleaning
![Page 11: Anastasia Ailamaki](https://reader031.vdocuments.mx/reader031/viewer/2022012211/61df421737892c2c907e0306/html5/thumbnails/11.jpg)
Query-Driven Data Sanitization
11Clean only useful data with probabilistic fixes
DenialConstraints
DuplicateEliminationIntegrity
Constraints
Common algebra q1
q2Offline Cleaning
Query Result
Cleaning
![Page 12: Anastasia Ailamaki](https://reader031.vdocuments.mx/reader031/viewer/2022012211/61df421737892c2c907e0306/html5/thumbnails/12.jpg)
Indexing Decisions: It’s All About AccessesConservative: parse before tune
– Based on workload-expectations– Data duplication– Coarse-grained access path selection
Real-time: on-the-fly indexing of raw data– Virtualized (logical) partitioning & online index tuning– Reuse raw data– Fine-grained tuning, as-you-go
12Online partial indexing, specialized to access pattern
TXT
LOAD
LOAD
LOAD
TuneDatabase
Query
TXT
Access
TuneQueryAccess
Access
![Page 13: Anastasia Ailamaki](https://reader031.vdocuments.mx/reader031/viewer/2022012211/61df421737892c2c907e0306/html5/thumbnails/13.jpg)
Evolving Indexes
13On-the-fly virtual indexes to invest on important data
B+
Data skippingFine-grained access path selectionChoose what to build & when - Value-Existence (i.e., Bloom filters)- Value-Position (i.e., B+ Trees)Build / drop based on budget
...
…
attr1 attrN costs vs. gainsShould I build or not?
Bf
Qm
TXT
Access
TuneQueryAccess
Access
![Page 14: Anastasia Ailamaki](https://reader031.vdocuments.mx/reader031/viewer/2022012211/61df421737892c2c907e0306/html5/thumbnails/14.jpg)
Runtime specialization embraces heterogeneity
14
Data
Heterogene
ous
Workload
Hardware
Unstandardized Inconsistencies Skewness
Uncertainty Complexity Volume
Underutilization Boundaries Interference
![Page 15: Anastasia Ailamaki](https://reader031.vdocuments.mx/reader031/viewer/2022012211/61df421737892c2c907e0306/html5/thumbnails/15.jpg)
Conservative: offline vs online AQP– Tradeoff between performance and flexibility– Preprocessing cost or reduced gains– Static sampling user-driven tactic
Real-time: Workload-driven data summarization– Maintain sample/sketch set specialized to workload history– Materialize summaries based on gain estimates– Pay-as-you-go summary creation and storage overheads
AQP: From Expectations to Adaptation
15Adapt summaries based on workload patterns
Select Persistent
SummariesQuery
Query
Query
Query
RefineSummaries
Query
Query
Query
QueryRefine
Summaries
![Page 16: Anastasia Ailamaki](https://reader031.vdocuments.mx/reader031/viewer/2022012211/61df421737892c2c907e0306/html5/thumbnails/16.jpg)
Tuning Materialized SummariesWindow-based prediction
Estimate prospective gains vs. storage cost
Adapt window size based on quality of predictions
16Materialize summaries maximizing gain on past workload
Q1 Q2 Q3 Q4 Q5 Q6
S1
S2S2
S3S2
S4S4 S4 S1
S2Summary
warehouse S1 S2
w = 2
S4
materialize materialize use
Useful Summaries
![Page 17: Anastasia Ailamaki](https://reader031.vdocuments.mx/reader031/viewer/2022012211/61df421737892c2c907e0306/html5/thumbnails/17.jpg)
Bypassing Intra-Query DependenciesConservative: Query unnesting
– Staged query execution: waiting for unnecessary details– Sharing prohibited across barriers– Blocking & limited parallelism dues to dependencies
Real-time: Speculate intra-query dependencies– Virtualize query execution by speculating and repairing– Exposed sharing and parallelism across barrier– Verify & repair query results as-you-go
17Expect, predict & proceed: specialize to uncertainty
Query
Sub-query
Sub-query
Query
Sub-querySub-querySub-querySub-query
![Page 18: Anastasia Ailamaki](https://reader031.vdocuments.mx/reader031/viewer/2022012211/61df421737892c2c907e0306/html5/thumbnails/18.jpg)
Chocking Points Reduce Parallelism
18Subquery speculation to create task-independence
Outer query
Subquery 2
Subquery 3
approx2
SVJoin
Δ(Q1)
Guard
Guard
SVJoin
Δ(Q2)
approx3
Apply recursively
Task parallelism and sharing opportunities
Stable repair propagation
Query
Sub-querySub-querySub-querySub-query
![Page 19: Anastasia Ailamaki](https://reader031.vdocuments.mx/reader031/viewer/2022012211/61df421737892c2c907e0306/html5/thumbnails/19.jpg)
Planning in Multi-Query ExecutionConservative: Opportunistic sharing
– Missed sharing opportunities, to avoid optimization time– (Multi-) Query & statistic dependent– Sensitive to query order, plans, correlations
Real-time: Adaptive multi-query optimization– Specialize to running queries- & data-at-hand– Inspect actual statistics, observed by minibatches– Reconfigure execution for pay-as-you-go plans
19
Learn & explore using running Qs à fine granularity reopt
Query Plan Execute
Query Plan Execute
Query Plan Execute
Query
Query
Query
Execute
Execute
Execute & revise plans
![Page 20: Anastasia Ailamaki](https://reader031.vdocuments.mx/reader031/viewer/2022012211/61df421737892c2c907e0306/html5/thumbnails/20.jpg)
Concurrent Queries: From Complexity to Shareability
20Query-interoperability-based, on-the-fly batch-reoptimization
RouLetteQ1Q2Q3
Feedback
Plan efficiencyExecution Efficiency
Reinforcement learningLow-overhead adaptation
Query Processing Time
Query
Query
Query
Execute
Execute
Execute & revise plans
![Page 21: Anastasia Ailamaki](https://reader031.vdocuments.mx/reader031/viewer/2022012211/61df421737892c2c907e0306/html5/thumbnails/21.jpg)
Runtime specialization embraces heterogeneity
21
Data
Heterogene
ous
Workload
Hardware
Unstandardized Inconsistencies Skewness
Uncertainty Complexity Volume
Underutilization Boundaries Interference
![Page 22: Anastasia Ailamaki](https://reader031.vdocuments.mx/reader031/viewer/2022012211/61df421737892c2c907e0306/html5/thumbnails/22.jpg)
Hardware Acceleration: A Balancing GameConservative: Performance – portability tradeoff across devices
– Device specific operator implementations– Limited or expectation-based load-balancing– Inefficient hardware use
Real-time: Synergistic CPU-GPU execution– Virtualize hardware & generate hardware-specific code– Throughput-based data-flow load balancing– Exploit Accelerator-Level Parallelism
22Use devices based on their relative performance
GPU
CPUQuery
GPU
CPUQuery
![Page 23: Anastasia Ailamaki](https://reader031.vdocuments.mx/reader031/viewer/2022012211/61df421737892c2c907e0306/html5/thumbnails/23.jpg)
Hardware Heterogeneity: Let the Data FlowDecouple data- from control-flowEncapsulate trait conversions into operatorsInspect flows to load-balance
23Flow inspection to load balance across heterogeneous devices
filter
unpack
cpu2gpu
gpu2cpu
pack
mem-move
mem-move
aggregate
router
router
unpack
aggregate
GPU
CPUQuery
![Page 24: Anastasia Ailamaki](https://reader031.vdocuments.mx/reader031/viewer/2022012211/61df421737892c2c907e0306/html5/thumbnails/24.jpg)
Hardware Boundaries à Isolation Mechanisms
Conservative: Device-collocated OLAP & OLTP– Wasted parallelism and throughput– Static hardware preferences– Destructive interference across workloads
Real-time: Dynamic task assignment minimizes interference– ALP & hardware boundaries as an isolation mechanism– Fresh-data-rate- & isolation-driven task assignment– Pay-as-you-access-fresh-data interference
24Align isolation requirements with hardware boundaries
GPU
CPUCommand
GPU
CPUCommandSchedule
![Page 25: Anastasia Ailamaki](https://reader031.vdocuments.mx/reader031/viewer/2022012211/61df421737892c2c907e0306/html5/thumbnails/25.jpg)
GPU Accesses Fresh Data from CPU MemoryOLTP generates fresh data
on CPU Memory
Data access protected byconcurrency control
OLAP needs to accessfresh data
25Provide snapshot isolation for OLAP w/o CC overheads
Read / Write
Storage
Fetch Fresh Data
Main Memory
DRAM NVLink
PCIe
OLTP
OLAP
GPU
CPUCommandSchedule
![Page 26: Anastasia Ailamaki](https://reader031.vdocuments.mx/reader031/viewer/2022012211/61df421737892c2c907e0306/html5/thumbnails/26.jpg)
HTAP: Chasing Freshness LocalityConservative: Static OLAP-OLTP assignment
– Unnecessary tradeoff between interference and performance– Pre-determined resource assignment based on workload type– Wasteful data consolidation and synchronization
Real-time: Adaptive scheduling of HTAP workloads– Specialize to requirements and data/freshness-rates– Workload-based resource assignment– Pay-as-you-go snapshot updates
26Task placement based on resource usage
GPU
CPU
GPU
CPU
GPU
CPU
GPU
CPU
![Page 27: Anastasia Ailamaki](https://reader031.vdocuments.mx/reader031/viewer/2022012211/61df421737892c2c907e0306/html5/thumbnails/27.jpg)
Workload Isolation & Fresh Data Throughput
27Freshness-based: from destructive to constructive interference
ETL
Isolated Elastic-ComputeHybrid-Access Colocated
Performance IsolationFresh Data Access Bandwidth
Fresh Data
OLAPOLTP
![Page 28: Anastasia Ailamaki](https://reader031.vdocuments.mx/reader031/viewer/2022012211/61df421737892c2c907e0306/html5/thumbnails/28.jpg)
Runtime specialization embraces heterogeneity
28
Data
Heterogene
ous
Workload
Hardware
Unstandardized Inconsistencies Skewness
Uncertainty Complexity Volume
Underutilization Boundaries Interference
Inspect & Reduce
Relax & Repair
Track & Adjust
![Page 29: Anastasia Ailamaki](https://reader031.vdocuments.mx/reader031/viewer/2022012211/61df421737892c2c907e0306/html5/thumbnails/29.jpg)
Runtime specialization embraces heterogeneity
29Application landscape changes data processing
Data
Heterogene
ous
Workload
Hardware
Unstandardized Inconsistencies Skewness
Uncertainty Complexity Volume
Underutilization Boundaries Interference
Queries
Workload
![Page 30: Anastasia Ailamaki](https://reader031.vdocuments.mx/reader031/viewer/2022012211/61df421737892c2c907e0306/html5/thumbnails/30.jpg)
Complexity of Modern WorkloadsDiverse modern data problems
– IOT, OCR, ML, NLP, Medical, Mathematics etc…
DBMS catch-up for popular functionality– Human effort and big delays– Oblivious to out-of-DBMS workflows
Vast resource of libraries– Authored by domain experts, used by everybody– Loose library-to-data-sources integration and optimization
30Need for systems that can “learn” new functionality
![Page 31: Anastasia Ailamaki](https://reader031.vdocuments.mx/reader031/viewer/2022012211/61df421737892c2c907e0306/html5/thumbnails/31.jpg)
five old friends revisitedData variety à Operational environment variety
– Unpredictable application requirements
Data veracity à Inter-component veracity– Heterogeneous data & variable importance
Data volume à Structural volume– Multi-layered system architectures
Data value à Resource value– Broader, multi-featured analytics
Data velocity à Technological velocity– Hardware heterogeneity & volatility
31Intelligent systems to catch-up with an evolving landscape
![Page 32: Anastasia Ailamaki](https://reader031.vdocuments.mx/reader031/viewer/2022012211/61df421737892c2c907e0306/html5/thumbnails/32.jpg)
Incorporate change into native design.Anticipate change and react, learning from errors.
A solution is only as efficient as its least adaptive component.
Intelligent Real-time Systems