correlation maps : a compressed access method for exploiting soft functional dependencies
DESCRIPTION
Correlation Maps : A Compressed Access Method for Exploiting Soft Functional Dependencies. George Huo Google, Inc. With Hideaki Kimura (Brown), Alex Rasin (Brown), Samuel Madden (MIT CSAIL), Stanley B. Zdonik (Brown). Two observations. Receiptdate. Shipdate. Boston. 71° 05'W. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Correlation Maps : A Compressed Access Method for Exploiting Soft Functional Dependencies](https://reader035.vdocuments.mx/reader035/viewer/2022081419/56815983550346895dc6c23b/html5/thumbnails/1.jpg)
Correlation Maps:A Compressed Access Method for
Exploiting Soft Functional Dependencies
George HuoGoogle, Inc.
With Hideaki Kimura (Brown), Alex Rasin (Brown),Samuel Madden (MIT CSAIL), Stanley B. Zdonik (Brown)
![Page 2: Correlation Maps : A Compressed Access Method for Exploiting Soft Functional Dependencies](https://reader035.vdocuments.mx/reader035/viewer/2022081419/56815983550346895dc6c23b/html5/thumbnails/2.jpg)
Two observations
![Page 3: Correlation Maps : A Compressed Access Method for Exploiting Soft Functional Dependencies](https://reader035.vdocuments.mx/reader035/viewer/2022081419/56815983550346895dc6c23b/html5/thumbnails/3.jpg)
1. Correlations abound
Attributes tend to encode related info(these are soft functional dependencies)
02116Boston
MA 71° 05'W
Honda
2007
Civic Hybrid
Receiptdate
Shipdate
{zip code, city, state, long/latitude}{manufacturer, model, year}{shipdate, receiptdate}
Geographic
![Page 4: Correlation Maps : A Compressed Access Method for Exploiting Soft Functional Dependencies](https://reader035.vdocuments.mx/reader035/viewer/2022081419/56815983550346895dc6c23b/html5/thumbnails/4.jpg)
2. Secondary indexes are often useless for range and aggregation queries
![Page 5: Correlation Maps : A Compressed Access Method for Exploiting Soft Functional Dependencies](https://reader035.vdocuments.mx/reader035/viewer/2022081419/56815983550346895dc6c23b/html5/thumbnails/5.jpg)
Clusteredaccess pattern
Unclusteredaccess pattern
How can we improve the access patternof a secondary index?
SELECT * FROM lineitem WHEREorderdate=‘2009-08-26’
One seekSorted byorderdate(clustered index on orderdate)
Sorted byorder_id(secondary index on orderdate)
Many seeks
![Page 6: Correlation Maps : A Compressed Access Method for Exploiting Soft Functional Dependencies](https://reader035.vdocuments.mx/reader035/viewer/2022081419/56815983550346895dc6c23b/html5/thumbnails/6.jpg)
Our contribution:Exploiting correlations
to improvesecondary index performance
![Page 7: Correlation Maps : A Compressed Access Method for Exploiting Soft Functional Dependencies](https://reader035.vdocuments.mx/reader035/viewer/2022081419/56815983550346895dc6c23b/html5/thumbnails/7.jpg)
lineitem access pattern
Clustered by primary key (uncorrelated)
SELECT * FROM lineitemWHERE orderdate = 2007-01-03
Clustered by shipdate (correlated)
![Page 8: Correlation Maps : A Compressed Access Method for Exploiting Soft Functional Dependencies](https://reader035.vdocuments.mx/reader035/viewer/2022081419/56815983550346895dc6c23b/html5/thumbnails/8.jpg)
Correlation determines index performance
0
20
40
60
80
100
120
140
160
180
1 4950 34065
Number of Clustered Fragments(Fewer Fragments = More Correlation)
Qu
ery
Ru
nti
me
(s
)
Real Runtime
DB Cost Estimate
Very Correlated
Poorly Correlated
Different sort orders
![Page 9: Correlation Maps : A Compressed Access Method for Exploiting Soft Functional Dependencies](https://reader035.vdocuments.mx/reader035/viewer/2022081419/56815983550346895dc6c23b/html5/thumbnails/9.jpg)
Our system:
1. Cost model with correlations
2. Correlation maps
3. Multi-attribute keys
4. Evaluation
![Page 10: Correlation Maps : A Compressed Access Method for Exploiting Soft Functional Dependencies](https://reader035.vdocuments.mx/reader035/viewer/2022081419/56815983550346895dc6c23b/html5/thumbnails/10.jpg)
i
j
shipdate (clustered)receiptdate
(unclustered)
1. Cost model with correlations
SELECT *FROM lineitem
WHEREreceiptdate IN (i, j)c_per_u: average number of clustered attribute values
per unclustered attribute value
2 lookups 3 c_per_u 10ms 3 levels
1ms 3 pages per shipdate 20s
![Page 11: Correlation Maps : A Compressed Access Method for Exploiting Soft Functional Dependencies](https://reader035.vdocuments.mx/reader035/viewer/2022081419/56815983550346895dc6c23b/html5/thumbnails/11.jpg)
Correlation Map Clustered B+Tree
2. Correlation MapsCREATE TABLE Salaries( State string PRIMARY_KEY, City string, Salary integer);
SELECT * FROM Salaries WHERE city=`Boston’;
Clustered Attribute: StateUnclustered Attribute: City
![Page 12: Correlation Maps : A Compressed Access Method for Exploiting Soft Functional Dependencies](https://reader035.vdocuments.mx/reader035/viewer/2022081419/56815983550346895dc6c23b/html5/thumbnails/12.jpg)
CMs: Usage• Populated using initial scan of the table
• Insertions/deletions: keep a co-occurrence count for each (u, c) pair
• Physically stored as a B+Tree in the DB
![Page 13: Correlation Maps : A Compressed Access Method for Exploiting Soft Functional Dependencies](https://reader035.vdocuments.mx/reader035/viewer/2022081419/56815983550346895dc6c23b/html5/thumbnails/13.jpg)
CMs: Compression
• CMs typically 10x-1000x smaller than a secondary B+Tree (1KB for a 5GB table)
• Achieves compression by mapping values → values, not values → tuples
• Possible to build many CMs; dedicated CM per query
• Improve performance by reducing buffer pool pressure
![Page 14: Correlation Maps : A Compressed Access Method for Exploiting Soft Functional Dependencies](https://reader035.vdocuments.mx/reader035/viewer/2022081419/56815983550346895dc6c23b/html5/thumbnails/14.jpg)
3. Multi-attribute keys
• Combined attributes may predict the clustered key better than either attr alone
• (longitude, latitude) → zip_code
• Challenges:
– Finding these is non-trivial
– Combining attributes leads to many-valued keys leads to large CMs
![Page 15: Correlation Maps : A Compressed Access Method for Exploiting Soft Functional Dependencies](https://reader035.vdocuments.mx/reader035/viewer/2022081419/56815983550346895dc6c23b/html5/thumbnails/15.jpg)
CM Advisor• The CM Advisor considers all possible attribute
combinations for clustered and unclustered keys given a training set of queries
• Buckets: collapse a range of key values into one• Bucketing clustered keys
– Leads to longer sequential disk reads– Boston:MA versus Boston:MA,MI
• Bucketing unclustered keys– Merging two unclustered buckets may increase disk seeks– Boston:MA versus Boise,Boston:ID,MA
Clustered Unclustered
Clustered Unclustered
![Page 16: Correlation Maps : A Compressed Access Method for Exploiting Soft Functional Dependencies](https://reader035.vdocuments.mx/reader035/viewer/2022081419/56815983550346895dc6c23b/html5/thumbnails/16.jpg)
4. Experimental evaluation
SELECT … WHERECity IN (Boston, Springf)
AND State IN (MA,NH,OH)
SELECT … WHERECity IN (Boston, Springf)
![Page 17: Correlation Maps : A Compressed Access Method for Exploiting Soft Functional Dependencies](https://reader035.vdocuments.mx/reader035/viewer/2022081419/56815983550346895dc6c23b/html5/thumbnails/17.jpg)
Benefit of correlation
0
10
20
30
40
50
60
70
80
1 2 4 6 8 10 20
Number of Shipdate Lookups
Elap
sed
(s) Full Table Scan
B+Tree (Uncorrelated)
B+Tree (Correlated)
CostModel (correlated)
SELECT * FROM lineitemWHERE shipdate IN (2009-01-03, …)
![Page 18: Correlation Maps : A Compressed Access Method for Exploiting Soft Functional Dependencies](https://reader035.vdocuments.mx/reader035/viewer/2022081419/56815983550346895dc6c23b/html5/thumbnails/18.jpg)
eBay category data
• Hierarchies of products in categories• antiques→architectural→hardware→locks & keys• 24,000 categories up to 6 levels deep• Clustered by catID• Correlation: catID → price• Generated unique ItemIDs for 43 million rows (3.5GB)
![Page 19: Correlation Maps : A Compressed Access Method for Exploiting Soft Functional Dependencies](https://reader035.vdocuments.mx/reader035/viewer/2022081419/56815983550346895dc6c23b/html5/thumbnails/19.jpg)
Maintenance costs: CM vs B+Tree
Index updates fit in memory
Each B+Tree: 1.5GB
Each CM: 300K
![Page 20: Correlation Maps : A Compressed Access Method for Exploiting Soft Functional Dependencies](https://reader035.vdocuments.mx/reader035/viewer/2022081419/56815983550346895dc6c23b/html5/thumbnails/20.jpg)
Mixed workload performance(5 indexes each)
Selects slow down inserts evenmore due to buffer pool pressure!
Total B+Tree size:7.7GB
Total CM size:1.4MB
![Page 21: Correlation Maps : A Compressed Access Method for Exploiting Soft Functional Dependencies](https://reader035.vdocuments.mx/reader035/viewer/2022081419/56815983550346895dc6c23b/html5/thumbnails/21.jpg)
SDSS Skyserver data
• Celestial objects and their optical properties• PhotoObj: right ascension (ra), declination (dec)• Clustered by fieldID• Correlation: (ra, dec) → fieldID• Initial data: 200k tuples• Copied ra and dec windows 10x to produce
20M tuples, 3GB
![Page 22: Correlation Maps : A Compressed Access Method for Exploiting Soft Functional Dependencies](https://reader035.vdocuments.mx/reader035/viewer/2022081419/56815983550346895dc6c23b/html5/thumbnails/22.jpg)
0.67 0.936 0.699
542
0
100
200
300
400
500
600
Index Size On-Disk (MB)
Multi-attributeindex performance
4
1.7
0.21
1.12
0
0.5
1
1.5
2
2.5
3
3.5
4
SDSS Range Query Runtime (s)
SELECT COUNT(*)FROM PhotoObjWHERE 193.1 < ra < 194.5AND 1.41 < dec < 1.55AND 23 < g+rho < 25
CM(ra)
CM(dec)
CM(ra,dec)
BTree(ra,dec)
CM(ra) CM(dec) CM(ra,dec)
BTree(ra,dec)Correlation:(ra, dec) → fieldID
![Page 23: Correlation Maps : A Compressed Access Method for Exploiting Soft Functional Dependencies](https://reader035.vdocuments.mx/reader035/viewer/2022081419/56815983550346895dc6c23b/html5/thumbnails/23.jpg)
Related ideas• BHUNT/CORDS
– Similar measure of correlation for query opt.
– Doesn’t discuss indexing, no cost model
• ADC Clustering– Proposes reclustering, but no cost model/designer
• Microsoft SQL Server: datetime clustering– Limited to datetime types
• Index compression (Prefix B+Tree, delta encoding, …)– Compression rates in the range of 2x
![Page 24: Correlation Maps : A Compressed Access Method for Exploiting Soft Functional Dependencies](https://reader035.vdocuments.mx/reader035/viewer/2022081419/56815983550346895dc6c23b/html5/thumbnails/24.jpg)
Summary• Correlations between attributes arise naturally in a
variety of applications
• Correlations determine the cost of secondary index lookups
• We presented a correlation-aware cost model and advisor to decide when to build CMs
• Multi-attribute CMs capture more correlations; bucketing keeps them tiny
• Experiments show that correlated lookups with CMs are 2-38x faster, and CMs are typically 10-1000x smaller than secondary B+Trees
![Page 25: Correlation Maps : A Compressed Access Method for Exploiting Soft Functional Dependencies](https://reader035.vdocuments.mx/reader035/viewer/2022081419/56815983550346895dc6c23b/html5/thumbnails/25.jpg)
![Page 26: Correlation Maps : A Compressed Access Method for Exploiting Soft Functional Dependencies](https://reader035.vdocuments.mx/reader035/viewer/2022081419/56815983550346895dc6c23b/html5/thumbnails/26.jpg)
Model accuracy
SELECT Avg(Price)
FROM EbayWHERE
Category=X
![Page 27: Correlation Maps : A Compressed Access Method for Exploiting Soft Functional Dependencies](https://reader035.vdocuments.mx/reader035/viewer/2022081419/56815983550346895dc6c23b/html5/thumbnails/27.jpg)
Isolated CM performance vs.secondary B+Tree
Slightly slower on isolated query;CM must filter unmatching tuples
B+Tree: 860MB
CM: 900KB
![Page 28: Correlation Maps : A Compressed Access Method for Exploiting Soft Functional Dependencies](https://reader035.vdocuments.mx/reader035/viewer/2022081419/56815983550346895dc6c23b/html5/thumbnails/28.jpg)
Bucketing
Acceptable performance
Smaller size
• Random-sample synopsis from table
• Try unclustered bucket sizes: 2², 2³, …
• Output candidates grouped by size, ordered by c_per_u