dremel: interactive analysis of web-scale...
TRANSCRIPT
![Page 1: Dremel: Interactive Analysis of Web-Scale Databasepavlo/courses/fall2013/static/slides/dremel.pdf · What is Dremel Trillion-record, multi-terabyte datasets at interactive speed Scales](https://reader033.vdocuments.mx/reader033/viewer/2022041712/5e48c995c7d0f317552faa71/html5/thumbnails/1.jpg)
Dremel:
Interactive Analysis of Web-Scale DatabasePresented by Jian Fang
Most parts of these slides are stolen from here: http://bit.ly/HIPzeG
![Page 2: Dremel: Interactive Analysis of Web-Scale Databasepavlo/courses/fall2013/static/slides/dremel.pdf · What is Dremel Trillion-record, multi-terabyte datasets at interactive speed Scales](https://reader033.vdocuments.mx/reader033/viewer/2022041712/5e48c995c7d0f317552faa71/html5/thumbnails/2.jpg)
What is Dremel
Trillion-record, multi-terabyte datasets at interactive speed
Scales to thousands of nodes
Fault and straggler tolerant execution
Nested data model
Complex datasets; normalization is prohibitive
Columnar storage and processing
Tree architecture (as in web search)
Interoperates with Google's data management tools
In situ data access (e.g., GFS, Bigtable)
MapReduce pipelines
![Page 3: Dremel: Interactive Analysis of Web-Scale Databasepavlo/courses/fall2013/static/slides/dremel.pdf · What is Dremel Trillion-record, multi-terabyte datasets at interactive speed Scales](https://reader033.vdocuments.mx/reader033/viewer/2022041712/5e48c995c7d0f317552faa71/html5/thumbnails/3.jpg)
Widely used inside Google
Analysis of crawled web documents
Tracking install data for applications on Android Market
Spam analysis
Results of tests run on Google's distributed build system
Disk I/O statistics for hundreds of thousands of disks
……
![Page 4: Dremel: Interactive Analysis of Web-Scale Databasepavlo/courses/fall2013/static/slides/dremel.pdf · What is Dremel Trillion-record, multi-terabyte datasets at interactive speed Scales](https://reader033.vdocuments.mx/reader033/viewer/2022041712/5e48c995c7d0f317552faa71/html5/thumbnails/4.jpg)
Outline
Nested columnar storage
Query processing
Experiments
Observations
![Page 5: Dremel: Interactive Analysis of Web-Scale Databasepavlo/courses/fall2013/static/slides/dremel.pdf · What is Dremel Trillion-record, multi-terabyte datasets at interactive speed Scales](https://reader033.vdocuments.mx/reader033/viewer/2022041712/5e48c995c7d0f317552faa71/html5/thumbnails/5.jpg)
Common Storage Layer
Google File System
Fault tolerance
Fast response time
Data can be manipulated easily
![Page 6: Dremel: Interactive Analysis of Web-Scale Databasepavlo/courses/fall2013/static/slides/dremel.pdf · What is Dremel Trillion-record, multi-terabyte datasets at interactive speed Scales](https://reader033.vdocuments.mx/reader033/viewer/2022041712/5e48c995c7d0f317552faa71/html5/thumbnails/6.jpg)
Rows vs Columns
A
B
C D
E
*
*
*
. . .
. . .
r1
r2
r1
r2
r1
r2
r1
r2
Challenge: preserve structure, reconstruct from a subset of fields
Read less,
cheaper
decompression
DocId: 10
Links
Forward: 20
Name
Language
Code: 'en-us'
Country: 'us'
Url: 'http://A'
Name
Url: 'http://B'
r1
![Page 7: Dremel: Interactive Analysis of Web-Scale Databasepavlo/courses/fall2013/static/slides/dremel.pdf · What is Dremel Trillion-record, multi-terabyte datasets at interactive speed Scales](https://reader033.vdocuments.mx/reader033/viewer/2022041712/5e48c995c7d0f317552faa71/html5/thumbnails/7.jpg)
Nested Data Model
7
message Document {
required int64 DocId;
optional group Links {
repeated int64 Backward;
repeated int64 Forward;
}
repeated group Name {
repeated group Language {
required string Code;
optional string Country;
}
optional string Url;
}
}
DocId: 10
Links
Forward: 20
Forward: 40
Forward: 60
Name
Language
Code: 'en-us'
Country: 'us'
Language
Code: 'en'
Url: 'http://A'
Name
Url: 'http://B'
Name
Language
Code: 'en-gb'
Country: 'gb'
r1
DocId: 20
Links
Backward: 10
Backward: 30
Forward: 80
Name
Url: 'http://C'
r2
![Page 8: Dremel: Interactive Analysis of Web-Scale Databasepavlo/courses/fall2013/static/slides/dremel.pdf · What is Dremel Trillion-record, multi-terabyte datasets at interactive speed Scales](https://reader033.vdocuments.mx/reader033/viewer/2022041712/5e48c995c7d0f317552faa71/html5/thumbnails/8.jpg)
Repetition and Definition Levels
Values alone do not convey the
structure of a record
Repetition levels
It tells us at what repeated field in the
field’s path the value has repeated
Example: r1, Name.Language.Code
Repetition level: [0,2,1,1]
DocId: 10
Links
Forward: 20
Forward: 40
Forward: 60
Name
Language
Code: 'en-us'
Country: 'us'
Language
Code: 'en'
Url: 'http://A'
Name
Language
Code: NULL
Url: 'http://B'
Name
Language
Code: 'en-gb'
Country: 'gb'
r1
R = 0
R = 2
R = 1
R = 1
![Page 9: Dremel: Interactive Analysis of Web-Scale Databasepavlo/courses/fall2013/static/slides/dremel.pdf · What is Dremel Trillion-record, multi-terabyte datasets at interactive speed Scales](https://reader033.vdocuments.mx/reader033/viewer/2022041712/5e48c995c7d0f317552faa71/html5/thumbnails/9.jpg)
Repetition and Definition Levels
Definition Levels
Specifying how many fields in a path
that could be undefined are actually
present in the record
Example: Name.Language.Country
DocId: 10
Links
Forward: 20
Forward: 40
Forward: 60
Name
Language
Code: 'en-us'
Country: 'us'
Language
Code: 'en'
Url: 'http://A'
Name
Url: 'http://B'
Name
Language
Code: 'en-gb'
Country: 'gb'
r1
Missing Country,
d = 2
Missing Language,
d = 1
![Page 10: Dremel: Interactive Analysis of Web-Scale Databasepavlo/courses/fall2013/static/slides/dremel.pdf · What is Dremel Trillion-record, multi-terabyte datasets at interactive speed Scales](https://reader033.vdocuments.mx/reader033/viewer/2022041712/5e48c995c7d0f317552faa71/html5/thumbnails/10.jpg)
Column-striped representation
value r d
10 0 0
20 0 0
value r d
20 0 2
40 1 2
60 1 2
80 0 2
value r d
NULL 0 1
10 0 2
30 1 2
DocId
value r d
http://A 0 2
http://B 1 2
NULL 1 1
http://C 0 2
Name.Url
value r d
en-us 0 2
en 2 2
NULL 1 1
en-gb 1 2
NULL 0 1
Name.Language.Code Name.Language.Country
Links.BackwardLinks.Forward
value r d
us 0 3
NULL 2 2
NULL 1 1
gb 1 3
NULL 0 1
![Page 11: Dremel: Interactive Analysis of Web-Scale Databasepavlo/courses/fall2013/static/slides/dremel.pdf · What is Dremel Trillion-record, multi-terabyte datasets at interactive speed Scales](https://reader033.vdocuments.mx/reader033/viewer/2022041712/5e48c995c7d0f317552faa71/html5/thumbnails/11.jpg)
Record Assembly FSM
Name.Language.CountryName.Language.Code
Links.Backward Links.Forward
Name.Ur
l
DocId
1
0
1
0
0,1,2
2
0,11
0
0
Transitions labeled with repetition levels
message Document {
required int64 DocId;
optional group Links {
repeated int64 Backward;
repeated int64 Forward;
}
repeated group Name {
repeated group Language {
required string Code;
optional string Country;
}
optional string Url;
}
}
![Page 12: Dremel: Interactive Analysis of Web-Scale Databasepavlo/courses/fall2013/static/slides/dremel.pdf · What is Dremel Trillion-record, multi-terabyte datasets at interactive speed Scales](https://reader033.vdocuments.mx/reader033/viewer/2022041712/5e48c995c7d0f317552faa71/html5/thumbnails/12.jpg)
Reading two fields
DocId
Name.Language.Country1,2
0
0
DocId: 10
Name
Language
Country: 'us'
Language
Name
Name
Language
Country: 'gb'
DocId: 20
Name
s1
s2
![Page 13: Dremel: Interactive Analysis of Web-Scale Databasepavlo/courses/fall2013/static/slides/dremel.pdf · What is Dremel Trillion-record, multi-terabyte datasets at interactive speed Scales](https://reader033.vdocuments.mx/reader033/viewer/2022041712/5e48c995c7d0f317552faa71/html5/thumbnails/13.jpg)
Query Processing
Optimized for select-project-aggregate
Very common class of interactive queries
Single scan
Within-record and cross-record aggregation
Approximations: count(distinct), top-k
Joins, temp tables, UDFs/TVFs, etc.
![Page 14: Dremel: Interactive Analysis of Web-Scale Databasepavlo/courses/fall2013/static/slides/dremel.pdf · What is Dremel Trillion-record, multi-terabyte datasets at interactive speed Scales](https://reader033.vdocuments.mx/reader033/viewer/2022041712/5e48c995c7d0f317552faa71/html5/thumbnails/14.jpg)
Serving Tree
storage layer (e.g., GFS)
. . .
. . .
. . .leaf servers
(with local
storage)
intermediate
servers
root server
client
![Page 15: Dremel: Interactive Analysis of Web-Scale Databasepavlo/courses/fall2013/static/slides/dremel.pdf · What is Dremel Trillion-record, multi-terabyte datasets at interactive speed Scales](https://reader033.vdocuments.mx/reader033/viewer/2022041712/5e48c995c7d0f317552faa71/html5/thumbnails/15.jpg)
Example: count()
SELECT A, COUNT(B) FROM T
GROUP BY A
T = {/gfs/1, /gfs/2, …, /gfs/100000}
SELECT A, SUM(c)
FROM (R11 UNION ALL R110)
GROUP BY A
SELECT A, COUNT(B) AS c
FROM T11 GROUP BY A
T11 = {/gfs/1, …, /gfs/10000}
SELECT A, COUNT(B) AS c
FROM T12 GROUP BY A
T12 = {/gfs/10001, …, /gfs/20000}
SELECT A, COUNT(B) AS c
FROM T31 GROUP BY A
T31 = {/gfs/1}
. . .
0
1
3
R11 R12
Data access ops
. . .
![Page 16: Dremel: Interactive Analysis of Web-Scale Databasepavlo/courses/fall2013/static/slides/dremel.pdf · What is Dremel Trillion-record, multi-terabyte datasets at interactive speed Scales](https://reader033.vdocuments.mx/reader033/viewer/2022041712/5e48c995c7d0f317552faa71/html5/thumbnails/16.jpg)
Experiments
1 PB of real data
(uncompressed, non-
replicated)
100K-800K tablets per
table
Experiments run during
business hours
Table
name
Number of
records
Size (unrepl.,
compressed)
Number
of fields
Data
center
Repl.
factor
T1 85 billion 87 TB 270 A 3×
T2 24 billion 13 TB 530 A 3×
T3 4 billion 70 TB 1200 A 3×
T4 1+ trillion 105 TB 50 B 3×
T5 1+ trillion 20 TB 30 B 2×
![Page 17: Dremel: Interactive Analysis of Web-Scale Databasepavlo/courses/fall2013/static/slides/dremel.pdf · What is Dremel Trillion-record, multi-terabyte datasets at interactive speed Scales](https://reader033.vdocuments.mx/reader033/viewer/2022041712/5e48c995c7d0f317552faa71/html5/thumbnails/17.jpg)
Read from disk
columnsrecords
objects
from
record
sfr
om
colu
mns
(a) read +decompress
(b) assemblerecords
(c) parse asC++ objects
(d) read +decompress
(e) parse asC++ objects
time (sec)
number of fields
"cold" time on local disk,
averaged over 30 runs
Table partition: 375 MB (compressed), 300K rows, 125 columns
2-4x overhead of
using records
10x speedup
using columnar
storage
![Page 18: Dremel: Interactive Analysis of Web-Scale Databasepavlo/courses/fall2013/static/slides/dremel.pdf · What is Dremel Trillion-record, multi-terabyte datasets at interactive speed Scales](https://reader033.vdocuments.mx/reader033/viewer/2022041712/5e48c995c7d0f317552faa71/html5/thumbnails/18.jpg)
MapReduce and Dremel Execution
Sawzall program ran on MR:
num_recs: table sum of int;
num_words: table sum of int;
emit num_recs <- 1;
emit num_words <-
count_words(input.txtField);
execution time (sec) on 3000 nodes
SELECT SUM(count_words(txtField)) / COUNT(*)
FROM T1
Q1:
87 TB 0.5 TB 0.5 TB
MR overheads: launch jobs, schedule 0.5M tasks, assemble records
Avg # of terms in txtField in 85 billion record table T1
![Page 19: Dremel: Interactive Analysis of Web-Scale Databasepavlo/courses/fall2013/static/slides/dremel.pdf · What is Dremel Trillion-record, multi-terabyte datasets at interactive speed Scales](https://reader033.vdocuments.mx/reader033/viewer/2022041712/5e48c995c7d0f317552faa71/html5/thumbnails/19.jpg)
Impact of serving tree depthexecution time (sec)
SELECT country, SUM(item.amount) FROM T2
GROUP BY country
SELECT domain, SUM(item.amount) FROM T2
WHERE domain CONTAINS ’.net’
GROUP BY domain
Q2:
Q3:
40 billion nested items
(returns 100s of records) (returns 1M records)
![Page 20: Dremel: Interactive Analysis of Web-Scale Databasepavlo/courses/fall2013/static/slides/dremel.pdf · What is Dremel Trillion-record, multi-terabyte datasets at interactive speed Scales](https://reader033.vdocuments.mx/reader033/viewer/2022041712/5e48c995c7d0f317552faa71/html5/thumbnails/20.jpg)
Observations
Possible to analyze large disk-resident datasets interactively on commodity
hardware
1T records, 1000s of nodes
MR can benefit from columnar storage just like a parallel DBMS
But record assembly is expensive
Interactive SQL and MR can be complementary
Parallel DBMSes may benefit from serving tree architecture just like search
engines
![Page 21: Dremel: Interactive Analysis of Web-Scale Databasepavlo/courses/fall2013/static/slides/dremel.pdf · What is Dremel Trillion-record, multi-terabyte datasets at interactive speed Scales](https://reader033.vdocuments.mx/reader033/viewer/2022041712/5e48c995c7d0f317552faa71/html5/thumbnails/21.jpg)
More Information
Big Query: http://code.google.com/apis/bigquery/
Apache Drill: http://incubator.apache.org/drill/index.html
Thank You