myria: analytics-as-a-service for (data) scientists
DESCRIPTION
Talk delivered at High Performance Transaction Processing 2013 Myria is a new Big Data service being developed at the University of Washington. We feature high level language interfaces, a hybrid graph-relational data model, database-style algebraic optimization, a comprehensive REST API, an iterative programming model suitable for machine learning and graph analytics applications, and a tight connection to new theories of parallel computation. In this talk, we describe the motivation for another big data platform emphasizing requirements emerging from the physical, life, and social sciences.TRANSCRIPT
04/11/2023 1
Myria: Analytics-as-a-Service for (Data)
Scientists
Bill HoweUniversity of Washington
Bill Howe, UW
2
“It’s a great time to be a data geek.”-- Roger Barga, Microsoft Research
“The greatest minds of my generation are trying to figure out how to make people click on ads”
-- Jeff Hammerbacher, co-founder, Cloudera
04/11/2023 4Bill Howe, UW
How can we deliver 1000 little SDSSs to anyone who wants one?
R/V Wecoma, April 2007
04/11/2023 6Bill Howe, UW
Armbrust Lab Retreat, 2009 (Biology, Oceanography)
04/11/2023 7Bill Howe, UW
Astronomy Visualization Workshop, 2011
04/11/2023 8Bill Howe, UW
Big Data in the Long Tail Workshop, 2012 (Social Sciences)
04/11/2023 9
Maier’s 2nd Maxim
Bill Howe, UW
Working with scientists is like working with 7 year olds:
They think they know everything and they don’t have any money
04/11/2023 10
My Goal: Expose all the world’s science data through declarative query interfaces
Bill Howe, UW
04/11/2023 11
Problem
How much time do you spend “handling data” as opposed to “doing science”?
Mode answer: “90%”
Bill Howe, UW
1204/11/2023 Bill Howe, UW
Simple Example
ANNOTATIONSUMMARY-COMBINEDORFANNOTATION16_Phaeo_genome
COGAnnotation_coastal_sample.txt
SELECT * FROM Phaeo_genome p, coastal_sample c WHERE p.COG_hit = c.hit
04/11/2023 13Bill Howe, UW
“As each need is satisfied, the next higher level in the hierarchy dominates conscious functioning.”
-- Maslow 43
Maslow’s Needs Hierarchy
04/11/2023 14
A “Needs Hierarchy” of Science Data Management
storage
sharing
Bill Howe, UW
query
curation
analytics
“As each need is satisfied, the next higher level in the hierarchy dominates conscious functioning.”
-- Maslow 43
04/11/2023 15
A “Needs Hierarchy” of Science Data Management
storage
sharing
Bill Howe, UW
semantic integration
query
analytics
“As each need is satisfied, the next higher level in the hierarchy dominates conscious functioning.”
-- Maslow 43
04/11/2023 16
Why should you care?
Bill Howe, UW
Science == Data Science
04/11/2023 17
QUERY-AS-A-SERVICE
Bill Howe, UW
2010 - present
Version 1
1) Upload data “as is”Cloud-hosted; no need to install or design a database; no pre-defined schema
2) Write SQLRight in your browser, writing queries on top of queries on top of queries ...
SELECT hit, COUNT(*)
FROM tigrfam_surface
GROUP BY hit
ORDER BY cnt DESC
3) Share the results Make them public, tag them, share with specific colleagues – anyone with access can query
04/11/2023 19
Find all TIGRFam ids (proteins) that are missing from at least one of three samples (relations)
SELECT col0 FROM [refseq_hma_fasta_TGIRfam_refs]
UNIONSELECT col0 FROM [est_hma_fasta_TGIRfam_refs]
UNIONSELECT col0 FROM [combo_hma_fasta_TGIRfam_refs]
EXCEPT
SELECT col0 FROM [refseq_hma_fasta_TGIRfam_refs]
INTERSECTSELECT col0 FROM [est_hma_fasta_TGIRfam_refs]
INTERSECTSELECT col0 FROM [combo_hma_fasta_TGIRfam_refs]
Bill Howe, UW
SELECT x.strain, x.chr, x.region as snp_region, x.start_bp as snp_start_bp , x.end_bp as snp_end_bp, w.start_bp as nc_start_bp, w.end_bp as nc_end_bp , w.category as nc_category , CASE WHEN (x.start_bp >= w.start_bp AND x.end_bp <= w.end_bp) THEN x.end_bp - x.start_bp + 1 WHEN (x.start_bp <= w.start_bp AND w.start_bp <= x.end_bp) THEN x.end_bp - w.start_bp + 1 WHEN (x.start_bp <= w.end_bp AND w.end_bp <= x.end_bp) THEN w.end_bp - x.start_bp + 1 END AS len_overlap
FROM [[email protected]].[hotspots_deserts.tab] x INNER JOIN [[email protected]].[table_noncoding_positions.tab] w ON x.chr = w.chr WHERE (x.start_bp >= w.start_bp AND x.end_bp <= w.end_bp) OR (x.start_bp <= w.start_bp AND w.start_bp <= x.end_bp) OR (x.start_bp <= w.end_bp AND w.end_bp <= x.end_bp) ORDER BY x.strain, x.chr ASC, x.start_bp ASC
Non-programmers can write very complex queries (rather than relying on staff programmers)
Example: Computing the overlaps of two sets of blast results
We see thousands of queries written by non-programmers
Howe, et al., CISE 2012
Steven Roberts
SQL as a lab notebook:http://bit.ly/16Xj2JP
Popular service for Bioinformatics Workflows
Halperin, Howe, et al. SSDBM 2013
04/11/2023 24Bill Howe, UW
“An undergraduate student and I are working with gigabytes of tabular data derived from analysis of protein surfaces.
Previously, we were using huge directory trees and plain text files.
Now we can accomplish a 10 minute 100 line script in 1 line of SQL.”-- Andrew D White
Andrew White, UW Chemistry
Decoding nonspecific interactions from nature. A. White, A. Nowinski, W. Huang, A. Keefe, F. Sun, S. Jiang. (2012) Chemical Science. Accepted
04/11/2023 25
Scientific data management reduces to sharing views• Integrate data from multiple sources?
– joins and unions with views
• Standardize on units, apply naming conventions?– rename columns, apply functions with views
• Attach metadata?– add new tables with descriptive names, add new columns
with views
• Data cleaning, quality control?– hide bad values with views
• Maintain provenance?– inspect view dependencies
• Propagate updates? – view maintenance
• Protect sensitive data?– expose subsets with views (assuming views carry
permissions) Bill Howe, UW
SSDBM 2011
04/11/2023 26
Two Problems with SQLShare
• No help for really big datasets• No iteration
Bill Howe, UW
04/11/2023 27
Myria is…
• A compiler framework for multiple iterative RA-based languages
• A parallel, shared-nothing, iterative execution engine
• A RESTful Query-as-a-Service platform
• prefix meaning “ten thousand” in Greek
Bill Howe, UW
Myria Team
28
Dan Suciu Magda Balazinska Bill Howe
Dan Halperin (postdoc, technical lead)Victor Almeida (postdoc)Andrew Whitaker (research scientist)
StudentsParis KoutrisEmad SoroushJingjing WangShengLiang Xu
Jennifer OrtizJeremy HyrkasShumo Chu
MyriaArchitecture
Coordinator
Language Parser
Myria Compiler
Logical Optimizer for RA+While
REST Server
Google App
Engine
Worker Catalog
Catalog
…
json query plan
netty protocols
RDBMS
jdbc
Worker Catalog
RDBMS
jdbc
Worker Catalog
RDBMS
jdbc
MyriaDB
C Compiler Grappa
Web UI
MyriaL
HDFS HDFS HDFS
04/11/2023 30Bill Howe, UW
A(y) :- R(‘a’, y)A(y) :- A(x), R(x,y)
04/11/2023 31Bill Howe, UW
A = LOAD('points.txt', id:int, x:float, y:float)
E = LIMIT(A, 4);F = SEQUENCE();Centroids = [FROM E EMIT (id=F.next, x=E.x, y=E.y)];Kmeans = [FROM A EMIT (id=id, x=x, y=y, cluster_id=0)]
DOI = CROSS(Kmeans, Centroids); J = [FROM I EMIT (Kmeans.id, Kmeans.x, Kmeans.y, Centroids.cluster_id, $distance(Kmeans.x, Kmeans.y, Centroids.x, Centroids.y))];
K = [FROM J EMIT id, distance=$min(distance)]; L = JOIN(J, id, K, id) M = [FROM L WHERE J.distance <= K.distance EMIT (id=J.id, x=J.x, y=J.y, cluster_id=J.cluster_id)];
Kmeans' = [FROM M EMIT (id, x, y, $min(cluster_id))];
Delta = DIFF(Kmeans', Kmeans) Kmeans = Kmeans'
Centroids = [FROM Kmeans' EMIT (cluster_id, x=avg(x), y=avg(y))];
WHILE DELTA != {}
32
Why Iteration MattersDatalog: reachability in a graph with 1.4B unique edges: almost 200 iterations
33
Why Iteration Matters
Datalog: reachability in a graph with 1.4B unique edges: almost 200 iterations
Vast majorityof reachable tuples
discovered byiteration 25
34
Why Iteration MattersDatalog: reachability in a graph with 1.4B unique edges: almost 200 iterations
Vast majorityof reachable tuples
discovered byiteration 25
The datalog programcontinues for almost200 iterations, eachalmost as expensive
as the early steps
04/11/2023 35
Fewer Iterations: Endgame Problem [Afrati 10]
Bill Howe, UW
0 20 40 60 80 100 120 140 160 1801
10
100
1,000
10,000
100,000
1,000,000
10,000,000
100,000,000 frontier tuplespreviously discovered tuples removed
iteration #
# of
tupl
es d
isco
vere
d
04/11/2023 36
Basic Semi-Naïve Evaluation
Bill Howe, UW
Join Dupe-elim
A(y) :- R(‘a’, y)A(y) :- A(x), R(x,y)
Reachability from ‘a’ in datalog
04/11/2023 37
MAYBE JUST USE HADOOP?
Bill Howe, UW
04/11/2023 Bill Howe, UW 38
(a) R is loop invariant, but gets loaded and shuffled on each iteration
(b) Ai grows slowly and monotonically, but is loaded and shuffled on each iteration. HaLoop’s Reducer Input Cache addressed (a), but did not support the append semantics needed for (b).
Bu, Howe, Balazinska, Ernst VLDB10, VLDBJ12, Datalog12
map
map
map
map
mapreduce
reduce
Join Difference
ΔAi-1
R(1)
R(0)
mapAi(0)
Ai(1)(a) (b)
reduce
reduce
map
VLDB 2010, VLDBJ 2011
39
map map
mapreduce
reduce
Join Difference
ΔAi-1 reduce
reduce
mapAi(0)
Ai(1)
map
R(1)
R(0)
Iteration i > 0:
map
mapR(1)
R(0)
A(1)
A(0)
Inter-loop caching
Bu, Howe, Balazinska, Ernst VLDB10, VLDBJ12, Datalog12
Iteration i = 0: Load a distributed cache
VLDB 2010, VLDBJ 2011
04/11/2023 40Bill Howe, UW
First iteration is slow, as the invariant graph is shuffled and cached
failure
23X
Caching Loop-Invariant Data map
map
map
map
mapreduce
reduce
Join Difference
ΔAi-1
R(1)
R(0)
mapAi(0)
Ai(1)
reduce
reduce
map
04/11/2023 Bill Howe, UW 41
Specialize Cache for Query Semantics
map
map
map
map
mapreduce
reduce
Join Difference
ΔAi-1
R(1)
R(0)
mapAi(0)
Ai(1)
reduce
reduce
map
join keys arriving from mappers
all tuples from cache
MapReduce semantics require that all keys from the cache be extracted and passed to reducers.
But we only care about keys that join.
Reducer for Join
04/11/2023 Bill Howe, UW 42
Second optimization:Specialization for Equijoin
map
map
map
map
mapreduce
reduce
Join Difference
ΔAi-1
R(1)
R(0)
mapAi(0)
Ai(1)
reduce
reduce
map
join keys arriving from mappers
keys that join
Index the cache, and only extract keys that join
Reducer for Join
indexed cache lookup
04/11/2023 43Bill Howe, UW
Failure occurred
~20%
tota
l tim
e fo
r lo
op
bo
dy (
s)map
map
map
map
mapreduce
reduce
Join Difference
ΔAi-1
R(1)
R(0)
mapAi(0)
Ai(1)
reduce
reduce
map
Effect of equijoin specialization
04/11/2023 Bill Howe, UW 44
Third Optimization: Extend Cache to Support Duplicate Elimination
map
map
map
map
mapreduce
reduce
Join Difference
ΔAi-1
R(1)
R(0)
mapAi(0)
Ai(1)
reduce
reduce
map
The accumulated result is not loop-invariant, but it changes relatively slowly, and is needed on every iteration to check for duplicates.
Extend the cache to support append, and we can use it for Dupe-Elim as well.
tuples arriving from mappers
unique keys
Reducer for Dupe-elim
indexed cache lookup, with new tuples inserted
04/11/2023 45
Effect of Diff Cache
Bill Howe, UW
tota
l tim
e fo
r lo
op
bo
dy (
s) Failures may be more likely due to extra network traffic
~20% overall improvement
0 50 100 150 200 2500
5000
10000
15000
20000
25000
30000
35000
iteration #
time
(s)
(c) all optimizations
(d) raw Hadoop overhead
(b) HaLoop
(a) no optimizations
Overall
04/11/2023 47
Fewer Iteraations: Loop unrolling
Bill Howe, UW
Run two joins for every dupe-elim
04/11/2023 48Bill Howe, UW
half the iterations, but each is more expensive
change strategies
1 3 5 7 9 11 13 15 17 19 21 231
10
100
1000
10000
100000
1000000
10000000
GreenplumMyria
Iteration
# of
New
ly D
iscov
ered
Fac
ts
not much useful work
reachable(Y) :- edge(5,Y)reachable(Y) :- edge(X,Y), reachable(X)
04/11/2023 50Bill Howe, UW
1 4 7 10 13 16 19 220
100
200
300
400
500
600
700
GreenplumMyriaGreenplum, incrementalGreenplum, incremental+index
Iteration
Tota
l Tim
e (s
econ
d)
Low per-iteration cost
04/11/2023 51
Summary
• Goal: Expose all the world’s science data through declarative query interfaces!
• Motivated by real science• Data and query model is iterative relational
algebra• Industrial-strength Query-as-a-Service
Bill Howe, UW
http://myria-web.appspot.com/
http://db.cs.washington.edu/myria/
04/11/2023 52Bill Howe, UW
04/11/2023 53
• Hypothesis: The performance difference between hand-coded graph algorithms and relational query plans amounts to implementation details
• Can we generate “hand-coded” plans?
Bill Howe, UW
Datalog Parser
Myria Compiler
Logical OptimizerGoogle
App EngineC Compiler Grappa
Path-Counting Queries
Ex: Count the number of unique 2-hops
04/11/2023 55Bill Howe, UW
answers = set()for all (x, y1) in edges: for all (y2, z) in edges: if y1 == y2: answers.insert((x,z))count = answers.size()
Assume a collection edges
In an RDBMS: “Nested Loops Join”
04/11/2023 56Bill Howe, UW
answers = set()for all (x, y) in edges: for all z in neighbors[y]: answers.insert((x,z))count = answers.size()
Assume a collection edges, but also an index neighbors: vertex -> [vetex]
In an RDBMS: “Hash Join”
04/11/2023 57Bill Howe, UW
answers = set()for all x in neighbors: for all y in neighbors[x]: for all z in neighbors[y]: answers.insert((x,z))count = answers.size()
Just drop the edges collection entirely, leaving only the index neighbors: vertex -> [vetex]
In an RDBMS: Still a Hash Join
04/11/2023 58Bill Howe, UW
count = 0answers = set()for all x in neighbors: for all y in neighbors[x]: for all z in neighbors[y]: answers.insert(z) count += answers.size() answers.clear()
Just drop the edges collection entirely, leaving only the index neighbors: vertex -> [vetex]
RDBMS don’t express this, but there’s no reason they couldn’t
stays smallonly one value
04/11/2023 59Bill Howe, UW
answers = set()for all x in neighbors: for all y in x.neighbors(): for all z in y.neighbors(): answers.insert(z) count += answers.size() answers.clear()
stays smallonly one value, so
Or if you prefer…assume a collection of vertices, where each vertex points directly to its neighbors
Boils down to dereferencing a pointer vs. probing a hash table
Experiments
• Data sets:Dataset # Vertices # Edges #Distinct 2-hop
Paths# Triangles
BSN* 685,230 7,600,595 78,350,597 6,935,709
Twitter 4MEⱡ 166,317 4,532,185 1,056,317,985 14,912,950
com-livejournal*
3,997,962 34,681,189 735,398,579
soc-livejournal*
4,874,571 68,993,773 112,319,229
*http://snap.stanford.edu/ⱡH. Kwak et al 2010.
Experiments
BSN data set Twitter 4ME data setno dupe elim
dupe elim
single-threaded
Experiments
Experiments
• Parallel system performance