myria: analytics-as-a-service for (data) scientists

04/11/2023 1

Myria: Analytics-as-a-Service for (Data)

Scientists

Bill HoweUniversity of Washington

Bill Howe, UW

2

“It’s a great time to be a data geek.”-- Roger Barga, Microsoft Research

“The greatest minds of my generation are trying to figure out how to make people click on ads”

-- Jeff Hammerbacher, co-founder, Cloudera

04/11/2023 4Bill Howe, UW

How can we deliver 1000 little SDSSs to anyone who wants one?

R/V Wecoma, April 2007


Armbrust Lab Retreat, 2009 (Biology, Oceanography)


Astronomy Visualization Workshop, 2011


Big Data in the Long Tail Workshop, 2012 (Social Sciences)

04/11/2023 9

Maier’s 2nd Maxim

Bill Howe, UW

Working with scientists is like working with 7 year olds:

They think they know everything and they don’t have any money

04/11/2023 10

My Goal: Expose all the world’s science data through declarative query interfaces

Bill Howe, UW

04/11/2023 11

Problem

How much time do you spend “handling data” as opposed to “doing science”?

Mode answer: “90%”

Bill Howe, UW

1204/11/2023 Bill Howe, UW

Simple Example

ANNOTATIONSUMMARY-COMBINEDORFANNOTATION16_Phaeo_genome

COGAnnotation_coastal_sample.txt

SELECT * FROM Phaeo_genome p, coastal_sample c WHERE p.COG_hit = c.hit

04/11/2023 13Bill Howe, UW

“As each need is satisfied, the next higher level in the hierarchy dominates conscious functioning.”

-- Maslow 43

Maslow’s Needs Hierarchy

04/11/2023 14

A “Needs Hierarchy” of Science Data Management

storage

sharing

Bill Howe, UW

query

curation

analytics


-- Maslow 43

04/11/2023 15

A “Needs Hierarchy” of Science Data Management

storage

sharing

Bill Howe, UW

semantic integration

query

analytics


-- Maslow 43

04/11/2023 16

Why should you care?

Bill Howe, UW

Science == Data Science

04/11/2023 17

QUERY-AS-A-SERVICE

Bill Howe, UW

2010 - present

Version 1

1) Upload data “as is”Cloud-hosted; no need to install or design a database; no pre-defined schema

2) Write SQLRight in your browser, writing queries on top of queries on top of queries ...

SELECT hit, COUNT(*)

FROM tigrfam_surface

GROUP BY hit

ORDER BY cnt DESC

3) Share the results Make them public, tag them, share with specific colleagues – anyone with access can query

04/11/2023 19

Find all TIGRFam ids (proteins) that are missing from at least one of three samples (relations)

SELECT col0 FROM [refseq_hma_fasta_TGIRfam_refs]

UNIONSELECT col0 FROM [est_hma_fasta_TGIRfam_refs]

UNIONSELECT col0 FROM [combo_hma_fasta_TGIRfam_refs]

EXCEPT

SELECT col0 FROM [refseq_hma_fasta_TGIRfam_refs]

INTERSECTSELECT col0 FROM [est_hma_fasta_TGIRfam_refs]

INTERSECTSELECT col0 FROM [combo_hma_fasta_TGIRfam_refs]

Bill Howe, UW

SELECT x.strain, x.chr, x.region as snp_region, x.start_bp as snp_start_bp , x.end_bp as snp_end_bp, w.start_bp as nc_start_bp, w.end_bp as nc_end_bp , w.category as nc_category , CASE WHEN (x.start_bp >= w.start_bp AND x.end_bp <= w.end_bp) THEN x.end_bp - x.start_bp + 1 WHEN (x.start_bp <= w.start_bp AND w.start_bp <= x.end_bp) THEN x.end_bp - w.start_bp + 1 WHEN (x.start_bp <= w.end_bp AND w.end_bp <= x.end_bp) THEN w.end_bp - x.start_bp + 1 END AS len_overlap

FROM [[email protected]].[hotspots_deserts.tab] x INNER JOIN [[email protected]].[table_noncoding_positions.tab] w ON x.chr = w.chr WHERE (x.start_bp >= w.start_bp AND x.end_bp <= w.end_bp) OR (x.start_bp <= w.start_bp AND w.start_bp <= x.end_bp) OR (x.start_bp <= w.end_bp AND w.end_bp <= x.end_bp) ORDER BY x.strain, x.chr ASC, x.start_bp ASC

Non-programmers can write very complex queries (rather than relying on staff programmers)

Example: Computing the overlaps of two sets of blast results

We see thousands of queries written by non-programmers

Howe, et al., CISE 2012

Steven Roberts

SQL as a lab notebook:http://bit.ly/16Xj2JP

Popular service for Bioinformatics Workflows

http://bit.ly/16Xj2JP

Halperin, Howe, et al. SSDBM 2013

04/11/2023 24Bill Howe, UW

“An undergraduate student and I are working with gigabytes of tabular data derived from analysis of protein surfaces.

Previously, we were using huge directory trees and plain text files.

Now we can accomplish a 10 minute 100 line script in 1 line of SQL.”-- Andrew D White

Andrew White, UW Chemistry

Decoding nonspecific interactions from nature. A. White, A. Nowinski, W. Huang, A. Keefe, F. Sun, S. Jiang. (2012) Chemical Science. Accepted

04/11/2023 25

Scientific data management reduces to sharing views• Integrate data from multiple sources?

– joins and unions with views

• Standardize on units, apply naming conventions?– rename columns, apply functions with views

• Attach metadata?– add new tables with descriptive names, add new columns

with views

• Data cleaning, quality control?– hide bad values with views

• Maintain provenance?– inspect view dependencies

• Propagate updates? – view maintenance

• Protect sensitive data?– expose subsets with views (assuming views carry

permissions) Bill Howe, UW

SSDBM 2011

04/11/2023 26

Two Problems with SQLShare

• No help for really big datasets• No iteration

Bill Howe, UW

04/11/2023 27

Myria is…

• A compiler framework for multiple iterative RA-based languages

• A parallel, shared-nothing, iterative execution engine

• A RESTful Query-as-a-Service platform

• prefix meaning “ten thousand” in Greek

Bill Howe, UW

Myria Team

28

Dan Suciu Magda Balazinska Bill Howe

Dan Halperin (postdoc, technical lead)Victor Almeida (postdoc)Andrew Whitaker (research scientist)

StudentsParis KoutrisEmad SoroushJingjing WangShengLiang Xu

Jennifer OrtizJeremy HyrkasShumo Chu

MyriaArchitecture

Coordinator

Language Parser

Myria Compiler

Logical Optimizer for RA+While

REST Server

Google App

Engine

Worker Catalog

Catalog

…

json query plan

netty protocols

RDBMS

jdbc

Worker Catalog

RDBMS

jdbc

Worker Catalog

RDBMS

jdbc

MyriaDB

C Compiler Grappa

Web UI

MyriaL

HDFS HDFS HDFS

04/11/2023 30Bill Howe, UW

A(y) :- R(‘a’, y)A(y) :- A(x), R(x,y)

04/11/2023 31Bill Howe, UW

A = LOAD('points.txt', id:int, x:float, y:float)

E = LIMIT(A, 4);F = SEQUENCE();Centroids = [FROM E EMIT (id=F.next, x=E.x, y=E.y)];Kmeans = [FROM A EMIT (id=id, x=x, y=y, cluster_id=0)]

DOI = CROSS(Kmeans, Centroids); J = [FROM I EMIT (Kmeans.id, Kmeans.x, Kmeans.y, Centroids.cluster_id, $distance(Kmeans.x, Kmeans.y, Centroids.x, Centroids.y))];

K = [FROM J EMIT id, distance=$min(distance)]; L = JOIN(J, id, K, id) M = [FROM L WHERE J.distance <= K.distance EMIT (id=J.id, x=J.x, y=J.y, cluster_id=J.cluster_id)];

Kmeans' = [FROM M EMIT (id, x, y, $min(cluster_id))];

Delta = DIFF(Kmeans', Kmeans) Kmeans = Kmeans'

Centroids = [FROM Kmeans' EMIT (cluster_id, x=avg(x), y=avg(y))];

WHILE DELTA != {}

32

Why Iteration MattersDatalog: reachability in a graph with 1.4B unique edges: almost 200 iterations

33

Why Iteration Matters

Datalog: reachability in a graph with 1.4B unique edges: almost 200 iterations

Vast majorityof reachable tuples

discovered byiteration 25

34

Why Iteration MattersDatalog: reachability in a graph with 1.4B unique edges: almost 200 iterations

Vast majorityof reachable tuples

discovered byiteration 25

The datalog programcontinues for almost200 iterations, eachalmost as expensive

as the early steps

04/11/2023 35

Fewer Iterations: Endgame Problem [Afrati 10]

Bill Howe, UW

0 20 40 60 80 100 120 140 160 1801

10

100

1,000

10,000

100,000

1,000,000

10,000,000

100,000,000 frontier tuplespreviously discovered tuples removed

iteration #

# of

tupl

es d

isco

vere

d

04/11/2023 36

Basic Semi-Naïve Evaluation

Bill Howe, UW

Join Dupe-elim

A(y) :- R(‘a’, y)A(y) :- A(x), R(x,y)

Reachability from ‘a’ in datalog

04/11/2023 37

MAYBE JUST USE HADOOP?

Bill Howe, UW

04/11/2023 Bill Howe, UW 38

(a) R is loop invariant, but gets loaded and shuffled on each iteration

(b) Ai grows slowly and monotonically, but is loaded and shuffled on each iteration. HaLoop’s Reducer Input Cache addressed (a), but did not support the append semantics needed for (b).

Bu, Howe, Balazinska, Ernst VLDB10, VLDBJ12, Datalog12

map

map

map

map

mapreduce

reduce

Join Difference

ΔAi-1

R(1)

R(0)

mapAi(0)

Ai(1)(a) (b)

reduce

reduce

map

VLDB 2010, VLDBJ 2011

39

map map

mapreduce

reduce

Join Difference

ΔAi-1 reduce

reduce

mapAi(0)

Ai(1)

map

R(1)

R(0)

Iteration i > 0:

map

mapR(1)

R(0)

A(1)

A(0)

Inter-loop caching

Bu, Howe, Balazinska, Ernst VLDB10, VLDBJ12, Datalog12

Iteration i = 0: Load a distributed cache

VLDB 2010, VLDBJ 2011

04/11/2023 40Bill Howe, UW

First iteration is slow, as the invariant graph is shuffled and cached

failure

23X

Caching Loop-Invariant Data map

map

map

map

mapreduce

reduce

Join Difference

ΔAi-1

R(1)

R(0)

mapAi(0)

Ai(1)

reduce

reduce

map

04/11/2023 Bill Howe, UW 41

Specialize Cache for Query Semantics

map

map

map

map

mapreduce

reduce

Join Difference

ΔAi-1

R(1)

R(0)

mapAi(0)

Ai(1)

reduce

reduce

map

join keys arriving from mappers

all tuples from cache

MapReduce semantics require that all keys from the cache be extracted and passed to reducers.

But we only care about keys that join.

Reducer for Join

04/11/2023 Bill Howe, UW 42

Second optimization:Specialization for Equijoin

map

map

map

map

mapreduce

reduce

Join Difference

ΔAi-1

R(1)

R(0)

mapAi(0)

Ai(1)

reduce

reduce

map

join keys arriving from mappers

keys that join

Index the cache, and only extract keys that join

Reducer for Join

indexed cache lookup

04/11/2023 43Bill Howe, UW

Failure occurred

~20%

tota

l tim

e fo

r lo

op

bo

dy (

s)map

map

map

map

mapreduce

reduce

Join Difference

ΔAi-1

R(1)

R(0)

mapAi(0)

Ai(1)

reduce

reduce

map

Effect of equijoin specialization

04/11/2023 Bill Howe, UW 44

Third Optimization: Extend Cache to Support Duplicate Elimination

map

map

map

map

mapreduce

reduce

Join Difference

ΔAi-1

R(1)

R(0)

mapAi(0)

Ai(1)

reduce

reduce

map

The accumulated result is not loop-invariant, but it changes relatively slowly, and is needed on every iteration to check for duplicates.

Extend the cache to support append, and we can use it for Dupe-Elim as well.

tuples arriving from mappers

unique keys

Reducer for Dupe-elim

indexed cache lookup, with new tuples inserted

04/11/2023 45

Effect of Diff Cache

Bill Howe, UW

tota

l tim

e fo

r lo

op

bo

dy (

s) Failures may be more likely due to extra network traffic

~20% overall improvement

0 50 100 150 200 2500

5000

10000

15000

20000

25000

30000

35000

iteration #

time

(s)

(c) all optimizations

(d) raw Hadoop overhead

(b) HaLoop

(a) no optimizations

Overall

04/11/2023 47

Fewer Iteraations: Loop unrolling

Bill Howe, UW

Run two joins for every dupe-elim

04/11/2023 48Bill Howe, UW

half the iterations, but each is more expensive

change strategies

1 3 5 7 9 11 13 15 17 19 21 231

10

100

1000

10000

100000

1000000

10000000

GreenplumMyria

Iteration

# of

New

ly D

iscov

ered

Fac

ts

not much useful work

reachable(Y) :- edge(5,Y)reachable(Y) :- edge(X,Y), reachable(X)

04/11/2023 50Bill Howe, UW

1 4 7 10 13 16 19 220

100

200

300

400

500

600

700

GreenplumMyriaGreenplum, incrementalGreenplum, incremental+index

Iteration

Tota

l Tim

e (s

econ

d)

Low per-iteration cost

04/11/2023 51

Summary

• Goal: Expose all the world’s science data through declarative query interfaces!

• Motivated by real science• Data and query model is iterative relational

algebra• Industrial-strength Query-as-a-Service

Bill Howe, UW

http://myria-web.appspot.com/

http://db.cs.washington.edu/myria/

04/11/2023 52Bill Howe, UW

04/11/2023 53

• Hypothesis: The performance difference between hand-coded graph algorithms and relational query plans amounts to implementation details

• Can we generate “hand-coded” plans?

Bill Howe, UW

Datalog Parser

Myria Compiler

Logical OptimizerGoogle

App EngineC Compiler Grappa

Path-Counting Queries

Ex: Count the number of unique 2-hops

04/11/2023 55Bill Howe, UW

answers = set()for all (x, y1) in edges: for all (y2, z) in edges: if y1 == y2: answers.insert((x,z))count = answers.size()

Assume a collection edges

In an RDBMS: “Nested Loops Join”

04/11/2023 56Bill Howe, UW

answers = set()for all (x, y) in edges: for all z in neighbors[y]: answers.insert((x,z))count = answers.size()

Assume a collection edges, but also an index neighbors: vertex -> [vetex]

In an RDBMS: “Hash Join”

04/11/2023 57Bill Howe, UW

answers = set()for all x in neighbors: for all y in neighbors[x]: for all z in neighbors[y]: answers.insert((x,z))count = answers.size()

Just drop the edges collection entirely, leaving only the index neighbors: vertex -> [vetex]

In an RDBMS: Still a Hash Join

04/11/2023 58Bill Howe, UW

count = 0answers = set()for all x in neighbors: for all y in neighbors[x]: for all z in neighbors[y]: answers.insert(z) count += answers.size() answers.clear()

Just drop the edges collection entirely, leaving only the index neighbors: vertex -> [vetex]

RDBMS don’t express this, but there’s no reason they couldn’t

stays smallonly one value

04/11/2023 59Bill Howe, UW

answers = set()for all x in neighbors: for all y in x.neighbors(): for all z in y.neighbors(): answers.insert(z) count += answers.size() answers.clear()

stays smallonly one value, so

Or if you prefer…assume a collection of vertices, where each vertex points directly to its neighbors

Boils down to dereferencing a pointer vs. probing a hash table

Experiments

• Data sets:Dataset # Vertices # Edges #Distinct 2-hop

Paths# Triangles

BSN* 685,230 7,600,595 78,350,597 6,935,709

Twitter 4MEⱡ 166,317 4,532,185 1,056,317,985 14,912,950

com-livejournal*

3,997,962 34,681,189 735,398,579

soc-livejournal*

4,874,571 68,993,773 112,319,229

*http://snap.stanford.edu/ⱡH. Kwak et al 2010.

http://snap.stanford.edu/

Experiments

BSN data set Twitter 4ME data setno dupe elim

dupe elim

single-threaded

Experiments

Experiments

• Parallel system performance

myria: analytics-as-a-service for (data) scientists

Technology

query chr

data science10132013bill

oceanography10132013bill

money10132013bill howe

present10132013bill

end hit

worlds science data

big data