myria: analytics-as-a-service for (data) scientists

63
Myria: Analytics-as-a- Service for (Data) Scientists Bill Howe University of Washington 07/03/2022 Bill Howe, UW 1

Upload: bill-howe

Post on 10-May-2015

1.078 views

Category:

Technology


2 download

DESCRIPTION

Talk delivered at High Performance Transaction Processing 2013 Myria is a new Big Data service being developed at the University of Washington. We feature high level language interfaces, a hybrid graph-relational data model, database-style algebraic optimization, a comprehensive REST API, an iterative programming model suitable for machine learning and graph analytics applications, and a tight connection to new theories of parallel computation. In this talk, we describe the motivation for another big data platform emphasizing requirements emerging from the physical, life, and social sciences.

TRANSCRIPT

Page 1: Myria: Analytics-as-a-Service for (Data) Scientists

04/11/2023 1

Myria: Analytics-as-a-Service for (Data)

Scientists

Bill HoweUniversity of Washington

Bill Howe, UW

Page 2: Myria: Analytics-as-a-Service for (Data) Scientists

2

“It’s a great time to be a data geek.”-- Roger Barga, Microsoft Research

“The greatest minds of my generation are trying to figure out how to make people click on ads”

-- Jeff Hammerbacher, co-founder, Cloudera

Page 3: Myria: Analytics-as-a-Service for (Data) Scientists
Page 4: Myria: Analytics-as-a-Service for (Data) Scientists

04/11/2023 4Bill Howe, UW

How can we deliver 1000 little SDSSs to anyone who wants one?

Page 5: Myria: Analytics-as-a-Service for (Data) Scientists

R/V Wecoma, April 2007

Page 6: Myria: Analytics-as-a-Service for (Data) Scientists

04/11/2023 6Bill Howe, UW

Armbrust Lab Retreat, 2009 (Biology, Oceanography)

Page 7: Myria: Analytics-as-a-Service for (Data) Scientists

04/11/2023 7Bill Howe, UW

Astronomy Visualization Workshop, 2011

Page 8: Myria: Analytics-as-a-Service for (Data) Scientists

04/11/2023 8Bill Howe, UW

Big Data in the Long Tail Workshop, 2012 (Social Sciences)

Page 9: Myria: Analytics-as-a-Service for (Data) Scientists

04/11/2023 9

Maier’s 2nd Maxim

Bill Howe, UW

Working with scientists is like working with 7 year olds:

They think they know everything and they don’t have any money

Page 10: Myria: Analytics-as-a-Service for (Data) Scientists

04/11/2023 10

My Goal: Expose all the world’s science data through declarative query interfaces

Bill Howe, UW

Page 11: Myria: Analytics-as-a-Service for (Data) Scientists

04/11/2023 11

Problem

How much time do you spend “handling data” as opposed to “doing science”?

Mode answer: “90%”

Bill Howe, UW

Page 12: Myria: Analytics-as-a-Service for (Data) Scientists

1204/11/2023 Bill Howe, UW

Simple Example

ANNOTATIONSUMMARY-COMBINEDORFANNOTATION16_Phaeo_genome

COGAnnotation_coastal_sample.txt

SELECT * FROM Phaeo_genome p, coastal_sample c WHERE p.COG_hit = c.hit

Page 13: Myria: Analytics-as-a-Service for (Data) Scientists

04/11/2023 13Bill Howe, UW

“As each need is satisfied, the next higher level in the hierarchy dominates conscious functioning.”

-- Maslow 43

Maslow’s Needs Hierarchy

Page 14: Myria: Analytics-as-a-Service for (Data) Scientists

04/11/2023 14

A “Needs Hierarchy” of Science Data Management

storage

sharing

Bill Howe, UW

query

curation

analytics

“As each need is satisfied, the next higher level in the hierarchy dominates conscious functioning.”

-- Maslow 43

Page 15: Myria: Analytics-as-a-Service for (Data) Scientists

04/11/2023 15

A “Needs Hierarchy” of Science Data Management

storage

sharing

Bill Howe, UW

semantic integration

query

analytics

“As each need is satisfied, the next higher level in the hierarchy dominates conscious functioning.”

-- Maslow 43

Page 16: Myria: Analytics-as-a-Service for (Data) Scientists

04/11/2023 16

Why should you care?

Bill Howe, UW

Science == Data Science

Page 17: Myria: Analytics-as-a-Service for (Data) Scientists

04/11/2023 17

QUERY-AS-A-SERVICE

Bill Howe, UW

2010 - present

Version 1

Page 18: Myria: Analytics-as-a-Service for (Data) Scientists

1) Upload data “as is”Cloud-hosted; no need to install or design a database; no pre-defined schema

2) Write SQLRight in your browser, writing queries on top of queries on top of queries ...

SELECT hit, COUNT(*)

FROM tigrfam_surface

GROUP BY hit

ORDER BY cnt DESC

3) Share the results Make them public, tag them, share with specific colleagues – anyone with access can query

Page 19: Myria: Analytics-as-a-Service for (Data) Scientists

04/11/2023 19

Find all TIGRFam ids (proteins) that are missing from at least one of three samples (relations)

SELECT col0 FROM [refseq_hma_fasta_TGIRfam_refs]

UNIONSELECT col0 FROM [est_hma_fasta_TGIRfam_refs]

UNIONSELECT col0 FROM [combo_hma_fasta_TGIRfam_refs]

EXCEPT

SELECT col0 FROM [refseq_hma_fasta_TGIRfam_refs]

INTERSECTSELECT col0 FROM [est_hma_fasta_TGIRfam_refs]

INTERSECTSELECT col0 FROM [combo_hma_fasta_TGIRfam_refs]

Bill Howe, UW

Page 20: Myria: Analytics-as-a-Service for (Data) Scientists

SELECT x.strain, x.chr, x.region as snp_region, x.start_bp as snp_start_bp , x.end_bp as snp_end_bp, w.start_bp as nc_start_bp, w.end_bp as nc_end_bp , w.category as nc_category , CASE WHEN (x.start_bp >= w.start_bp AND x.end_bp <= w.end_bp) THEN x.end_bp - x.start_bp + 1 WHEN (x.start_bp <= w.start_bp AND w.start_bp <= x.end_bp) THEN x.end_bp - w.start_bp + 1 WHEN (x.start_bp <= w.end_bp AND w.end_bp <= x.end_bp) THEN w.end_bp - x.start_bp + 1 END AS len_overlap

FROM [[email protected]].[hotspots_deserts.tab] x INNER JOIN [[email protected]].[table_noncoding_positions.tab] w ON x.chr = w.chr WHERE (x.start_bp >= w.start_bp AND x.end_bp <= w.end_bp) OR (x.start_bp <= w.start_bp AND w.start_bp <= x.end_bp) OR (x.start_bp <= w.end_bp AND w.end_bp <= x.end_bp) ORDER BY x.strain, x.chr ASC, x.start_bp ASC

Non-programmers can write very complex queries (rather than relying on staff programmers)

Example: Computing the overlaps of two sets of blast results

We see thousands of queries written by non-programmers

Page 21: Myria: Analytics-as-a-Service for (Data) Scientists

Howe, et al., CISE 2012

Page 22: Myria: Analytics-as-a-Service for (Data) Scientists

Steven Roberts

SQL as a lab notebook:http://bit.ly/16Xj2JP

Popular service for Bioinformatics Workflows

Page 23: Myria: Analytics-as-a-Service for (Data) Scientists

Halperin, Howe, et al. SSDBM 2013

Page 24: Myria: Analytics-as-a-Service for (Data) Scientists

04/11/2023 24Bill Howe, UW

“An undergraduate student and I are working with gigabytes of tabular data derived from analysis of protein surfaces.

Previously, we were using huge directory trees and plain text files.

Now we can accomplish a 10 minute 100 line script in 1 line of SQL.”-- Andrew D White

Andrew White, UW Chemistry

Decoding nonspecific interactions from nature. A. White, A. Nowinski, W. Huang, A. Keefe, F. Sun, S. Jiang. (2012) Chemical Science. Accepted

Page 25: Myria: Analytics-as-a-Service for (Data) Scientists

04/11/2023 25

Scientific data management reduces to sharing views• Integrate data from multiple sources?

– joins and unions with views

• Standardize on units, apply naming conventions?– rename columns, apply functions with views

• Attach metadata?– add new tables with descriptive names, add new columns

with views

• Data cleaning, quality control?– hide bad values with views

• Maintain provenance?– inspect view dependencies

• Propagate updates? – view maintenance

• Protect sensitive data?– expose subsets with views (assuming views carry

permissions) Bill Howe, UW

SSDBM 2011

Page 26: Myria: Analytics-as-a-Service for (Data) Scientists

04/11/2023 26

Two Problems with SQLShare

• No help for really big datasets• No iteration

Bill Howe, UW

Page 27: Myria: Analytics-as-a-Service for (Data) Scientists

04/11/2023 27

Myria is…

• A compiler framework for multiple iterative RA-based languages

• A parallel, shared-nothing, iterative execution engine

• A RESTful Query-as-a-Service platform

• prefix meaning “ten thousand” in Greek

Bill Howe, UW

Page 28: Myria: Analytics-as-a-Service for (Data) Scientists

Myria Team

28

Dan Suciu Magda Balazinska Bill Howe

Dan Halperin (postdoc, technical lead)Victor Almeida (postdoc)Andrew Whitaker (research scientist)

StudentsParis KoutrisEmad SoroushJingjing WangShengLiang Xu

Jennifer OrtizJeremy HyrkasShumo Chu

Page 29: Myria: Analytics-as-a-Service for (Data) Scientists

MyriaArchitecture

Coordinator

Language Parser

Myria Compiler

Logical Optimizer for RA+While

REST Server

Google App

Engine

Worker Catalog

Catalog

json query plan

netty protocols

RDBMS

jdbc

Worker Catalog

RDBMS

jdbc

Worker Catalog

RDBMS

jdbc

MyriaDB

C Compiler Grappa

Web UI

MyriaL

HDFS HDFS HDFS

Page 30: Myria: Analytics-as-a-Service for (Data) Scientists

04/11/2023 30Bill Howe, UW

A(y) :- R(‘a’, y)A(y) :- A(x), R(x,y)

Page 31: Myria: Analytics-as-a-Service for (Data) Scientists

04/11/2023 31Bill Howe, UW

A = LOAD('points.txt', id:int, x:float, y:float)

E = LIMIT(A, 4);F = SEQUENCE();Centroids = [FROM E EMIT (id=F.next, x=E.x, y=E.y)];Kmeans = [FROM A EMIT (id=id, x=x, y=y, cluster_id=0)]

DOI = CROSS(Kmeans, Centroids); J = [FROM I EMIT (Kmeans.id, Kmeans.x, Kmeans.y, Centroids.cluster_id, $distance(Kmeans.x, Kmeans.y, Centroids.x, Centroids.y))];

K = [FROM J EMIT id, distance=$min(distance)]; L = JOIN(J, id, K, id) M = [FROM L WHERE J.distance <= K.distance EMIT (id=J.id, x=J.x, y=J.y, cluster_id=J.cluster_id)];

Kmeans' = [FROM M EMIT (id, x, y, $min(cluster_id))];

Delta = DIFF(Kmeans', Kmeans) Kmeans = Kmeans'

Centroids = [FROM Kmeans' EMIT (cluster_id, x=avg(x), y=avg(y))];

WHILE DELTA != {}

Page 32: Myria: Analytics-as-a-Service for (Data) Scientists

32

Why Iteration MattersDatalog: reachability in a graph with 1.4B unique edges: almost 200 iterations

Page 33: Myria: Analytics-as-a-Service for (Data) Scientists

33

Why Iteration Matters

Datalog: reachability in a graph with 1.4B unique edges: almost 200 iterations

Vast majorityof reachable tuples

discovered byiteration 25

Page 34: Myria: Analytics-as-a-Service for (Data) Scientists

34

Why Iteration MattersDatalog: reachability in a graph with 1.4B unique edges: almost 200 iterations

Vast majorityof reachable tuples

discovered byiteration 25

The datalog programcontinues for almost200 iterations, eachalmost as expensive

as the early steps

Page 35: Myria: Analytics-as-a-Service for (Data) Scientists

04/11/2023 35

Fewer Iterations: Endgame Problem [Afrati 10]

Bill Howe, UW

0 20 40 60 80 100 120 140 160 1801

10

100

1,000

10,000

100,000

1,000,000

10,000,000

100,000,000 frontier tuplespreviously discovered tuples removed

iteration #

# of

tupl

es d

isco

vere

d

Page 36: Myria: Analytics-as-a-Service for (Data) Scientists

04/11/2023 36

Basic Semi-Naïve Evaluation

Bill Howe, UW

Join Dupe-elim

A(y) :- R(‘a’, y)A(y) :- A(x), R(x,y)

Reachability from ‘a’ in datalog

Page 37: Myria: Analytics-as-a-Service for (Data) Scientists

04/11/2023 37

MAYBE JUST USE HADOOP?

Bill Howe, UW

Page 38: Myria: Analytics-as-a-Service for (Data) Scientists

04/11/2023 Bill Howe, UW 38

(a) R is loop invariant, but gets loaded and shuffled on each iteration

(b) Ai grows slowly and monotonically, but is loaded and shuffled on each iteration. HaLoop’s Reducer Input Cache addressed (a), but did not support the append semantics needed for (b).

Bu, Howe, Balazinska, Ernst VLDB10, VLDBJ12, Datalog12

map

map

map

map

mapreduce

reduce

Join Difference

ΔAi-1

R(1)

R(0)

mapAi(0)

Ai(1)(a) (b)

reduce

reduce

map

VLDB 2010, VLDBJ 2011

Page 39: Myria: Analytics-as-a-Service for (Data) Scientists

39

map map

mapreduce

reduce

Join Difference

ΔAi-1 reduce

reduce

mapAi(0)

Ai(1)

map

R(1)

R(0)

Iteration i > 0:

map

mapR(1)

R(0)

A(1)

A(0)

Inter-loop caching

Bu, Howe, Balazinska, Ernst VLDB10, VLDBJ12, Datalog12

Iteration i = 0: Load a distributed cache

VLDB 2010, VLDBJ 2011

Page 40: Myria: Analytics-as-a-Service for (Data) Scientists

04/11/2023 40Bill Howe, UW

First iteration is slow, as the invariant graph is shuffled and cached

failure

23X

Caching Loop-Invariant Data map

map

map

map

mapreduce

reduce

Join Difference

ΔAi-1

R(1)

R(0)

mapAi(0)

Ai(1)

reduce

reduce

map

Page 41: Myria: Analytics-as-a-Service for (Data) Scientists

04/11/2023 Bill Howe, UW 41

Specialize Cache for Query Semantics

map

map

map

map

mapreduce

reduce

Join Difference

ΔAi-1

R(1)

R(0)

mapAi(0)

Ai(1)

reduce

reduce

map

join keys arriving from mappers

all tuples from cache

MapReduce semantics require that all keys from the cache be extracted and passed to reducers.

But we only care about keys that join.

Reducer for Join

Page 42: Myria: Analytics-as-a-Service for (Data) Scientists

04/11/2023 Bill Howe, UW 42

Second optimization:Specialization for Equijoin

map

map

map

map

mapreduce

reduce

Join Difference

ΔAi-1

R(1)

R(0)

mapAi(0)

Ai(1)

reduce

reduce

map

join keys arriving from mappers

keys that join

Index the cache, and only extract keys that join

Reducer for Join

indexed cache lookup

Page 43: Myria: Analytics-as-a-Service for (Data) Scientists

04/11/2023 43Bill Howe, UW

Failure occurred

~20%

tota

l tim

e fo

r lo

op

bo

dy (

s)map

map

map

map

mapreduce

reduce

Join Difference

ΔAi-1

R(1)

R(0)

mapAi(0)

Ai(1)

reduce

reduce

map

Effect of equijoin specialization

Page 44: Myria: Analytics-as-a-Service for (Data) Scientists

04/11/2023 Bill Howe, UW 44

Third Optimization: Extend Cache to Support Duplicate Elimination

map

map

map

map

mapreduce

reduce

Join Difference

ΔAi-1

R(1)

R(0)

mapAi(0)

Ai(1)

reduce

reduce

map

The accumulated result is not loop-invariant, but it changes relatively slowly, and is needed on every iteration to check for duplicates.

Extend the cache to support append, and we can use it for Dupe-Elim as well.

tuples arriving from mappers

unique keys

Reducer for Dupe-elim

indexed cache lookup, with new tuples inserted

Page 45: Myria: Analytics-as-a-Service for (Data) Scientists

04/11/2023 45

Effect of Diff Cache

Bill Howe, UW

tota

l tim

e fo

r lo

op

bo

dy (

s) Failures may be more likely due to extra network traffic

~20% overall improvement

Page 46: Myria: Analytics-as-a-Service for (Data) Scientists

0 50 100 150 200 2500

5000

10000

15000

20000

25000

30000

35000

iteration #

time

(s)

(c) all optimizations

(d) raw Hadoop overhead

(b) HaLoop

(a) no optimizations

Overall

Page 47: Myria: Analytics-as-a-Service for (Data) Scientists

04/11/2023 47

Fewer Iteraations: Loop unrolling

Bill Howe, UW

Run two joins for every dupe-elim

Page 48: Myria: Analytics-as-a-Service for (Data) Scientists

04/11/2023 48Bill Howe, UW

half the iterations, but each is more expensive

change strategies

Page 49: Myria: Analytics-as-a-Service for (Data) Scientists

1 3 5 7 9 11 13 15 17 19 21 231

10

100

1000

10000

100000

1000000

10000000

GreenplumMyria

Iteration

# of

New

ly D

iscov

ered

Fac

ts

not much useful work

reachable(Y) :- edge(5,Y)reachable(Y) :- edge(X,Y), reachable(X)

Page 50: Myria: Analytics-as-a-Service for (Data) Scientists

04/11/2023 50Bill Howe, UW

1 4 7 10 13 16 19 220

100

200

300

400

500

600

700

GreenplumMyriaGreenplum, incrementalGreenplum, incremental+index

Iteration

Tota

l Tim

e (s

econ

d)

Low per-iteration cost

Page 51: Myria: Analytics-as-a-Service for (Data) Scientists

04/11/2023 51

Summary

• Goal: Expose all the world’s science data through declarative query interfaces!

• Motivated by real science• Data and query model is iterative relational

algebra• Industrial-strength Query-as-a-Service

Bill Howe, UW

http://myria-web.appspot.com/

http://db.cs.washington.edu/myria/

Page 52: Myria: Analytics-as-a-Service for (Data) Scientists

04/11/2023 52Bill Howe, UW

Page 53: Myria: Analytics-as-a-Service for (Data) Scientists

04/11/2023 53

• Hypothesis: The performance difference between hand-coded graph algorithms and relational query plans amounts to implementation details

• Can we generate “hand-coded” plans?

Bill Howe, UW

Datalog Parser

Myria Compiler

Logical OptimizerGoogle

App EngineC Compiler Grappa

Page 54: Myria: Analytics-as-a-Service for (Data) Scientists

Path-Counting Queries

Ex: Count the number of unique 2-hops

Page 55: Myria: Analytics-as-a-Service for (Data) Scientists

04/11/2023 55Bill Howe, UW

answers = set()for all (x, y1) in edges: for all (y2, z) in edges: if y1 == y2: answers.insert((x,z))count = answers.size()

Assume a collection edges

In an RDBMS: “Nested Loops Join”

Page 56: Myria: Analytics-as-a-Service for (Data) Scientists

04/11/2023 56Bill Howe, UW

answers = set()for all (x, y) in edges: for all z in neighbors[y]: answers.insert((x,z))count = answers.size()

Assume a collection edges, but also an index neighbors: vertex -> [vetex]

In an RDBMS: “Hash Join”

Page 57: Myria: Analytics-as-a-Service for (Data) Scientists

04/11/2023 57Bill Howe, UW

answers = set()for all x in neighbors: for all y in neighbors[x]: for all z in neighbors[y]: answers.insert((x,z))count = answers.size()

Just drop the edges collection entirely, leaving only the index neighbors: vertex -> [vetex]

In an RDBMS: Still a Hash Join

Page 58: Myria: Analytics-as-a-Service for (Data) Scientists

04/11/2023 58Bill Howe, UW

count = 0answers = set()for all x in neighbors: for all y in neighbors[x]: for all z in neighbors[y]: answers.insert(z) count += answers.size() answers.clear()

Just drop the edges collection entirely, leaving only the index neighbors: vertex -> [vetex]

RDBMS don’t express this, but there’s no reason they couldn’t

stays smallonly one value

Page 59: Myria: Analytics-as-a-Service for (Data) Scientists

04/11/2023 59Bill Howe, UW

answers = set()for all x in neighbors: for all y in x.neighbors(): for all z in y.neighbors(): answers.insert(z) count += answers.size() answers.clear()

stays smallonly one value, so

Or if you prefer…assume a collection of vertices, where each vertex points directly to its neighbors

Boils down to dereferencing a pointer vs. probing a hash table

Page 60: Myria: Analytics-as-a-Service for (Data) Scientists

Experiments

• Data sets:Dataset # Vertices # Edges #Distinct 2-hop

Paths# Triangles

BSN* 685,230 7,600,595 78,350,597 6,935,709

Twitter 4MEⱡ 166,317 4,532,185 1,056,317,985 14,912,950

com-livejournal*

3,997,962 34,681,189 735,398,579

soc-livejournal*

4,874,571 68,993,773 112,319,229

*http://snap.stanford.edu/ⱡH. Kwak et al 2010.

Page 61: Myria: Analytics-as-a-Service for (Data) Scientists

Experiments

BSN data set Twitter 4ME data setno dupe elim

dupe elim

single-threaded

Page 62: Myria: Analytics-as-a-Service for (Data) Scientists

Experiments

Page 63: Myria: Analytics-as-a-Service for (Data) Scientists

Experiments

• Parallel system performance