xldb south america keynote: escience institute and myria

Myria: Scalable Analytics as a Service

Bill Howe, PhDUniversity of Washington

XLDB South America 2014

04/11/2023 2/57

This morning

• UW eScience Institute– A “Data Science Environment”

• SQLShare and High Variety Data

• Myria and “Relational Algorithmics”

Bill Howe, UW

3

“It’s a great time to be a data geek.”-- Roger Barga, Microsoft Research

“The greatest minds of my generation are trying to figure out how to make people click on ads”

-- Jeff Hammerbacher, co-founder, Cloudera

The Fourth Paradigm

1. Empirical + experimental2. Theoretical3. Computational4. Data-Intensive

Jim Gray

04/11/2023 Bill Howe, UW 4

“All across our campus, the process of discovery will increasingly rely on researchers’ ability to extract knowledge from vast amounts of data… In order to remain at the forefront, UW must be a leader in advancing these techniques and technologies, and in making [them] accessible to researchers in the broadest imaginable range of fields.”

2005-2008

In other words: • Data-driven discovery will be

ubiquitous • UW must be a leader in inventing

the capabilities • UW must be a leader in

translational activities – in putting these capabilities to work

• It’s about intellectual infrastructure (human capital) and software infrastructure (shared tools and services – digital capital)

A 5-year, US$37.8 million cross-institutional collaboration to create a data science environment

6

2014

04/11/2023 7Bill Howe, UW

Data Science Kickoff Session:137 posters from 30+ departments and units

Establish a virtuous cycle

• 6 working groups, each with • 3-6 faculty from each institution

04/11/2023 9

UW Data Science Education Efforts

Bill Howe, UW

Students Non-StudentsCS/Informatics Non-Major

professionals researchersundergrads grads undergrads grads

UWEO Data Science Certificate MOOC Intro to Data ScienceIGERT: Big Data PhD Track New CS Courses Bootcamps and workshops Intro to Data Programming Data Science Masters (planned) Incubator: hands-on training

04/11/2023 10Bill Howe, UW

Next Session begins June 30, 2014https://www.coursera.org/course/datasci

https://www.coursera.org/course/datasci

11/57

MOOC Participation numbers

• “Registered”: 119,517 totally irrelevant

• Clicked play in first 2 weeks: 78,589 • Turned in 1st homework: 10,663• Completed all assignments: ~9000 typical attrition for a

MOOC• “Passed”: 7022• Forum threads: 4661• Forum posts: 22,900

Fairly consistent with Coursera data across “hard” courses

Educational transformation:A new generation of “Pi-shaped” scientists

12

PhD πhD

Educational transformation

Magda Balazinska

13

Educational transformation

Big Data access and management

Big Data modeling

Big Data analytics

Collaborative Big Data scienceData

Education and Research in Data Science• Ultimate goal: A new PhD program

– Initial goal: A new certificate based on Big Data tracks in all departments

– Education highlights: data science courses, co-advising, and internships

• End-to-End Research Agenda– Big Data mgmt, analytics, modeling, & collaboration

• Cyberinfrastructure Development– Big Data analysis service

The Data Science Studio

• An open collaborative research space• A resident data science team

– Permanent staff of ~5 data scientists – applied research and development

– ~15-20 data science fellows (research scientists, visitors, postdocs, students)

• How to Engage:– Drop-in open workspace– Studio “Office Hours”– Incubation Program

14

15

6th floor Physics Astronomy Building

A partnership among …

• Provost• UW Libraries• Physics, Astronomy,

Arts & Sciences• eScience Institute

16

Estimated Timeline:• Design Phase Jan-June• Construction June – Sep• Target: October 1, 2014

04/11/2023 17Bill Howe, UW

The rest of this talk…

04/11/2023 18/57Bill Howe, UW

How can we deliver 1000 little SDSSs to anyone who wants one?

04/11/2023 19/57Bill Howe, UW

# o

f b

yte

s

# of data sources

telescopes

spectra

LSST (~100PB; images, spectra)

PanSTARRS (~40PB; images, trajectories)

OOI (~50TB/year; sims, RSN)IOOS (~50TB/year; sims, satellite, gliders,

AUVs, vessels, more)CMOP (~10TB/year; sims, stations, gliders,

AUVs, vessels, more)

SDSS (~400TB; images, spectra, catalogs)

n-body sims

models

AUVs

stations

cruises, CTDsflow cytometry

gliders

ADCPsatellites

Astronomy

Ocean Sciences

3 V’s of Big Data

Volume

Variety

Velocity

04/11/2023 20/57

How much time do you spend “handling data” as opposed to “doing science”?

Mode answer: “90%”

Bill Howe, UW

Key question: How can we reduce this “data overhead”?

21/5704/11/2023 Bill Howe, UW

Simple Example

ANNOTATIONSUMMARY-COMBINEDORFANNOTATION16_Phaeo_genome

COGAnnotation_coastal_sample.txt

SELECT * FROM Phaeo_genome p, coastal_sample c WHERE p.COG_hit = c.hit

04/11/2023 22/57

Data Science Workflow:

Bill Howe, UW

1) Preparing to run a model

2) Running the model

3) Interpreting the results

Gathering, cleaning, integrating, restructuring, transforming, loading, filtering, deleting, combining, merging, verifying, extracting, shaping, massaging

“80% of the work”

-- Aaron Kimball

“The other 80% of the work”

DB

ML/Stats

Vis

“[This was hard] due to the large amount of data (e.g. data indexes for data retrieval, dissection into data blocks and processing steps, order in which steps are performed to match memory/time requirements, file formats required by software used).

In addition we actually spend quite some time in iterations fixing problems with certain features (e.g. capping ENCODE data), testing features and feature products to include, identifying useful test data sets, adjusting the training data (e.g. 1000G vs human-derived variants)

So roughly 50% of the project was testing and improving the model, 30% figuring out how to do things (engineering) and 20% getting files and getting them into the right format.

I guess in total [I spent] 6 months [on this project].”

At least 3 months on issues of scale, file handling, and feature engineering.

Martin Kircher, Genome SciencesWhy?

3k NSF postdocs in 2010$50k / postdocat least 50% overhead

maybe $75M annually at NSF alone?

Benchmark 1 Benchmark 20

30

60

90

120

Old system Your system Our system

A typical Computer Science paper….

slide src: Dan Halperin

Benchmark 1 Benchmark 20

2500

5000

7500

10000

12500

Old system Your system Our systemWhat people use

The reality of the situation….

slide src: Dan Halperin

04/11/2023 26/57

A modest goal:

Expose all the world’s science data through declarative query interfaces

Bill Howe, UW

QUERY-AS-A-SERVICE

27

2010 - present

Version 1

1) Upload data “as is”Cloud-hosted, secure; no need to install or design a database; no pre-defined schema; schema inference; some itegration

2) Write QueriesRight in your browser, writing views on top of views on top of views ...

SELECT hit, COUNT(*)

FROM tigrfam_surface

GROUP BY hit

ORDER BY cnt DESC

3) Share the results Make them public, tag them, share with specific colleagues – anyone with access can query

http://sqlshare.escience.washington.edu

SELECT x.strain, x.chr, x.region as snp_region, x.start_bp as snp_start_bp , x.end_bp as snp_end_bp, w.start_bp as nc_start_bp, w.end_bp as nc_end_bp , w.category as nc_category , CASE WHEN (x.start_bp >= w.start_bp AND x.end_bp <= w.end_bp) THEN x.end_bp - x.start_bp + 1 WHEN (x.start_bp <= w.start_bp AND w.start_bp <= x.end_bp) THEN x.end_bp - w.start_bp + 1 WHEN (x.start_bp <= w.end_bp AND w.end_bp <= x.end_bp) THEN w.end_bp - x.start_bp + 1 END AS len_overlap

FROM [[email protected]].[hotspots_deserts.tab] x INNER JOIN [[email protected]].[table_noncoding_positions.tab] w ON x.chr = w.chr WHERE (x.start_bp >= w.start_bp AND x.end_bp <= w.end_bp) OR (x.start_bp <= w.start_bp AND w.start_bp <= x.end_bp) OR (x.start_bp <= w.end_bp AND w.end_bp <= x.end_bp) ORDER BY x.strain, x.chr ASC, x.start_bp ASC

Non-programmers can write very complex queries (rather than relying on staff programmers)

Example: Computing the overlaps of two sets of blast results

We see thousands of queries written by non-programmers

Howe, et al., CISE 2013

Steven Roberts

SQL as a lab notebook:http://bit.ly/16Xj2JP

Popular service for Bioinformatics Workflows

http://bit.ly/16Xj2JP

Halperin, Howe, et al. SSDBM 2013

Two Problems with SQLShare

• No help for truly big datasets• No help for “algorithmics”

33

Limitations of SQLShare

04/11/2023 34Bill Howe, UW

Relational Algorithmics-as-a-Service

Version 2

http://myria.cs.washington.edu

Myria is…

• MyriaQ: A compiler framework for multiple iterative RA-based languages and multiple big data back ends

• MyriaX: A parallel, shared-nothing, iterative execution engine

• MyriaWeb: A RESTful Analytics-as-a-Service platform and web-based interface 35

Myria is …

Magda Balazinska, Bill Howe, and Dan Suciu

Dan Halperin (technical lead)Victor AlmeidaAndrew Whitaker

PhD StudentsShumo Chu Eric GribkoffJeremy HyrkasParis KoutrisRyan MaasDominik MoritzLaurel OrrJennifer OrtizEmad SoroushJingjing WangShengLiang Xu

Undergraduate StudentsLee Lee ChooVaspol Ruamviboonsuk

Myria Team

Myria Architecture

Coordinator

Language Parser

Myria Compiler

Logical Optimizer for RA+While

REST Server

Worker Catalog

Catalog

…

json query plan

netty protocols

RDBMS

jdbc

Worker Catalog

RDBMS

jdbc

Worker Catalog

RDBMS

jdbc

MyriaX (Java)

C Compiler Grappa

Web UI

MyriaQ (Python)

HDFS HDFS HDFS

Datalog SQL MyriaL

REST

SciDB

SparkSerial C++GrappaMyriaX SQL

SQLDatalogMyriaL ??

Relational Algebra + Iteration

Compiler Compiler Compiler Compiler Compiler

MyriaQ

Oceanography, Astronomy, Biology, Medical Informatics

Laser

Microscope Objective

Pine Hole Lens

Nozzle d1

d2

FSC (Forward scatter)

Orange fluo

Red fluo

EX: SeaFlowFrancois Ribalet

Jarred Swalwell

Ginger Armbrust

Ex: SeaFlowd1

/ F

SC

d2 / FSC

RE

D f

luor

esce

nce

FSC

Picoplankton

Nanoplankton

IS

Ultraplankton

Prochlorococcus

Continuous observations of various phytoplankton groups from 1-20 mm in size

Based on RED fluo: Prochlorococcus, Pico-, Ultra- and Nanoplankton Based on ORANGE fluo: Synechococcus, Cryptophytes Based on FSC: Coccolithophores

Francois Ribalet

Jarred Swalwell

Ginger Armbrust

Ex: SeaFlowFrancois Ribalet

Jarred Swalwell

Ginger Armbrust

SeaFlow in Myria

• “That 5-line MyriaL program was 100x faster than my R cluster, and much simpler”

Dan Halperin Sophie Clayton

04/11/2023 43/57Bill Howe, UW

1) BD experiments are ridiculously labor-intensive– N systems x M real-world applications– Big clusters and big datasets

2) No “one size fits all solution”– Realistic environments will use more than one system

3) A return to distributed, federated databases– Erase the distinction between ETL and Analytics

Why a big data middleware?

Pregel (Malewicz)

Hadoop 2008

2009

2010

2011

2012

2013

2014

HaLoop (Bu)

Spark (Zakaria)

Vertica (Pavlo)

~100x faster

SystemML (Ghoting)

Hyracks (Borkar)

GraphLab (Low)

faster

Cumulon (Huang)

comparable or inconclusive

Giraph (Tian)

Dremel (Melnik)

SimSQL (Cai)

epiC (Jiang)

Impala (Cloudera)

Shark (Xin)

HIVE (Thusoo)

“The good old days”

“The age of uncertainty”

04/11/2023 45/57Bill Howe, UW

What can we conclude?

Hadoop was probably just pretty bad

The rest of the story not so clear

04/11/2023 46/57

Relational Algebra is the Calculus of Big Data

• Hadoopspawn: Pig, HIVE, blah• Hadoop contemporaries: Cascalog, Flume, blah• Post-Hadoop: Spark/Shark, Dremel, blah• etc.

Bill Howe, UW

04/11/2023 47/57

HBase

Bill Howe, UW

BigTable

Dremel

Tenzing

2004

Pregel

Hadoop

2005

MapReduce

2006

2007

2008

2009

Spanner

Megastore

2010

2011

2012

Google Big Data Systems

non-Google open source implementationdirect influence / shared features

compatible

implementation of

SQL-like interface

BigQuery

04/11/2023 48/57

Relational Algebra is the Calculus of Small Data

• Galaxy – “bioinformatics workflows”

• Pandas (Python)merge(left, right, on=‘key’)

• dplyr (R)filter(x), select(x), arrange(x),

groupby(x), inner_join(x, y), left_join(x, y), ….

• Manimal, Pyxis/StatusQuo, others– Extract RA operators implemented manually in Java

codeBill Howe, UW

“…Operate on Genomics Intervals -> Join”

04/11/2023 Bill Howe, UW 49/57

Key Idea: Algebraic Optimization

N = ((z*2)+((z*3)+0))/1

Algebraic Laws: 1. (+) identity: x+0 = x2. (/) identity: x/1 = x3. (*) distributes: (n*x+n*y) = n*(x+y)4. (*) commutes: x*y = y*x

Apply rules 1, 3, 4, 2:N = (2+3)*z

two operations instead of five, no division operator

Same idea works with the Relational Algebra!

A closer look at an example

ROI(id, start, stop) is a set of “regions of interest”

Read(id, start, stop) is a set of “reads” from sequencer

Task: For each region of interest, count the number of reads it contains

start stop

stopstart

SELECT roi.id, count(rd.id)FROM regions_of_interest roi, reads rdWHERE roi.start <= rd.start AND rd.[end] <= roi.[end]GROUP BY roi.id

As a query

“region of interest”sequence “read”

SELECT roi.id, count(rd.start)FROM regions_of_interest roi, reads rdWHERE roi.start <= rd.start AND rd.[end] <= roi.[end]GROUP BY roi.id

Why databases get a bad reputation

many minutes

SELECT roi.id, count(rd.start) as cntFROM regions_of_interest roi, indexed_reads rdWHERE roi.start <= rd.start AND rd.start <= roi.[end] AND roi.start <= rd.[end] AND rd.[end] >= roi.[end]GROUP BY roi.id

3 seconds!

roiread

two-sided index scan

one-sided index scan, plus filter

The broken promise of declarative query…

Lowering barrier to entry

Giving users insight

Shumo Chu Dominik Moritz

Diagnosing problemsS

ourc

e no

deDestination node

Shumo Chu Dominik Moritz

56

A = LOAD('points.txt', id:int, x:float, y:float)

E = LIMIT(A, 4);F = SEQUENCE();Centroids = [FROM E EMIT (id=F.next, x=E.x, y=E.y)];Kmeans = [FROM A EMIT (id=id, x=x, y=y, cluster_id=0)]

DOI = CROSS(Kmeans, Centroids); J = [FROM I EMIT (Kmeans.id, Kmeans.x, Kmeans.y, Centroids.cluster_id, $distance(Kmeans.x, Kmeans.y, Centroids.x, Centroids.y))];

K = [FROM J EMIT id, distance=$min(distance)]; L = JOIN(J, id, K, id) M = [FROM L WHERE J.distance <= K.distance EMIT (id=J.id, x=J.x, y=J.y, cluster_id=J.cluster_id)];

Kmeans' = [FROM M EMIT (id, x, y, $min(cluster_id))];

Delta = DIFF(Kmeans', Kmeans) Kmeans = Kmeans'

Centroids = [FROM Kmeans' EMIT (cluster_id, x=avg(x), y=avg(y))];

WHILE DELTA != {}

K-Means in the language MyriaL

57

CurGood = SCAN(public:adhoc:sc_points);

DO mean = [FROM CurGood EMIT val=AVG(v)]; std = [FROM CurGood EMIT val=STDEV(v)]; NewBad = [FROM Good WHERE ABS(Good.v - mean) > 2 * std EMIT *]; CurGood = CurGood - NewBad; continue = [FROM NewBad EMIT COUNT(NewBad.v) > 0];WHILE continue;

DUMP(CurGood);

Sigma-clipping, V0

58

CurGood = Psum = [FROM CurGood EMIT SUM(val)];sumsq = [FROM CurGood EMIT SUM(val*val)]cnt = [FROM CurGood EMIT CNT(*)];NewBad = []DO sum = sum – [FROM NewBad EMIT SUM(val)]; sumsq = sum – [FROM NewBad EMIT SUM(val*val)]; cnt = sum - [FROM NewBad EMIT CNT(*)]; mean = sum / cnt std = sqrt(1/(cnt*(cnt-1)) * (cnt * sumsq - sum*sum)) NewBad = FILTER([ABS(val-mean)>std], CurGood) CurGood = CurGood - NewBad WHILE NewBad != {}

Sigma-clipping, V1: Incremental

59

Points = SCAN(public:adhoc:sc_points);aggs = [FROM Points EMIT _sum=SUM(v), sumsq=SUM(v*v), cnt=COUNT(v)];newBad = []

bounds = [FROM Points EMIT lower=MIN(v), upper=MAX(v)];

DO new_aggs = [FROM newBad EMIT _sum=SUM(v), sumsq=SUM(v*v), cnt=COUNT(v)]; aggs = [FROM aggs, new_aggs EMIT _sum=aggs._sum - new_aggs._sum, sumsq=aggs.sumsq - new_aggs.sumsq, cnt=aggs.cnt - new_aggs.cnt];

stats = [FROM aggs EMIT mean=_sum/cnt, std=SQRT(1.0/(cnt*(cnt-1)) * (cnt * sumsq - _sum * _sum))];

newBounds = [FROM stats EMIT lower=mean - 2 * std, upper=mean + 2 * std];

tooLow = [FROM Points, bounds, newBounds WHERE newBounds.lower > v AND v >= bounds.lower EMIT v=Points.v]; tooHigh = [FROM Points, bounds, newBounds WHERE newBounds.upper < v AND v <= bounds.upper EMIT v=Points.v]; newBad = UNIONALL(tooLow, tooHigh);

bounds = newBounds; continue = [FROM newBad EMIT COUNT(v) > 0];WHILE continue;

output = [FROM Points, bounds WHERE Points.v > bounds.lower AND Points.v < bounds.upper EMIT v=Points.v];DUMP(output);

Sigma-clipping, V2

• Hypothesis: Loops + RA covers everything anyone wants to do– and it scales, it’s optimizable, and it’s accessible

• We can smooth the ROI curve for novices– Start with simple queries…– …end up working on advanced parallel algorithms

• “White Box Analytics”– Compose queries, inspect plans, monitoring, debugging, “UDRs”

– user-defined optimization rules

• Multiple languages, multiple backends, one data/query model– Ask me about graph data– Ask me about array data (or, rather, mesh data)

“Relational Algorithmics”

Takeaways

• We hope to see “Data Science Environments” at universities worldwide– We try to make our programs and activities reusable

• Software-as-a-service to reach the “long tail” of science

• “Relational Algorithmics” – The relational algebra is the calculus of big data– “It’s not just for databases anymore”– Learn it, use it, teach it– Myria is a platform for “relational algorithmics”

http://escience.washington.edu@[email protected]

http://escience.washington.edu/

mailto:[email protected]

63

Maslow’s Needs Hierarchy

“As each need is satisfied, the next higher level in the hierarchy dominates conscious functioning.”

-- Maslow 43

A “Needs Hierarchy” of Science Data Management

storage

sharing

64

query

integration

analytics


-- Maslow 43

A “Needs Hierarchy” of Science Data Management

storage

sharing

65

integration

query

analytics


-- Maslow 43

xldb south america keynote: escience institute and myria

Technology

data scientists

coursera data

data geek

handling data

data overhead

data science courses

data science studio

data science igert