bigquery javascript user-defined functions by thomas park and felipe hoffa at big data spain 2014

Post on 12-Jul-2015

1.740 Views

Category:

Technology

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

HANDS ON WITH BIGQUERY JAVASCRIPT UDFS

THOMAS PARKSOFTWARE ENGINEER - GOOGLE

Hands-on with BigQuery JavaScriptUser-Defined Functions

Thomas ParkSoftware Engineer - Google

Felipe Hoffa@felipehoffaDeveloper Advocate - Google

Agenda

Background

Example: Cross-row intervals

Under the hood

Example: Codebreaking

I.

II.

III.

IV.

Agenda

Background

Example: Cross-row intervals

Under the hood

Example: Codebreaking

I.

II.

III.

IV.

What is BigQuery?

BigQuery: Big Data Analytics in the Cloud

Unrivaled Performance and Scale

● Scan multiple TB’s in seconds● Interactive query

performance● No limits on amount of data

Ease of Use and Adoption

● No administration / provisioning

● Convenience of SQL● Open interfaces

(REST, WebUI, ODBC)● First 1 TB of data processed

per month is free

Advanced “Big Data” Storage

● Familiar database structure● Easy data management and

ACL’s● Fast, atomic imports

Google confidential │ Do not distribute

How many pageviews does Wikipediahave in a month?

SELECT COUNT(*)FROM[fh-bigquery:wikipedia.wikipedia_views_201308]

https://bigquery.cloud.google.com/table/fh-bigquery:wikipedia.pagecounts_20140602_18

Google confidential │ Do not distribute

$500 in Cloud Platform credit to launch your idea!

Build. Store. Analyze.On the same infrastructure

that powers GoogleStart building

Click ‘Apply Now’ and complete the

application with promo code: bigdata-spain

Starter Pack

Offer Description

1

2

3

Go to cloud.google.com/developers/starterpack

Agenda

Background

Example: Cross-row intervals

Under the hood

Example: Codebreaking

I.

II.

III.

IV.

Images by Connie Zhou

Scenario:

Door access records from a

Very Well-Secured Lab

where users must badge in

to enter or leave

Image by Tod Kurt

Images by Connie Zhou

Example:

Time-series analysis

from discrete user action data

Image by Tod Kurt

user_id timestamp

Beep!!9h: arrive @ lab thomas 2014.07.15 09:00

Beep!!10h: leave to pick up

prototypethomas 2014.07.15 10:00

Beep!!10h15: return with

prototypethomas 2014.07.15 10:15

Beep!!12h: out for lunch thomas 2014.07.15 12:00

How can we find out how much time each user spent in the lab?

...where each scan of the user’s access card is

represented as a discrete row?

rownum user_id timestamp1 thomas 2014.07.15 09:002 thomas 2014.07.15 10:003 thomas 2014.07.15 10:154 thomas 2014.07.15 12:00

60 minutes

105 minutes

Our analysis with data in this format via SQL is horrid and painful

A BigQuery + JS friendly format:

data for each user in separate rows

user_id timestampsthomas [ 09:00, 10:00, 10:15, 12:00, ... ]hoffa [ 08:10, 11:30, 12:00, 12:15, ... ]

SELECT user_id, NEST(timestamp) AS timestampsFROM TGROUP BY user_id;

Producing this format is trivial in BigQuery...

// This function will be called once for each user,// and receive an array of timestamps.function(record, emit) { var total_time = 0; // Order of records built by NEST are not guaranteed! // Sort to guarantee ascending timestamps. var ts = record.ts.sort( function (a, b) {return a > b;}); // Loop over timestamp pairs, calculate interval. for (var i = 0; i < ts.length - 1; i += 2) { total_time += (ts[i+1] - ts[i]); } // Emit total time for this user. emit({user: record.user_id, total_time: total_time});}

JS: Total time for each user

SELECT * FROM js( // Input table or query. [secret-lab:door_scans.201411] // Input columns. user_id, timestamps, // Output schema. "[{name: 'user_id', type:'string'}, {name: 'tot_time', type:'integer'}]", // The function. "function(r, emit) { ... emit(...); }")

SELECT * FROM js( // Input table or query. [secret-lab:door_scans.201411] // Input columns. user_id, timestamps, // Output schema. "[{name: 'user_id', type:'string'}, {name: 'tot_time', type:'integer'}]", // The function. "function(r, emit) { ... emit(...); }")

The JS function

SELECT * FROM js( // Input table or query. [secret-lab:door_scans.201411] // Input columns. user_id, timestamps, // Output schema. "[{name: 'user_id', type:'string'}, {name: 'tot_time', type:'integer'}]", // The function. "function(r, emit) { ... emit(...); }")

Input schema (column

names only!)

SELECT * FROM js( // Input table or query. [secret-lab:door_scans.201411] // Input columns. user_id, timestamps, // Output schema. "[{name: 'user_id', type:'string'}, {name: 'tot_time', type:'integer'}]", // The function. "function(r, emit) { ... emit(...); }")

Output schema (full

declaration)

SELECT * FROM js( // Input table or query. [secret-lab:door_scans.201411] // Input columns. user_id, timestamps, // Output schema. "[{name: 'user_id', type:'string'}, {name: 'tot_time', type:'integer'}]", // The function. "function(r, emit) { ... emit(...); }")

Input table or subquery

SELECT * FROM js( // Input table or query. [secret-lab:door_scans.201411] // Input columns. user_id, timestamps, // Output schema. "[{name: 'user_id', type:'string'}, {name: 'tot_time', type:'integer'}]", // The function. "function(r, emit) { ... emit(...); }")

Agenda

Background

Example: Cross-row intervals

Under the hood

Example: Codebreaking

I.

II.

III.

IV.

How BigQuery works

Get data from lower levels, filter / join / transform,send rows up

Tree Structured Query Dispatch and Aggregation

Distributed StorageSELECT title, requests

Leaf Leaf Leaf LeafSUM(requests)GROUP BY titleWHERE REGEX_MATCH(title, 'pat.*rn')

Mixer 1 Mixer 1 SUM(requests)GROUP BY title

Mixer 0

LIMIT 10ORDER BY c DESCSUM(requests)GROUP BY title

Data for each row is calculated and

streamed through a “Row Iterator”

Subquery0 Subquery1

JOINRow Iterator 0 Row Iterator 1

Row Iterator 2

Can insert JavaScript Functions

wherever we have a Row Iterator

Subquery0 Subquery1

JOINRow Iterator 0 Row Iterator 1

Row Iterator 2

UDF1

UDF0

Join order item info with web hits info

SELECT item FROM

orders

SELECT query string FROM hits

JOINRow Iterator 0 Row Iterator 1

Row Iterator 2

UDF1

UDF0

http://www.store.com/?q=7%2e1+Speakers

SELECT item FROM

orders

SELECT query string FROM hits

JOINRow Iterator 0 Row Iterator 1

Row Iterator 2

UDF1

UDF0

http://www.store.com/?q=7%2e1+Speakers

Extract and decode query term => “7.1 Speakers”

SELECT item FROM

orders

SELECT query string FROM hits

JOINRow Iterator 0 Row Iterator 1

Row Iterator 2

UDF1

UDF0

UDF execution

Subquery0 Subquery1

JOIN

UDF1

Process boundary

UDF0UDF0

User Code

Agenda

Background

Example: Cross-row intervals

Under the hood

Example: Codebreaking

I.

II.

III.

IV.

Demos:

Ñ

Image: El Hormiguero (Flickr CC)

http://jsfiddle.net/fhoffa/y4pt9s23/

Image: TheVanCats (Flickr CC)

Questions?

News: reddit.com/r/bigqueryAsk: stackoverflow.com

Share: bigqueri.es

Thomas ParkFelipe Hoffa @felipehoffa +FelipeHoffa

Rate us?

http://goo.gl/k3bzdw

Backup slides / screenshots

17TH ~ 18th NOV 2014MADRID (SPAIN)

top related