bigquery javascript user-defined functions by thomas park and felipe hoffa at big data spain 2014
TRANSCRIPT
HANDS ON WITH BIGQUERY JAVASCRIPT UDFS
THOMAS PARKSOFTWARE ENGINEER - GOOGLE
Hands-on with BigQuery JavaScriptUser-Defined Functions
Thomas ParkSoftware Engineer - Google
Felipe Hoffa@felipehoffaDeveloper Advocate - Google
Agenda
Background
Example: Cross-row intervals
Under the hood
Example: Codebreaking
I.
II.
III.
IV.
Agenda
Background
Example: Cross-row intervals
Under the hood
Example: Codebreaking
I.
II.
III.
IV.
What is BigQuery?
BigQuery: Big Data Analytics in the Cloud
Unrivaled Performance and Scale
● Scan multiple TB’s in seconds● Interactive query
performance● No limits on amount of data
Ease of Use and Adoption
● No administration / provisioning
● Convenience of SQL● Open interfaces
(REST, WebUI, ODBC)● First 1 TB of data processed
per month is free
Advanced “Big Data” Storage
● Familiar database structure● Easy data management and
ACL’s● Fast, atomic imports
Google confidential │ Do not distribute
How many pageviews does Wikipediahave in a month?
SELECT COUNT(*)FROM[fh-bigquery:wikipedia.wikipedia_views_201308]
https://bigquery.cloud.google.com/table/fh-bigquery:wikipedia.pagecounts_20140602_18
Google confidential │ Do not distribute
$500 in Cloud Platform credit to launch your idea!
Build. Store. Analyze.On the same infrastructure
that powers GoogleStart building
Click ‘Apply Now’ and complete the
application with promo code: bigdata-spain
Starter Pack
Offer Description
1
2
3
Go to cloud.google.com/developers/starterpack
Agenda
Background
Example: Cross-row intervals
Under the hood
Example: Codebreaking
I.
II.
III.
IV.
Images by Connie Zhou
Scenario:
Door access records from a
Very Well-Secured Lab
where users must badge in
to enter or leave
Image by Tod Kurt
Images by Connie Zhou
Example:
Time-series analysis
from discrete user action data
Image by Tod Kurt
user_id timestamp
Beep!!9h: arrive @ lab thomas 2014.07.15 09:00
Beep!!10h: leave to pick up
prototypethomas 2014.07.15 10:00
Beep!!10h15: return with
prototypethomas 2014.07.15 10:15
Beep!!12h: out for lunch thomas 2014.07.15 12:00
How can we find out how much time each user spent in the lab?
...where each scan of the user’s access card is
represented as a discrete row?
rownum user_id timestamp1 thomas 2014.07.15 09:002 thomas 2014.07.15 10:003 thomas 2014.07.15 10:154 thomas 2014.07.15 12:00
60 minutes
105 minutes
Our analysis with data in this format via SQL is horrid and painful
A BigQuery + JS friendly format:
data for each user in separate rows
user_id timestampsthomas [ 09:00, 10:00, 10:15, 12:00, ... ]hoffa [ 08:10, 11:30, 12:00, 12:15, ... ]
SELECT user_id, NEST(timestamp) AS timestampsFROM TGROUP BY user_id;
Producing this format is trivial in BigQuery...
// This function will be called once for each user,// and receive an array of timestamps.function(record, emit) { var total_time = 0; // Order of records built by NEST are not guaranteed! // Sort to guarantee ascending timestamps. var ts = record.ts.sort( function (a, b) {return a > b;}); // Loop over timestamp pairs, calculate interval. for (var i = 0; i < ts.length - 1; i += 2) { total_time += (ts[i+1] - ts[i]); } // Emit total time for this user. emit({user: record.user_id, total_time: total_time});}
JS: Total time for each user
SELECT * FROM js( // Input table or query. [secret-lab:door_scans.201411] // Input columns. user_id, timestamps, // Output schema. "[{name: 'user_id', type:'string'}, {name: 'tot_time', type:'integer'}]", // The function. "function(r, emit) { ... emit(...); }")
SELECT * FROM js( // Input table or query. [secret-lab:door_scans.201411] // Input columns. user_id, timestamps, // Output schema. "[{name: 'user_id', type:'string'}, {name: 'tot_time', type:'integer'}]", // The function. "function(r, emit) { ... emit(...); }")
The JS function
SELECT * FROM js( // Input table or query. [secret-lab:door_scans.201411] // Input columns. user_id, timestamps, // Output schema. "[{name: 'user_id', type:'string'}, {name: 'tot_time', type:'integer'}]", // The function. "function(r, emit) { ... emit(...); }")
Input schema (column
names only!)
SELECT * FROM js( // Input table or query. [secret-lab:door_scans.201411] // Input columns. user_id, timestamps, // Output schema. "[{name: 'user_id', type:'string'}, {name: 'tot_time', type:'integer'}]", // The function. "function(r, emit) { ... emit(...); }")
Output schema (full
declaration)
SELECT * FROM js( // Input table or query. [secret-lab:door_scans.201411] // Input columns. user_id, timestamps, // Output schema. "[{name: 'user_id', type:'string'}, {name: 'tot_time', type:'integer'}]", // The function. "function(r, emit) { ... emit(...); }")
Input table or subquery
SELECT * FROM js( // Input table or query. [secret-lab:door_scans.201411] // Input columns. user_id, timestamps, // Output schema. "[{name: 'user_id', type:'string'}, {name: 'tot_time', type:'integer'}]", // The function. "function(r, emit) { ... emit(...); }")
Agenda
Background
Example: Cross-row intervals
Under the hood
Example: Codebreaking
I.
II.
III.
IV.
How BigQuery works
Get data from lower levels, filter / join / transform,send rows up
Tree Structured Query Dispatch and Aggregation
Distributed StorageSELECT title, requests
Leaf Leaf Leaf LeafSUM(requests)GROUP BY titleWHERE REGEX_MATCH(title, 'pat.*rn')
Mixer 1 Mixer 1 SUM(requests)GROUP BY title
Mixer 0
LIMIT 10ORDER BY c DESCSUM(requests)GROUP BY title
Data for each row is calculated and
streamed through a “Row Iterator”
Subquery0 Subquery1
JOINRow Iterator 0 Row Iterator 1
Row Iterator 2
Can insert JavaScript Functions
wherever we have a Row Iterator
Subquery0 Subquery1
JOINRow Iterator 0 Row Iterator 1
Row Iterator 2
UDF1
UDF0
Join order item info with web hits info
SELECT item FROM
orders
SELECT query string FROM hits
JOINRow Iterator 0 Row Iterator 1
Row Iterator 2
UDF1
UDF0
http://www.store.com/?q=7%2e1+Speakers
SELECT item FROM
orders
SELECT query string FROM hits
JOINRow Iterator 0 Row Iterator 1
Row Iterator 2
UDF1
UDF0
http://www.store.com/?q=7%2e1+Speakers
Extract and decode query term => “7.1 Speakers”
SELECT item FROM
orders
SELECT query string FROM hits
JOINRow Iterator 0 Row Iterator 1
Row Iterator 2
UDF1
UDF0
UDF execution
Subquery0 Subquery1
JOIN
UDF1
Process boundary
UDF0UDF0
User Code
Agenda
Background
Example: Cross-row intervals
Under the hood
Example: Codebreaking
I.
II.
III.
IV.
Demos:
Ñ
Image: El Hormiguero (Flickr CC)
http://jsfiddle.net/fhoffa/y4pt9s23/
Image: TheVanCats (Flickr CC)
Questions?
News: reddit.com/r/bigqueryAsk: stackoverflow.com
Share: bigqueri.es
Thomas ParkFelipe Hoffa @felipehoffa +FelipeHoffa
Rate us?
http://goo.gl/k3bzdw
Backup slides / screenshots
17TH ~ 18th NOV 2014MADRID (SPAIN)