how to make lean big data with six tools from google

17
How to make lean Big Data with six tools from Google Nikolay Novozhilov ([email protected]) June 2014

Upload: nikolay-novozhilov

Post on 12-Jul-2015

413 views

Category:

Data & Analytics


3 download

TRANSCRIPT

Page 1: How to make lean Big Data with six tools from Google

How to make lean Big Data with six

tools from Google Nikolay Novozhilov ([email protected])

June 2014

Page 2: How to make lean Big Data with six tools from Google

2

Silicon Valley veterans with top engineers from around the world

One slide about Bubbly

40M+ users just over three years since launch of Bubbly

Leading mobile social media & messaging service across Asia

Sequoia Capital, SingTel, JAFCO & Comcast

Singapore (HQ) + Mumbai, Manila, Jakarta, Tokyo, Hanoi & Bangkok

• Tony Bates, CEO Skype / CSO Microsoft• Jeff Karras, MD SingTel Innov8• Dave Williams, former CTO O2, AT&T, and Telefonica• Jimmy Iovine, Chairman, Interscope Records (Judge on American Idol)• Gaurav Garg, Sequoia Capital US• Nikki Han, President, SM Entertainment (Korea)• Mohit Bhatnagar, Sequoia India

Overview

Offices

Investors

Users

Team

Board /

Advisors

Page 3: How to make lean Big Data with six tools from Google

3

What do we want from Data Analytics?

Make the Dashboard with key metrics

Dive deep in user behavior and A/B testing

Monitor availability and performance

Produce reports for external users

Etc…

Everybody needs the same

Page 4: How to make lean Big Data with six tools from Google

4

What did we do?

We have tried many things to satisfy our needs.

And found solution optimal for us

Fast to make and cheap

Flexible and with a lot of functionality

Able to deal with Big Data – we log 60 mln events a day

In this presentation we show how it’s done

Page 5: How to make lean Big Data with six tools from Google

5

Why we didn’t use Mixpanel

Not enough configurabilityOnce you really care about your data – standard charts are not enough!

Mixpanel export APIs don’t solve all issues

What about extra features – not data mining:Use results inside your product

Send monitoring alerts the way you want

Give limited access to 3rd parties

Costs a lot! People often sample data to Mixpanel.

But what if you need full data dumped in one place?

There are tons of other cloud-solutions, that might be doing some of these tricks, but I don’t trust “small projects”

Page 6: How to make lean Big Data with six tools from Google

6

Why we didn’t use Hadoop

It is too complicated

Hadoop needs server infrastructure

Even with hosted Hadoop solution you need a lot to setup

Batch processing – Hadoop is not reactive to your queries. It kills

you when you do:

Ad hoc and trial-and-error data analysis

Mistakes in scripts

…I mean – you do it every day!

Hadoop doesn’t give you visualization, monitoring, etc… You still

have to build it.

Page 7: How to make lean Big Data with six tools from Google

7

Why we didn’t use MySQL

We have too much data for MySQL

Still need to host it, build all functionality, etc…

Already enough reasons!

Page 8: How to make lean Big Data with six tools from Google

8

What did we do instead?

Google Big Query

Google Spreadsheets

Google Charts

Google Drive / Google Sites

Store all possible events from users

Query and transform data

Interactive visualization

Host the Dashboard

Google Analytics Look after Dashboard users

Page 9: How to make lean Big Data with six tools from Google

9

Why BigQuery?

Solution hosted by Google – ready to use today!

Much cheaper than hosting own applications in AWS.

Established API – easy to add logging to your code.

Web UI for queries

Our trick to make it “schema less”

For every upload check current schema in BigQuery

Compare with schema of current upload

If you have extra fields – add these fields using BigQuery API

Page 10: How to make lean Big Data with six tools from Google

10

Why Google Spreadsheets?

Nothing is better for analytics than spreadsheets!!!

But why not MS Excel? Several reasons:

Easy to query data from BigQuery (Tutorial from Goolge)

Cloud hosted solution with cron-like scheduler for scripts

Cross platform solution (Excel VBA scripts fail on Mac)

Security – you can give read-only rights to some users

Already has email functionality for alerts and much more…

Page 11: How to make lean Big Data with six tools from Google

11

How to use Google Spreadsheets?

Example - link!

The goal was to make it usable for SQL-only people (no coding)

How it works

Our Google apps script is triggered periodically

It scans all sheets for value “SQL” in A1.

If it finds “SQL”, then A2 contains SQL query that is pushed to BigQuery

Results are populated below on the same page

Page 12: How to make lean Big Data with six tools from Google

12

Why Google Charts?

Big visualization library, free, done by Google

Integrated with Google spreadsheets (Google Tutorial)

Interactive controls – business people can explore data

too!

Example - link

Page 13: How to make lean Big Data with six tools from Google

13

Why Google Sites / Google Drive?

Easy to manage access to data for all users (including 3rd

parties)

Dropbox gives you only “full-access”

Google Drive has many roles: “owner”, “can edit”, “read only”

After using BigQuery, Spreadsheets and Charts from Google –

why not everything

Google Drive – host html files with Charts. It has good desktop

client so it is easy to manage charts

Google Sites has WYSIWYG site builder

Page 14: How to make lean Big Data with six tools from Google

14

Why Google Analytics?

Dashboard is a product itself. In our case in has about 30

users.

You need data from users to improve your product

You need analytics tool for it!

I use Google Analytics to watch how users visit my

Dashboard on Google Sites

… and punish ones who is not using it ;)

Page 15: How to make lean Big Data with six tools from Google

15

What about costs?

In the whole solution only BigQuery costs money!

We never paid more than 200$ per month

Real costs come from time/efforts to develop and

support. Our solution is smart but lean:

The whole project is done by one analyst/developer

1 month from idea to fist live version

Page 16: How to make lean Big Data with six tools from Google

16

Best practice to optimize costs of BigQuery

BigQuery performs full-table scans

In most queries you care only about recent events

If you store all data in one table with time you scan a lot of data for nothing resulting in

Higher costs

Slower queries

We rotate event tables monthly, creating tables inside one dataset (like events_2014Jan, events_2014Feb,…)

Google scripts Apps are ideal for monthly rotation

For queries that require historical data we use meta-SQL that is parced by Google Spreadsheets script

• “FROMDATASET dataset” – query all tables in dataset

• “FROMLAST table” – query “table” and “table_2014Jan” (table from last month)

Page 17: How to make lean Big Data with six tools from Google

17

Example dashboard

Check out this page for example dashboard with all

working source code:

https://sites.google.com/site/leanbigdatawith6tools/