big data on google platform dev fest presentation

22
Big Data on Google Platform or how we moved our data into cloud Przemysław Pastuszka, Kraków, 08.11.2014

Upload: przemyslaw-pastuszka

Post on 08-Jul-2015

165 views

Category:

Data & Analytics


1 download

TRANSCRIPT

Page 1: Big data on google platform   dev fest presentation

Big Data on Google Platform

or how we moved our data into cloud

Przemysław Pastuszka, Kraków, 08.11.2014

Page 2: Big data on google platform   dev fest presentation

● Quick introduction to Ocado● When traditional DBMS is not enough● Journey into cloud(s)● BigQuery ups and downs

What do I want to talk about?

Page 3: Big data on google platform   dev fest presentation

Ocado intro

Ocado is the world's largest online-only grocery retailer, reaching over 70% of British households, shipping over 150,000 orders a week or 1.1M items a day.

Page 4: Big data on google platform   dev fest presentation

Shop

Page 5: Big data on google platform   dev fest presentation

Customer Fulfilment Center

Page 6: Big data on google platform   dev fest presentation

Delivery

Page 7: Big data on google platform   dev fest presentation

● Warehouse & delivery○ operational metrics○ GPS data○ items statuses

● Shop○ clickstream○ page views○ customer events (register, add to basket, checkout, etc.)

● Backend○ server logs○ performance metrics

The data

Page 8: Big data on google platform   dev fest presentation

Pre-BigData picture

OC

AD

O S

ERVI

CES

Oracle

Green plum

JMS

Page 9: Big data on google platform   dev fest presentation

●○ not scalable○ expensive stuff○ needs DBAs with domain expertise

●○ meant to overcome shortcomings of Oracle○ and it did… for some time○ scalability became a problem again

Why this was not enough

Page 10: Big data on google platform   dev fest presentation

● it’s scalable○ can handle petabytes of data○ you can easily add / remove machines on demand

● huge ecosystem○ hive, pig, giraph, mahout, zookeeper, ...

● widely adopted○ Facebook, Twitter, Spotify, Netflix…

● open source

Here’s the idea - let’s do

Page 11: Big data on google platform   dev fest presentation

● allowed us to spin up Big Data division with least effort● cost effective

○ we use machines only when we need them○ amount of resources tailored for our needs

● easy to maintain○ no need for hardware maintenance○ software upgrades are extremely easy○ each user can have her own cluster of machines

● elastic○ add / remove resources on demand

● lots of integrated tools

Bare metal or cloud? Cloud, of course!

Page 12: Big data on google platform   dev fest presentation

● cheaper than Amazon● Amazon is a competition for Ocado● oriented towards data analysis● rapidly evolving● Google BigQuery● Google Cloud Dataflow

Why we chose Google Cloud Platform

Page 13: Big data on google platform   dev fest presentation

At last - BigData in da house!O

CA

DO

SER

VIC

ES

Oracle

Green plum

JMS

Google Cloud

Storage

Compute cluster

User Cluster

Cluster Manager

Transformed ORC files

Raw data

Page 14: Big data on google platform   dev fest presentation

● MapReduce is a very restrictive paradigm● ad-hoc queries? Forget it!● needs lots of space for temporary data● many accompanying tools are of poor quality

○ SQL coverage in Hive○ Mahout is buggy and performs poorly

● incompatibilities between tools in ecosystem○ makes whole thing hard to maintain

● data governance must be done by hand

But, grandma, elephants are clumsy!

Page 15: Big data on google platform   dev fest presentation

● much more elastic than MapReduce● very powerful and clean API● good SQL support● very well suited for machine learning● more coherent than Hadoop ecosystem● really gaining momentum lately

to the rescue!

Page 16: Big data on google platform   dev fest presentation

BigData in da house (now with Spark!)O

CA

DO

SER

VIC

ES

Oracle

Green plum

JMS

Google Cloud

Storage

Compute cluster

User Cluster

Cluster Manager

Transformed ORC files

Raw data

Page 17: Big data on google platform   dev fest presentation

We need to get back to the board.

But is everyone happy with this? Hell no!

Page 18: Big data on google platform   dev fest presentation

● long latency before data is available for querying● ad-hoc queries still not really possible● no connectors for popular BI tools

Analysts are still frustrated

Page 19: Big data on google platform   dev fest presentation

● outstanding performance○ answers on small datasets within seconds○ queries on bigger data take longer, but it’s still above reach of

Hadoop or Spark● integration with BI tools (Excel, Tableau, etc.)● data available immediately after loading● super-easy to use● highly available● does the data governance for us!

BigQuery is the savior

Page 20: Big data on google platform   dev fest presentation

So this is how it looks nowIN

PUT

STR

EAM

Event Registry

EventProcessorEvent

ProcessorEventProcessor

END

POIN

TS

Cluster Manager

Page 21: Big data on google platform   dev fest presentation

● fluctuations in performance● jobs fail from time to time● usage restrictions

○ constraints on SQL○ no user defined functions○ no parametrized views

● cannot easily reuse data stored in BigQuery in external systems

BigQuery is great… but could be even better

Page 22: Big data on google platform   dev fest presentation

Questions?