big data on google platform dev fest presentation
TRANSCRIPT
Big Data on Google Platform
or how we moved our data into cloud
Przemysław Pastuszka, Kraków, 08.11.2014
● Quick introduction to Ocado● When traditional DBMS is not enough● Journey into cloud(s)● BigQuery ups and downs
What do I want to talk about?
Ocado intro
Ocado is the world's largest online-only grocery retailer, reaching over 70% of British households, shipping over 150,000 orders a week or 1.1M items a day.
Shop
Customer Fulfilment Center
Delivery
● Warehouse & delivery○ operational metrics○ GPS data○ items statuses
● Shop○ clickstream○ page views○ customer events (register, add to basket, checkout, etc.)
● Backend○ server logs○ performance metrics
The data
Pre-BigData picture
OC
AD
O S
ERVI
CES
Oracle
Green plum
JMS
●○ not scalable○ expensive stuff○ needs DBAs with domain expertise
●○ meant to overcome shortcomings of Oracle○ and it did… for some time○ scalability became a problem again
Why this was not enough
● it’s scalable○ can handle petabytes of data○ you can easily add / remove machines on demand
● huge ecosystem○ hive, pig, giraph, mahout, zookeeper, ...
● widely adopted○ Facebook, Twitter, Spotify, Netflix…
● open source
Here’s the idea - let’s do
● allowed us to spin up Big Data division with least effort● cost effective
○ we use machines only when we need them○ amount of resources tailored for our needs
● easy to maintain○ no need for hardware maintenance○ software upgrades are extremely easy○ each user can have her own cluster of machines
● elastic○ add / remove resources on demand
● lots of integrated tools
Bare metal or cloud? Cloud, of course!
● cheaper than Amazon● Amazon is a competition for Ocado● oriented towards data analysis● rapidly evolving● Google BigQuery● Google Cloud Dataflow
Why we chose Google Cloud Platform
At last - BigData in da house!O
CA
DO
SER
VIC
ES
Oracle
Green plum
JMS
Google Cloud
Storage
Compute cluster
User Cluster
Cluster Manager
Transformed ORC files
Raw data
● MapReduce is a very restrictive paradigm● ad-hoc queries? Forget it!● needs lots of space for temporary data● many accompanying tools are of poor quality
○ SQL coverage in Hive○ Mahout is buggy and performs poorly
● incompatibilities between tools in ecosystem○ makes whole thing hard to maintain
● data governance must be done by hand
But, grandma, elephants are clumsy!
● much more elastic than MapReduce● very powerful and clean API● good SQL support● very well suited for machine learning● more coherent than Hadoop ecosystem● really gaining momentum lately
to the rescue!
BigData in da house (now with Spark!)O
CA
DO
SER
VIC
ES
Oracle
Green plum
JMS
Google Cloud
Storage
Compute cluster
User Cluster
Cluster Manager
Transformed ORC files
Raw data
We need to get back to the board.
But is everyone happy with this? Hell no!
● long latency before data is available for querying● ad-hoc queries still not really possible● no connectors for popular BI tools
Analysts are still frustrated
● outstanding performance○ answers on small datasets within seconds○ queries on bigger data take longer, but it’s still above reach of
Hadoop or Spark● integration with BI tools (Excel, Tableau, etc.)● data available immediately after loading● super-easy to use● highly available● does the data governance for us!
BigQuery is the savior
So this is how it looks nowIN
PUT
STR
EAM
Event Registry
EventProcessorEvent
ProcessorEventProcessor
END
POIN
TS
Cluster Manager
● fluctuations in performance● jobs fail from time to time● usage restrictions
○ constraints on SQL○ no user defined functions○ no parametrized views
● cannot easily reuse data stored in BigQuery in external systems
BigQuery is great… but could be even better
Questions?