how and why you need to build a big data lab

27
How and why you need to build a Big Data Lab Why GCP is a pretty cool place to do it Chris Kernaghan Principal Consultant

Upload: chris-kernaghan

Post on 22-Jan-2018

88 views

Category:

Technology


2 download

TRANSCRIPT

How and why you need to build a Big Data LabWhy GCP is a pretty cool place to do it

Chris Kernaghan

Principal Consultant

VS

Data Lab Data Factory

Big data Lab – the world’s biggest

• WLCG – Worldwide LHC Computing Grid

• 170 Computing facilities

• 200,000 Cores

• 300GB/s data stream ingestion

• 300MB/s data stream filtered

• 27TB RAW data per day

4

Big data Lab – Traditional Home brew

• Based on Vmware or Virtuabox or Raspberry PI

• Mix of hardware

• Limited resources – 6 cores, 128GB space

• Low performance – 1 GHz Processor

• Lots of baby sitting

• Equal measures of heartbreak and joy

5

Big data Lab – Using Cloud

• IaaS and PaaS services

• Mix of applications

• Infinite resources

• High performance

• Access to quality data sets

• Utility billing

• Sharable outcomes

Big data platforms in the Cloud - AWS

Big data platforms in the Cloud - GCP

Big data platforms in the Cloud - Azure

Big data platforms in the Cloud - SAP

Big data platforms in the Cloud - IBM

Common characteristics of Cloud based platforms

Streaming Engine

Data Storage

Hadoop

In Memory Engine

Machine Learning

Analytics

Why have a lab

• Data is a complex beast, it has several attributes

• Quality – different tasks require different data quality

• Machine Learning & Predictive

• Reporting

• Context – data context is vital for analytics

• Story of the data

• Volume – how much data is there

• Testing requirements for data latency

• Format – data format is not universal

• Different applications have different data types

• Analysis

• What and how to analyse

A lab is essential for testing these items before large scale factory work is done

Why have a lab

• Data is a complex beast, it has several attributes

• Quality – different tasks require different data quality

• Machine Learning & Predictive

• Reporting

• Context – data context is vital for analytics

• Story of the data

• Volume – how much data is there

• Testing requirements for data latency

• Format – data format is not universal

• Different applications have different data types

• Analysis

• What and how to analyse

A lab is essential for testing these items before large scale factory work is done

Why have a lab

• Data is a complex beast, it has several attributes

• Quality – different tasks require different data quality

• Machine Learning & Predictive

• Reporting

• Context – data context is vital for analytics

• Story of the data

• Volume – how much data is there

• Testing requirements for data latency

• Format – data format is not universal

• Different applications have different data types

• Analysis

• What and how to analyse

A lab is essential for testing these items before large scale factory work is done

Define your goals

• Achieving the best use of resources is critical

• Cloud based Big Data labs have a direct charge model

• Homebrew Big Data labs have limited resources

• Define what the outcome of the lab work is

• This is no different to a proper science experiment

• Design your lab and define your tools

• You have to use the right tool for the job, not just those you are familiar with

• Define your data set

• Work out what data you need

• Gain permission to use what you need if required

Define your goals

• Achieving the best use of resources is critical

• Cloud based Big Data labs have a direct charge model

• Homebrew Big Data labs have limited resources

• Define what the outcome of the lab work is

• This is no different to a proper science experiment

• Design your lab and define your tools

• You have to use the right tool for the job, not just those you are familiar with

• Define your data set

• Work out what data you need

• Gain permission to use what you need if required

Mind the gap and acquire knowledge

Part of the fun of big data labs is working out what you don’t know• A particular framework• An algorithm• A data set• A visualisation

The next fun part is working out where to fill that knowledge gap• Online sources –

• Kaggle• MOOC’s – Andrew Ng’s Stanford course• Forums – Stack Overflow

It is also implicit that you also share what you have learnt once you have

Mind the gap and acquire knowledge

Part of the fun of big data labs is working out what you don’t know• A particular framework• An algorithm• A data set• A visualisation

The next fun part is working out where to fill that knowledge gap• Online sources –

• Kaggle• MOOC’s – Andrew Ng’s Stanford course• Forums – Stack Overflow

It is also implicit that you also share what you have learnt once you have

SAP and Big Data platforms

In-Memory Store

Simplified processing of large volumes of archived data

HANA SDA / Spark Adapter

HANA-Spark Adapter for real-time understanding of current data with historical context

Unified administration using HANA cockpit administration simplifies system management

SAP HANA

Application Services

Database Services

Processing Services

Integration Services

YARN

HDFSFiles Files Files

Vora

Spark

Vora

Spark

Vora

Spark

SAP HANA Platform

HANA Smart Data Access

Structured Storage

Dynamic Tiering

Spark API enhancement

Hadoop Cluster

SAP HANA Express Edition

• Fast application development and deployment with essential features• Free up to 32GB of memory – upgradeable for a fee• Flexible access from a laptop, desktop, server, Cloud platform• Pre-Packages with sample code and data• Downloadable from SAP Developer network

Big data datasets

Companies are really really bad at using external data sets

• There are many public data sets which can be used to compliment existing internal data.

• Weather data for logistics companies

• AWS Public Datasets

• Google Public Datasets

• GitHub Public Datasets

• Kaggle Public Datasets

• Data.gov.uk Public Datasets

AWS Big data datasets

Google Big data datasets

GitHub Big data datasets

Kaggle Big data datasets

Data.gov.uk data datasets

SAP HANA Express Edition Deploying in GCP

DEMO