google cloud dataproc - easier, faster, more cost-effective spark and hadoop

42
James Malone Product Manager More data. Zero headaches. Making the Spark and Hadoop ecosystem fast, easy, and cost- effective.

Upload: huguk

Post on 19-Jan-2017

333 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop

James MaloneProduct Manager

More data. Zero headaches.Making the Spark and Hadoop ecosystem fast, easy, and cost-effective.

Page 2: Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop

Google Cloud Platform 2

Cloud Dataproc features and benefits

Page 3: Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop

Google Cloud Platform 3

Apache Spark and Apache Hadoop should be fast, easy, and cost-effective.

Page 4: Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop

Easy, fast, cost-effective

FastThings take seconds to minutes, not hours or weeks

EasyBe an expert with your data, not your data infrastructure

Cost-effectivePay for exactly what you use

Page 5: Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop

Running Hadoop on Google Cloud

bdutilFree OSS Toolkit

Dataproc

Managed HadoopCustom CodeMonitoring/HealthDev IntegrationScalingJob SubmissionGCP ConnectivityDeploymentCreation

Custom CodeMonitoring/HealthDev IntegrationManual ScalingJob SubmissionGCP ConnectivityDeploymentCreation

On Premise

Custom CodeMonitoring/HealthDev IntegrationScalingJob SubmissionGCP ConnectivityDeploymentCreation

Google Managed

Google Cloud Platform

Customer Managed

Vendor Hadoop

Custom CodeMonitoring/HealthDev IntegrationScalingJob SubmissionGCP ConnectivityDeploymentCreation

Page 6: Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop

6

Cloud Dataproc - integrated

6

Cloud Dataproc is natively integrated with several Google

Cloud Platform products as part of

an integrated data platform.

Storage

Operations

Data

Page 7: Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop

7

Where Cloud Dataproc fits into GCP

7

Google Bigtable(HBase)

Google BigQuery(Analytics, Data

warehouse)

Stackdriver Logging(Logging Ops.)

Google Cloud Dataflow(Batch/Stream

Processing)

Google Cloud Storage(HCFS/HDFS)

Stackdriver Monitoring(Monitoring)

Page 8: Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop

8

Most time can be spent with data, not toolingMore time can be dedicated to examining data for actionable insights

Less time is spent with clusters since creating, resizing, and destroying clusters is easily done

Hands-on with dataCloud Dataproc setup and customization

Page 9: Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop

9

Lift and shift workloads to Cloud Dataproc

Copy data to GCS

Copy your data to Google Cloud Storage (GCS) by installing the

connector or by copying manually.

Update file prefix

Update the file location prefix in your scripts

from hdfs:// to gcs:// to access your data in

GCS.

Use Cloud Dataproc

Create a Cloud Dataproc cluster and run your job on the cluster against the data you copied to

GCS. Done.

1 32

Page 10: Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop

Google Cloud Platform 10

How does Google Cloud Dataproc help me?

Page 11: Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop

Traditional Spark and Hadoop clusters

Page 12: Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop

Google Cloud Dataproc

Page 13: Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop

Cloud example - slow vs. fast

Things take seconds to

minutes, not hours or weeks

capa

city

need

ed

t

Time needed to obtain new capacity

capa

city

use

d

t

Scaling can take hours, days, or

weeks to perform

Traditional clusters Cloud Dataproc

Page 14: Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop

Cloud example - hard vs. easy

Be an expert with your data, not

your data infrastructure

Need experts to optimize

utilization and deployment

Traditional clusters Cloud Dataproc

clust

er u

tiliza

tion

Cluster Inactive

t clust

er u

tiliza

tion

t

cluster 1 cluster 2

Page 15: Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop

Cloud example - costly vs. cost-effective

Pay for exactly what you use

You (probably) pay for more capacity than actually used

Traditional clusters Cloud Dataproc

Time

Cost

Time

Cost

Page 16: Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop

Google Cloud Dataproc - under the hood

Google Cloud Services

Dataproc Cluster

Cloud Dataproc uses GCP - Compute Engine, Cloud Storage, and Stackdriver tools

Page 17: Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop

Google Cloud Dataproc - under the hood

Cloud Dataproc Agent

Google Cloud Services

Dataproc Cluster

Cloud Dataproc clusters have an agent to manage the Cloud Dataproc clusterDataproc uses Compute Engine, Cloud Storage, and Cloud Ops tools

Page 18: Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop

Google Cloud Dataproc - under the hood

Spark & Hadoop OSS

Spark, Hadoop, Hive, Pig, and other OSS components execute on the clusterCloud Dataproc

Agent

Google Cloud Services

Dataproc Cluster

Cloud Dataproc clusters have an agent to manage the Cloud Dataproc clusterDataproc uses Compute Engine, Cloud Storage, and Cloud Ops tools

Page 19: Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop

Google Cloud Dataproc - under the hood

SparkPySpark

Spark SQLMapReduce

PigHive

Spark & Hadoop OSS

Cloud Dataproc Agent

Google Cloud Services

Dataproc Cluster

Dataproc Jobs

Page 20: Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop

Google Cloud Dataproc - under the hood

Applications on the cluster

Dataproc Jobs

GCP Products

SparkPySpark

Spark SQLMapReduce

PigHive

Dataproc Cluster

Spark & Hadoop OSS

Cloud Dataproc Agent

Google Cloud Services

Dataproc Jobs FeaturesData

Outputs

Page 21: Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop

Google Cloud Platform 21

How can I use Cloud Dataproc?

Page 22: Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop

Google Cloud Platform 22

Google Developers Consolehttps://console.developers.google.com/

Page 23: Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop

Google Cloud Platform 23

Google Cloud SDKhttps://cloud.google.com/sdk/

Page 24: Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop

Google Cloud Platform 24

Cloud Dataproc REST APIhttps://cloud.google.com/dataproc/reference/rest/

Page 25: Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop

Google Cloud Platform 25

Let’s see an example - Cloud Dataproc demo

Page 26: Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop

Confidential & ProprietaryGoogle Cloud Platform 26

Google Cloud Dataproc - demo overviewIn this demo we are going to do a few things:

Create a clusterQuery a large set of data stored in Google Cloud

StorageReview the output of the queriesDelete the cluster

Page 27: Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop

Google Cloud Platform 27

YARN Cores1,600

What just happened?

YARN RAM4.7 TB

Spark & Hadoop

100%Click1

Page 28: Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop

Google Cloud Platform 2828

The New York City Taxi & Limousine Commission and Uber released a dataset of trips from 2009-2015

Original dataset is in CSV format and contains over 20 columns of data and about 1.2 billion trips

The dataset is about ~270 gigabytes

NYC taxi data

28

Page 29: Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop

Google Cloud Platform 29

CREATE EXTERNAL TABLE trips (trip_id INT,vendor_id STRING,pickup_datetime TIMESTAMP,dropoff_datetime TIMESTAMP,store_and_fwd_flag STRING,...(44 other columns)...,dropoff_puma STRING)

STORED AS orcLOCATION 'gs://taxi-nyc-demo/trips/'TBLPROPERTIES (

"orc.compress"="SNAPPY","orc.stripe.size"="536870912","orc.row.index.stride"="50000");

Page 30: Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop

Google Cloud Platform 30

SELECT cab_type, count(*)FROM tripsGROUP BY cab_type;

SELECT passenger_count, avg(total_amount)FROM tripsGROUP BY passenger_count;

SELECT passenger_count, year(pickup_datetime), count(*)FROM tripsGROUP BY passenger_count, year(pickup_datetime);

SELECT passenger_count, year(pickup_datetime) trip_year, round(trip_distance), count(*) tripsFROM tripsGROUP BY passenger_count, year(pickup_datetime), round(trip_distance)ORDER BY trip_year, trips DESC;

Page 31: Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop

Google Cloud Platform 31

Dataset270 GB

Demo recap

Trips1.2 B

Queries4

Apache ecosystem

100%

Page 32: Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop

Google Cloud Platform 32

$12.85(vs $77.58, $41.54)

Page 33: Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop

Google Cloud Platform 33

If you’re processing data, you may also want to consider...

Page 34: Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop

Google Cloud Dataflow & Apache Beam

The Cloud Dataflow SDK, based on Apache Beam, is

a collection of SDKs for building streaming data

processing pipelines.

Cloud Dataflow is a fully managed (no-ops) and integrated service for executing optimized

parallelized data processing pipelines.

Page 35: Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop

MapReduce

BigTable DremelColossus

FlumeMegastore

SpannerPubSub

MillwheelCloud

Dataflow

Cloud Dataproc

Apache Beam

Page 36: Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop

Joining several threads into Beam

MapReduce

BigTable DremelColossus

FlumeMegastore

SpannerPubSub

MillwheelCloud

Dataflow

Cloud Dataproc

Apache Beam

Page 37: Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop

Google BigQuery

Virtually unlimited resources, but you only pay for what you use

Fully-managed

Analytics Data Warehouse

Highly Available, Encrypted, Durable

Page 38: Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop

Google Cloud BigtableGoogle Cloud Bigtable offers companies a fast, fully managed, infinitely scalable NoSQL database service with a HBase-compliant API included. Unlike comparable market offerings, Bigtable is the only fully-managed database where organizations don’t have to sacrifice speed, scale or cost-efficiency when they build applications.

Google Cloud Bigtable has been battle-tested at Google for 10 years as the database driving all major applications including Google Analytics, Gmail and YouTube.

Page 39: Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop

Google Cloud Platform 39

Wrapping things up

Page 40: Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop

Cloud Dataproc - get started today Create a Google Cloud

project

Visit Dataproc section

1

2

3

4

Open Developers Console

Create cluster in 1 click, 90 sec.

Page 41: Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop

If you only remember 3 things...

Cloud Dataproc is

easyCloud Dataproc offers a

number of tools to easily interact with clusters and jobs so you can be hands-on

with your data.

Cloud Dataproc is

fastCloud Dataproc

clusters start in under 90 seconds on average so you spend less time and money waiting for

your clusters.

Cloud Dataproc is

cost effective

Cloud Dataproc is easy on the pocketbook with a low pricing of just 1c per vCPU per hour and

minute by minute billing

Page 42: Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop

Google Cloud Platform 42

Thank You