google cloud dataproc - easier, faster, more cost-effective spark and hadoop

James MaloneProduct Manager

More data. Zero headaches.Making the Spark and Hadoop ecosystem fast, easy, and cost-effective.

Google Cloud Platform 2

Cloud Dataproc features and benefits


Apache Spark and Apache Hadoop should be fast, easy, and cost-effective.

Easy, fast, cost-effective

FastThings take seconds to minutes, not hours or weeks

EasyBe an expert with your data, not your data infrastructure

Cost-effectivePay for exactly what you use

Running Hadoop on Google Cloud

bdutilFree OSS Toolkit

Dataproc

Managed HadoopCustom CodeMonitoring/HealthDev IntegrationScalingJob SubmissionGCP ConnectivityDeploymentCreation

Custom CodeMonitoring/HealthDev IntegrationManual ScalingJob SubmissionGCP ConnectivityDeploymentCreation

On Premise

Custom CodeMonitoring/HealthDev IntegrationScalingJob SubmissionGCP ConnectivityDeploymentCreation

Google Managed

Google Cloud Platform

Customer Managed

Vendor Hadoop

Custom CodeMonitoring/HealthDev IntegrationScalingJob SubmissionGCP ConnectivityDeploymentCreation

6

Cloud Dataproc - integrated

6

Cloud Dataproc is natively integrated with several Google

Cloud Platform products as part of

an integrated data platform.

Storage

Operations

Data

7

Where Cloud Dataproc fits into GCP

7

Google Bigtable(HBase)

Google BigQuery(Analytics, Data

warehouse)

Stackdriver Logging(Logging Ops.)

Google Cloud Dataflow(Batch/Stream

Processing)

Google Cloud Storage(HCFS/HDFS)

Stackdriver Monitoring(Monitoring)

8

Most time can be spent with data, not toolingMore time can be dedicated to examining data for actionable insights

Less time is spent with clusters since creating, resizing, and destroying clusters is easily done

Hands-on with dataCloud Dataproc setup and customization

9

Lift and shift workloads to Cloud Dataproc

Copy data to GCS

Copy your data to Google Cloud Storage (GCS) by installing the

connector or by copying manually.

Update file prefix

Update the file location prefix in your scripts

from hdfs:// to gcs:// to access your data in

GCS.

Use Cloud Dataproc

Create a Cloud Dataproc cluster and run your job on the cluster against the data you copied to

GCS. Done.

1 32


How does Google Cloud Dataproc help me?

Traditional Spark and Hadoop clusters

Google Cloud Dataproc

Cloud example - slow vs. fast

Things take seconds to

minutes, not hours or weeks

capa

city

need

ed

t

Time needed to obtain new capacity

capa

city

use

d

t

Scaling can take hours, days, or

weeks to perform

Traditional clusters Cloud Dataproc

Cloud example - hard vs. easy

Be an expert with your data, not

your data infrastructure

Need experts to optimize

utilization and deployment


clust

er u

tiliza

tion

Cluster Inactive

t clust

er u

tiliza

tion

t

cluster 1 cluster 2

Cloud example - costly vs. cost-effective

Pay for exactly what you use

You (probably) pay for more capacity than actually used


Time

Cost

Time

Cost

Google Cloud Dataproc - under the hood

Google Cloud Services

Dataproc Cluster

Cloud Dataproc uses GCP - Compute Engine, Cloud Storage, and Stackdriver tools


Cloud Dataproc Agent


Dataproc Cluster

Cloud Dataproc clusters have an agent to manage the Cloud Dataproc clusterDataproc uses Compute Engine, Cloud Storage, and Cloud Ops tools


Spark & Hadoop OSS

Spark, Hadoop, Hive, Pig, and other OSS components execute on the clusterCloud Dataproc

Agent


Dataproc Cluster

Cloud Dataproc clusters have an agent to manage the Cloud Dataproc clusterDataproc uses Compute Engine, Cloud Storage, and Cloud Ops tools


SparkPySpark

Spark SQLMapReduce

PigHive

Spark & Hadoop OSS



Dataproc Cluster

Dataproc Jobs


Applications on the cluster

Dataproc Jobs

GCP Products

SparkPySpark

Spark SQLMapReduce

PigHive

Dataproc Cluster

Spark & Hadoop OSS



Dataproc Jobs FeaturesData

Outputs


How can I use Cloud Dataproc?


Google Developers Consolehttps://console.developers.google.com/


Google Cloud SDKhttps://cloud.google.com/sdk/


Cloud Dataproc REST APIhttps://cloud.google.com/dataproc/reference/rest/


Let’s see an example - Cloud Dataproc demo

Confidential & ProprietaryGoogle Cloud Platform 26

Google Cloud Dataproc - demo overviewIn this demo we are going to do a few things:

Create a clusterQuery a large set of data stored in Google Cloud

StorageReview the output of the queriesDelete the cluster


YARN Cores1,600

What just happened?

YARN RAM4.7 TB

Spark & Hadoop

100%Click1


The New York City Taxi & Limousine Commission and Uber released a dataset of trips from 2009-2015

Original dataset is in CSV format and contains over 20 columns of data and about 1.2 billion trips

The dataset is about ~270 gigabytes

NYC taxi data

28


CREATE EXTERNAL TABLE trips (trip_id INT,vendor_id STRING,pickup_datetime TIMESTAMP,dropoff_datetime TIMESTAMP,store_and_fwd_flag STRING,...(44 other columns)...,dropoff_puma STRING)

STORED AS orcLOCATION 'gs://taxi-nyc-demo/trips/'TBLPROPERTIES (

"orc.compress"="SNAPPY","orc.stripe.size"="536870912","orc.row.index.stride"="50000");


SELECT cab_type, count(*)FROM tripsGROUP BY cab_type;

SELECT passenger_count, avg(total_amount)FROM tripsGROUP BY passenger_count;

SELECT passenger_count, year(pickup_datetime), count(*)FROM tripsGROUP BY passenger_count, year(pickup_datetime);

SELECT passenger_count, year(pickup_datetime) trip_year, round(trip_distance), count(*) tripsFROM tripsGROUP BY passenger_count, year(pickup_datetime), round(trip_distance)ORDER BY trip_year, trips DESC;


Dataset270 GB

Demo recap

Trips1.2 B

Queries4

Apache ecosystem

100%


$12.85(vs $77.58, $41.54)


If you’re processing data, you may also want to consider...

Google Cloud Dataflow & Apache Beam

The Cloud Dataflow SDK, based on Apache Beam, is

a collection of SDKs for building streaming data

processing pipelines.

Cloud Dataflow is a fully managed (no-ops) and integrated service for executing optimized

parallelized data processing pipelines.

MapReduce

BigTable DremelColossus

FlumeMegastore

SpannerPubSub

MillwheelCloud

Dataflow

Cloud Dataproc

Apache Beam

Joining several threads into Beam

MapReduce

BigTable DremelColossus

FlumeMegastore

SpannerPubSub

MillwheelCloud

Dataflow

Cloud Dataproc

Apache Beam

Google BigQuery

Virtually unlimited resources, but you only pay for what you use

Fully-managed

Analytics Data Warehouse

Highly Available, Encrypted, Durable

Google Cloud BigtableGoogle Cloud Bigtable offers companies a fast, fully managed, infinitely scalable NoSQL database service with a HBase-compliant API included. Unlike comparable market offerings, Bigtable is the only fully-managed database where organizations don’t have to sacrifice speed, scale or cost-efficiency when they build applications.

Google Cloud Bigtable has been battle-tested at Google for 10 years as the database driving all major applications including Google Analytics, Gmail and YouTube.


Wrapping things up

Cloud Dataproc - get started today Create a Google Cloud

project

Visit Dataproc section

1

2

3

4

Open Developers Console

Create cluster in 1 click, 90 sec.

If you only remember 3 things...

Cloud Dataproc is

easyCloud Dataproc offers a

number of tools to easily interact with clusters and jobs so you can be hands-on

with your data.

Cloud Dataproc is

fastCloud Dataproc

clusters start in under 90 seconds on average so you spend less time and money waiting for

your clusters.

Cloud Dataproc is

cost effective

Cloud Dataproc is easy on the pocketbook with a low pricing of just 1c per vCPU per hour and

minute by minute billing

http://cloud.google.com/dataproc/pricing


Thank You

google cloud dataproc - easier, faster, more cost-effective spark and hadoop

Technology