google cloud dataproc - easier, faster, more cost-effective spark and hadoop
TRANSCRIPT
James MaloneProduct Manager
More data. Zero headaches.Making the Spark and Hadoop ecosystem fast, easy, and cost-effective.
Google Cloud Platform 2
Cloud Dataproc features and benefits
Google Cloud Platform 3
Apache Spark and Apache Hadoop should be fast, easy, and cost-effective.
Easy, fast, cost-effective
FastThings take seconds to minutes, not hours or weeks
EasyBe an expert with your data, not your data infrastructure
Cost-effectivePay for exactly what you use
Running Hadoop on Google Cloud
bdutilFree OSS Toolkit
Dataproc
Managed HadoopCustom CodeMonitoring/HealthDev IntegrationScalingJob SubmissionGCP ConnectivityDeploymentCreation
Custom CodeMonitoring/HealthDev IntegrationManual ScalingJob SubmissionGCP ConnectivityDeploymentCreation
On Premise
Custom CodeMonitoring/HealthDev IntegrationScalingJob SubmissionGCP ConnectivityDeploymentCreation
Google Managed
Google Cloud Platform
Customer Managed
Vendor Hadoop
Custom CodeMonitoring/HealthDev IntegrationScalingJob SubmissionGCP ConnectivityDeploymentCreation
6
Cloud Dataproc - integrated
6
Cloud Dataproc is natively integrated with several Google
Cloud Platform products as part of
an integrated data platform.
Storage
Operations
Data
7
Where Cloud Dataproc fits into GCP
7
Google Bigtable(HBase)
Google BigQuery(Analytics, Data
warehouse)
Stackdriver Logging(Logging Ops.)
Google Cloud Dataflow(Batch/Stream
Processing)
Google Cloud Storage(HCFS/HDFS)
Stackdriver Monitoring(Monitoring)
8
Most time can be spent with data, not toolingMore time can be dedicated to examining data for actionable insights
Less time is spent with clusters since creating, resizing, and destroying clusters is easily done
Hands-on with dataCloud Dataproc setup and customization
9
Lift and shift workloads to Cloud Dataproc
Copy data to GCS
Copy your data to Google Cloud Storage (GCS) by installing the
connector or by copying manually.
Update file prefix
Update the file location prefix in your scripts
from hdfs:// to gcs:// to access your data in
GCS.
Use Cloud Dataproc
Create a Cloud Dataproc cluster and run your job on the cluster against the data you copied to
GCS. Done.
1 32
Google Cloud Platform 10
How does Google Cloud Dataproc help me?
Traditional Spark and Hadoop clusters
Google Cloud Dataproc
Cloud example - slow vs. fast
Things take seconds to
minutes, not hours or weeks
capa
city
need
ed
t
Time needed to obtain new capacity
capa
city
use
d
t
Scaling can take hours, days, or
weeks to perform
Traditional clusters Cloud Dataproc
Cloud example - hard vs. easy
Be an expert with your data, not
your data infrastructure
Need experts to optimize
utilization and deployment
Traditional clusters Cloud Dataproc
clust
er u
tiliza
tion
Cluster Inactive
t clust
er u
tiliza
tion
t
cluster 1 cluster 2
Cloud example - costly vs. cost-effective
Pay for exactly what you use
You (probably) pay for more capacity than actually used
Traditional clusters Cloud Dataproc
Time
Cost
Time
Cost
Google Cloud Dataproc - under the hood
Google Cloud Services
Dataproc Cluster
Cloud Dataproc uses GCP - Compute Engine, Cloud Storage, and Stackdriver tools
Google Cloud Dataproc - under the hood
Cloud Dataproc Agent
Google Cloud Services
Dataproc Cluster
Cloud Dataproc clusters have an agent to manage the Cloud Dataproc clusterDataproc uses Compute Engine, Cloud Storage, and Cloud Ops tools
Google Cloud Dataproc - under the hood
Spark & Hadoop OSS
Spark, Hadoop, Hive, Pig, and other OSS components execute on the clusterCloud Dataproc
Agent
Google Cloud Services
Dataproc Cluster
Cloud Dataproc clusters have an agent to manage the Cloud Dataproc clusterDataproc uses Compute Engine, Cloud Storage, and Cloud Ops tools
Google Cloud Dataproc - under the hood
SparkPySpark
Spark SQLMapReduce
PigHive
Spark & Hadoop OSS
Cloud Dataproc Agent
Google Cloud Services
Dataproc Cluster
Dataproc Jobs
Google Cloud Dataproc - under the hood
Applications on the cluster
Dataproc Jobs
GCP Products
SparkPySpark
Spark SQLMapReduce
PigHive
Dataproc Cluster
Spark & Hadoop OSS
Cloud Dataproc Agent
Google Cloud Services
Dataproc Jobs FeaturesData
Outputs
Google Cloud Platform 21
How can I use Cloud Dataproc?
Google Cloud Platform 22
Google Developers Consolehttps://console.developers.google.com/
Google Cloud Platform 23
Google Cloud SDKhttps://cloud.google.com/sdk/
Google Cloud Platform 24
Cloud Dataproc REST APIhttps://cloud.google.com/dataproc/reference/rest/
Google Cloud Platform 25
Let’s see an example - Cloud Dataproc demo
Confidential & ProprietaryGoogle Cloud Platform 26
Google Cloud Dataproc - demo overviewIn this demo we are going to do a few things:
Create a clusterQuery a large set of data stored in Google Cloud
StorageReview the output of the queriesDelete the cluster
Google Cloud Platform 27
YARN Cores1,600
What just happened?
YARN RAM4.7 TB
Spark & Hadoop
100%Click1
Google Cloud Platform 2828
The New York City Taxi & Limousine Commission and Uber released a dataset of trips from 2009-2015
Original dataset is in CSV format and contains over 20 columns of data and about 1.2 billion trips
The dataset is about ~270 gigabytes
NYC taxi data
28
Google Cloud Platform 29
CREATE EXTERNAL TABLE trips (trip_id INT,vendor_id STRING,pickup_datetime TIMESTAMP,dropoff_datetime TIMESTAMP,store_and_fwd_flag STRING,...(44 other columns)...,dropoff_puma STRING)
STORED AS orcLOCATION 'gs://taxi-nyc-demo/trips/'TBLPROPERTIES (
"orc.compress"="SNAPPY","orc.stripe.size"="536870912","orc.row.index.stride"="50000");
Google Cloud Platform 30
SELECT cab_type, count(*)FROM tripsGROUP BY cab_type;
SELECT passenger_count, avg(total_amount)FROM tripsGROUP BY passenger_count;
SELECT passenger_count, year(pickup_datetime), count(*)FROM tripsGROUP BY passenger_count, year(pickup_datetime);
SELECT passenger_count, year(pickup_datetime) trip_year, round(trip_distance), count(*) tripsFROM tripsGROUP BY passenger_count, year(pickup_datetime), round(trip_distance)ORDER BY trip_year, trips DESC;
Google Cloud Platform 31
Dataset270 GB
Demo recap
Trips1.2 B
Queries4
Apache ecosystem
100%
Google Cloud Platform 32
$12.85(vs $77.58, $41.54)
Google Cloud Platform 33
If you’re processing data, you may also want to consider...
Google Cloud Dataflow & Apache Beam
The Cloud Dataflow SDK, based on Apache Beam, is
a collection of SDKs for building streaming data
processing pipelines.
Cloud Dataflow is a fully managed (no-ops) and integrated service for executing optimized
parallelized data processing pipelines.
MapReduce
BigTable DremelColossus
FlumeMegastore
SpannerPubSub
MillwheelCloud
Dataflow
Cloud Dataproc
Apache Beam
Joining several threads into Beam
MapReduce
BigTable DremelColossus
FlumeMegastore
SpannerPubSub
MillwheelCloud
Dataflow
Cloud Dataproc
Apache Beam
Google BigQuery
Virtually unlimited resources, but you only pay for what you use
Fully-managed
Analytics Data Warehouse
Highly Available, Encrypted, Durable
Google Cloud BigtableGoogle Cloud Bigtable offers companies a fast, fully managed, infinitely scalable NoSQL database service with a HBase-compliant API included. Unlike comparable market offerings, Bigtable is the only fully-managed database where organizations don’t have to sacrifice speed, scale or cost-efficiency when they build applications.
Google Cloud Bigtable has been battle-tested at Google for 10 years as the database driving all major applications including Google Analytics, Gmail and YouTube.
Google Cloud Platform 39
Wrapping things up
Cloud Dataproc - get started today Create a Google Cloud
project
Visit Dataproc section
1
2
3
4
Open Developers Console
Create cluster in 1 click, 90 sec.
If you only remember 3 things...
Cloud Dataproc is
easyCloud Dataproc offers a
number of tools to easily interact with clusters and jobs so you can be hands-on
with your data.
Cloud Dataproc is
fastCloud Dataproc
clusters start in under 90 seconds on average so you spend less time and money waiting for
your clusters.
Cloud Dataproc is
cost effective
Cloud Dataproc is easy on the pocketbook with a low pricing of just 1c per vCPU per hour and
minute by minute billing
Google Cloud Platform 42
Thank You