aws re:invent 2016: extending hadoop and spark to the aws cloud (gpst304)

26
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Rahul Bhartia, Principal Solutions Architect, APN Andy Kimbrough, Sr. Mgr. of Engineering, Amazon S3 Paul Scott-Murphy, VP of Product Mgmt., WANdisco Extending Hadoop and Spark to the AWS Cloud with AWS Technology Partners

Upload: amazon-web-services

Post on 16-Apr-2017

399 views

Category:

Technology


0 download

TRANSCRIPT

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Rahul Bhartia, Principal Solutions Architect, APN

Andy Kimbrough, Sr. Mgr. of Engineering, Amazon S3

Paul Scott-Murphy, VP of Product Mgmt., WANdisco

Extending Hadoop and Spark to the AWS

Cloud

with AWS Technology Partners

What to Expect from the Session

• Learn how to easily and seamless transition or extend Hadoop and Spark in

AWS

• Patterns for migrating data from Hadoop clusters to Amazon S3

• Learn about solutions offered by AWS Big Data Technology Competency

Partner solutions.

• Automated deployment of Partner solutions on AWS cloud for minimal

disruptions

Reduce Costs Increase Speed Innovations

Big Data workloads in the Cloud

• Optimize infrastructure for the

workload

• Decouple Compute from storage

GE Oil & Gas is migrating 500 applications, and more than 750TB of data, to

the cloud by the end of 2016 as part of a major digital transformation, helping it

attain a 52% reduction in TCO and greater speed to market.

• Launch resources as needed without

any planning

• Provide self-services for users

• Test new ideas, new frameworks

without any commitment

• Bring new products to markets

Designed for 11 9s

of durability

Designed for

99.99% availability

Durable Available High performance• Multipart upload

• Range GET

• Parallelize List

• Store as much as you need

• Scale storage and compute

independently

• No minimum usage commitments

Scalable

• AWS Elastic MapReduce

• Amazon Redshift

• Amazon DynamoDB

• Spark, Hive, Impala, Presto

• Many others

Integrated

• Simple REST API

• AWS SDKs

• Read-after-create consistency

• Event Notification

• Lifecycle policies

Easy to use

Amazon S3 – Storage for Big data

Cross-Region

Replication Lifecycle

Policy

Data

Classification

& Management

Event

Notifications

CloudWatch Metrics S3 Inventory Audit with CloudTrail

Data EventsStorage

Analytics

S3 – Storage Management for S3

Amazon S3 with Big Data workloads

• EMRFS with Amazon EMR

• Open source Hadoop/Spark connector (S3A)

• Consistency - S3Guard

• Performance - Lazy Seek, Connection re-use

• AWS SDK – Multi- part

• Other open-source integrations

• HUE, Alluxio, Presto

Migrating Big Data workloads to the Cloud

HDFS

Application

HDFS

Application

Input Output

Backup

InputOutput

Copy

Application

Lift-and-Shift Burst-or-Extend

Getting your data to Cloud

(and back …)

One Time Periodic Continuous

Patterns for migrating data from Hadoop

• AWS SnowBall with HDFS Interface

• AWS Import/Export

• Amazon EMR with s3-dist-cp

• AWS S3 APIs

AWS Technology Partners

• Amazon Kinesis with Streams and Firehose

• Amazon DMS

AWS SnowBall with HDFS Interface! (NEW)

$ snowball cp -n hdfs://HOST:PORT/PATH_TO_FILE_ON_HDFS s3://BUCKET-NAME/DESTINATION-PATH

Distributed Copy

Works like a MapReduce job with Amazon S3 as a target

Best for periodic data backups

s3DistCp

--src s3://mybucket/prefix --dest hdfs:///folders

--srcPattern patternOn-prem

Cluster

Amazon S3

bucket

Distributed Copy

• Workflow management - Apache Falcon

• Connectivity is the key - AWS DirectConnect

• Remember

• Dealing with Kerberos authentications (across Cluster)

• Needs a scheduled workflow management

• Can easily saturate the bandwidth

• Needs compute for moving data

WANDisco Fusion for Hadoop

Best for synchronization of data

DEMO

Replication to Amazon S3 with WANDisco

Paul Scott-Murphy

VP of Product Management

WANDisco

WANDisco Fusion for Hadoop

Advantages

• No extra compute your Cluster

• No management or workflow required

• Available via AWS Marketplace

Learn more at https://www.wandisco.com/product/amazon-

s3-active-migrator

Stop by booth #2524

Data beyond Hadoop

Collect Logs/Events/Streams

Replicate Relational Databases

Get up and running in AWS

Moving to AWS is more than just “lift and shift”

Big Data Solution running on a

NON-AWS Cloud environment

• Rigid / un-flexible

• Low utilization

• High cost

Big Data Solution running on AWS

• Flexible – scale up / down in minutes

• Size to your needs in less than 1 hour

• Constantly optimize cost: price reductions

+ innovations

Step 1: Migrate Data

Step 2: Process

Step 3: Leverage

Get started easily

AWS Big Data Competency Partners

Hortonworks Data Cloud AWS MarketplaceAWS Quickstart

Support for usage based pricing model with data in Amazon S3

Compliment the functionality with managed services or clusters

Symantec: Provisioning Big Data Platform

http://www.slideshare.net/HadoopSummit/provisioning-big-data-platform-using-cloudbreak-ambari

Leverage Amazon S3 with what you prefer

• Hive/LLAP with Amazon S3 - http://hortonworks.com/blog/llap-enables-sub-second-sql-hadoop/

• Impala with Amazon S3 -https://www.cloudera.com/documentation/enterprise/latest/topics/impala_s3.html

• Drill with Amazon S3 - https://www.mapr.com/resources/videos/sql-queries-data-amazon-s3-storage-drill-demo

• Databricks FileSystem -https://docs.cloud.databricks.com/docs/latest/databricks_guide/01%20Databricks%20Overview/10%20Databricks%20File%20System%20-%20DBFS.html

• Vertica External Flex Tables - https://community.dev.hpe.com/t5/Vertica-Blog/Automatic-HP-Vertica-Database-Loader-for-AWS-S3/ba-p/230344

Dataricks FileSystem

DBFS is a distributed file system that comes installed on

Spark Clusters in Databricks. It is a layer over S3, which

allows you to:

• Mount S3 buckets to make them available to users in

your workspace

• Cache S3 data on the solid-state disks (SSDs) of your

worker nodes to speed up access.

Homeaway

HomeAway replaced its homegrown environment with Databricks to simplify

the management of their Spark infrastructure through its native access to S3,

interactive notebooks, and cluster management capabilities. With

Databricks, the productivity of their data science team increased dramatically,

allowing them to spend more time on rapid prototyping and asking more

questions of their data.

Thank you!

Remember to complete

your evaluations!