aws re:invent 2016: extending hadoop and spark to the aws cloud (gpst304)
TRANSCRIPT
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Rahul Bhartia, Principal Solutions Architect, APN
Andy Kimbrough, Sr. Mgr. of Engineering, Amazon S3
Paul Scott-Murphy, VP of Product Mgmt., WANdisco
Extending Hadoop and Spark to the AWS
Cloud
with AWS Technology Partners
What to Expect from the Session
• Learn how to easily and seamless transition or extend Hadoop and Spark in
AWS
• Patterns for migrating data from Hadoop clusters to Amazon S3
• Learn about solutions offered by AWS Big Data Technology Competency
Partner solutions.
• Automated deployment of Partner solutions on AWS cloud for minimal
disruptions
Reduce Costs Increase Speed Innovations
Big Data workloads in the Cloud
• Optimize infrastructure for the
workload
• Decouple Compute from storage
GE Oil & Gas is migrating 500 applications, and more than 750TB of data, to
the cloud by the end of 2016 as part of a major digital transformation, helping it
attain a 52% reduction in TCO and greater speed to market.
• Launch resources as needed without
any planning
• Provide self-services for users
• Test new ideas, new frameworks
without any commitment
• Bring new products to markets
Designed for 11 9s
of durability
Designed for
99.99% availability
Durable Available High performance• Multipart upload
• Range GET
• Parallelize List
• Store as much as you need
• Scale storage and compute
independently
• No minimum usage commitments
Scalable
• AWS Elastic MapReduce
• Amazon Redshift
• Amazon DynamoDB
• Spark, Hive, Impala, Presto
• Many others
Integrated
• Simple REST API
• AWS SDKs
• Read-after-create consistency
• Event Notification
• Lifecycle policies
Easy to use
Amazon S3 – Storage for Big data
Cross-Region
Replication Lifecycle
Policy
Data
Classification
& Management
Event
Notifications
CloudWatch Metrics S3 Inventory Audit with CloudTrail
Data EventsStorage
Analytics
S3 – Storage Management for S3
Amazon S3 with Big Data workloads
• EMRFS with Amazon EMR
• Open source Hadoop/Spark connector (S3A)
• Consistency - S3Guard
• Performance - Lazy Seek, Connection re-use
• AWS SDK – Multi- part
• Other open-source integrations
• HUE, Alluxio, Presto
Migrating Big Data workloads to the Cloud
HDFS
Application
HDFS
Application
Input Output
Backup
InputOutput
Copy
Application
Lift-and-Shift Burst-or-Extend
One Time Periodic Continuous
Patterns for migrating data from Hadoop
• AWS SnowBall with HDFS Interface
• AWS Import/Export
• Amazon EMR with s3-dist-cp
• AWS S3 APIs
AWS Technology Partners
• Amazon Kinesis with Streams and Firehose
• Amazon DMS
AWS SnowBall with HDFS Interface! (NEW)
$ snowball cp -n hdfs://HOST:PORT/PATH_TO_FILE_ON_HDFS s3://BUCKET-NAME/DESTINATION-PATH
Distributed Copy
Works like a MapReduce job with Amazon S3 as a target
Best for periodic data backups
s3DistCp
--src s3://mybucket/prefix --dest hdfs:///folders
--srcPattern patternOn-prem
Cluster
Amazon S3
bucket
Distributed Copy
• Workflow management - Apache Falcon
• Connectivity is the key - AWS DirectConnect
• Remember
• Dealing with Kerberos authentications (across Cluster)
• Needs a scheduled workflow management
• Can easily saturate the bandwidth
• Needs compute for moving data
WANDisco Fusion for Hadoop
Advantages
• No extra compute your Cluster
• No management or workflow required
• Available via AWS Marketplace
Learn more at https://www.wandisco.com/product/amazon-
s3-active-migrator
Stop by booth #2524
Moving to AWS is more than just “lift and shift”
Big Data Solution running on a
NON-AWS Cloud environment
• Rigid / un-flexible
• Low utilization
• High cost
Big Data Solution running on AWS
• Flexible – scale up / down in minutes
• Size to your needs in less than 1 hour
• Constantly optimize cost: price reductions
+ innovations
Step 1: Migrate Data
Step 2: Process
Step 3: Leverage
AWS Big Data Competency Partners
Hortonworks Data Cloud AWS MarketplaceAWS Quickstart
Support for usage based pricing model with data in Amazon S3
Compliment the functionality with managed services or clusters
Symantec: Provisioning Big Data Platform
http://www.slideshare.net/HadoopSummit/provisioning-big-data-platform-using-cloudbreak-ambari
Leverage Amazon S3 with what you prefer
• Hive/LLAP with Amazon S3 - http://hortonworks.com/blog/llap-enables-sub-second-sql-hadoop/
• Impala with Amazon S3 -https://www.cloudera.com/documentation/enterprise/latest/topics/impala_s3.html
• Drill with Amazon S3 - https://www.mapr.com/resources/videos/sql-queries-data-amazon-s3-storage-drill-demo
• Databricks FileSystem -https://docs.cloud.databricks.com/docs/latest/databricks_guide/01%20Databricks%20Overview/10%20Databricks%20File%20System%20-%20DBFS.html
• Vertica External Flex Tables - https://community.dev.hpe.com/t5/Vertica-Blog/Automatic-HP-Vertica-Database-Loader-for-AWS-S3/ba-p/230344
Dataricks FileSystem
DBFS is a distributed file system that comes installed on
Spark Clusters in Databricks. It is a layer over S3, which
allows you to:
• Mount S3 buckets to make them available to users in
your workspace
• Cache S3 data on the solid-state disks (SSDs) of your
worker nodes to speed up access.
Homeaway
HomeAway replaced its homegrown environment with Databricks to simplify
the management of their Spark infrastructure through its native access to S3,
interactive notebooks, and cluster management capabilities. With
Databricks, the productivity of their data science team increased dramatically,
allowing them to spend more time on rapid prototyping and asking more
questions of their data.