bigdata- on - aws cloud -1
TRANSCRIPT
BUSI758B
Big Data Analytics On
Amazon Web Services
Yelp was able to save $55,000 in upfront in Hardware
costs.
Unilever processes Genetic sequences 20 times faster .
Swipely generates insight from millions of Credit Card
transactions.
Expedia processes click stream data from global
network of websites.
The Big Question is
How ???
The Answer is :
Some Background on Cloud Computing and AWS
What is Cloud Computing ?lCloud computing is a model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computingresources (e.g., networks, servers, storage, applications, and services) thatcan be rapidly provisioned and released with minimal management effortor service provider interaction.l - NIST Definition
lThis cloud model is composed of five essential characteristics, three service models, and four deployment models.
Essential Characteristics:l- On-demand self-service.l- Broad network access.l- Resource pooling.l- Rapid elasticity.l- Measured service.
Service Models:
IaaS Providers : AWS,HPCloud,Rackspace.
PaaS Providers: Google AppEngine, heroku, Redhat Openshift
SaaS Providers: Salesforce,Linkedin, Taleo
Delivery Models:
lPublic CloudlPrivate CloudlHybrid CloudlCommunity Cloud*
* NIST Defines Community cloud as The cloud infrastructure provisioned for exclusive use by a specific
community of consumers from organizations that have shared concerns (e.g., mission,security requirements,
policy, and compliance considerations).
lNow Few Questions ??
1. What service model does AWS fall into ?? 2. What are the advantages of using Cloud Platform for Big data ?3. How AWS leverage those advantages to provide Big Data Analytics ?
Advantage of Cloud Platform
l- Ability to Scale the infrastructure l- OPEX instead of CAPEXl- Custom solutions as per the need. l- Easier/faster Deployment. l- Help focus on Core Business l solutions/Analytics.
So , It can be safely said that the Cloud Platform acts as Enabler of Big Data
technology.
AWS Big Data Analytics :
Elastic MapReduce(EMR)
Elastic MapReduce(EMR)
Hadoop as a Service
lAmazon Elastic Mapreduce supports Hadoop
Software Eco-System.(Hadoop 1.X, Hadoop 2.X)
lAmazon EMR control software is responsible for
automated arrangement, coordination, and
management of Hadoop Cluster.
lAmazon Elastic Mapreduce also Supports MAPR,
Apache Hadoop-derived software.
Integrated With Tools
Amazon EMR provides you have root access to the cluster.
Additional Software required can be installed and configured in the cluster before
Hadoop starts by creating BootStrap Action.
*Spark is installed using BootStrapping.
Mapreduce Engine
lJob/Task
lRoles of Servers:
la> Master Node
lb> Core Node
lc> Task Node
lStep: Unit of work
Mapreduce Engine implements the Distributed processing
framework of Hadoop.
Mapreduce Engine- Cont..
ll
Hadoop AWS
Name Node Master Node
Data Node Core Node
Additional concepts of Task Node and Steps :
Task Node - Task Nodes are optional. You can add task Nodes when you start
the cluster, or you can add task groups to a running cluster. Because they do
not store data and can be added and removed from a cluster, you can use task
nodes to manage the EC2 instance capacity your cluster uses, increasing
capacity to handle peak loads and decreasing it later.
Steps: Contains 1 or more Hadoop jobs. Step is an instruction given to
manipulate date using Hadoop jobs.
Max. no of Pending and Active Steps allowed in Cluster is 256.
Massively Parallel
lVirtual Instances -Much Easier to
Scale.
lQuick and Cost effective Scaling.
lDynamic Resizing while running the
job.
lDistributed Hadoop System in true
sense.
lMultiple clusters accessing same data
Cost Effective AWS Wrapper
lSpot Instances
lPay as you go.
lAutomatic Cluster
termination after
job completion.
lBundled License
softwares with
infrastructure.
lEconomy of Scale
Integrated to AWS Services
lAmazon EMR is integrated with other Amazon Web Services such as Amazon EC2,
Amazon S3, DynamoDB, Amazon RDS, CloudWatch, and AWS Data Pipeline.
lEasily access data stored in AWS from EMR cluster and make use of the
functionality offered by other Amazon Web Services to manage your cluster and
store the output of your cluster
ComputelEC2
Networking•VPC•ELB•Route 53
StoragelEBSlS3lGlacier
Data Services
lRDS
lDynamoDB
lRedshift
Deployment and ManagementlAWS Management Console lAWS Command Line InterfacelAWS IAMlCloud Watch
Life Cycle of EMR Cluster
How to launch and connect to EMR Cluster-Quick Demo
Click on Create Cluster
lProvide Cluster name for easier Identification.
lTermination Protection has to be selected 'Yes' to prevent accidental
termination of Cluster.
lLogging has to be enabled as this feature leads to automatic logging of cluster
activity.
lProvide S3 folder location for logging.
lDebugging is enabled so that any troubleshooting regarding cluster activity
can be done.
lIt is optional feature but always encouraged to have tags.
lTag is Key/Value pair which gets associated with every resource in cluster.
lHelps in monitoring and in managing cluster resource easily.