bigdata- on - aws cloud -1

32
BUSI758B Big Data Analytics On Amazon Web Services

Upload: milind-gunjan

Post on 22-Jul-2015

38 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: BigData- On - AWS Cloud -1

BUSI758B

Big Data Analytics On

Amazon Web Services

Page 2: BigData- On - AWS Cloud -1

Yelp was able to save $55,000 in upfront in Hardware

costs.

Unilever processes Genetic sequences 20 times faster .

Swipely generates insight from millions of Credit Card

transactions.

Expedia processes click stream data from global

network of websites.

Page 3: BigData- On - AWS Cloud -1

The Big Question is

How ???

Page 4: BigData- On - AWS Cloud -1

The Answer is :

Page 5: BigData- On - AWS Cloud -1

Some Background on Cloud Computing and AWS

Page 6: BigData- On - AWS Cloud -1

What is Cloud Computing ?lCloud computing is a model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computingresources (e.g., networks, servers, storage, applications, and services) thatcan be rapidly provisioned and released with minimal management effortor service provider interaction.l - NIST Definition

lThis cloud model is composed of five essential characteristics, three service models, and four deployment models.

Page 7: BigData- On - AWS Cloud -1

Essential Characteristics:l- On-demand self-service.l- Broad network access.l- Resource pooling.l- Rapid elasticity.l- Measured service.

Page 8: BigData- On - AWS Cloud -1

Service Models:

IaaS Providers : AWS,HPCloud,Rackspace.

PaaS Providers: Google AppEngine, heroku, Redhat Openshift

SaaS Providers: Salesforce,Linkedin, Taleo

Page 9: BigData- On - AWS Cloud -1

Delivery Models:

lPublic CloudlPrivate CloudlHybrid CloudlCommunity Cloud*

* NIST Defines Community cloud as The cloud infrastructure provisioned for exclusive use by a specific

community of consumers from organizations that have shared concerns (e.g., mission,security requirements,

policy, and compliance considerations).

Page 10: BigData- On - AWS Cloud -1

lNow Few Questions ??

1. What service model does AWS fall into ?? 2. What are the advantages of using Cloud Platform for Big data ?3. How AWS leverage those advantages to provide Big Data Analytics ?

Page 11: BigData- On - AWS Cloud -1

Advantage of Cloud Platform

l- Ability to Scale the infrastructure l- OPEX instead of CAPEXl- Custom solutions as per the need. l- Easier/faster Deployment. l- Help focus on Core Business l solutions/Analytics.

So , It can be safely said that the Cloud Platform acts as Enabler of Big Data

technology.

Page 12: BigData- On - AWS Cloud -1

AWS Big Data Analytics :

Page 13: BigData- On - AWS Cloud -1

Elastic MapReduce(EMR)

Page 14: BigData- On - AWS Cloud -1

Elastic MapReduce(EMR)

Page 15: BigData- On - AWS Cloud -1

Hadoop as a Service

lAmazon Elastic Mapreduce supports Hadoop

Software Eco-System.(Hadoop 1.X, Hadoop 2.X)

lAmazon EMR control software is responsible for

automated arrangement, coordination, and

management of Hadoop Cluster.

lAmazon Elastic Mapreduce also Supports MAPR,

Apache Hadoop-derived software.

Page 16: BigData- On - AWS Cloud -1

Integrated With Tools

Amazon EMR provides you have root access to the cluster.

Additional Software required can be installed and configured in the cluster before

Hadoop starts by creating BootStrap Action.

*Spark is installed using BootStrapping.

Page 17: BigData- On - AWS Cloud -1

Mapreduce Engine

lJob/Task

lRoles of Servers:

la> Master Node

lb> Core Node

lc> Task Node

lStep: Unit of work

Mapreduce Engine implements the Distributed processing

framework of Hadoop.

Page 18: BigData- On - AWS Cloud -1

Mapreduce Engine- Cont..

ll

Hadoop AWS

Name Node Master Node

Data Node Core Node

Additional concepts of Task Node and Steps :

Task Node - Task Nodes are optional. You can add task Nodes when you start

the cluster, or you can add task groups to a running cluster. Because they do

not store data and can be added and removed from a cluster, you can use task

nodes to manage the EC2 instance capacity your cluster uses, increasing

capacity to handle peak loads and decreasing it later.

Steps: Contains 1 or more Hadoop jobs. Step is an instruction given to

manipulate date using Hadoop jobs.

Max. no of Pending and Active Steps allowed in Cluster is 256.

Page 19: BigData- On - AWS Cloud -1

Massively Parallel

lVirtual Instances -Much Easier to

Scale.

lQuick and Cost effective Scaling.

lDynamic Resizing while running the

job.

lDistributed Hadoop System in true

sense.

lMultiple clusters accessing same data

Page 20: BigData- On - AWS Cloud -1

Cost Effective AWS Wrapper

lSpot Instances

lPay as you go.

lAutomatic Cluster

termination after

job completion.

lBundled License

softwares with

infrastructure.

lEconomy of Scale

Page 21: BigData- On - AWS Cloud -1

Integrated to AWS Services

lAmazon EMR is integrated with other Amazon Web Services such as Amazon EC2,

Amazon S3, DynamoDB, Amazon RDS, CloudWatch, and AWS Data Pipeline.

lEasily access data stored in AWS from EMR cluster and make use of the

functionality offered by other Amazon Web Services to manage your cluster and

store the output of your cluster

ComputelEC2

Networking•VPC•ELB•Route 53

StoragelEBSlS3lGlacier

Data Services

lRDS

lDynamoDB

lRedshift

Deployment and ManagementlAWS Management Console lAWS Command Line InterfacelAWS IAMlCloud Watch

Page 22: BigData- On - AWS Cloud -1

Life Cycle of EMR Cluster

Page 23: BigData- On - AWS Cloud -1

How to launch and connect to EMR Cluster-Quick Demo

Page 24: BigData- On - AWS Cloud -1
Page 25: BigData- On - AWS Cloud -1

Click on Create Cluster

Page 26: BigData- On - AWS Cloud -1

lProvide Cluster name for easier Identification.

lTermination Protection has to be selected 'Yes' to prevent accidental

termination of Cluster.

lLogging has to be enabled as this feature leads to automatic logging of cluster

activity.

lProvide S3 folder location for logging.

lDebugging is enabled so that any troubleshooting regarding cluster activity

can be done.

Page 27: BigData- On - AWS Cloud -1

lIt is optional feature but always encouraged to have tags.

lTag is Key/Value pair which gets associated with every resource in cluster.

lHelps in monitoring and in managing cluster resource easily.

Page 28: BigData- On - AWS Cloud -1
Page 29: BigData- On - AWS Cloud -1
Page 30: BigData- On - AWS Cloud -1
Page 31: BigData- On - AWS Cloud -1
Page 32: BigData- On - AWS Cloud -1