b. ramamurthy cloud strategy ii 7/11/2014 rich's big data analytics training 1

36
B. RAMAMURTHY Cloud Strategy II 7/11/2014 Rich's Big Data Analytics Training 1

Upload: jessica-veronica-mathews

Post on 17-Dec-2015

217 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: B. RAMAMURTHY Cloud Strategy II 7/11/2014 Rich's Big Data Analytics Training 1

Rich's Big Data Analytics Training

1

B. RAMAMURTHY

Cloud Strategy II

7/11/2014

Page 2: B. RAMAMURTHY Cloud Strategy II 7/11/2014 Rich's Big Data Analytics Training 1

Rich's Big Data Analytics Training

2

Tableau vs Qlikview vs Spotfire

7/11/2014 Source not available: this is from a cached page R integration is currently the direction taken by all three.

Page 3: B. RAMAMURTHY Cloud Strategy II 7/11/2014 Rich's Big Data Analytics Training 1

The Context: Big-data Data mining huge amounts of data collected in a wide range of

domains from astronomy to healthcare has become essential for planning and performance.

We are in a knowledge economy. Data is an important asset to any organization Discovery of knowledge; Enabling discovery; Annotation of data Complex computational models No single environment is good enough: need elastic, on-demand

capacities We are also looking at newer

Programming models Supporting algorithms and data structures We need a rapid prototyping environment for learning these

7/11/2014Rich's Big Data Analytics Training

3

Page 4: B. RAMAMURTHY Cloud Strategy II 7/11/2014 Rich's Big Data Analytics Training 1

Rich's Big Data Analytics Training

4

The Context: Big-data Strategies

7/11/2014

What are your strategies for tapping into emerging technologies? Exploring the emerging technologies before investing in it:

make informed decisions Training your workforce

The Cloud infrastructure Inclusion of cloud and computing on the cloud in your

environment has become indispensable to keep up with the emerging technologies

Newer methods and algorithms Maintaining your competitive edge by including newer

approaches to data analytics and visualization: R , JS libraries, tap into APIs provided by social media

Page 5: B. RAMAMURTHY Cloud Strategy II 7/11/2014 Rich's Big Data Analytics Training 1

Rich's Big Data Analytics Training

5

Outline for Today

7/11/2014

Amazon cloud hands-on exercises (take 2)Google App engine – conceptGoogle App engine – hands-on exercisesMap-reduce (MR) algorithmHadoop infrastructureMR on amazon web servicesMoving forwardSummary

Page 6: B. RAMAMURTHY Cloud Strategy II 7/11/2014 Rich's Big Data Analytics Training 1

Cloud Model & Enabling Technologies

64-bit processor

Multi-core architectures

Virtualization: bare metal, hypervisor. …

VM0 VM1 VMn

Web-services, SOA, WS standards

Services interface

Cloud applications: data-intensive, compute-intensive, storage-intensive

Cloud applications: data-intensive, compute-intensive, storage-intensive

Storage Models:

S3, BigTable, BlobStore,

...

Bandwidth

WS

7/11/2014Rich's Big Data Analytics Training

6

Page 7: B. RAMAMURTHY Cloud Strategy II 7/11/2014 Rich's Big Data Analytics Training 1

Common Features of Cloud Providers

Development

Environment: IDE, SDK,

Plugins

Production Environment

Simplestorag

e

Table Store <key,

value>

Drives Accessible through Web services

Management Console and Monitoring tools

& multi-level security

7/11/2014Rich's Big Data Analytics Training

7

Page 8: B. RAMAMURTHY Cloud Strategy II 7/11/2014 Rich's Big Data Analytics Training 1

Public Cloud vs. Private Cloud

Rationale for Private Cloud:Security and privacy of business data was a

big concernPotential for vendor lock-inService Level Agreements (SLAs) required

for real-time performance and reliabilityCost savings of the shared model achieved

because of the multiple projects that the company is actively developing

7/11/2014Rich's Big Data Analytics Training

8

Page 9: B. RAMAMURTHY Cloud Strategy II 7/11/2014 Rich's Big Data Analytics Training 1

Cloud Computing for the EnterpriseWhat should IT Do

Revise cost model to utility-based computing: CPU/hour, GB/day etc.

Include hidden costs for management, training

Different cloud models for different applications - evaluate

Use for prototyping applications and learnLink it to current strategic plans for Services-

Oriented Architecture, Disaster Recovery, etc.

7/11/2014Rich's Big Data Analytics Training

9

Page 10: B. RAMAMURTHY Cloud Strategy II 7/11/2014 Rich's Big Data Analytics Training 1

Rich's Big Data Analytics Training

10

Cloud Infrastructure

7/11/2014

Essential component of today’s IT Running legacy applications: 32-bit, old obsolete languages, older version

of software. Launching emerging applications: don’t have the infrastructure for a

cluster or large machines; map-reduce cluster, social media data collection cluster.

Prototyping a setup before investing in it. Load balancing: address sudden surge in traffic. Use it when there is

wide variability in traffic. Temporary set up such as for a conference, meetings, tournaments

(Example: US Open, FIFA in South Africa, Ebola camp). Establishing IT in places where there is no infrastructure (South pole,

Amazon jungle). Rapid prototyping: quickly get something going to address an emergency

situation or for disaster mitigation. Plain and simple: run your business on the cloud. Success story is Netflix.

Page 11: B. RAMAMURTHY Cloud Strategy II 7/11/2014 Rich's Big Data Analytics Training 1

Rich's Big Data Analytics Training

11

REVIEW MATERIAL FROM SESSION 4 : CLOUD STRATEGY I

7/11/2014

Amazon Web Services (AWS)

Page 12: B. RAMAMURTHY Cloud Strategy II 7/11/2014 Rich's Big Data Analytics Training 1

Rich's Big Data Analytics Training

12

Getting Started with AWS

7/11/2014

Starting point is the excellent documentation in: http://

docs.aws.amazon.com/gettingstarted/latest/awsgsg-intro/awsgsg-intro.pdf

Go through this document before you launch anything on AWS.We will use amazon machine image (AMI) to launch

and connect to a Windows machine/instance.We will use a Linux AMI to launch a Linux machine

and work with it.There are many other applications such as data

workflows, data pipeline, elastic map reduce, etc.We will also deploy a map-reduce application

(wordcount) on AWS.

Page 13: B. RAMAMURTHY Cloud Strategy II 7/11/2014 Rich's Big Data Analytics Training 1

Rich's Big Data Analytics Training

13

7/11/2014

Google App Engine

Page 14: B. RAMAMURTHY Cloud Strategy II 7/11/2014 Rich's Big Data Analytics Training 1

Rich's Big Data Analytics Training

14

What is google app engine?

Google App Engine is a Platform as a Service (PaaS)

It lets you build and run applications on Google’s infrastructure.

App Engine applications are easy to build, easy to maintain, and easy to scale as your traffic and data storage needs change.

With App Engine, there are no servers for you to maintain.

You simply upload your application and it is ready to go.

7/11/2014

Page 15: B. RAMAMURTHY Cloud Strategy II 7/11/2014 Rich's Big Data Analytics Training 1

Rich's Big Data Analytics Training

15

GAE Features

App Engine makes it easy to build and deploy an application that runs reliably even under heavy load and with large amounts of data. It includes the following features: Persistent storage with queries, sorting, and transactions. Automatic scaling and load balancing. Asynchronous task queues for performing work outside the

scope of a request. Scheduled tasks for triggering events at specified times or

regular intervals. Integration with other Google cloud services and APIs. Applications run in a secure, sandboxed environment, allowing

App Engine to distribute requests across multiple servers, and scaling servers to meet traffic demands.

7/11/2014

Page 16: B. RAMAMURTHY Cloud Strategy II 7/11/2014 Rich's Big Data Analytics Training 1

Rich's Big Data Analytics Training

16

How to deploy an application?

We will work from the Eclipse environment we have already downloaded

You need to download the Google app engine plugin and use that to deploy the application.

You need a google app engine account before you can deploy anything. Please do that before we proceed.

We will deploy two applications we developed earlier: Hangman and three.js application that shows a cube and a sphere

7/11/2014

Page 17: B. RAMAMURTHY Cloud Strategy II 7/11/2014 Rich's Big Data Analytics Training 1

Rich's Big Data Analytics Training

17

GAE capacities

All applications can use up to 1 GB of storage and enough CPU and bandwidth to support an efficient app serving around 5 million page views a month, absolutely free.

Runs your web applications on Google's infrastructure. Google App Engine is fully-integrated development

environment You can serve your applications with your own domain

(such as http://richs.com), or you can use the App Engine domain for free (just like http://svg2bina.appspot.com)

You can use server side languages such as PHP, Java, Python, and Go.

7/11/2014

Page 18: B. RAMAMURTHY Cloud Strategy II 7/11/2014 Rich's Big Data Analytics Training 1

Rich's Big Data Analytics Training

18

Monitor Your App

7/11/2014

Page 19: B. RAMAMURTHY Cloud Strategy II 7/11/2014 Rich's Big Data Analytics Training 1

Rich's Big Data Analytics Training

19

Summarizing Google App Engine

Among the cloud offerings Google App Engine has the least expensive model for cloud deployment

The learning curve for the process of deployment is not that steep.

Moreover Eclipse has a plugin to simplify the process.

But all the infrastructure is hidden from you;You access them only through the API and the

services offered by the GAE.

7/11/2014

Page 20: B. RAMAMURTHY Cloud Strategy II 7/11/2014 Rich's Big Data Analytics Training 1

Rich's Big Data Analytics Training

20

7/11/2014

Hadoop-MapReduce

Page 21: B. RAMAMURTHY Cloud Strategy II 7/11/2014 Rich's Big Data Analytics Training 1

Google File System

Internet introduced a new challenge in the form of web logs, web crawler’s data: large scale “peta scale”

But observe that this type of data has an uniquely different characteristic than your transactional or the “customer order” data : “write once read many (WORM)” • Privacy protected healthcare and patient information; • Historical financial data; • Other historical data • Transactional data from your sales• Manufacturing data for quality control

Google exploited this WORM characteristics in its Google file system (GFS) for running massive parallel processes/programs

7/11/2014Rich's Big Data Analytics Training

21

Page 22: B. RAMAMURTHY Cloud Strategy II 7/11/2014 Rich's Big Data Analytics Training 1

What is Hadoop?

At Google MapReduce operations are run on a special file system called Google File System (GFS) that is highly optimized for this purpose.

However GFS is not open source.Doug Cutting and others at Yahoo! reverse

engineered the GFS and called it Hadoop Distributed File System (HDFS).

The software framework that supports HDFS, MapReduce and other related entities is called the project Hadoop or simply Hadoop.

This is now open source and distributed by Apache.

7/11/2014Rich's Big Data Analytics Training

22

Page 23: B. RAMAMURTHY Cloud Strategy II 7/11/2014 Rich's Big Data Analytics Training 1

Basic Features: HDFS

Highly fault-tolerantHigh throughputSuitable for applications with large data setsStreaming access to file system dataCan be built out of commodity hardware HDFS provides Java API for applications to use.It also provides a streaming API for other

languages.A HTTP browser can be used to browse the files

of a HDFS instance.

7/11/2014

23

Rich's Big Data Analytics Training

Page 24: B. RAMAMURTHY Cloud Strategy II 7/11/2014 Rich's Big Data Analytics Training 1

Fault tolerance

Failure is the norm rather than exception in a large network.

A HDFS instance may consist of thousands of server machines, each storing part of the file system’s data.

Since we have huge number of components and that each component has non-trivial probability of failure means that there is always some component that is non-functional.

Detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS. 7/11/2014Rich's Big Data Analytics Training

24

Page 25: B. RAMAMURTHY Cloud Strategy II 7/11/2014 Rich's Big Data Analytics Training 1

HDFS Architecture

Namenode

Breplication

Rack1 Rack2

Client

Blocks

Datanodes Datanodes

Client

Write

Read

Metadata opsMetadata(Name, replicas..)(/home/foo/data,6. ..

Block ops

7/11/2014Rich's Big Data Analytics Training

25

Page 26: B. RAMAMURTHY Cloud Strategy II 7/11/2014 Rich's Big Data Analytics Training 1

Hadoop Distributed File System

Application

Local file

system

Master node

Name Nodes

HDFS Client

HDFS Server

Block size: 2K

Block size: 128MReplicated

7/11/2014Rich's Big Data Analytics Training

26

Page 27: B. RAMAMURTHY Cloud Strategy II 7/11/2014 Rich's Big Data Analytics Training 1

What is MapReduce?

MapReduce is a programming model Google has used successfully is processing its “big-data” sets (~ 20000 peta bytes per day) A map function extracts some intelligence from raw data. A reduce function aggregates according to some guides the

data output by the map. Users specify the computation in terms of a map and a reduce

function, Underlying runtime system automatically parallelizes the

computation across large-scale clusters of machines, and Underlying system also handles machine failures, efficient

communications, and performance issues. -- Reference: Dean, J. and Ghemawat, S. 2008. MapReduce

: simplified data processing on large clusters. Communication of ACM 51, 1 (Jan. 2008), 107-113.

7/11/2014Rich's Big Data Analytics Training

27

Page 28: B. RAMAMURTHY Cloud Strategy II 7/11/2014 Rich's Big Data Analytics Training 1

Classes of problems “mapreducable”

Benchmark for comparing: Jim Gray’s challenge on data-intensive computing. Ex: “Sort”

Google uses it for wordcount, adwords, pagerank, indexing data.

Simple algorithms such as grep, text-indexing, reverse indexing

Bayesian classification: data mining domainAny number of applications involving data mining and

machine learningFacebook uses it for various operations: demographicsFinancial services use it for analyticsAstronomy: Gaussian analysis for locating extra-

terrestrial objects.

7/11/2014Rich's Big Data Analytics Training

28

Page 29: B. RAMAMURTHY Cloud Strategy II 7/11/2014 Rich's Big Data Analytics Training 1

MapReduce Example: Mapper

This is a catCat sits on a roof<this 1> <is 1> <a <1,1,>> <cat <1,1>> <sits 1> <on 1> <roof 1>

The roof is a tin roofThere is a tin can on the roof<the <1,1>> <roof <1,1,1>> <is <1,1>> <a <1,1>> <tin <1,1>> <then 1>

<can 1> <on 1>

Cat kicks the canIt rolls on the roof and falls on the next roof<cat 1> <kicks 1> <the <1,1>> <can 1> <it 1> <roll 1> <on <1,1>> <roof

<1,1>> <and 1> <falls 1> <next 1>

The cat rolls tooIt sits on the can<the <1,1>> <cat 1> <rolls 1> <too 1> <it 1> <sits 1> <on 1> <cat 1>

7/11/2014Rich's Big Data Analytics Training

29

Page 30: B. RAMAMURTHY Cloud Strategy II 7/11/2014 Rich's Big Data Analytics Training 1

MapReduce Example: Combiner, Reducer

<this 1> <is 1> <a <1,1,>> <cat <1,1>> <sits 1> <on 1> <roof 1><the <1,1>> <roof <1,1,1>> <is <1,1>> <a <1,1>> <tin <1,1>> <then 1>

<can 1> <on 1><cat 1> <kicks 1> <the <1,1>> <can 1> <it 1> <roll 1> <on <1,1>> <roof

<1,1>> <and 1> <falls 1> <next 1><the <1,1>> <cat 1> <rolls 1> <too 1> <it 1> <sits 1> <on 1> <cat 1>

Combine the counts of all the same words:<cat <1,1,1,1>><roof <1,1,1,1,1,1>><can <1, 1,1>>…Reduce (sum in this case) the counts:<cat 4><can 3><roof 6>

7/11/2014Rich's Big Data Analytics Training

30

Page 31: B. RAMAMURTHY Cloud Strategy II 7/11/2014 Rich's Big Data Analytics Training 1

Cou

nt

Cou

nt

Cou

nt

Large scale data splits

Parse-hash

Parse-hash

Parse-hash

Parse-hash

Map <key, 1><key, value>pair Reducers (say, Count)

P-0000

P-0001

P-0002

, count1

, count2

,count3

7/11/2014Rich's Big Data Analytics Training 31

Page 32: B. RAMAMURTHY Cloud Strategy II 7/11/2014 Rich's Big Data Analytics Training 1

Rich's Big Data Analytics Training

32

Putting it all together

7/11/2014

Starting point: question. What do you want to know? Don’t try to fit a question into a technology.

Exploratory data analysis Use R to do EDA

Explore on the cloud For emerging technologies For legacy technologies

Visualize JS and JS libraries three.js, d3.js

Analyze, present & make decisions Gephi Tableau/Qlikview

Question answered?

Page 33: B. RAMAMURTHY Cloud Strategy II 7/11/2014 Rich's Big Data Analytics Training 1

Rich's Big Data Analytics Training

33

The Data Science Process (Review)

7/11/2014

Raw data

collected

Exploratory data analysis

Machine learning

algorithms;Statistical

modelsBuild data

products

Communication

VisualizationReport

FindingsMake decisionsMicro-level data strategy

Data is processe

d

Data is cleaned

1 2 3

4

5

6

7

Page 34: B. RAMAMURTHY Cloud Strategy II 7/11/2014 Rich's Big Data Analytics Training 1

Rich's Big Data Analytics Training

34

Big Data Training

7/11/2014

CloudStrategy

RStudio

R

StatModel MLAlgs

CloudAWS CloudGAE

DataInf rastructure

Hadoop:HDFS MapReduce

DataAlgorithms

DataAnaly tics

DataCommunication

BigData

HighLev el

Gephi TableauJS d3.js three.js

ProgLev el

Page 35: B. RAMAMURTHY Cloud Strategy II 7/11/2014 Rich's Big Data Analytics Training 1

Summary

We illustrated cloud concepts and demonstrated the cloud capabilities through simple applications

We discussed the features of the Hadoop File System, and mapreduce to handle big-data sets.

We also explored some real business issues in adoption of cloud.

Cloud is indeed an impactful technology that is sure to transform computing in business.

7/11/2014Rich's Big Data Analytics Training

35

Page 36: B. RAMAMURTHY Cloud Strategy II 7/11/2014 Rich's Big Data Analytics Training 1

References & useful links

Amazon AWS: http://aws.amazon.com/free/Google App Engine (GAE): http://

code.google.com/appengine/docs/whatisgoogleappengine.html

7/11/2014Rich's Big Data Analytics Training

36