b. ramamurthy cloud strategy ii 7/11/2014 rich's big data analytics training 1
TRANSCRIPT
Rich's Big Data Analytics Training
1
B. RAMAMURTHY
Cloud Strategy II
7/11/2014
Rich's Big Data Analytics Training
2
Tableau vs Qlikview vs Spotfire
7/11/2014 Source not available: this is from a cached page R integration is currently the direction taken by all three.
The Context: Big-data Data mining huge amounts of data collected in a wide range of
domains from astronomy to healthcare has become essential for planning and performance.
We are in a knowledge economy. Data is an important asset to any organization Discovery of knowledge; Enabling discovery; Annotation of data Complex computational models No single environment is good enough: need elastic, on-demand
capacities We are also looking at newer
Programming models Supporting algorithms and data structures We need a rapid prototyping environment for learning these
7/11/2014Rich's Big Data Analytics Training
3
Rich's Big Data Analytics Training
4
The Context: Big-data Strategies
7/11/2014
What are your strategies for tapping into emerging technologies? Exploring the emerging technologies before investing in it:
make informed decisions Training your workforce
The Cloud infrastructure Inclusion of cloud and computing on the cloud in your
environment has become indispensable to keep up with the emerging technologies
Newer methods and algorithms Maintaining your competitive edge by including newer
approaches to data analytics and visualization: R , JS libraries, tap into APIs provided by social media
Rich's Big Data Analytics Training
5
Outline for Today
7/11/2014
Amazon cloud hands-on exercises (take 2)Google App engine – conceptGoogle App engine – hands-on exercisesMap-reduce (MR) algorithmHadoop infrastructureMR on amazon web servicesMoving forwardSummary
Cloud Model & Enabling Technologies
64-bit processor
Multi-core architectures
Virtualization: bare metal, hypervisor. …
VM0 VM1 VMn
Web-services, SOA, WS standards
Services interface
Cloud applications: data-intensive, compute-intensive, storage-intensive
Cloud applications: data-intensive, compute-intensive, storage-intensive
Storage Models:
S3, BigTable, BlobStore,
...
Bandwidth
WS
7/11/2014Rich's Big Data Analytics Training
6
Common Features of Cloud Providers
Development
Environment: IDE, SDK,
Plugins
Production Environment
Simplestorag
e
Table Store <key,
value>
Drives Accessible through Web services
Management Console and Monitoring tools
& multi-level security
7/11/2014Rich's Big Data Analytics Training
7
Public Cloud vs. Private Cloud
Rationale for Private Cloud:Security and privacy of business data was a
big concernPotential for vendor lock-inService Level Agreements (SLAs) required
for real-time performance and reliabilityCost savings of the shared model achieved
because of the multiple projects that the company is actively developing
7/11/2014Rich's Big Data Analytics Training
8
Cloud Computing for the EnterpriseWhat should IT Do
Revise cost model to utility-based computing: CPU/hour, GB/day etc.
Include hidden costs for management, training
Different cloud models for different applications - evaluate
Use for prototyping applications and learnLink it to current strategic plans for Services-
Oriented Architecture, Disaster Recovery, etc.
7/11/2014Rich's Big Data Analytics Training
9
Rich's Big Data Analytics Training
10
Cloud Infrastructure
7/11/2014
Essential component of today’s IT Running legacy applications: 32-bit, old obsolete languages, older version
of software. Launching emerging applications: don’t have the infrastructure for a
cluster or large machines; map-reduce cluster, social media data collection cluster.
Prototyping a setup before investing in it. Load balancing: address sudden surge in traffic. Use it when there is
wide variability in traffic. Temporary set up such as for a conference, meetings, tournaments
(Example: US Open, FIFA in South Africa, Ebola camp). Establishing IT in places where there is no infrastructure (South pole,
Amazon jungle). Rapid prototyping: quickly get something going to address an emergency
situation or for disaster mitigation. Plain and simple: run your business on the cloud. Success story is Netflix.
Rich's Big Data Analytics Training
11
REVIEW MATERIAL FROM SESSION 4 : CLOUD STRATEGY I
7/11/2014
Amazon Web Services (AWS)
Rich's Big Data Analytics Training
12
Getting Started with AWS
7/11/2014
Starting point is the excellent documentation in: http://
docs.aws.amazon.com/gettingstarted/latest/awsgsg-intro/awsgsg-intro.pdf
Go through this document before you launch anything on AWS.We will use amazon machine image (AMI) to launch
and connect to a Windows machine/instance.We will use a Linux AMI to launch a Linux machine
and work with it.There are many other applications such as data
workflows, data pipeline, elastic map reduce, etc.We will also deploy a map-reduce application
(wordcount) on AWS.
Rich's Big Data Analytics Training
13
7/11/2014
Google App Engine
Rich's Big Data Analytics Training
14
What is google app engine?
Google App Engine is a Platform as a Service (PaaS)
It lets you build and run applications on Google’s infrastructure.
App Engine applications are easy to build, easy to maintain, and easy to scale as your traffic and data storage needs change.
With App Engine, there are no servers for you to maintain.
You simply upload your application and it is ready to go.
7/11/2014
Rich's Big Data Analytics Training
15
GAE Features
App Engine makes it easy to build and deploy an application that runs reliably even under heavy load and with large amounts of data. It includes the following features: Persistent storage with queries, sorting, and transactions. Automatic scaling and load balancing. Asynchronous task queues for performing work outside the
scope of a request. Scheduled tasks for triggering events at specified times or
regular intervals. Integration with other Google cloud services and APIs. Applications run in a secure, sandboxed environment, allowing
App Engine to distribute requests across multiple servers, and scaling servers to meet traffic demands.
7/11/2014
Rich's Big Data Analytics Training
16
How to deploy an application?
We will work from the Eclipse environment we have already downloaded
You need to download the Google app engine plugin and use that to deploy the application.
You need a google app engine account before you can deploy anything. Please do that before we proceed.
We will deploy two applications we developed earlier: Hangman and three.js application that shows a cube and a sphere
7/11/2014
Rich's Big Data Analytics Training
17
GAE capacities
All applications can use up to 1 GB of storage and enough CPU and bandwidth to support an efficient app serving around 5 million page views a month, absolutely free.
Runs your web applications on Google's infrastructure. Google App Engine is fully-integrated development
environment You can serve your applications with your own domain
(such as http://richs.com), or you can use the App Engine domain for free (just like http://svg2bina.appspot.com)
You can use server side languages such as PHP, Java, Python, and Go.
7/11/2014
Rich's Big Data Analytics Training
18
Monitor Your App
7/11/2014
Rich's Big Data Analytics Training
19
Summarizing Google App Engine
Among the cloud offerings Google App Engine has the least expensive model for cloud deployment
The learning curve for the process of deployment is not that steep.
Moreover Eclipse has a plugin to simplify the process.
But all the infrastructure is hidden from you;You access them only through the API and the
services offered by the GAE.
7/11/2014
Rich's Big Data Analytics Training
20
7/11/2014
Hadoop-MapReduce
Google File System
Internet introduced a new challenge in the form of web logs, web crawler’s data: large scale “peta scale”
But observe that this type of data has an uniquely different characteristic than your transactional or the “customer order” data : “write once read many (WORM)” • Privacy protected healthcare and patient information; • Historical financial data; • Other historical data • Transactional data from your sales• Manufacturing data for quality control
Google exploited this WORM characteristics in its Google file system (GFS) for running massive parallel processes/programs
7/11/2014Rich's Big Data Analytics Training
21
What is Hadoop?
At Google MapReduce operations are run on a special file system called Google File System (GFS) that is highly optimized for this purpose.
However GFS is not open source.Doug Cutting and others at Yahoo! reverse
engineered the GFS and called it Hadoop Distributed File System (HDFS).
The software framework that supports HDFS, MapReduce and other related entities is called the project Hadoop or simply Hadoop.
This is now open source and distributed by Apache.
7/11/2014Rich's Big Data Analytics Training
22
Basic Features: HDFS
Highly fault-tolerantHigh throughputSuitable for applications with large data setsStreaming access to file system dataCan be built out of commodity hardware HDFS provides Java API for applications to use.It also provides a streaming API for other
languages.A HTTP browser can be used to browse the files
of a HDFS instance.
7/11/2014
23
Rich's Big Data Analytics Training
Fault tolerance
Failure is the norm rather than exception in a large network.
A HDFS instance may consist of thousands of server machines, each storing part of the file system’s data.
Since we have huge number of components and that each component has non-trivial probability of failure means that there is always some component that is non-functional.
Detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS. 7/11/2014Rich's Big Data Analytics Training
24
HDFS Architecture
Namenode
Breplication
Rack1 Rack2
Client
Blocks
Datanodes Datanodes
Client
Write
Read
Metadata opsMetadata(Name, replicas..)(/home/foo/data,6. ..
Block ops
7/11/2014Rich's Big Data Analytics Training
25
Hadoop Distributed File System
Application
Local file
system
Master node
Name Nodes
HDFS Client
HDFS Server
Block size: 2K
Block size: 128MReplicated
7/11/2014Rich's Big Data Analytics Training
26
What is MapReduce?
MapReduce is a programming model Google has used successfully is processing its “big-data” sets (~ 20000 peta bytes per day) A map function extracts some intelligence from raw data. A reduce function aggregates according to some guides the
data output by the map. Users specify the computation in terms of a map and a reduce
function, Underlying runtime system automatically parallelizes the
computation across large-scale clusters of machines, and Underlying system also handles machine failures, efficient
communications, and performance issues. -- Reference: Dean, J. and Ghemawat, S. 2008. MapReduce
: simplified data processing on large clusters. Communication of ACM 51, 1 (Jan. 2008), 107-113.
7/11/2014Rich's Big Data Analytics Training
27
Classes of problems “mapreducable”
Benchmark for comparing: Jim Gray’s challenge on data-intensive computing. Ex: “Sort”
Google uses it for wordcount, adwords, pagerank, indexing data.
Simple algorithms such as grep, text-indexing, reverse indexing
Bayesian classification: data mining domainAny number of applications involving data mining and
machine learningFacebook uses it for various operations: demographicsFinancial services use it for analyticsAstronomy: Gaussian analysis for locating extra-
terrestrial objects.
7/11/2014Rich's Big Data Analytics Training
28
MapReduce Example: Mapper
This is a catCat sits on a roof<this 1> <is 1> <a <1,1,>> <cat <1,1>> <sits 1> <on 1> <roof 1>
The roof is a tin roofThere is a tin can on the roof<the <1,1>> <roof <1,1,1>> <is <1,1>> <a <1,1>> <tin <1,1>> <then 1>
<can 1> <on 1>
Cat kicks the canIt rolls on the roof and falls on the next roof<cat 1> <kicks 1> <the <1,1>> <can 1> <it 1> <roll 1> <on <1,1>> <roof
<1,1>> <and 1> <falls 1> <next 1>
The cat rolls tooIt sits on the can<the <1,1>> <cat 1> <rolls 1> <too 1> <it 1> <sits 1> <on 1> <cat 1>
7/11/2014Rich's Big Data Analytics Training
29
MapReduce Example: Combiner, Reducer
<this 1> <is 1> <a <1,1,>> <cat <1,1>> <sits 1> <on 1> <roof 1><the <1,1>> <roof <1,1,1>> <is <1,1>> <a <1,1>> <tin <1,1>> <then 1>
<can 1> <on 1><cat 1> <kicks 1> <the <1,1>> <can 1> <it 1> <roll 1> <on <1,1>> <roof
<1,1>> <and 1> <falls 1> <next 1><the <1,1>> <cat 1> <rolls 1> <too 1> <it 1> <sits 1> <on 1> <cat 1>
Combine the counts of all the same words:<cat <1,1,1,1>><roof <1,1,1,1,1,1>><can <1, 1,1>>…Reduce (sum in this case) the counts:<cat 4><can 3><roof 6>
7/11/2014Rich's Big Data Analytics Training
30
Cou
nt
Cou
nt
Cou
nt
Large scale data splits
Parse-hash
Parse-hash
Parse-hash
Parse-hash
Map <key, 1><key, value>pair Reducers (say, Count)
P-0000
P-0001
P-0002
, count1
, count2
,count3
7/11/2014Rich's Big Data Analytics Training 31
Rich's Big Data Analytics Training
32
Putting it all together
7/11/2014
Starting point: question. What do you want to know? Don’t try to fit a question into a technology.
Exploratory data analysis Use R to do EDA
Explore on the cloud For emerging technologies For legacy technologies
Visualize JS and JS libraries three.js, d3.js
Analyze, present & make decisions Gephi Tableau/Qlikview
Question answered?
Rich's Big Data Analytics Training
33
The Data Science Process (Review)
7/11/2014
Raw data
collected
Exploratory data analysis
Machine learning
algorithms;Statistical
modelsBuild data
products
Communication
VisualizationReport
FindingsMake decisionsMicro-level data strategy
Data is processe
d
Data is cleaned
1 2 3
4
5
6
7
Rich's Big Data Analytics Training
34
Big Data Training
7/11/2014
CloudStrategy
RStudio
R
StatModel MLAlgs
CloudAWS CloudGAE
DataInf rastructure
Hadoop:HDFS MapReduce
DataAlgorithms
DataAnaly tics
DataCommunication
BigData
HighLev el
Gephi TableauJS d3.js three.js
ProgLev el
Summary
We illustrated cloud concepts and demonstrated the cloud capabilities through simple applications
We discussed the features of the Hadoop File System, and mapreduce to handle big-data sets.
We also explored some real business issues in adoption of cloud.
Cloud is indeed an impactful technology that is sure to transform computing in business.
7/11/2014Rich's Big Data Analytics Training
35
References & useful links
Amazon AWS: http://aws.amazon.com/free/Google App Engine (GAE): http://
code.google.com/appengine/docs/whatisgoogleappengine.html
7/11/2014Rich's Big Data Analytics Training
36