large-scale data processing on the cloud · • some slides are also taken from “big data and...

35
Large-scale Data Processing on the Cloud MTAT.08.036 Lecture 1: Data analytics in the cloud Satish Srirama [email protected]

Upload: others

Post on 09-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Large-scale Data Processing on the Cloud · • Some slides are also taken from “Big Data and Cloud Computing: Current State and Future Opportunities”, tutorial at EDBT 2011 by

Large-scale Data Processing on the

Cloud

MTAT.08.036

Lecture 1: Data analytics in the cloud

Satish [email protected]

Page 2: Large-scale Data Processing on the Cloud · • Some slides are also taken from “Big Data and Cloud Computing: Current State and Future Opportunities”, tutorial at EDBT 2011 by

Course Purpose

• Introduce cloud computing concepts

• Introduce data analytics in the cloud

• Introduce – Distributed computing algorithms like MapReduce

– BigData solutions like Pig and Hive– BigData solutions like Pig and Hive

– NoSQL

• Glance of research at Mobile Cloud Lab in cloud computing domain

• http://courses.cs.ut.ee/2013/LDPC/

9/3/2013 Satish Srirama 2/35

Page 3: Large-scale Data Processing on the Cloud · • Some slides are also taken from “Big Data and Cloud Computing: Current State and Future Opportunities”, tutorial at EDBT 2011 by

Course Schedule

• Lectures:

– Tue. 14.15 - 16.00 (J. Liivi 2 - 122)

• Practice sessions:

– Thur. 10.15 - 12.00 (J. Liivi 2 - 003)– Thur. 10.15 - 12.00 (J. Liivi 2 - 003)

9/3/2013 Satish Srirama 3/35

Page 4: Large-scale Data Processing on the Cloud · • Some slides are also taken from “Big Data and Cloud Computing: Current State and Future Opportunities”, tutorial at EDBT 2011 by

Related Courses

• MTAT.08.027 Basics of Cloud Computing (3 ECTS)

– Spring 2014

• MTAT.03.280 Mobile and Cloud Computing • MTAT.03.280 Mobile and Cloud Computing Seminar (3 ECTS)

– Fri. 10.15 - 12.00, J. Liivi 2 – 512

• MTAT.03.266 Mobile Application Development Projects (3 ECTS)

– Mon. 10.15 - 12.00, J. Liivi 2 – 611

9/3/2013 Satish Srirama 4/35

Page 5: Large-scale Data Processing on the Cloud · • Some slides are also taken from “Big Data and Cloud Computing: Current State and Future Opportunities”, tutorial at EDBT 2011 by

Questions

• Is everyone comfortable with data structures?

• How comfortable you are with algorithms?

• How comfortable you are with programming?

– Java ? – Java ?

• External APIs?

– Web programming

9/3/2013 Satish Srirama 5/35

Page 6: Large-scale Data Processing on the Cloud · • Some slides are also taken from “Big Data and Cloud Computing: Current State and Future Opportunities”, tutorial at EDBT 2011 by

Grading

• Written exam – 50%

• Labs – 45%– 6+ lab exercises

– You must submit lab exercise solutions by Wednesday night of the week after (for full score)• Late submission for a week (80%)• Late submission for a week (80%)

• Late submission for 2 weeks (50%) – Hard deadlines will be mentioned later

• Active participation in the lectures (Max 5%)

• To pass the course– You need to score at least 50% in each of the subsections

– You need to score at least 50% in the total

9/3/2013 Satish Srirama 6/35

Page 7: Large-scale Data Processing on the Cloud · • Some slides are also taken from “Big Data and Cloud Computing: Current State and Future Opportunities”, tutorial at EDBT 2011 by

Course schedule

• 03.09 Basics of Cloud computing

• 10.09 MapReduce

• 17.09 MapReduce algorithms

• 24.09 Pig and Hive

• 01.10 Pig and Hive - continued• 01.10 Pig and Hive - continued

• 08.10 NoSQL

• 15.10 Case studies

• 22.10 Case studies

• 29.10 Examination 1

9/3/2013 Satish Srirama 7/35

Page 8: Large-scale Data Processing on the Cloud · • Some slides are also taken from “Big Data and Cloud Computing: Current State and Future Opportunities”, tutorial at EDBT 2011 by

Course schedule - continued

• Labs

05.09 Starting with a cloud

12.09 MapReduce - Basics

19.09 Data analysis with MapReduce19.09 Data analysis with MapReduce

26.09 Working with Pig and Hive

03.10 Working with Pig and Hive

10.10 NoSQL

++

9/3/2013 Satish Srirama 8/35

Page 9: Large-scale Data Processing on the Cloud · • Some slides are also taken from “Big Data and Cloud Computing: Current State and Future Opportunities”, tutorial at EDBT 2011 by

DATA ANALYTICS IN THE CLOUD

Lecture 1

9/3/2013 Satish Srirama 9

Page 10: Large-scale Data Processing on the Cloud · • Some slides are also taken from “Big Data and Cloud Computing: Current State and Future Opportunities”, tutorial at EDBT 2011 by

WHAT IS CLOUD COMPUTING?

“It’s nothing new”“...we’ve redefined Cloud

Computing to include everything

that we already do... I don’t

understand what we would do

differently ... other than change

the wording of some of our ads.”

“It’s a trap”“It’s worse than stupidity: it’s

marketing hype. Somebody is

saying this is inevitable—and

whenever you hear that, it’s very

likely to be a set of businesses

campaigning to make it true.”WHAT IS CLOUD COMPUTING?

No consistent answer!

Everyone thinks it is something else…

the wording of some of our ads.”

Larry Ellison, CEO, Oracle (Wall

Street Journal, Sept. 26, 2008)

campaigning to make it true.”

Richard Stallman, Founder, Free

Software Foundation (The

Guardian, Sept. 29, 2008)

Slide taken from Professor Anthony D. Joseph’s lecture at RWTH Aachen9/3/2013 Satish Srirama 10

Page 11: Large-scale Data Processing on the Cloud · • Some slides are also taken from “Big Data and Cloud Computing: Current State and Future Opportunities”, tutorial at EDBT 2011 by

What is Cloud Computing?

• Computing as a utility– Utility services e.g. water, electricity, gas etc

– Consumers pay based on their usage

• Cloud Computing characteristics – Illusion of infinite resources– Illusion of infinite resources

– No up-front cost

– Fine-grained billing (e.g. hourly)

• Gartner: “Cloud computing is a style of computing where massively scalable IT-related capabilities are provided ‘as a service’ across the Internet to multiple external customers”

9/3/2013 Satish Srirama 11/35

Page 12: Large-scale Data Processing on the Cloud · • Some slides are also taken from “Big Data and Cloud Computing: Current State and Future Opportunities”, tutorial at EDBT 2011 by

Timeline

9/3/2013 Satish Srirama 12/35

Page 13: Large-scale Data Processing on the Cloud · • Some slides are also taken from “Big Data and Cloud Computing: Current State and Future Opportunities”, tutorial at EDBT 2011 by

Clouds - Why Now (not then)?

• Experience with very large datacenters

– Unprecedented economies of scale

– Transfer of risk

• Technology factors• Technology factors

– Pervasive broadband Internet

– Maturity in Virtualization Technology

• Business factors

– Minimal capital expenditure

– Pay-as-you-go billing model

9/3/2013 Satish Srirama 13/35

Page 14: Large-scale Data Processing on the Cloud · • Some slides are also taken from “Big Data and Cloud Computing: Current State and Future Opportunities”, tutorial at EDBT 2011 by

Cloud Computing - Services• Software as a Service – SaaS

– A way to access applications hosted on the web through your web browser

• Platform as a Service – PaaS

– A pay-as-you-go model for IT resources accessed over the

SaaS

Facebook, Flikr, Myspace.com,

Google maps API, Gmail

Level of

Abstraction

resources accessed over the Internet

• Infrastructure as a Service –IaaS

– Use of commodity computers, distributed across Internet, to perform parallel processing, distributed storage, indexing and mining of data

– Virtualization

PaaS

Google App Engine,

Force.com, Hadoop, Azure, Amazon S3, etc

IaaSAmazon EC2, SciCloud,

Joyent Accelerators, Nirvanix Storage Delivery Network, etc.

9/3/2013 Satish Srirama 14/35

Page 15: Large-scale Data Processing on the Cloud · • Some slides are also taken from “Big Data and Cloud Computing: Current State and Future Opportunities”, tutorial at EDBT 2011 by

Cloud Computing - Themes

• Massively scalable

• On-demand & dynamic

• Only use what you need - Elastic – No upfront commitments, use on short term basis

• Accessible via Internet, location independent• Accessible via Internet, location independent

• Transparent – Complexity concealed from users, virtualized,

abstracted

• Service oriented – Easy to use SLAs

SLA – Service Level Agreement

9/3/2013 Satish Srirama 15/35

Page 16: Large-scale Data Processing on the Cloud · • Some slides are also taken from “Big Data and Cloud Computing: Current State and Future Opportunities”, tutorial at EDBT 2011 by

Cloud Models

• Internal (private) cloud– Cloud with in an organization

• Community cloud– Cloud infrastructure jointly

owned by several organizations

• Public cloud• Public cloud– Cloud infrastructure owned by

an organization, provided to general public as service

• Hybrid cloud– Composition of two or more

cloud models

9/3/2013 Satish Srirama 16/35

Page 17: Large-scale Data Processing on the Cloud · • Some slides are also taken from “Big Data and Cloud Computing: Current State and Future Opportunities”, tutorial at EDBT 2011 by

Cloud Application Demand

• Many cloud applications have cyclical demand curves– Daily, weekly, monthly, …

Demand

Resources

• Workload spikes are more frequent and significant– When some event happens like a pop star has expired:

• More # tweets, Wikipedia traffic increases

• 22% of tweets, 20% of Wikipedia traffic when Michael Jackson expired in 2009 – Google thought they are under attack

Demand

Time

Resources

9/3/2013 Satish Srirama 17/35

Page 18: Large-scale Data Processing on the Cloud · • Some slides are also taken from “Big Data and Cloud Computing: Current State and Future Opportunities”, tutorial at EDBT 2011 by

Economics of Cloud Users

• Pay by use instead of provisioning for peak

Capacity

Resources

Resources

Unused resources

Static data center Data center in the cloud

Demand

Time

Resources

Demand

Capacity

TimeResources

9/3/2013 Satish Srirama 18/35

Page 19: Large-scale Data Processing on the Cloud · • Some slides are also taken from “Big Data and Cloud Computing: Current State and Future Opportunities”, tutorial at EDBT 2011 by

Unused resources

Economics of Cloud Users - continued

• Risk of over-provisioning: underutilization

– Huge sunk cost in infrastructureCapacity

Static data center

Demand

Time

Resources

9/3/2013 Satish Srirama 19/35

Page 20: Large-scale Data Processing on the Cloud · • Some slides are also taken from “Big Data and Cloud Computing: Current State and Future Opportunities”, tutorial at EDBT 2011 by

Economics of Cloud Users - continued

• Heavy penalty for under-provisioning

Resources

Demand

Capacity

Resources

Lost revenue

Lost users

Resources

Demand

Capacity

Time (days)

1 2 3

Demand

Time (days)

1 2 3

Resources

Demand

Capacity

Time (days)

1 2 3

9/3/2013 Satish Srirama 20/35

Page 21: Large-scale Data Processing on the Cloud · • Some slides are also taken from “Big Data and Cloud Computing: Current State and Future Opportunities”, tutorial at EDBT 2011 by

Short Term Implications of Clouds

• Startups and prototyping

– Minimize infrastructure risk

– Lower cost of entry

• Batch jobs• Batch jobs

• One-off tasks

– Washington post, NY Times

• Cost associatively for scientific applications

• Research at scale

9/3/2013 Satish Srirama 21/35

Page 22: Large-scale Data Processing on the Cloud · • Some slides are also taken from “Big Data and Cloud Computing: Current State and Future Opportunities”, tutorial at EDBT 2011 by

Long Term Implications of clouds

• Application software:

– Cloud & client parts, disconnection tolerance

• Infrastructure software:

– Resource accounting, VM awareness– Resource accounting, VM awareness

• Hardware systems:

– Containers, energy proportionality

9/3/2013 Satish Srirama 22/35

Page 23: Large-scale Data Processing on the Cloud · • Some slides are also taken from “Big Data and Cloud Computing: Current State and Future Opportunities”, tutorial at EDBT 2011 by

Cloud Computing Progress

Armando Fox, 2010

9/3/2013 Satish Srirama 23/35

Page 24: Large-scale Data Processing on the Cloud · • Some slides are also taken from “Big Data and Cloud Computing: Current State and Future Opportunities”, tutorial at EDBT 2011 by

Our Data-driven World

• Science– Data bases from astronomy, genomics, environmental data,

Sensors and Smart Devices …

• Humanities and Social Sciences– Scanned books, historical documents, social interactions data, …

• Business & Commerce• Business & Commerce– Corporate sales, stock market transactions, census, airline

traffic, …

• Entertainment– Internet images, MP3 files, …

• Medicine– MRI & CT scans, patient records, …

[Agrawal et al, EDBT 2011 Tutorial]

9/3/2013 Satish Srirama 24/35

Page 25: Large-scale Data Processing on the Cloud · • Some slides are also taken from “Big Data and Cloud Computing: Current State and Future Opportunities”, tutorial at EDBT 2011 by

What can we do with this wealth?

� What can we do?� Scientific breakthroughs

� Business process efficiencies

� Realistic special effects� Realistic special effects

� Improve quality-of-life: healthcare, transportation, environmental disasters, daily life, …

� Could We Do More?� YES: but need major

advances in our capability to analyze this data

[Agrawal et al, EDBT 2011 Tutorial]

9/3/2013 Satish Srirama 25

Page 26: Large-scale Data Processing on the Cloud · • Some slides are also taken from “Big Data and Cloud Computing: Current State and Future Opportunities”, tutorial at EDBT 2011 by

Data Warehousing, Data Analytics &

Decision Support Systems

• Used to manage and control business

• Transactional Data: historical or point-in-time

• Optimized for inquiry rather than update

• Use of the system is loosely defined and can • Use of the system is loosely defined and can

be ad-hoc

• Used by managers and analysts to understand

the business and make judgments

• Limited scalability[Agrawal et al, EDBT 2011 Tutorial]

9/3/2013 Satish Srirama 26/35

Page 27: Large-scale Data Processing on the Cloud · • Some slides are also taken from “Big Data and Cloud Computing: Current State and Future Opportunities”, tutorial at EDBT 2011 by

Data Analytics in the Cloud

• Scalability to large data volumes:– Scan 100 TB on 1 node @ 50 MB/sec = 23 days

– Scan on 1000-node cluster = 33 minutes

� Divide-And-Conquer (i.e., data partitioning)– This is where MapReduce jumps in– This is where MapReduce jumps in

– Replicate data and computation (Next Lecture)

• Cost-efficiency:– Commodity nodes (cheap, but unreliable)

– Commodity network

– Automatic fault-tolerance

9/3/2013 Satish Srirama 27/35

Page 28: Large-scale Data Processing on the Cloud · • Some slides are also taken from “Big Data and Cloud Computing: Current State and Future Opportunities”, tutorial at EDBT 2011 by

BigData Solutions

• We mainly study Hadoop MapReduce

– Hadoop MapReduce is one of the most used solution

for large scale data analysis.

• However, Hadoop has some troubles• However, Hadoop has some troubles

– Writing low level MapReduce code is slow

– Need a lot of expertise to optimize MapReduce code

– Prototyping requires compilation

– A lot of custom code is required

– Hard to manage more complex MapReduce job chains

9/3/2013 Satish Srirama 28/35

Page 29: Large-scale Data Processing on the Cloud · • Some slides are also taken from “Big Data and Cloud Computing: Current State and Future Opportunities”, tutorial at EDBT 2011 by

Pig & Hive (Lecture 4 & 5)

• Pig is a high level language on top of Hadoop MapReduce – Models a scripting language (Latin)

• Fast prototyping

– Similar to declarative SQL • Easier to get started • Easier to get started

– In comparison to Hadoop MapReduce: • 5% of the code

• 5% of the time

• Hive™ – A data warehouse infrastructure that provides data

summarization and ad hoc querying.

– Developed by Facebook

9/3/2013 Satish Srirama 29/35

Page 30: Large-scale Data Processing on the Cloud · • Some slides are also taken from “Big Data and Cloud Computing: Current State and Future Opportunities”, tutorial at EDBT 2011 by

App App App

Scaling in the Cloud

Load Balancer (Proxy)

App

Client Site

App

Client Site Client Site

App

Server

App

Server

App

ServerApp

Server

MySQL

Master DBMySQL

Slave DB

ReplicationDatabase becomes the

Scalability Bottleneck

Cannot leverage elasticity

App

Server

9/3/2013 Satish Srirama 30

Page 31: Large-scale Data Processing on the Cloud · • Some slides are also taken from “Big Data and Cloud Computing: Current State and Future Opportunities”, tutorial at EDBT 2011 by

App App App

Scaling in the Cloud

Load Balancer (Proxy)

App

Client Site

App

Client Site Client Site

App

Server

App

Server

App

ServerApp

Server

MySQL

Master DBMySQL

Slave DB

Replication

App

Server

[Agrawal et al, EDBT 2011 Tutorial]

9/3/2013 Satish Srirama 31

Page 32: Large-scale Data Processing on the Cloud · • Some slides are also taken from “Big Data and Cloud Computing: Current State and Future Opportunities”, tutorial at EDBT 2011 by

Apache Apache Apache

Scaling in the Cloud

Load Balancer (Proxy)

Apache

Client Site

Apache

Client Site Client Site

Key Value Stores

Apache

+ App

Server

Apache

+ App

Server

Apache

+ App

Server

Apache

+ App

Server

Apache

+ App

Server

Scalable and Elastic,

but limited consistency and

operational flexibility NoSQL

(Lecture 6)9/3/2013 Satish Srirama 32

Page 33: Large-scale Data Processing on the Cloud · • Some slides are also taken from “Big Data and Cloud Computing: Current State and Future Opportunities”, tutorial at EDBT 2011 by

This week in Lab (Homework)

• Registration to the cloud & keys

– Firefox plugin & working with Eucatools and API

• Some small programming task to get you

startedstarted

9/3/2013 Satish Srirama 33/35

Page 34: Large-scale Data Processing on the Cloud · • Some slides are also taken from “Big Data and Cloud Computing: Current State and Future Opportunities”, tutorial at EDBT 2011 by

Next lecture

• Introduction to MapReduce

9/3/2013 Satish Srirama 34/35

Page 35: Large-scale Data Processing on the Cloud · • Some slides are also taken from “Big Data and Cloud Computing: Current State and Future Opportunities”, tutorial at EDBT 2011 by

References

• Several of the slides are taken from Prof. Anthony D. Joseph’s lecture at RWTH Aachen (March 2010)

• Some slides are also taken from “Big Data and Cloud Computing: Current State and Future Cloud Computing: Current State and Future Opportunities”, tutorial at EDBT 2011 by D. Agrawal, S. Das and A. Abbadi.

• M. Armbrust et al., “Above the Clouds, A Berkeley View of Cloud Computing”, Technical Report, University of California, Feb, 2009.

9/3/2013 Satish Srirama 35/35