Download - Lecture1-Introduction to Cloud Computing
-
8/13/2019 Lecture1-Introduction to Cloud Computing
1/64
Introduction to Cloud Computing
http://net.pku.edu.cn/~course/cs402/2009/
6/30/2009
http://net.pku.edu.cn/~course/cs402/2009/http://net.pku.edu.cn/~course/cs402/2009/ -
8/13/2019 Lecture1-Introduction to Cloud Computing
2/64
2
(Cloud Computing)?
-
8/13/2019 Lecture1-Introduction to Cloud Computing
3/64
(Cloud Computing)
-
8/13/2019 Lecture1-Introduction to Cloud Computing
4/64
4
What is Cloud Computing?
1. First write down your own opinion about cloudcomputing , whatever you thought about inyour mind.
2. Question: What ? Who? Why? How? Pros andcons?
3. The most important question is:What is therelation with me?
http://localhost/var/www/apps/conversion/tmp/scratch_3/%E9%AB%98%E6%B8%85%E3%80%8A%E4%BA%91%E8%AE%A1%E7%AE%97%E3%80%8B%E6%9C%80%E6%B5%85%E6%98%BE%E8%A7%A3%E8%B0%9C%E4%BA%91%E6%95%85%E4%BA%8B.flvhttp://localhost/var/www/apps/conversion/tmp/scratch_3/%E9%AB%98%E6%B8%85%E3%80%8A%E4%BA%91%E8%AE%A1%E7%AE%97%E3%80%8B%E6%9C%80%E6%B5%85%E6%98%BE%E8%A7%A3%E8%B0%9C%E4%BA%91%E6%95%85%E4%BA%8B.flvhttp://localhost/var/www/apps/conversion/tmp/scratch_3/%E9%AB%98%E6%B8%85%E3%80%8A%E4%BA%91%E8%AE%A1%E7%AE%97%E3%80%8B%E6%9C%80%E6%B5%85%E6%98%BE%E8%A7%A3%E8%B0%9C%E4%BA%91%E6%95%85%E4%BA%8B.flvhttp://localhost/var/www/apps/conversion/tmp/scratch_3/%E9%AB%98%E6%B8%85%E3%80%8A%E4%BA%91%E8%AE%A1%E7%AE%97%E3%80%8B%E6%9C%80%E6%B5%85%E6%98%BE%E8%A7%A3%E8%B0%9C%E4%BA%91%E6%95%85%E4%BA%8B.flv -
8/13/2019 Lecture1-Introduction to Cloud Computing
5/64
5
Cloud Computing is
No software access everywhere by Internet
power -- Large-scale data processing
Appeal for startups Cost efficiency
Software as platform
Cons
Security
Data lock-in
SaaSPaaS
Utility Computing
-
8/13/2019 Lecture1-Introduction to Cloud Computing
6/64
6
Software as a Service (SaaS)
a model of software deploymentwhereby a
provider licenses an application to customers foruse as a service on demand.
http://en.wikipedia.org/wiki/Software_deploymenthttp://en.wikipedia.org/wiki/Software_deployment -
8/13/2019 Lecture1-Introduction to Cloud Computing
7/647
Platform as a Service (PaaS)
Web ApplicationServicesPaaSInternet Multi-tenant architecture platform
-
8/13/2019 Lecture1-Introduction to Cloud Computing
8/648
Utility Computing
pay-as-you-go Microsoft paylessutility computing 500 use less pay lesscloud computing
-
8/13/2019 Lecture1-Introduction to Cloud Computing
9/649
Cloud Computing is
-
8/13/2019 Lecture1-Introduction to Cloud Computing
10/6410
Key Characteristics
illusion of infinitecomputing resourcesavailable on demand;
elimination of an up-front
commitment by Cloud users;
ability to payfor use ofcomputing resources on a
short-term basis as neededbillingutility computing
very large datacenters
large-scale software infrastructure
operational expertise
-
8/13/2019 Lecture1-Introduction to Cloud Computing
11/6411
Why now?
very large-scale datacenterBusiness
pay-as-you-go computing
-
8/13/2019 Lecture1-Introduction to Cloud Computing
12/6412
Key Players
Amazon Web Services
Google App Engine
Microsoft Windows Azure
-
8/13/2019 Lecture1-Introduction to Cloud Computing
13/6413
Key Applications
Mobile Interactive applications, Tim OReillyMobiledatacentermashup
Parallel batch processingCloud
ComputingMapReduceHadoop/cloudAmazonhost large public datasets for free
The rise of analyticstransaction based
analytics
Extension of compute-intensive desktop applicationmatlab, mathematicacloudcomputingwoo~
-
8/13/2019 Lecture1-Introduction to Cloud Computing
14/6414
Cloud Computing = Silver Bullet?
Google37Google
Problem of Data Lock-in
-
8/13/2019 Lecture1-Introduction to Cloud Computing
15/6415
Challenges
-
8/13/2019 Lecture1-Introduction to Cloud Computing
16/6416
Some other Voices
Its stupidity. Its worse than stupidity: its a marketing hypecampaign. Somebody is saying this is inevitableandwhenever you hear somebody saying that, its very likely to be
a set of businesses campaigning to make it true.Richard Stallman, quoted in The Guardian, September 29,2008
The interesting thing about Cloud Computing is that weve redefinedCloud Computing to include everything that we already do. . . . Idont understand what we would do differently in the light of CloudComputing other than change the wording of some of our ads.Larry Ellison, quoted in the Wall Street Journal, September 26, 2008
-
8/13/2019 Lecture1-Introduction to Cloud Computing
17/6417
Whats matter with ME?!
What you want to do with 1000pcs, or even100,000pcs?
-
8/13/2019 Lecture1-Introduction to Cloud Computing
18/6418
Cloud is coming
Google alone has 450,000systems running across 20
datacenters, and Microsoft's
Windows Live team is doubling
the number of servers it uses
every 14 months, which is faster
than Moore's Law
Data enter is a omputerParallelism everywhere
Massive Scalable Reliable
Resource ManagementData Management
Programming Model & Tools
http://en.wikipedia.org/wiki/Moore%27s_lawhttp://en.wikipedia.org/wiki/Moore%27s_law -
8/13/2019 Lecture1-Introduction to Cloud Computing
19/64
-
8/13/2019 Lecture1-Introduction to Cloud Computing
20/6420
-
8/13/2019 Lecture1-Introduction to Cloud Computing
21/6421
Happening everywhere!
Molecular biology
(cancer)microarray chips
Particle events (LHC)particle colliders
microprocessorsSimulations
(Millennium)
Network traffic (spam)fiber optics
300M/day
1B
1M/sec
-
8/13/2019 Lecture1-Introduction to Cloud Computing
22/6422 Maximilien Brice, CERN
-
8/13/2019 Lecture1-Introduction to Cloud Computing
23/64
23 Maximilien Brice, CERN
-
8/13/2019 Lecture1-Introduction to Cloud Computing
24/64
24 Maximilien Brice, CERN
-
8/13/2019 Lecture1-Introduction to Cloud Computing
25/64
25 Maximilien Brice, CERN
-
8/13/2019 Lecture1-Introduction to Cloud Computing
26/64
26
How much data?
Internet archive has 2 PB of data + 20 TB/month Google processes 20 PB a day (2008)
all words ever spoken by human beings ~ 5 EB
CERNs LHC will generate 10-15 PB a year Sanger anticipates 6 PB of data in 2009
640Kought to be
enough for
anybody.
NERSC User George Smoot wins
-
8/13/2019 Lecture1-Introduction to Cloud Computing
27/64
27
NERSC User George Smoot wins2006 Nobel Prize in Physics
Smoot and Mather 1992
COBE Experiment showed
anisotropy of CMB
Cosmic MicrowaveBackground Radiation
(CMB): an image of the
universe at 400,000 years
-
8/13/2019 Lecture1-Introduction to Cloud Computing
28/64
28
The Current CMB Map
Unique imprint of primordial physics through the tiny anisotropies in
temperature and polarization.
Extracting these Kelvin fluctuations from inherently noisy data is a
serious computational challenge.
source J. Borrill, LBNL
-
8/13/2019 Lecture1-Introduction to Cloud Computing
29/64
29
Evolution Of CMB Data Sets: Cost >O(Np^3 )
Experiment Nt Np NbLimiting
DataNotes
COBE (1989) 2x109 6x103 3x101 Time Satellite, Workstation
BOOMERanG(1998)
3x108 5x105 3x101 Pixel Balloon, 1st HPC/NERSC
(4yr) WMAP (2001) 7x1010 4x107 1x103 ? Satellite, Analysis-bound
Planck (2007) 5x1011 6x108 6x103 Time/ PixelSatellite,
Major HPC/DA effort
POLARBEAR (2007) 8x1012 6x106 1x103 TimeGround, NG-
multiplexing
CMBPol (~2020) 1014 109 104 Time/ PixelSatellite, Early
planning/design
data compression
-
8/13/2019 Lecture1-Introduction to Cloud Computing
30/64
30
Example: Wikipedia Anthropology
Experiment
Download entire revisionhistory of Wikipedia
4.7 M pages, 58 M revisions,800 GB
Analyze editing patterns &
trends
Computation
Hadoop on 20-machinecluster
Kittur, Suh, Pendleton (UCLA, PARC), He Says,She Says: Conflict and Coordination in WikipediaCHI, 2007
Increasing fract ion of edits are for
wo rk ind irect ly related to art ic les
-
8/13/2019 Lecture1-Introduction to Cloud Computing
31/64
31
Example: Scene Completion
Image Database Grouped bySemantic Content
30 different Flickr.com groups
2.3 M images total (396 GB).
Select Candidate Images MostSuitable for Filling Hole
Classify images with gist scenedetector [Torralba]
Color similarity
Local context matching
Computation
Index images offline
50 min. scene matching, 20min. local matching, 4 min.
compositing Reduces to 5 minutes total by
using 5 machines
Extension
Flickr.com has over 500 million
images
Hays, Efros (CMU), Scene Completion UsingMillions of Photographs SIGGRAPH, 2007
-
8/13/2019 Lecture1-Introduction to Cloud Computing
32/64
32
Example: Web Page Analysis
Experiment
Use web crawler to gather151M HTML pages weekly11 times
Generated 1.2 TB loginformation
Analyze page statistics andchange frequencies
Systems ChallengeMoreover, we experienced acatastrophic disk failure
during the third crawl,causing us to lose a quarterof the logs of that crawl.
Fetterly, Manasse, Najork, Wiener (Microsoft, HP),
A Large-Scale Study of the Evolution of WebPages, Software-Practice & Experience, 2004
-
8/13/2019 Lecture1-Introduction to Cloud Computing
33/64
33
GATGCTTACTATGCGGGCCCC
CGGTCTAATGCTTACTATGC
GCTTACTATGCGGGCCCCTT
AATGCTTACTATGCGGGCCCCTT
TAATGCTTACTATGC
AATGCTTAGCTATGCGGGC
AATGCTTACTATGCGGGCCCCTT
AATGCTTACTATGCGGGCCCCTT
CGGTCTAGATGCTTACTATGC
AATGCTTACTATGCGGGCCCCTT
CGGTCTAATGCTTAGCTATGC
ATGCTTACTATGCGGGCCCCTT?
Subject
genome
Sequencer
Reads
-
8/13/2019 Lecture1-Introduction to Cloud Computing
34/64
34
DNA Sequencing
ATCTGATAAGTCCCAGGACTTCAGT
GCAAGGCAAACCCGAGCCCAGTTT
TCCAGTTCTAGAGTTTCACATGATC
GGAGTTAGTAAAAGTCCACATTGAG
Genome of an organism encodes genetic
information in long sequence of 4 DNAnucleotides: ATCG
Bacteria: ~5 million bp
Humans: ~3 billion bp
Current DNA sequencing machines can generate
1-2 Gbp of sequence per day, in millions of shortreads (25-300bp)
Shorter reads, but much higher throughput
Per-base error rate estimated at 1-2% (Simpson,et al, 2009)
Recent studies of entire human genomes haveused 3.3 (Wang, et al., 2008) & 4.0 (Bentley, etal., 2008) billion 36bp reads
~144 GB of compressed sequence data
-
8/13/2019 Lecture1-Introduction to Cloud Computing
35/64
35
CGGTCTAGATGCTTAGCTATGCGGGCCCCTT
Reference sequence
Alignment
GCTTATCTAT
TTATCTATGC
ATCTATGCGG
ATCTATGCGG
GCTTATCTAT
TCTAGATGCT
CTATGCGGGCCTAGATGCTT
ATCTATGCGGCTATGCGGGC
ATCTATGCGG
Subject reads
-
8/13/2019 Lecture1-Introduction to Cloud Computing
36/64
36
CGGTCTAGATGCTTATCTATGCGGGCCCCTT
GCTTATCTATTTATCTATGC
ATCTATGCGGATCTATGCGG
GCTTATCTAT GGCCCCTT
GCCCCTTCCTT
CGG
CGGTCCGGTCTCGGTCTAG
TCTAGATGCTCTATGCGGGCCTAGATGCTT
CTT
ATGCGGGCCC
Reference sequence
Subject reads
-
8/13/2019 Lecture1-Introduction to Cloud Computing
37/64
37
Example: Bioinformatics
Evaluate running time on local 24 core cluster
Running time increases linearly with the number ofreads
Michael Schatz. CloudBurst: Highly
Sensitive Read Mapping with
MapReduce. Bioinformatics, 2009, in
press.
-
8/13/2019 Lecture1-Introduction to Cloud Computing
38/64
38
Example: Data Mining
del.icio.uscrawl->abipartite graphcovering 802739Webpages and1021107 tags.
Haoyuan Li,Yi Wang, Dong Zhang,Ming Zhang, Edward Y. Chang: Pfp:parallel fp-growth for queryrecommendation. RecSys 2008: 107-114
http://www.sigmod.org/dblp/db/indices/a-tree/l/Li:Haoyuan.htmlhttp://www.sigmod.org/dblp/db/indices/a-tree/w/Wang:Yi.htmlhttp://www.sigmod.org/dblp/db/indices/a-tree/z/Zhang:Ming.htmlhttp://www.sigmod.org/dblp/db/indices/a-tree/c/Chang:Edward_Y=.htmlhttp://www.sigmod.org/dblp/db/conf/recsys/recsys2008.htmlhttp://www.sigmod.org/dblp/db/conf/recsys/recsys2008.htmlhttp://www.sigmod.org/dblp/db/indices/a-tree/c/Chang:Edward_Y=.htmlhttp://www.sigmod.org/dblp/db/indices/a-tree/z/Zhang:Ming.htmlhttp://www.sigmod.org/dblp/db/indices/a-tree/w/Wang:Yi.htmlhttp://www.sigmod.org/dblp/db/indices/a-tree/l/Li:Haoyuan.html -
8/13/2019 Lecture1-Introduction to Cloud Computing
39/64
+
An Example
-
8/13/2019 Lecture1-Introduction to Cloud Computing
40/64
40
Try on these collection:
2006870 Million,2 TB.
Google, Yahoo100+Billion pages
-
8/13/2019 Lecture1-Introduction to Cloud Computing
41/64
41
Divide and Conquer
Work
w1 w2 w3
r1 r2 r3
Result
worker worker worker
Partition
Combine
-
8/13/2019 Lecture1-Introduction to Cloud Computing
42/64
42
Whats Mapreduce
Parallel/Distributed Computing ProgrammingModel
Input split shuffle output
-
8/13/2019 Lecture1-Introduction to Cloud Computing
43/64
43
Typical problem solved by MapReduce
:key/value Map: extract something
map (in_key, in_value) -> list(out_key, intermediate_value)
input key/value pair
key/value pairs
Shuffle: key
Reduce: aggregate, summarize, filter, etc. reduce (out_key, list(intermediate_value)) -> list(out_value)
keyvalues
(usually just one)
-
8/13/2019 Lecture1-Introduction to Cloud Computing
44/64
44
Word Frequencies in Web pages
one document per record mapfunction
key = document URL
value = document contents
map(potentially many) key/value pairs. document
-
8/13/2019 Lecture1-Introduction to Cloud Computing
45/64
45
Example continued:
MapReduce()key(shuffle/sort)
reducefunctionkeyvalues
sum
Reduce
-
8/13/2019 Lecture1-Introduction to Cloud Computing
46/64
-
8/13/2019 Lecture1-Introduction to Cloud Computing
47/64
47
History of Hadoop
2004 - Initial versions of what is now Hadoop Distributed File System and
Map-Reduce implemented by Doug Cutting & Mike Cafarella December 2005 - Nutch ported to the new framework. Hadoop runs reliably
on 20 nodes. January 2006 - Doug Cutting joins Yahoo! February 2006 - Apache Hadoop project official started to support the
standalone development of Map-Reduce and HDFS.
March 2006 - Formation of the Yahoo! Hadoop team May 2006 - Yahoo sets up a Hadoop research cluster - 300 nodes April 2006 - Sort benchmark run on 188 nodes in 47.9 hours May 2006 - Sort benchmark run on 500 nodes in 42 hours (better hardware
than April benchmark) October 2006 - Research cluster reaches 600 Nodes
December 2006 - Sort times 20 nodes in 1.8 hrs, 100 nodes in 3.3 hrs, 500nodes in 5.2 hrs, 900 nodes in 7.8 January 2006 - Research cluster reaches 900 node April 2007 - Research clusters - 2 clusters of 1000 nodes Sep 2008 - Scaling Hadoop to 4000 nodesat Yahoo!
http://jeremy.zawodny.com/blog/archives/006471.htmlhttp://jeremy.zawodny.com/blog/archives/006471.html -
8/13/2019 Lecture1-Introduction to Cloud Computing
48/64
48
From Theory to Practice
Hadoop ClusterYou
1. Scp data to cluster
2. Move data into HDFS
3. Develop code locally
4. Submit MapReduce job
4a. Go back to Step 3
5. Move data out of HDFS
6. Scp data from cluster
-
8/13/2019 Lecture1-Introduction to Cloud Computing
49/64
-
8/13/2019 Lecture1-Introduction to Cloud Computing
50/64
50
MapReduce MapReduce
MapReduce
-
8/13/2019 Lecture1-Introduction to Cloud Computing
51/64
51
LEC# TOPICS ABSTRACT
1 - MapReduce
MapReduce
2 MapReduce MapReduce
Inverted IndexMapReduceInverted Index
3
PageRankMapReducePageRank
4 MapReduce
MapReduce
ClusteringMapReduce
Clustering
5 MapReduce MapReduce
MapReduce
6
7
8
-
8/13/2019 Lecture1-Introduction to Cloud Computing
52/64
52
Grading Policy
30%Assignments
20%Readings
50% CourseprojectHw1 - Read - Intro Distributed system;
Intro MapReduce Programming.Hw2 - Read MapReduce[1]Hw3Read GFS[2]Hw4Read Pig Latin[3]
Lab 1 - Introduction to Hadoop, EclipseLab 2A Simple Inverted IndexLab 3 - PageRankover Wikipedia CorpusLab 4Clusteringthe Netflix Movie Data
http://code.google.com/edu/parallel/dsd-tutorial.htmlhttp://code.google.com/edu/parallel/mapreduce-tutorial.htmlhttp://code.google.com/edu/parallel/mapreduce-tutorial.htmlhttp://code.google.com/edu/parallel/dsd-tutorial.html -
8/13/2019 Lecture1-Introduction to Cloud Computing
53/64
53
Programming Language
Lots of java programming practices
-
8/13/2019 Lecture1-Introduction to Cloud Computing
54/64
54
Teachers and Resources
http://net.pku.edu.cn/~cour
se/cs402/2009/
http://groups.google.com/g
roup/cs402pku
Hadoop http://hadoop.apache.org/c
ore/
Resources http://net.pku.edu.cn/~cour
se/cs402/2008/resource.html
http://net.pku.edu.cn/~course/cs402/2009/http://net.pku.edu.cn/~course/cs402/2009/http://groups.google.com/group/cs402pkuhttp://groups.google.com/group/cs402pkuhttp://hadoop.apache.org/core/http://hadoop.apache.org/core/http://net.pku.edu.cn/~course/cs402/2008/resource.htmlhttp://net.pku.edu.cn/~course/cs402/2008/resource.htmlhttp://net.pku.edu.cn/~course/cs402/2008/resource.htmlhttp://net.pku.edu.cn/~yhf/mailto:[email protected]:[email protected]://net.pku.edu.cn/~yhf/http://net.pku.edu.cn/~course/cs402/2008/resource.htmlhttp://net.pku.edu.cn/~course/cs402/2008/resource.htmlhttp://net.pku.edu.cn/~course/cs402/2008/resource.htmlhttp://hadoop.apache.org/core/http://hadoop.apache.org/core/http://groups.google.com/group/cs402pkuhttp://groups.google.com/group/cs402pkuhttp://net.pku.edu.cn/~course/cs402/2009/http://net.pku.edu.cn/~course/cs402/2009/ -
8/13/2019 Lecture1-Introduction to Cloud Computing
55/64
55
Homework
http://net.pku.edu.cn/~course/cs402/2009/
3-4project
Lab1
Lab 1 - Introduction to Hadoop, Eclipse
HW Reading1
Intro Distributed system; Intro Parallel Programming. http://code.google.com/edu/parallel/dsd-tutorial.html
http://code.google.com/edu/parallel/mapreduce-tutorial.html
http://net.pku.edu.cn/~course/cs402/2009/http://code.google.com/edu/parallel/dsd-tutorial.htmlhttp://code.google.com/edu/parallel/mapreduce-tutorial.htmlhttp://code.google.com/edu/parallel/mapreduce-tutorial.htmlhttp://code.google.com/edu/parallel/mapreduce-tutorial.htmlhttp://code.google.com/edu/parallel/mapreduce-tutorial.htmlhttp://code.google.com/edu/parallel/dsd-tutorial.htmlhttp://code.google.com/edu/parallel/dsd-tutorial.htmlhttp://code.google.com/edu/parallel/dsd-tutorial.htmlhttp://net.pku.edu.cn/~course/cs402/2009/ -
8/13/2019 Lecture1-Introduction to Cloud Computing
56/64
56
Summary
CloudComputing brings
Possible of using unlimitedresourceson-demand, and byanytime and anywhere
Possible of construct anddeploy applicationsautomatically scaleto tens ofthousands computers
Possible of construct and runprograms dealing withprodigious volume of data
How to make it real? Distributed File System
Distributed ComputingFramework
-
8/13/2019 Lecture1-Introduction to Cloud Computing
57/64
Q&A
-
8/13/2019 Lecture1-Introduction to Cloud Computing
58/64
58
[1] J. Dean and S. Ghemawat, "MapReduce:Simplified Data Processing on Large Clusters," inOsdi, 2004, pp. 137-150.
[2] G. Sanjay, G. Howard, and L. Shun-Tak, "TheGoogle file system," in Proceedings of the
nineteenth ACM symposium on Operatingsystems principles. Bolton Landing, NY, USA:
ACM Press, 2003. [3] O. Christopher, R. Benjamin, S. Utkarsh, K.
Ravi, and T. Andrew, "Pig latin: a not-so-foreignlanguage for data processing," in Proceedings ofthe 2008 ACM SIGMOD international conferenceon Management of data. Vancouver, Canada:
ACM, 2008.
-
8/13/2019 Lecture1-Introduction to Cloud Computing
59/64
59
Google App Engine
App Engine handles HTTP(S) requests, nothing else Think RPC: request in, processing, response out
Works well for the web and AJAX; also for other services
App configuration is dead simple No performance tuning needed
Everything is built to scale
infinite number of apps, requests/sec, storage capacity APIs are simple, stupid
-
8/13/2019 Lecture1-Introduction to Cloud Computing
60/64
60
App Engine Architecture
60
Python
VM
process
stdlib
app
memcachedatastore
mail
images
urlfech
stateful
APIs
stateless APIs R/O FS
req/resp
-
8/13/2019 Lecture1-Introduction to Cloud Computing
61/64
61
Microsoft Windows Azure
-
8/13/2019 Lecture1-Introduction to Cloud Computing
62/64
62
Amazon Web Services
Amazons infrastructure (auto scaling, loadbalancing)
Elastic Compute Cloud (EC2)scalable virtualprivate server instances
Simple Storage Service (S3)
Simple Queue Service (SQS)messaging
SimpleDB - database
Flexible Payments Service, Mechanical Turk,CloudFront, etc.
-
8/13/2019 Lecture1-Introduction to Cloud Computing
63/64
63
Amazon Web Services
Very flexible, lower-level offering (closer tohardware) = more possibilities, higher performing
Runs platform you provide (machine images)
Supports all major web languages
Industry-standard services (move off AWS easily)
Require much more work, longer time-to-market
Deployment scripts, configuring images, etc.
Various libraries and GUI plug-ins make AWS dohelp
-
8/13/2019 Lecture1-Introduction to Cloud Computing
64/64
Price of Amazon EC2