geoff rothman presentation on parallel processing

Cloud Computing & MapReduce: Parallel Processing on a Massive Scale

Geoff Rothman ([email protected])

March 27, 2010

mailto:[email protected]�

Outline

1. Overview of Cloud Computing– Establish a general definition

2. Overview of Google MapReduce– Parallel programming with Cloud Computing

3. Debate between MapReduce & Parallel DBMS– Is one better than the other or are they

complementary?

Overview of Cloud Computing

Cloud Computing: What Does It Mean?

• On-demand network access to shared pool of configurable computing resources [1]

[2]

Presenter

Presentation Notes

Cloud typically used in drawings to depict Internet; Notice diff service providers connected via Cloud. Details abstracted away from end user; unaware of where app or computing power physically resides. -differing opinions; huge feat just to define -Geoff’s pick: National Institute of Science & Technology (def 10/2009) Network Servers Storage Apps Services With the cloud, imagine using a provider that has unlimited servers or computing power at your disposal Grid vs CC: ------------- -Batch job scheduling vs real-time ondemand allocation -Cloud has more control over system (IAAS) and types of systems that can be deployed. Can boot a server into full production automated. Virtualization is vital; VMs not needed in grid computing. focus on resources leased by virtualization rather than by scheduling jobs. -focus on resources leased by virtualization rather than scheduling jobs

NIST View of Cloud Computing

• Five characteristics

• Three service models

• Four deployment models

Presenter

Presentation Notes

-seems to be first example of a generally agreed upon classification/taxonomy, Peter Mell and Tim Grance NIST

Cloud Computing Characteristics

• On-Demand & Automated

• Broad network access

• Resource Pooling

• Rapid Elasticity

• Measured Service

Presenter

Presentation Notes

On-Demand: resources available as needed automatically, no human intervention Broad network access: available thru network on heterogeneous clients (laptop, cellphone, pda etc) Resource Pooling: dynamic appropriation of resources (invisible to user). Includes storage, processing, memory, network bandwidth and VMs. Rapid Elasticity: scale up/down quickly in automated fashion; appears unlimited to end user Measured Service: just like utility company, you can meter storage, processing, bandwidth for charge backs and reporting

“SPI Model - as a Service”

• Software as a Service (SaaS):– Application system (Salesforce, WebEx)

• Platform as a Service (PaaS):– Infrastructure pre-existing; simply code and deploy

(Google AppEngine, MS Azure, Force.com)

• Infrastructure as a Service (IaaS):– Raw infrastructure, servers and storage provided on-

demand (Amazon Web Services, GoGrid) [3]

Presenter

Presentation Notes

http://news.cnet.com/8301-19413_3-10140278-240.html?tag=mncol;txt (by James Urquhart 1/11/2009) software as a service (SaaS): applications delivered over Internet on some form of "on-demand" billing system. Examples include Salesforce.com, Google Docs, WebEx, and Workday. platform as a service (PaaS): Development platforms and middleware systems hosted by vendor. Allows developers to code / deploy without worrying about infrastructure. Examples include Google AppEngine, Microsoft Azure, and Force.com. infrastructure as a service (IaaS): Raw infrastructure, such as servers and storage, provided from vendor premises directly as on-demand service. Examples include Amazon Web Services, GoGrid, and Flexiscale.

[4]

Presenter

Presentation Notes

http://rationalsecurity.typepad.com/blog/2009/01/cloud-computing-taxonomy-ontology-please-review.html (Chris Hoff) Cloud Stack- like OSI network layer model InfrastructureAAS: Network, Storage, VMs PlatformAAS: dbase, messaging etc SoftwareAAS: PC, Mobile - Voice, Data & Video -the lower the stack, the less established/more insecure it is

[5]

Presenter

Presentation Notes

http://www.opencrowd.com/views/cloud.php/2Security CC Vendor Taxonomy -addition of “cloud centers” or cloud infrastructure providers -Half could be gone by next year…rapidly changing

Cloud Deployment Models

• Private– Single tenant, owned and managed by company or service provider

either on or off-premise; consumers are trusted

• Public– Single or multitenant (shared), owned by service provider off-premise;

consumers are untrusted

• Managed– Single or multi-tenant (shared), located in org’s datacenter but

managed and secured by Service Provider; consumers are trusted or untrusted

• Hybrid– Combination of public/private offering; “cloud burst”; consumers are

trusted or untrusted

Presenter

Presentation Notes

Cloud Security Alliance Doc – based on NIST def 1. Private Clouds provided by organization or SP and offer a single-tenant (dedicated) operating environment with elasticity and utility benefits of Cloud model. Physical infrastructure may on or off-premise (SP). Management and security controlled by the org or SP. The consumers of the service are considered “trusted.” Trusted consumers are those who are considered part of an organization’s legal/contractual umbrella including employees, contractors, & business partners. Untrusted consumers are those that may be authorized to consume some/all services but are not logical extensions of the organization (3rd parties). 2. Public Clouds are provided by an SP and may offer either a single-tenant (dedicated) or multi-tenant (shared) operating environment with elasticity and utility benefits of Cloud model. The physical infrastructure is generally owned by and managed by the SP and located within the provider’s datacenters (off-premise.) Consumers of Public Cloud services are considered to be untrusted. 3. Managed Clouds are provided by a SP and may offer either a single-tenant (dedicated) or multi-tenant (shared) operating environment with elasticity and utility benefits of Cloud model. The physical infrastructure is owned by and/or physically located in the org’s datacenters with an extension of management and security control planes controlled by the SP. Consumers of Managed Clouds may be trusted or untrusted. [AT&T??] 4. Hybrid Clouds are a combination of public and private cloud offerings that allow for transitive information exchange and possibly application compatibility and portability across disparate Cloud service offerings and providers utilizing standard or proprietary methodologies regardless of ownership or location. Think Cloudburst! This model provides for an extension of management and security control planes . Consumers of Hybrid Clouds may be trusted or untrusted.

Why use the Cloud? CFO View

• Operational vs Capital Expenditures

• Better Cash Flow

• Limited Financial Risk

• Better Balance Sheet

• Outsource non-core competencies [7]

Presenter

Presentation Notes

Gartner lists CC as Top10 strategic technology areas for 2010 ML predicts a market worth $160B by 2011. OpEx: can deduct full amount immediately instead of tracking a depreciating asset Better Cash Flow: paying monthly allows for more projects to be funded because less up front Limited Financial Risk: Pay monthly and analyze results; instead of pay all up front with uncertain return Balance Sheet: related to Opex, nothing shows as opposed to SW/HW carried as long-term cap asset Outsource non-core competencies: focus on critical CSR app issues instead of how to fix obscure issues in MS Exchange. [add part about picking job wisely; don’t get a job that could get outsourced]

Why Use the Cloud? CIO View

• Analytics

• Parallel Batch Processing

• Compute intensive desktops apps [6]

• Mobile Interactive Apps (GUI for mashups) [6]

• Webserver uptime / redundancy

• Accelerate project rollouts

Presenter

Presentation Notes

-Analytics Massive amount of data to analyze: What are customers buying? Need for targeted ads & relevant search engine results -Parallel Batch Processing (quicker results for same cost) 1) The New York Times needed to generate PDF files for 11,000,000 articles (every article from 1851-1980) in the form of images scanned from the original paper. Using Hadoop & 100 EC2 instances at Amazon and 24 hours, the New York Times was able to convert 4TB of scanned articles to 1.5TB of PDF documents. 2) Peter Harkins, a Senior Engineer at The Washington Post, used Oracle, Amazon & 200 EC2 instances (1,407 server hours) to convert 17,481 pages of Hillary Clinton’s travel documents into a form more friendly to use on the WWW within nine hours after they were released -Compute intensive desktops apps (3d Rendering) Desktop apps extend into cloud for more resources if needed Option to keep data and computing remote in cloud; transfer GUI back to human user -Mobile Interactive Apps mobile phone app front ends to data from various sources in cloud iPhone apps that combine information from different sources (Urbanspoon online restaurant directory uses AWS) -Webserver uptime Utility computing allows for heavy spikes in webtraffic, automatic failover for websites -Accelerate project rollouts The structure is already there when starting with a cloud solution

Overview of Google MapReduce

Presenter

Presentation Notes

Cloud computing is often identified with Google MapReduce seeing as Google has become such a large player. Major players like Visa, Facebook and Yahoo are using it as well.

Cloud Computing & Parallel Batch Processing: Overview of Map/Reduce

• Developed by Google to perform simple computations on massive amounts of data ( > 1TB) in a substantially reduced amount of time

• Hides details for– Parallelization

– Data distribution

– Load balancing

– Fault tolerance

Presenter

Presentation Notes

-possibly derived from functional programming language LISP -programming simplified to allow for easy parallelism

MapReduce Programming Model [8]

Input & Output: each a set of key/value pairs

Code two functions: map & reduce

map (in_key, in_value) -> list(out_key, intermediate_value) • Processes input key/value pair • Produces set of intermediate pairs

reduce (out_key, list(intermediate_value)) -> list(out_value) • Combines all intermediate values for a particular key • Produces a set of merged output values (usually just one)

Determine frequency of words in a file.

Map function (assign a value of 1 to every word):- input is (file offset, various text) - output is a key-value pair [(word,1)]

MR Library Shuffle Step takes Map Output and groups by Keys by Hash function.

Reduce function (total counts per word):- input is (word, [1,1,1]) - output is (word, count)

Case 1: Word Count

Presenter

Presentation Notes

-simple, flexible…could do accumulator in map function if you wanted

Word Count – Sample Code [9]

map(String key, String value):// key: document name// value: document contentsfor each word w in value:

EmitIntermediate(w, "1");

reduce(String key, Iterator values):// key: a word// values: a list of countsint result = 0;for each v in values:

result += ParseInt(v);Emit(AsString(result));

Map tasks:Map1[(i,1)][(love,1)][(to,1)][(code,1)]

Map2[(to,1)][(code,1)][(is,1)][(to,1)][(love,1]

Word Counti love to code code,2

i,1is,1

love,2to,3

Result

File 2File 1

Reduce tasks:Reducer1 File1(code, [1,1]) -> (code,2)(i, [1]) -> (i,1)(is,[1]) -> (is,1)

Reducer2 File2(love,[1,1]) -> (love,2)(to,[1,1,1]) -> (to,3)

* File2 will have a key value pair of (to,2) after map when using MR Combiner functionality

MR Library groups intermediate keys and values in “Shuffle Phase”

to code is to love

MapReduce Features

• Fault Tolerance

• Redundant Execution

• Locality Optimization

• Skip Bad Records

• Sort before Reduce

• Combiner

Presenter

Presentation Notes

Fault Tolerance- periodic ping, re-execute map or reduce task as necessary Redundanct Execution- improves time and gets rid of “stragglers” (soft errors, resource overload) by Master scheduling backup copy of tasks near the end of MR job. Whichever finishes first wins. Locality Optimization- tries to place input file splits on same machine or rack as the worker; save bandwidth Skips Bad Records- after 2 failures on same record, master says skip record; willing to tolerate imperfections; PDBMS can’t skip banking info. Sorting for Reduce- reduce worker sorts; prevents random data access Combiner- save network bandwidth by aggregating key/values in the Map function

MapReduce System Flow [8]

Presenter

Presentation Notes

1) The MapReduce library in the user program first splits the input files into M pieces of typically 16 megabytes to 64 megabytes (MB) per piece. It then starts up many copies of the program on a cluster of machines. 2) One of the copies of the program is special: the master. The rest are workers that are assigned work by the master. There are M map tasks and R reduce tasks to assign. The master picks idle workers and assigns each one a map task or a reduce task. 3) A worker who is assigned a map task reads the contents of the corresponding input split. It parses key/value pairs out of the input data and passes each pair to the user-defined Map function. The intermediate key/value pairs produced by the Map function are buffered in memory. 4) Periodically, the buffered pairs are written to local disk, partitioned into R regions by the partitioning function. The locations of these buffered pairs on the local disk are passed back to the master, who is responsible for forwarding these locations to the reduce workers. 5) When a reduce worker is notified by the master about these locations, it uses remote procedure calls to read the buffered data from the local disks of the map workers. When a reduce worker has read all intermediate data, it sorts it by the intermediate keys so that all occurrences of the same key are grouped together. If the amount of intermediate data is too large to fit in memory, an external sort is used. 6) The reduce worker iterates over the sorted intermediate data and for each unique intermediate key encountered, it passes the key and the corresponding set of intermediate values to the user's Reduce function. The output of the Reduce function is appended to a final output file for this reduce partition. 7) When all map tasks and reduce tasks have been completed, the master wakes up the user program. At this point, the MapReduce call in the user program returns back to the user code. After successful completion, the output of the MapReduce execution is available in the R output files. [1] To detect failure, the master pings every worker periodically. If no response is received from a worker in a certain amount of time, the master marks the worker as failed. Any map tasks completed by the worker are reset back to their initial idle state, and therefore become eligible for scheduling on other workers. Similarly, any map task or reduce task in progress on a failed worker is also reset to idle and becomes eligible for rescheduling. Completed map tasks are re-executed when failure occurs because their output is stored on the local disk(s) of the failed machine and is therefore inaccessible. Completed reduce tasks do not need to be re-executed since their output is stored in a global file system.

MapReduce Function Flow [8]

Presenter

Presentation Notes

Group by Key = Shuffle Phase

Map & Reduce Parallel Execution [8]

Counts lines in all files that match a <regex> and displays counts.

Other uses include: analyzing web server access logs to find the top requested pages that match a given pattern

Map function (establish a match):- input is (file offset, char) - output is either:

1. an empty list [] (the line does not match ‘A’ or ‘C’) 2. a key-value pair [(line, 1)] (if it matches)

Reduce function (total counts):- input is (char, [1, 1, ...]) - output is (char, n) where n is the number of 1s in the list.

http://web.cs.wpi.edu/~cs4513/d08/OtherStuff/MapReduce-TeamC.ppt

Case 2: Distributed Grep

Map tasks:File1(0, C) -> [(C, 1)](2, B) -> [](4, B) -> [](6, C) -> [(C, 1)]

File2(0, C) -> [(C, 1)](2, A) -> [(A, 1)]

Distributed GrepCBBC

CA 3 C

1 A

ResultFile 2File 1

Reduce tasks:(A, [1]) -> (A, 1)(C, [1, 1, 1]) -> (C, 3)

-Data Analysis Needed: for all professional tennis tournaments the past 3 years, process log files to determine fastest speed serve each year.

Map function (enumerate speeds for each year):- input is (file offset, Year Speed) - output is a key-value pair [(Year,Speed)]

Reduce function (determine max speed each year):- input is (Year, [speed1, … speedN]) - output is (Year, Speed) where Speed is the fastest recorded that year.

Case 3: Max Speed Serve

Map tasks:[(2008,136)][(2009,126)][(2009,132)][(2008,134)][(2009,127)][(2010,124)]

Max Speed Serve

2008 1362009 1262009 132

2008- 1362009- 1322010- 124

ResultFile 2File 1

Reduce tasks:(2008, [136, 134]) -> (2008,136)(2009,[126,132,127]) -> (2009,132)(2010,[124] -> (2010,124)

2008 1342009 1272010 124

* Will drop value when using MR Combiner functionality

Find occurrences of pairs of words where word1 is located within 4 words of word2.

Map function (assign a value of 1 to every match):- input is (file offset, various text) - output is a key-value pair [(word1|word2,1)]

Reduce function (total count per match):- input is (word1|word2, [1,1,1]) - output is (word1|word2, count)

Case 4: Word Proximity

Presenter

Presentation Notes

-improves relevancy of search engine findings

Word1 = “piece” Word2 = “pie”

Map tasks:(0,i have a piece of the pie) (piece|pie,1)(0,it is a piece of cake; it doesn’t even look like pie) ()

Word Proximity

i have a piece of the pie piece|pie,1

ResultFile 2File 1

Reduce tasks:(piece|pie, [1]) -> (piece|pie,1)

it is a piece of cake; it doesn’t even look like

pie

Presenter

Presentation Notes

-word proximity crucial for accurate resutls

Given a list of website home pages (W1…W4) and every link on that page, point the destination sites back to the original source web site.

Map function - input is (adjacency list in format source: dest1, dest2..) - output is a key-value pair[dest,source]

Reduce function (create adjacency list with dest as key):-input/output is (dest,[source1, source2])

Case 5: Reverse Web-Link Graph

Map tasks:(W1,W2) -> (W2,W1)(W1,W4) -> (W4,W1)(W2,W1) -> (W1,W2)(W2,W3) -> (W3,W2)(W2,W4) -> (W4,W2)(W3,W4) -> (W4,W3)(W4,W1) -> (W1,W4)(W4,W3) -> (W3,W4)

Link Reversal

Reduce tasks:(W1,[W2,W4])(W2,[W1](W3,[W2,W4](W4,[W1,W2,W3]

W1: W2,W4W2: W1,W3,W4W3: W4W4: W1,W3

Input: Adjacency ListW1: W2,W4W2: W1W3: W2,W4W4: W1,W2,W3

Output: reversed list

MR Library groups intermediate keys and values in “Shuffle Phase”

Why Use MapReduce?

• Hides messy details of distributed infrastructure

• MapReduce simplifies programming paradigm to

allow for easy parallel execution

• Easily scales to thousands of machines

Presenter

Presentation Notes

-can be applied only to word related functions but also image manipulation, social networks, targeted relevant ads MapReduce in 4 words: Scale, Reliable, Simple, Affordable

MapReduce Jobs Run @ Google [15]

Aug. '04 Mar. '06 Sep. '07 Number of jobs (1000s) 29 171 2,217 Avg. completion time (secs) 634 874 395 Machine years used 217 2,002 11,081 map input data (TB) 3,288 52,254 403,152 map output data (TB) 758 6,743 34,774 reduce output data (TB) 193 2,970 14,018 Avg. machines per job 157 268 394 Unique implementations map 395 1958 4083 reduce 269 1208 2418

Current Debate:MapReduce vs Parallel DBMS

Presenter

Presentation Notes

One of tasks of IndepStudy was to understand current debate between the two.

Why Not Use A Parallel DBMS?

• Parallel DBMS: – multiple CPUs, multiple servers

– classic parallel programming concepts

– HUGE established industry $$$

• Parallel DBMS Vendors– Teradata (NCR), DB2 (IBM), Oracle (via exadata),

Greenplum, Vertica etc.

Presenter

Presentation Notes

*START HERE* -Been in existence for 20+ years -proven technology, high level query language, schemas etc Same concepts as MR: shared nothing architecture, partitioning, distributed processing, merge; maybe that was the source of confusion originally!!

“MapReduce is a Major Step Backward”

Stonebraker & Dewitt attack on MR (1/17/08) [10,11]

– a step backwards in database access

– a poor implementation

– not novel

– missing features

– incompatible with DBMS tools

Presenter

Presentation Notes

Dave Dewitt – UofW-Madison Prof, written 100+ technical reports Michael Stonebraker- relational dbase pioneer: ingres, postgres (UCALBERK), vertica, now Hstore -challenge the google mapreduce “hype” 1. MapReduce is a step backwards in database access -no schema, separate schema from application, high level access language (sql) 2. MapReduce is a poor implementation -no indexes, pull instead of push on reduce 3. MapReduce is not novel -partioning and UDF supported for decades 4. MapReduce is missing features -Bulk loader, Indexing, Updates, Transactions, Integrity constraints, Referential integrity, Views 5. MapReduce is incompatible with the DBMS tools -Report writers, BI, Data mining, Replication, Database design tools MR enthusiasts outraged -Accused of comparing “apples to oranges” -Prompted follow-up blog posting -Controvery spilled over into not only the web, but news media as well. http://databasecolumn.vertica.com/database-innovation/mapreduce-ii/ �

“Comparison of Approaches to Large-Scale Data Analysis”

Stonebraker Dewitt Comparison of Hadoop MR vs Vertica & DBMS-X (7/2009) [12]

– Hadoop• easy to install, get up & running• Maintenance of apps harder• Good for fault tolerance in queries• Slow because of reading entire file each time & pull of file on

reduce step

– Vertica & DBMS-X• much faster than Hadoop because of indexes, schema,

column orientation, compression & “warm start-up at boot time”.

“MapReduce and Parallel DBMSs: Friends or Foes?”

Dewitt & Stonebraker update their position (1/2010) [13]

– Hadoop MR and Parallel DBMS are complementary

– Use Hadoop MR for subsets of tasks

– Use Parallel DBMS for all other applications

– Hadoop still needs significant improvements

Presenter

Presentation Notes

-are complementary -use Hadoop for subsets of tasks -Extract, Transform, Load (ETL) -Complex Analytics (multiple passes) -Semi-structured data (key/value pairs) -“quick and dirty” analyses (quick start up time) -Limited budget -Use Parallel DBMS for all other applications -Hadoop needs significant improvements: repetitive record parsing no compression hit on writing intermediate files to disk inferior scheduling row storage

“MapReduce: A Flexible Data Processing Tool”

Jeffrey Dean & Sanjay Ghemawat (Google) rebuttal (1/2010) [14]

– MR can input data from heterogenous environments

– MR can use indices as input to MR

– Useful for Complex functions

– “Protocol Buffers” parse much faster

– MR pull model non-negotiable

– Addresses performance concerns

Presenter

Presentation Notes

*notice they call it a data processing tool* MR can input data in heterogenous environments -redefine their reader/writer funcs that operate on DFS, query results, Google BigTable etc. -simply redefine reader/writer functions for new storage system *inefficient to copy/load in PDBMS if only for couple of queries MR can use indices as input to MR -filenames, subset of columns from BigTable etc Useful for complex functions -large scale image mining/manipulation -link mining -fault tolerant parallel execution of programs across sets of input data -written in high level languages such as PigLatin, Sawzall Use structured data via Google’s “Protocol Buffers” message instead of textual input (strings) -protocol buffer binary encoding; allows google programmers to share data types easily -application doesn’t need to be changed for data -Parsing a string = 1731ns/record; Parsing buffer = 20 ns/record MR pull model crucial for fault tolerance -batching, sorting, grouping are used to mitigate pull cost -MR jobs encounter a few failures; push requires re-execution of all Map tasks Addresses performance concerns (Hadoop != Google MR)

Conclusions

• Hadoop MapReduce solid choice for leveraging power of Cloud Computing when tackling specific parallel data processing tasks; use PDBMS for all other tasks.

• MR and PDBMS can learn from each other

• Open source Hadoop MR continues to gain ground on performance and efficiency

• Battle of MR vs PDBMS subsiding for now

Presenter

Presentation Notes

-not going to throw SQL away; HIVE created as a data warehouse for Hadoop MR -HadoopDB by Abadi (wanting to improve MR technology)

Questions???

References

[1] http://csrc.nist.gov/groups/SNS/cloud-computing/cloud-def-v15.doc [2] http://en.wikipedia.org/wiki/File:Cloud_computing.svg[3] http://news.cnet.com/8301-19413_3-10140278-240.html?tag=mncol;txt [4] http://rationalsecurity.typepad.com/blog/2009/01/cloud-computing-taxonomy-ontology-please-

review.html [5] http://www.opencrowd.com/views/cloud.php/2Security[6] http://berkeleyclouds.blogspot.com[7] Forrester Research, Talking to Your CFO About Cloud Computing, Ted Schadler; Oct. 29, 2008.[8] http://code.google.com/edu/parallel/mapreduce-tutorial.html[9] http://labs.google.com/papers/mapreduce.html[10] http://databasecolumn.vertica.com/database-innovation/mapreduce-a-major-step-backwards/[11] http://databasecolumn.vertica.com/database-innovation/mapreduce-ii/ [12] “Comparison of Approaches to Large-Scale Data Analysis”, Pavlo, Abadi, Stonebraker, Dewitt , et al

(7/2009)[13] ACM, “MapReduce and Parallel DBMSs: Friends or Foes?”, Stonebraker, Abadi, Dewitt, et al (1/2010)[14] ACM, “MapReduce: A Flexible Data Processing Tool”, Jeffrey Dean & Sanjay Ghemawat (1/2010)[15] http://googlesystem.blogspot.com/2008/01/google-reveals-more-mapreduce-stats.html

geoff rothman presentation on parallel processing

Documents