大规模数据处理 / 云计算 lecture 1 - introduction to mapreduce 彭波...

/ Lecture 1 - Introduction to MapReduce 4/21/2011 http://net.pku.edu.cn/~course/cs402/ This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details Jimmy Lin University of Maryland SEWMGroup

What is this course about? Data-intensive information processing Large-data (web-scale) problems Focus on MapReduce programming An entry-level course~ 2

What is MapReduce? Programming model for expressing distributed computations at a massive scale Execution framework for organizing and performing such computations Open-source implementation called Hadoop 3

Why Large Data?

How much data? Google processes 20 PB a day (2008) Wayback Machine has 3 PB + 100 TB/month (3/2009) Facebook has 2.5 PB of user data + 15 TB/day (4/2009) eBay has 6.5 PB of user data + 50 TB/day (5/2009) CERNs LHC will generate 15 PB a year (??) 640K ought to be enough for anybody. 5

7 Happening everywhere! Molecular biology (cancer) microarray chips Particle events (LHC) particle colliders microprocessors Simulations (Millennium) Network traffic (spam) fiber optics 300M/day 1B 1M/sec

8 Maximilien Brice, CERN

No data like more data! (Banko and Brill, ACL 2001) (Brants et al., EMNLP 2007) s/knowledge/data/g; How do we get here if were not Google? 12

Example: information extraction Answering factoid questions Pattern matching on the Web Works amazingly well Learning relations Start with seed instances Search for patterns on the Web Using patterns to find more instances Who shot Abraham Lincoln? X shot Abraham Lincoln Birthday-of(Mozart, 1756) Birthday-of(Einstein, 1879) Wolfgang Amadeus Mozart (1756 - 1791) Einstein was born in 1879 PERSON (DATE PERSON was born in DATE (Brill et al., TREC 2001; Lin, ACM TOIS 2007) (Agichtein and Gravano, DL 2000; Ravichandran and Hovy, ACL 2002; ) 13

14 Example: Scene Completion Image Database Grouped by Semantic Content 30 different Flickr.com groups 2.3 M images total (396 GB). Select Candidate Images Most Suitable for Filling Hole Classify images with gist scene detector [Torralba] Color similarity Local context matching Computation Index images offline 50 min. scene matching, 20 min. local matching, 4 min. compositing Reduces to 5 minutes total by using 5 machines Extension Flickr.com has over 500 million images Hays, Efros (CMU), Scene Completion Using Millions of Photographs SIGGRAPH, 2007

More Data More Gains? CNNIC 2010 6 4.2 31.8% 4334 2.77 18.6% 1.4 30% 15

2009 2009 238868 145475 93393 37. 88 312.46 73.4 1.41 0.33 567.27 4.73 8.86% 11.24% 5.36% 4.53% 4.61% 8.94% 16

Did you know? 17

Did you know? We are currently preparing our students for jobs that dont yet exist It is estimated that a weeks worth of the New York Times contains more information than a person was likely to come across in a lifetime in the 18th century The amount of new technical information is doubling every 2 years So what does IT ALL MEAN?So what does IT ALL MEAN? 18

We are living in exponential times 19

20 Two Different Views a thrower-awayerMyLifeBits trying to live an efficient life so that one has time to work and be with ones family. Jennifer Widom Gordon Bell

Information Overloading 21

What is Cloud Computing?

The best thing since sliced bread? Before clouds Grids Vector supercomputers Cloud computing means many different things: Large-data processing Rebranding of web 2.0 Utility computing Everything as a service 23

Rebranding of web 2.0 Rich, interactive web applications Clouds refer to the servers that run them AJAX as the de facto standard (for better or worse) Examples: Facebook, YouTube, Gmail, The network is the computer: take two User data is stored in the clouds Rise of the netbook, smartphones, etc. Browser is the OS 24

Source: Wikipedia (Electricity meter)

Utility Computing What? Computing resources as a metered service (pay as you go) Ability to dynamically provision virtual machines Why? Cost: capital vs. operating expenses Scalability: infinite capacity Elasticity: scale up or down on demand Does it make sense? Benefits to cloud users Business case for cloud providers I think there is a world market for about five computers. 26

Everything as a Service Utility computing = Infrastructure as a Service (IaaS) Why buy machines when you can rent cycles? Examples: Amazons EC2, Rackspace Platform as a Service (PaaS) Give me nice API and take care of the maintenance, upgrades, Example: Google App Engine Software as a Service (SaaS) Just run it for me! Example: Gmail, Salesforce 27

Utility Computing pay-as-you-go Microsoft pay less utility computing 500 use less pay less cloud computing 28

Platform as a Service (PaaS) Web Application Services PaaS Internet Multi-tenant architecture platform 29

Software as a Service (SaaS) a model of software deployment whereby a provider licenses an application to customers for use as a service on demand.software deployment 30

Who cares? Ready-made large-data problems Lots of user-generated content Even more user behavior data Examples: Facebook friend suggestions, Google ad placement Business intelligence: gather everything in a data warehouse and run analytics to generate insight Utility computing Provision Hadoop clusters on-demand in the cloud Lower barrier to entry for tackling large-data problem Commoditization and democratization of large-data capabilities 31

Story around Hadoop

Google-IBM Cloud Computing Initiative 2007 10 Google IBM 6 MapReduce Hadoop 2000 2500 33

Cloud Computing Initiative Google and IBM team on cloud computing initiative for universities(2007-1008) provide several hundred computers access through the Internet to test parallel programming projects The idea for the program from Google senior software engineer Christophe Bisciglia Google Code UniversityGoogle Code University 34

35 The Information Factories Googleplex(pre-2008) servers number 450,000, according to the lowest estimate 200 petabytes of hard disk storage four petabytes of RAM To handle the current load of 100 million queries a day, input-output bandwidth must be in the neighborhood of 3 petabits per second

Google Infrastructure 2003 "The Google file system," in sosps. Bolton Landing, NY, USA: ACM Press, 2003. 2004 "MapReduce: Simplified Data Processing on Large Clusters," in Osdi, 2004, 2006 "Bigtable: A Distributed Storage System for Structured Data (Awarded Best Paper!)," in Osdi, 2006 . 36

37 Hadoop Project Doug Cutting

38 History of Hadoop 2004 - Initial versions of what is now Hadoop Distributed File System and Map-Reduce implemented by Doug Cutting & Mike Cafarella December 2005 - Nutch ported to the new framework. Hadoop runs reliably on 20 nodes. January 2006 - Doug Cutting joins Yahoo!Doug Cutting joins Yahoo! February 2006 - Apache Hadoop project official started to support the standalone development of Map-Reduce and HDFS. March 2006 - Formation of the Yahoo! Hadoop team May 2006 - Yahoo sets up a Hadoop research cluster - 300 nodes April 2006 - Sort benchmark run on 188 nodes in 47.9 hours May 2006 - Sort benchmark run on 500 nodes in 42 hours (better hardware than April benchmark) October 2006 - Research cluster reaches 600 Nodes December 2006 - Sort times 20 nodes in 1.8 hrs, 100 nodes in 3.3 hrs, 500 nodes in 5.2 hrs, 900 nodes in 7.8 January 2007 - Research cluster reaches 900 node April 2007 - Research clusters - 2 clusters of 1000 nodes Sep 2008 - Scaling Hadoop to 4000 nodes at Yahoo!

Google Code University 2008, Seminar: Mass Data Processing Technology on Large Scale Clusters, Tsinghua University Aaron Kimball 39

Startup: Cloudera Cloudera is pushing a commercial distribution for Hadoop Hadoop Mike Olson Aaron Kimball Doug Cutting Christophe Bisciglia Tom White 40

Course Administrivia

TextBooks [Tom] Tom White, Hadoop: The Definitive Guide, O'Reilly, 2009. [Chinese Edition][Chinese Edition] [Lin] Jimmy Lin and Chris Dyer, Data-Intensive Text Processing with MapReduce, 2010. [EBook][EBook] 42

IDTopicsContentsReading 1Background 1.CloudComputing 2. [Lin]Ch1:Introduction [Tom]Ch1:Meet Hadoop 2System1.MapReduce & runtime system 2.HDFS [Lin]Ch2:Mapreduce Basic [Tom]Ch6:How mapreduce works [paper] 3Environment 1. Hadoop (*)Hadoop environment setup (*)WordCount [Tom]Ch9:Setting up hadoop cluster [Tom]Appendix A:Installing apache hadoop [Tom]Ch2:Mapreduce 4Algorithm Design 1.MapReduce Program Develop 2.Basic MapReduce algorithm design and design patterns (*)WordCount (Mapreduce library class) [Tom]Ch5:Developing a MapReduce Application [Lin]Ch3:MapReduce algorithm design 5Text retrieval(*)Inverted Index[Lin]Ch4:Inverted Indexing for Text Retrieval 6Graph Algorithm (*)PageRank (Mapreduce features: side data distribution) [Lin]Ch5:Graph Algorithms 43

Recap Why large data? Cloud computing and Hadoop What is MapReduce? 44

Q&A? Thanks you!

大规模数据处理 / 云计算 lecture 1 - introduction to mapreduce 彭波...

Documents