大规模数据处理 / 云计算 lecture 1 - introduction to mapreduce 彭波...

Download 大规模数据处理 / 云计算 Lecture 1 - Introduction to MapReduce 彭波 北京大学信息科学技术学院 4/21/2011 course/cs402/ This work is licensed under a Creative

If you can't read please download the document

Post on 19-Dec-2015

265 views

Category:

Documents


13 download

TRANSCRIPT

  • Slide 1
  • / Lecture 1 - Introduction to MapReduce 4/21/2011 http://net.pku.edu.cn/~course/cs402/ This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details Jimmy Lin University of Maryland SEWMGroup
  • Slide 2
  • What is this course about? Data-intensive information processing Large-data (web-scale) problems Focus on MapReduce programming An entry-level course~ 2
  • Slide 3
  • What is MapReduce? Programming model for expressing distributed computations at a massive scale Execution framework for organizing and performing such computations Open-source implementation called Hadoop 3
  • Slide 4
  • Why Large Data?
  • Slide 5
  • How much data? Google processes 20 PB a day (2008) Wayback Machine has 3 PB + 100 TB/month (3/2009) Facebook has 2.5 PB of user data + 15 TB/day (4/2009) eBay has 6.5 PB of user data + 50 TB/day (5/2009) CERNs LHC will generate 15 PB a year (??) 640K ought to be enough for anybody. 5
  • Slide 6
  • 6
  • Slide 7
  • 7 Happening everywhere! Molecular biology (cancer) microarray chips Particle events (LHC) particle colliders microprocessors Simulations (Millennium) Network traffic (spam) fiber optics 300M/day 1B 1M/sec
  • Slide 8
  • 8 Maximilien Brice, CERN
  • Slide 9
  • 9
  • Slide 10
  • 10 Maximilien Brice, CERN
  • Slide 11
  • 11 Maximilien Brice, CERN
  • Slide 12
  • No data like more data! (Banko and Brill, ACL 2001) (Brants et al., EMNLP 2007) s/knowledge/data/g; How do we get here if were not Google? 12
  • Slide 13
  • Example: information extraction Answering factoid questions Pattern matching on the Web Works amazingly well Learning relations Start with seed instances Search for patterns on the Web Using patterns to find more instances Who shot Abraham Lincoln? X shot Abraham Lincoln Birthday-of(Mozart, 1756) Birthday-of(Einstein, 1879) Wolfgang Amadeus Mozart (1756 - 1791) Einstein was born in 1879 PERSON (DATE PERSON was born in DATE (Brill et al., TREC 2001; Lin, ACM TOIS 2007) (Agichtein and Gravano, DL 2000; Ravichandran and Hovy, ACL 2002; ) 13
  • Slide 14
  • 14 Example: Scene Completion Image Database Grouped by Semantic Content 30 different Flickr.com groups 2.3 M images total (396 GB). Select Candidate Images Most Suitable for Filling Hole Classify images with gist scene detector [Torralba] Color similarity Local context matching Computation Index images offline 50 min. scene matching, 20 min. local matching, 4 min. compositing Reduces to 5 minutes total by using 5 machines Extension Flickr.com has over 500 million images Hays, Efros (CMU), Scene Completion Using Millions of Photographs SIGGRAPH, 2007
  • Slide 15
  • More Data More Gains? CNNIC 2010 6 4.2 31.8% 4334 2.77 18.6% 1.4 30% 15
  • Slide 16
  • 2009 2009 238868 145475 93393 37. 88 312.46 73.4 1.41 0.33 567.27 4.73 8.86% 11.24% 5.36% 4.53% 4.61% 8.94% 16
  • Slide 17
  • Did you know? 17
  • Slide 18
  • Did you know? We are currently preparing our students for jobs that dont yet exist It is estimated that a weeks worth of the New York Times contains more information than a person was likely to come across in a lifetime in the 18th century The amount of new technical information is doubling every 2 years So what does IT ALL MEAN?So what does IT ALL MEAN? 18
  • Slide 19
  • We are living in exponential times 19
  • Slide 20
  • 20 Two Different Views a thrower-awayerMyLifeBits trying to live an efficient life so that one has time to work and be with ones family. Jennifer Widom Gordon Bell
  • Slide 21
  • Information Overloading 21
  • Slide 22
  • What is Cloud Computing?
  • Slide 23
  • The best thing since sliced bread? Before clouds Grids Vector supercomputers Cloud computing means many different things: Large-data processing Rebranding of web 2.0 Utility computing Everything as a service 23
  • Slide 24
  • Rebranding of web 2.0 Rich, interactive web applications Clouds refer to the servers that run them AJAX as the de facto standard (for better or worse) Examples: Facebook, YouTube, Gmail, The network is the computer: take two User data is stored in the clouds Rise of the netbook, smartphones, etc. Browser is the OS 24
  • Slide 25
  • Source: Wikipedia (Electricity meter)
  • Slide 26
  • Utility Computing What? Computing resources as a metered service (pay as you go) Ability to dynamically provision virtual machines Why? Cost: capital vs. operating expenses Scalability: infinite capacity Elasticity: scale up or down on demand Does it make sense? Benefits to cloud users Business case for cloud providers I think there is a world market for about five computers. 26
  • Slide 27
  • Everything as a Service Utility computing = Infrastructure as a Service (IaaS) Why buy machines when you can rent cycles? Examples: Amazons EC2, Rackspace Platform as a Service (PaaS) Give me nice API and take care of the maintenance, upgrades, Example: Google App Engine Software as a Service (SaaS) Just run it for me! Example: Gmail, Salesforce 27
  • Slide 28
  • Utility Computing pay-as-you-go Microsoft pay less utility computing 500 use less pay less cloud computing 28
  • Slide 29
  • Platform as a Service (PaaS) Web Application Services PaaS Internet Multi-tenant architecture platform 29
  • Slide 30
  • Software as a Service (SaaS) a model of software deployment whereby a provider licenses an application to customers for use as a service on demand.software deployment 30
  • Slide 31
  • Who cares? Ready-made large-data problems Lots of user-generated content Even more user behavior data Examples: Facebook friend suggestions, Google ad placement Business intelligence: gather everything in a data warehouse and run analytics to generate insight Utility computing Provision Hadoop clusters on-demand in the cloud Lower barrier to entry for tackling large-data problem Commoditization and democratization of large-data capabilities 31
  • Slide 32
  • Story around Hadoop
  • Slide 33
  • Google-IBM Cloud Computing Initiative 2007 10 Google IBM 6 MapReduce Hadoop 2000 2500 33
  • Slide 34
  • Cloud Computing Initiative Google and IBM team on cloud computing initiative for universities(2007-1008) provide several hundred computers access through the Internet to test parallel programming projects The idea for the program from Google senior software engineer Christophe Bisciglia Google Code UniversityGoogle Code University 34
  • Slide 35
  • 35 The Information Factories Googleplex(pre-2008) servers number 450,000, according to the lowest estimate 200 petabytes of hard disk storage four petabytes of RAM To handle the current load of 100 million queries a day, input-output bandwidth must be in the neighborhood of 3 petabits per second
  • Slide 36
  • Google Infrastructure 2003 "The Google file system," in sosps. Bolton Landing, NY, USA: ACM Press, 2003. 2004 "MapReduce: Simplified Data Processing on Large Clusters," in Osdi, 2004, 2006 "Bigtable: A Distributed Storage System for Structured Data (Awarded Best Paper!)," in Osdi, 2006 . 36
  • Slide 37
  • 37 Hadoop Project Doug Cutting
  • Slide 38
  • 38 History of Hadoop 2004 - Initial versions of what is now Hadoop Distributed File System and Map-Reduce implemented by Doug Cutting & Mike Cafarella December 2005 - Nutch ported to the new framework. Hadoop runs reliably on 20 nodes. January 2006 - Doug Cutting joins Yahoo!Doug Cutting joins Yahoo! February 2006 - Apache Hadoop project official started to support the standalone development of Map-Reduce and HDFS. March 2006 - Formation of the Yahoo! Hadoop team May 2006 - Yahoo sets up a Hadoop research cluster - 300 nodes April 2006 - Sort benchmark run on 188 nodes in 47.9 hours May 2006 - Sort benchmark run on 500 nodes in 42 hours (better hardware than April benchmark) October 2006 - Research cluster reaches 600 Nodes December 2006 - Sort times 20 nodes in 1.8 hrs, 100 nodes in 3.3 hrs, 500 nodes in 5.2 hrs, 900 nodes in 7.8 January 2007 - Research cluster reaches 900 node April 2007 - Research clusters - 2 clusters of 1000 nodes Sep 2008 - Scaling Hadoop to 4000 nodes at Yahoo!
  • Slide 39
  • Google Code University 2008, Seminar: Mass Data Processing Technology on Large Scale Clusters, Tsinghua University Aaron Kimball 39
  • Slide 40
  • Startup: Cloudera Cloudera is pushing a commercial distribution for Hadoop Hadoop Mike Olson Aaron Kimball Doug Cutting Christophe Bisciglia Tom White 40
  • Slide 41
  • Course Administrivia
  • Slide 42
  • TextBooks [Tom] Tom White, Hadoop: The Definitive Guide, O'Reilly, 2009. [Chinese Edition][Chinese Edition] [Lin] Jimmy Lin and Chris Dyer, Data-Intensive Text Processing with MapReduce, 2010. [EBook][EBook] 42
  • Slide 43
  • IDTopicsContentsReading 1Background 1.CloudComputing 2. [Lin]Ch1:Introduction [Tom]Ch1:Meet Hadoop 2System1.MapReduce & runtime system 2.HDFS [Lin]Ch2:Mapreduce Basic [Tom]Ch6:How mapreduce works [paper] 3Environment 1. Hadoop (*)Hadoop environment setup (*)WordCount [Tom]Ch9:Setting up hadoop cluster [Tom]Appendix A:Installing apache hadoop [Tom]Ch2:Mapreduce 4Algorithm Design 1.MapReduce Program Develop 2.Basic MapReduce algorithm design and design patterns (*)WordCount (Mapreduce library class) [Tom]Ch5:Developing a MapReduce Application [Lin]Ch3:MapReduce algorithm design 5Text retrieval(*)Inverted Index[Lin]Ch4:Inverted Indexing for Text Retrieval 6Graph Algorithm (*)PageRank (Mapreduce features: side data distribution) [Lin]Ch5:Graph Algorithms 43
  • Slide 44
  • Recap Why large data? Cloud computing and Hadoop What is MapReduce? 44
  • Slide 45
  • Q&A? Thanks you!