massive data processing 01: introduction

Massive Data Processing01: Introduction

http://net.pku.edu.cn/~course/cs402/2014

闫宏飞北京大学信息科学技术学院

7/1/2014

http://net.pku.edu.cn/~course/cs402/2014

Contents

• 01 Introduction (1~18)• 02 MapReduce Basics (19~38)• 03 Basic MapReduce Algorithm Design (39~64)• 04 Inverted Indexing for Text Retrieval (65~86))• 05 Graph Algorithms (87~105)

2

Buzzwords （流行词）

Big data( 大数据 )

PaaS( 平台即服务 )Hadoop

Urban computing( 城市计算 )IaaS( 基础设施即服务 )

Cloud computing( 云计算 )

Smart city( 智慧城市 )

SaaS( 软件即服务 )

Memex

Google App Engine( 谷歌应用引擎 )

Amazon EC2

Utility Computing( 效用计算 )

Internet of Things( 物联网 )

Social media( 社交媒体 )

Mobile

Memex （梅米克斯）• 个人的“梅米克斯”，可记录人所看到听

到的一切，需要时快速检索出来• 世界的“梅米克斯”，即建立文本、音乐

、图象、艺术、电影的“全集”，可回答有关的任何提问，像人类专家那样快而好地做索引，做文摘。

人类公开资源 (Kelly, Kevin (may 14, 2006). “Scan

This Book!”. )• 估计至少有

– 320 万本书， 7.50 亿篇文章– 2 千 5 百万首歌， 50 万部电影– 5 亿个图像， 3 百万个视频、电视节目和短片– 1000 亿网页

• 超星扫描了 200 多个图书馆中的 130 万本中文书，大约是 1949 年以来中文出版书籍的一半

超星 2014 年 1 月提供服务的资源

• 超星读书 (http://book.chaoxing.com/ ) 近40 万本电子书

• 超星视频 (http://video.chaoxing.com/)– 98,593 集，专题、课程 7,839 门，– 名师 5,850 位，博导 4,320 ，院士 285

http://book.chaoxing.com/

http://video.chaoxing.com/

Google Books 计划• 2004 年 12 月开始，目的扫描书和杂志，使

用字符识别软件确认文本的字、词、句和段落，将数字化图像转化为数据化文本。

• 2013 年 4 月，扫描 3 千万本– 2010 年，全世界估计有 1.3 亿本书– 2008 年 11 月，数字化 700 万本– 2007 年，数字化 100 万本。

• 2007 年 9 月，发布” My Library”

Google Books 全自动翻页高速书本扫描仪

• 内含一个置书槽，可放置各种尺寸厚薄的书本，只要往书槽放置，选定参数就会自动翻页扫瞄每小时扫瞄可高达 1200 页，

• 一位操作员可轻松的照顾五台扫瞄机，每小时的总产可达到 5,000 页。

Google Books 工程师开源“线性书本扫描仪

• Google 的图书扫描计划数字化全球近 1.3 亿本图书， Dany Qumsiyeh 与团队成员还是在近日对外介绍并开源了一个“线性书本扫描仪 (Linear Book Scanner)”

• 可自动翻页的书本扫描仪能在 90 分钟内将一本 1000 页的书本变成电子格式，扫描过程当中无需人工介入操作。设备的材料成本为 1500 美元 .

ReCaptcha 与数据再利用（ Luis von Ahn ）

• 人们需要从计算机光学字符识别程序无法识别的文本扫描项目中读出两个单词并输入。– 其中一个单词其他用户也识别过，从而可以从该用户的输入中判断注册者是人；

– 另一个单词则是有待辨识和解疑的新词。为了保证准确度，系统会将同一个模糊单词发给五个不同的人，直到他们都输入正确后才确定这个单词是对的。

• 在这里，数据的主要用途是证明用户是人，但它也有第二个目的：破译数字化文本中不清楚的单词。

类似项目• Internet Archive 每天数字化 1000 本书，同

时也做 Google Books 和其他来源数字化书的镜像。– 2011 年 5 月， 280 万册书

• 微软 2006 年底发起的一个类似 Google Books 的 Live Search Book 计划– 2008 年 5 月取消。其所有 30 万资料存于

Internet Archive

Minority Report ( 电影《少数派报告》 )

• 预测与惩罚，不是因为“所做”，而是因为“将做”。

Tom Cruise

The best thing since sliced bread?

• Before clouds…– Grids– Vector supercomputers– …

• Cloud computing means many different things:– Large-data processing– Rebranding of web 2.0– Utility computing– Everything as a service

13

Rebranding of web 2.0

• Rich, interactive web applications– Clouds refer to the servers that run them– AJAX as the de facto standard (for better or worse)– Examples: Facebook, YouTube, Gmail, …

• “The network is the computer”: take two– User data is stored “in the clouds”– Rise of the netbook, smartphones, etc.– Browser is the OS

Source: Wikipedia (Electricity meter)

Utility Computing

• What?– Computing resources as a metered service (“pay as you go”)– Ability to dynamically provision virtual machines

• Why?– Cost: capital vs. operating expenses– Scalability: “infinite” capacity– Elasticity: scale up or down on demand

• Does it make sense?– Benefits to cloud users– Business case for cloud providers

I think there is a world market for about five computers. 16

Everything as a Service

• Utility computing = Infrastructure as a Service (IaaS)– Why buy machines when you can rent cycles?

– Examples: Amazon’s EC2, Rackspace

• Platform as a Service (PaaS)– Give me nice API and take care of the maintenance, upgrades, …

– Example: Google App Engine

• Software as a Service (SaaS)– Just run it for me!

– Example: Gmail, Salesforce

17

von Neumann Model vs. MapReduce Model

Google Cloud Platform

• provides cloud computing services that allow you to build applications and websites, store data and analyze data on Google’s infrastructure.– Over 3 million apps deployed to Google Cloud

Platform– App Engine is a platform as a service that uses

familiar technologies to build and host applications on the same infrastructure used at Google.

Amazon Web Service

• 2006 年，亚马逊先后推出了包括 EC2 、 S3 （ Simple Storage Service）、 CloudWatch（管理类服务）在内的 AWS 服务，

• 这些服务在之后的若干年为亚马逊占领云的 IaaS市场奠定了很好的基础，也使得亚马逊成为可以提供大规模云基础设施平台的公司。

云价格比较

云盘：百度 2TB ，腾讯 10TB

智慧地球、物联网与云计算

云计算

云计算

智慧的地球

互联网

物联网

云计算是物联网的核心

云计算促进物联网和互联网的智能融合

智慧城

市

智慧的地球，从智慧城市开始

智慧的交通

智慧的医疗智慧的教育与科技

智慧的公共事业

智慧的市民服务智慧的公共安全

全面感知

充分整合

激励创新

协同运

作

•自动收费•票务管理

•运输信息管理

•电子病历•家庭健康服务•医疗费用管理

•犯罪信息仓库•突发事件响应•数字监控系统

•高速宽带网路•智慧的电力•建筑能耗评估监测•水处理 /水资源管理

•开放式学习•先进教室•智慧的科技园区

•失业保险金管理•就业服务•家庭服务•住宅信息管理

大数据

Data from the Physical World

• Common characteristics

• Generated (indirectly) by human– Representing physical world activities– Structuralized– Privacy risks

• Human knowledge embedded– Geo-tagged photos: where do you take photographs?– Location check-in: where do you stay?– Taxi trajectories: how do taxi drivers navigate in a city?– BMAC card records: how do you take public transit?

Big Data is commonly characterized by three vectors —volume, variety and velocity.

• Volume, it is about complexity– Many small datasets that are considered big data do not consume much

physical space but are particularly complex in nature.

– At the same time, large datasets that require significant physical space may not be complex enough to be considered big data.

• variety refers to its 'polystructured' nature – i.e. a mixture of structured, semi-structured and unstructured data such

as text, audio and video;

• and velocity refers to the rate at which it is generated and analyzed – which in some applications needs to be in real time, or near real-time.

• Veracity

• Value

A Free Large-Scale GPS Dataset

• 17621 trajectories, 1.2 million kilometers, 48000+ hours

The Digital Universe in 2020: Big Data, Bigger Digital Shadows, and Biggest Growth in the Far East

(IDC & EMC, December 2012)

CNNIC发布第 33次中国互联网络发展状况

• 截至 2013 年 12 月底，我国网民规模达6.18 亿人。手机网民规模达 5 亿。

• 2013 年，中国网络购物用户规模达 3.02 亿。

2012年全国新闻出版业基本情况

• 2012 年：出版图书 414,005 种，期刊9,867 种，报纸 1,918 种– 三类出版物总印张为 3074.01 亿印张，折合用

纸量 711.36 万吨。• 图书定价总金额 1183.37 亿元。• 期刊定价总金额 252.68 亿元。• 报纸定价总金额 434.39 亿元。

Big Data: A Revolution That Will Transform How We Live, Work and

Think (book)

• Since Aristotle, we have fought to understand the causes behind everything.

• But this ideology is fading. In the age of big data, we can crunch an incomprehensible amount of information, providing us with invaluable insights about the what rather than the why.

Whitehouse & CCF Initiative

• 2012 年 3 月美国政府公布“大数据研发计划”– 目标是改进现有人们从海量和复杂的数据中获取知识的能力，– 从而加速在科学与工程领域发明的步伐，– 增强国家安全，转变人们现有的工作、学习和生活方式。

• 2012 年 10 月 CCF 成立了大数据专家委员会– 探讨大数据的核心科学与技术问题，推动大数据学科方向的建设与

发展；– 构建面向大数据产学研用的学术交流、技术合作与数据共享平台；– 对相关政府部门提供大数据研究与应用的战略性意见与建议。

美国政府开放数据• www.data.gov• 网站从 2009 年的

47 个数据集，到2014 年 6 月达到105 万，涵盖 228个机构。

数据化不是数字化• 把书籍的每一页扫描然后存为一个高分辨率的图像文件是数字化。– 只有依靠人的阅读才能转化为有用的信息。

• 一旦世界被数据化，用数学分析工具（统计学和算法）及必要的设备（信息处理器和存储器），可以在更多领域、更快、更大规模地进行数据处理。

大数据的价值

Google 拼写和 Amazon 推荐系统

• Google收集拼写错误的数据，利用这些数据创建拼写检查程序，同时自身具备挖掘数据价值的技术

• Amazon ，推荐系统， 1997 年提出协同过滤。

巴诺与 NOOK 快照• 电子书阅读器捕捉了大量关于文学喜好和阅读人群的数据–阅读一页或者一节需要多少时间，略读还是直接放弃，是否画线强调

• 向出版商和读者展示– 读者的好恶和阅读模式

Coursera 等在线教育• 跟踪学生的 Web 交互来寻找最佳的教学方法– 约有 2000 名学生课外作业的答案是错的，但错误答案居然是相同的

–他们把一个算法里的两个代数方程弄反了。– 所以现在如果其他学生犯同样的错误，系统不

会简单告诉他们错• 找到最合适阅读的论坛帖子

Coursera

2PM, June 17, 2013

Infer Fine-Grained Air Quality in a City Using Big

Data

Goal: We infer the real-time and fine-grained air quality information throughout a city, based on the (historical and real-time) air quality data reported by existing monitor stations and a variety of data sources we observed in the city, such as meteorology, traffic flow, human mobility, structure of road networks, and point of interests (POIs).

Netflix Prize

• The Netflix Prize sought to substantially improve the accuracy of predictions about how much someone is going to enjoy a movie based on their movie preferences.

• On September 21, 2009 we awarded the $1M Grand Prize to team “BellKor’s Pragmatic Chaos”.

Risks of Big Data( 大数据洞察 )

• 在大数据时代，不管是否告知与许可、模糊化还是匿名化，这三大隐私保护策略都失效了。

• 通过把奈飞公司的数据与其他公共数据信息对比分析，– 得克萨斯大学的研究人员很快发现，匿名用户进行的收视率排名与互联网电影数据库（ imdb）上实名用户所排的是匹配的。

2013 年第一届中国大数据技术

• 关键词行业分类（百度）• 电信网络寻呼黑洞分析• 电信用户交往圈构建和特定类型用户识别• 用户购买行为的归因分析（秒针）• 基于出租车 GPS 轨迹的位置服务（数据堂）

Summary

• 云计算是大数据的基础，• 存储和处理大数据是云计算的重要应用。• 二者相辅相成，可以发现大数据中更多的价值。

massive data processing 01: introduction

Documents