introduction to pig&zookeeper

35
Introduction to Pig & Zookeeper Introduction to Pig & Zookeeper Introduction to Pig & Zookeeper Introduction to Pig & Zookeeper Introduction to Pig & Zookeeper Introduction to Pig & Zookeeper Introduction to Pig & Zookeeper Introduction to Pig & Zookeeper Yoyo Cheng ISCAS

Upload: guangyao-cheng

Post on 27-Jun-2015

3.315 views

Category:

Technology


5 download

DESCRIPTION

Introduction to pig&zookeeper

TRANSCRIPT

Page 1: Introduction to pig&zookeeper

Introduction to Pig & ZookeeperIntroduction to Pig & ZookeeperIntroduction to Pig & ZookeeperIntroduction to Pig & ZookeeperIntroduction to Pig & ZookeeperIntroduction to Pig & ZookeeperIntroduction to Pig & ZookeeperIntroduction to Pig & Zookeeper

Yoyo ChengISCAS

Page 2: Introduction to pig&zookeeper

ZooKeeper

• 什么是ZooKeeper• ZooKeeper的工作模式

• 工作原理

• API接口

Page 3: Introduction to pig&zookeeper

What is ZooKeeper

• a high-performance coordination service for distributed applications

• common services– naming– configuration management– synchronization– group services

• used by HBase, Yahoo! Message Broker, Fetch Service of Yahoo! crawler in Yahoo!( like Google 's Chubby based on Paxos)

Page 4: Introduction to pig&zookeeper

Example

• 假设有20个搜索引擎的服务器(每个负责总索引中的一部分的搜索任务,其中15个服务器现在提供搜索服务,5个服务器正在生成索引.)和一个总服务器(负责向这20个搜索引擎的服务器发出搜索请求并合并结果集),一个备用的总服务器(负责当总服务器宕机时替换总服务器),一个web的 cgi(向总服务器发出搜索请求)

• 这20个搜索引擎的服务器经常要让正在提供搜索服务的服务器停止提供服务开始生成索引,或生成索引的服务器已经把索引生成完成可以搜索提供服务了

Page 5: Introduction to pig&zookeeper

What can ZooKeeper do?

• 可以保证总服务器自动感知有多少提供搜索引擎的服务器并向这些服务器发出搜索请求

• 备用的总服务器宕机时自动启用备用的总服务器

• web的cgi能够自动地获知总服务器的网络地址变化

Page 6: Introduction to pig&zookeeper

How did ZooKeeper do?

• 总服务器自动感知有多少提供搜索引擎的服务器并向这些服务器发出搜索请求– Step 1: 提供搜索引擎的服务器都在Zookeeper中创建

znode,zk.create("/search/nodes/node1", "hostname".getBytes(), Ids.OPEN_ACL_UNSAFE, CreateFlags.EPHEMERAL);

– Step2: 总服务器可以从Zookeeper中获取一个znode的子节点的列表,zk.getChildren("/search/nodes", true);

– Step3:总服务器遍历这些子节点,并获取子节点的数据生成提供搜索引擎的服务器列表;

– Step4. 当总服务器接收到子节点改变的事件信息,重新返回第二步.

Page 7: Introduction to pig&zookeeper

How did ZooKeeper do?

• 备用的总服务器宕机时自动启用备用的总服务器

– 总服务器在Zookeeper中创建节点,zk.create("/search/master", "hostname".getBytes(), Ids.OPEN_ACL_UNSAFE, CreateFlags.EPHEMERAL);

– 备用的总服务器监控Zookeeper中的"/search/master"节点.当这个znode的节点数据改变时,把自己启动变成总服务器, 并把自己的网络地址数据放进这个节点.

Page 8: Introduction to pig&zookeeper

How did ZooKeeper do?

• web的cgi能够自动地获知总服务器的网络地址变化– web的cgi从Zookeeper中"/search/master"节点获取总服务器的网络地址数据并向其发送搜索请求.

– web的cgi监控Zookeeper中的"/search/master"节点,当这个znode的节点数据改变时,从这个节点获取总服务器的网络地址数据,并改变当前的总服务器的网络地址.

Page 9: Introduction to pig&zookeeper

standalon or quorum

• standalon– 只有一个zookeeper service – 便于测试– 但不能保证服务的高性能和高可靠性

• quorum:– 只要集群中的大多数正常工作,就可以提供稳定的高性能服务

– 例如:五个节点的ensemble,任意两个节点失败,服务器仍然可以正常工作

– 原理:znode树的每一次修改都被复制到ensemble的大多数机器中

– Zookeeper使用zab协议

Page 10: Introduction to pig&zookeeper

two phase commit

• Phase1:leader election– 选举一个杰出的组员(一个zookeeper

service),称之为leader,其他的机器称之为followers.

• Phase2:Atomic broadcast– 所有的写请求传递到leader,leader通过广播更新followers。当大多数更改后,leader提交更新,同时client得到响应:更新成功。

Page 11: Introduction to pig&zookeeper

ZooKeeper Service

Page 12: Introduction to pig&zookeeper

ZooKeeper Components

Page 13: Introduction to pig&zookeeper

• replicated database是一个包含整个数据树的内存数据库.更新被logged到磁盘以提供可恢复性,写操作先持久化到磁盘,然后再对内存数据库作变更.

• 消息层负责替换失效leader并同步followers.• 当Leader收到写请求,它计算写请求起作用时系统将要处于的状态,并将写请求转换为一个封装新状态的事务处理操作。

Page 14: Introduction to pig&zookeeper

Query

• 用来查询服务器端的数据,不会更改服务器端的数据• 所有的查询命令都可以即刻从client连接的server立即返回,不需要leader进

行协调。• 所有的查询命令都可以指定watcher,通过它来跟踪指定path的数据变化。一

旦指定的数据发生变化(create,delete,modified,children_changed),服务器将会发送命令来回调注册的watcher.

• 查询命令:

– 1. exists:判断指定path的node是否存在,如果存在则返回true,否则返回false.

– 2. getData:从指定path获取该node的数据

– 3. getACL:获取指定path的ACL。

– 4. getChildren:获取指定path的node的所有子结点。

Page 15: Introduction to pig&zookeeper

Modify

• 主要是用来修改节点数据或结构,或者权限信息。任何修改命令都需要提交到leader进行协调,协调完成后才返回。

• 在leader的协调过程中,需要leader与Follower之间的来回请求响应。并且在此过程中还会涉及事务日志的记录,更糟糕的情况是还有take snapshot的操作。因此此过程可能比较耗时。

• Zookeeper的通信中最大特点是异步的,如果请求是连续不断的,Zookeeper的处理是集中处理逻辑,然后批量发送,批量的大小也是有控制的。如果请求量不大,则即刻发送。这样当负载很大时也能保证很大的吞吐量,时效性也在一定程度上进行了保证。

• 修改命令主要包括:

– 1. createSession:请求server创建一个session

– 2. create:创建一个节点

– 3. delete:删除一个节点

– 4. setData:修改一个节点的数据

– 5. setACL:修改一个节点的ACL

– 6. closeSession:请求server关闭session

Page 16: Introduction to pig&zookeeper

Pig

• 什么是Pig• 为什么要使用Pig• pig的应用场景

• 如何使用pig

Page 17: Introduction to pig&zookeeper

What is Pig

• SQL-like语言,是在MapReduce上构建的一种高级查询语言

Page 18: Introduction to pig&zookeeper

Motivation

• Map Reduce is very powerful,but:– It requires a Java programmer.– re-invent the wheel(join, filter, etc.)

Page 19: Introduction to pig&zookeeper

Pig Latin

• Pig provides a higher level language, Pig Latin, that:– Increases productivity. In one test

• 10 lines of Pig Latin ≈ 200 lines of Java.• What took 4 hours to write in Java took 15

minutes in Pig Latin.– Opens the system to non-Java programmers.– Provides common operations like join,group,

filter, sort.

Page 20: Introduction to pig&zookeeper

Why a New Language?

• Pig Latin is a data flow language.• User code and existing binaries can be

included almost anywhere.• Metadata not required, but used when

available.• Support for nested types(map,

list,collection...), pig latin support that as first class type.

• Operates on files in HDFS

Page 21: Introduction to pig&zookeeper

Background

• Yahoo! was the first big adopter of Hadoop.• Hadoop gained popularity in the company

quickly.• Yahoo! Research developed Pig to

address the need for a higher level language.

• Roughly 30% of Hadoop jobs run at Yahoo! are Pig jobs.

Page 22: Introduction to pig&zookeeper

How Pig is Being Used

• Web log processing • Data processing for web search platforms• Ad hoc queries across large data sets.• Rapid prototyping of algorithms for

processing large data sets

Page 23: Introduction to pig&zookeeper

Accessing Pig

• Submit a script directly.• Grunt, the pig shell.• PigServer Java class, a JDBC like

interface.• PigPen, an eclipse plugin

– Allows textual and graphical scripting.– Samples data and shows example data– flow.

Page 24: Introduction to pig&zookeeper
Page 25: Introduction to pig&zookeeper

Data Types

• Scalar types: int, long,double, chararray, bytearray.

• Complex types:– map: associative array.– tuple: ordered list of data, elements may be of

any scalar or complex type.– bag: unordered collection of tuples.

Page 26: Introduction to pig&zookeeper

How to use

• No need to install anything extra on your Hadoop cluster

• Start a terminal and run$ cd /usr/share/cloudera/pig/$ bin/pig –x localShould see a prompt like:grunt>

Page 27: Introduction to pig&zookeeper

Load Data

� LOAD … AS …� PigStorage(‘,’) to specify separator

Users = Users = Users = Users = LOADLOADLOADLOAD 'users.txt' 'users.txt' 'users.txt' 'users.txt' USINGUSINGUSINGUSING PigStorage(',') PigStorage(',') PigStorage(',') PigStorage(',') ASASASAS (name, age); (name, age); (name, age); (name, age);

John,18Mary,20Bob,30

30Bob20Mary18John

agename

Page 28: Introduction to pig&zookeeper

Filter

� FILTER … BY …� constraints can be composite

FltrdFltrdFltrdFltrd = = = = FILTERFILTERFILTERFILTER Users Users Users Users BYBYBYBY age >= 18 AND age <= 25; age >= 18 AND age <= 25; age >= 18 AND age <= 25; age >= 18 AND age <= 25;

30Bob20Mary18John

agename

20Mary18John

agename

Page 29: Introduction to pig&zookeeper

Generate / Project

� FOREACH … GENERATE

Names = Names = Names = Names = FOREACHFOREACHFOREACHFOREACH Fltrd Fltrd Fltrd Fltrd GENERATEGENERATEGENERATEGENERATE name; name; name; name;

20Mary18John

agename

MaryJohnname

Page 30: Introduction to pig&zookeeper

Store DataSTORESTORESTORESTORE Names Names Names Names INTOINTOINTOINTO 'names.out'; 'names.out'; 'names.out'; 'names.out';

� STORE … INTO …� PigStorage(‘,’) to specify separator if multiple

fields

Page 31: Introduction to pig&zookeeper

Command - JOINUsers = Users = Users = Users = LOADLOADLOADLOAD ‘‘‘‘usersusersusersusers’’’’ ASASASAS (name, age); (name, age); (name, age); (name, age);Pages = Pages = Pages = Pages = LOADLOADLOADLOAD ‘‘‘‘pagespagespagespages’’’’ ASASASAS (user, url); (user, url); (user, url); (user, url);Jnd = Jnd = Jnd = Jnd = JOINJOINJOINJOIN Users Users Users Users BYBYBYBY name, Pages name, Pages name, Pages name, Pages BYBYBYBY user; user; user; user;

30Bob20Mary18John

ageageageagenamenamenamename

bingBobgoogMaryyahoJohn

urlurlurlurluseruseruseruserBob

Mary

John

useruseruseruser

bing

goog

yaho

urlurlurlurl

Bob

Mary

John

namenamenamename

30

20

18

ageageageage

Page 32: Introduction to pig&zookeeper

Command - GROUPGrpd = Grpd = Grpd = Grpd = GROUPGROUPGROUPGROUP Jnd Jnd Jnd Jnd bybybyby url; url; url; url;describedescribedescribedescribe Grpd; Grpd; Grpd; Grpd;

yhoo25Dee

bing40Kim

bing

goog

yhoo

urlurlurlurl

Bob

Mary

John

namenamenamename

30

20

18

ageageageage

(Kim, 40, bing)(Bob, 30, bing)

bing

goog

yhoo

(Mary, 20, goog)

(John, 18, yhoo)(Dee, 25, yhoo)

Page 33: Introduction to pig&zookeeper

Other Commands

� ORDER – sort by a field� COUNT – eval: count #elements� COGROUP – structured JOIN�More at http://hadoop.apache.org/pig/

Page 34: Introduction to pig&zookeeper

Reference

• 初识ZooKeeper, http://bbs.hadoopor.com/thread-533-1-1.html• Zookeeper分布式安装手册, http://bbs.hadoopor.com/thread-1541-1-1.html• 安装zookeeper, http://bbs.hadoopor.com/thread-836-1-1.html• Paxos在大型系统中常见的应用场景,http://timyang.net/tag/zookeeper/• Introduction to Pig programming,Yiwei Chen,Yahoo Search Engineering,

http://www.docstoc.com/docs/27501834/Introduction-to-Pig-programming• Introduction to Pig,Allen

Gates,Yahoo!,http://www.cloudera.com/videos/introduction_to_pig , http://www.cloudera.com/videos/pig_tutorial

• Pig Latin ── Language for Large Data Processing,http://www.hadoop.tw/2010/04/pig.html

• Pig安装与配置教程,http://www.hadoopor.com/thread-236-1-1.html• Hadoop学习-9 Pig执

行,http://sunjun041640.blog.163.com/blog/static/2562683220106240117330/• http://hadoop.apache.org/pig/• http://hadoop.apache.org/zookeeper/• http://wiki.apache.org/hadoop/ZooKeeper

Page 35: Introduction to pig&zookeeper

感谢关注感谢关注