hadoop overview - tistorycfs9.tistory.com/upload_control/download.blog?fhandle=... ·...
TRANSCRIPT
![Page 2: Hadoop Overview - Tistorycfs9.tistory.com/upload_control/download.blog?fhandle=... · 2015-01-22 · Hadoop • History • 2005년 Nutch 오픈소스 검색 엔진의 분산확장](https://reader030.vdocuments.mx/reader030/viewer/2022040515/5e70c64a7c88db6ae913d6d3/html5/thumbnails/2.jpg)
발표순서• Hadoop 이란?
• Hadoop Architecture
• HDFS
• MapReduce
• HBase
• MapReduce Programming
![Page 3: Hadoop Overview - Tistorycfs9.tistory.com/upload_control/download.blog?fhandle=... · 2015-01-22 · Hadoop • History • 2005년 Nutch 오픈소스 검색 엔진의 분산확장](https://reader030.vdocuments.mx/reader030/viewer/2022040515/5e70c64a7c88db6ae913d6d3/html5/thumbnails/3.jpg)
Hadoop
• History• 2005년 Nutch 오픈소스 검색 엔진의 분산확장 문제에서 출발
(Inspired from Google’s GFS, BigTable, MapReduce)
• 2006년 Yahoo의 전폭적인 지원
• 2008년 Apache Top-level Project로 승격• 현재 0.17.0 Release
• 특징• Java 언어 기반• Apache License
• 많은 컴포넌트들 (HDFS, HBase, MapReduce, Hadoop On Demand(HOD), Stremaing, HQL, Hama, Mahout, etc)
![Page 4: Hadoop Overview - Tistorycfs9.tistory.com/upload_control/download.blog?fhandle=... · 2015-01-22 · Hadoop • History • 2005년 Nutch 오픈소스 검색 엔진의 분산확장](https://reader030.vdocuments.mx/reader030/viewer/2022040515/5e70c64a7c88db6ae913d6d3/html5/thumbnails/4.jpg)
Hadoop 사용현황
• Nutch: Open Source Web Search Software
• Yahoo!
• ~10000 machines running Hadoop
• Porting ~100 webmap applications to MapReduce
• The New York Times: Times Machine
• EC2/S3/Hadoop
• Large TIFF images(405,000) , articles(3,300,000), meta data(405,000) -> 810,000 PNG data
![Page 5: Hadoop Overview - Tistorycfs9.tistory.com/upload_control/download.blog?fhandle=... · 2015-01-22 · Hadoop • History • 2005년 Nutch 오픈소스 검색 엔진의 분산확장](https://reader030.vdocuments.mx/reader030/viewer/2022040515/5e70c64a7c88db6ae913d6d3/html5/thumbnails/5.jpg)
Hadoop Architecture
HDFS
HBase
MapReduce : Distributed Programming Model
: Distributed Database (BigTable in Google)
: Hadoop Distributed FileSystem (GFS in Google)
Commodity PC cluster
![Page 6: Hadoop Overview - Tistorycfs9.tistory.com/upload_control/download.blog?fhandle=... · 2015-01-22 · Hadoop • History • 2005년 Nutch 오픈소스 검색 엔진의 분산확장](https://reader030.vdocuments.mx/reader030/viewer/2022040515/5e70c64a7c88db6ae913d6d3/html5/thumbnails/6.jpg)
HDFS
![Page 7: Hadoop Overview - Tistorycfs9.tistory.com/upload_control/download.blog?fhandle=... · 2015-01-22 · Hadoop • History • 2005년 Nutch 오픈소스 검색 엔진의 분산확장](https://reader030.vdocuments.mx/reader030/viewer/2022040515/5e70c64a7c88db6ae913d6d3/html5/thumbnails/7.jpg)
HDFS
• User-level distributed file system
• Non-standard file system interface
• Master/Slave (namenode and datanodes)
• Replication (3 copies)
• Large chunk size (64MB)
• No cache chunk (cache metadata, however)
![Page 8: Hadoop Overview - Tistorycfs9.tistory.com/upload_control/download.blog?fhandle=... · 2015-01-22 · Hadoop • History • 2005년 Nutch 오픈소스 검색 엔진의 분산확장](https://reader030.vdocuments.mx/reader030/viewer/2022040515/5e70c64a7c88db6ae913d6d3/html5/thumbnails/8.jpg)
Architecture
![Page 9: Hadoop Overview - Tistorycfs9.tistory.com/upload_control/download.blog?fhandle=... · 2015-01-22 · Hadoop • History • 2005년 Nutch 오픈소스 검색 엔진의 분산확장](https://reader030.vdocuments.mx/reader030/viewer/2022040515/5e70c64a7c88db6ae913d6d3/html5/thumbnails/9.jpg)
Write Operation
Namenode
DataNode
DataNode
DataNode
ClientMeta data
Pipelined Data Transfer
![Page 10: Hadoop Overview - Tistorycfs9.tistory.com/upload_control/download.blog?fhandle=... · 2015-01-22 · Hadoop • History • 2005년 Nutch 오픈소스 검색 엔진의 분산확장](https://reader030.vdocuments.mx/reader030/viewer/2022040515/5e70c64a7c88db6ae913d6d3/html5/thumbnails/10.jpg)
HBase
![Page 11: Hadoop Overview - Tistorycfs9.tistory.com/upload_control/download.blog?fhandle=... · 2015-01-22 · Hadoop • History • 2005년 Nutch 오픈소스 검색 엔진의 분산확장](https://reader030.vdocuments.mx/reader030/viewer/2022040515/5e70c64a7c88db6ae913d6d3/html5/thumbnails/11.jpg)
HBase
• Distributed database modeled on Bigtable
• Column-oriented store
• Goal of billions of rows x millions of cells
• Petabytes of data across thousands of servers
• Is Not SQL Database.
![Page 12: Hadoop Overview - Tistorycfs9.tistory.com/upload_control/download.blog?fhandle=... · 2015-01-22 · Hadoop • History • 2005년 Nutch 오픈소스 검색 엔진의 분산확장](https://reader030.vdocuments.mx/reader030/viewer/2022040515/5e70c64a7c88db6ae913d6d3/html5/thumbnails/12.jpg)
DataModel
• Table of rows x columns; timestamp• Sparse Table• Column-based DB
![Page 13: Hadoop Overview - Tistorycfs9.tistory.com/upload_control/download.blog?fhandle=... · 2015-01-22 · Hadoop • History • 2005년 Nutch 오픈소스 검색 엔진의 분산확장](https://reader030.vdocuments.mx/reader030/viewer/2022040515/5e70c64a7c88db6ae913d6d3/html5/thumbnails/13.jpg)
DataModel
• Physical Storage View: Store Column Family• Column Name: <Family>:<Label>
![Page 14: Hadoop Overview - Tistorycfs9.tistory.com/upload_control/download.blog?fhandle=... · 2015-01-22 · Hadoop • History • 2005년 Nutch 오픈소스 검색 엔진의 분산확장](https://reader030.vdocuments.mx/reader030/viewer/2022040515/5e70c64a7c88db6ae913d6d3/html5/thumbnails/14.jpg)
MapReduce
![Page 15: Hadoop Overview - Tistorycfs9.tistory.com/upload_control/download.blog?fhandle=... · 2015-01-22 · Hadoop • History • 2005년 Nutch 오픈소스 검색 엔진의 분산확장](https://reader030.vdocuments.mx/reader030/viewer/2022040515/5e70c64a7c88db6ae913d6d3/html5/thumbnails/15.jpg)
MapReduce• Programming model and implementation for
parallel processing large data sets
• Automatic parallelism & distribution
• Fault-tolerant
• Clean abstraction for programmers
• map & reduce functions
map (k1, v1) -> list (k2, v2)reduce (k2, list (v2)) -> list (v2)
![Page 16: Hadoop Overview - Tistorycfs9.tistory.com/upload_control/download.blog?fhandle=... · 2015-01-22 · Hadoop • History • 2005년 Nutch 오픈소스 검색 엔진의 분산확장](https://reader030.vdocuments.mx/reader030/viewer/2022040515/5e70c64a7c88db6ae913d6d3/html5/thumbnails/16.jpg)
Example: Word Count
• map(String input_key, String input_value):
// input_key: document name // input_value: document contents for each word w in input_value: EmitIntermediate(w, "1");
• reduce(String output_key, Iterator intermediate_values):
// output_key: a word // output_values: a list of counts int result = 0; for each v in intermediate_values: result += ParseInt(v); Emit(AsString(result));
![Page 17: Hadoop Overview - Tistorycfs9.tistory.com/upload_control/download.blog?fhandle=... · 2015-01-22 · Hadoop • History • 2005년 Nutch 오픈소스 검색 엔진의 분산확장](https://reader030.vdocuments.mx/reader030/viewer/2022040515/5e70c64a7c88db6ae913d6d3/html5/thumbnails/17.jpg)
Data Processing Flow
J Dean, S Ghemawat “MapReduce: Simplified Data Processing on Large Clusters”
![Page 18: Hadoop Overview - Tistorycfs9.tistory.com/upload_control/download.blog?fhandle=... · 2015-01-22 · Hadoop • History • 2005년 Nutch 오픈소스 검색 엔진의 분산확장](https://reader030.vdocuments.mx/reader030/viewer/2022040515/5e70c64a7c88db6ae913d6d3/html5/thumbnails/18.jpg)
Parallelization
J Dean, S Ghemawat “MapReduce: Simplified Data Processing on Large Clusters”
![Page 19: Hadoop Overview - Tistorycfs9.tistory.com/upload_control/download.blog?fhandle=... · 2015-01-22 · Hadoop • History • 2005년 Nutch 오픈소스 검색 엔진의 분산확장](https://reader030.vdocuments.mx/reader030/viewer/2022040515/5e70c64a7c88db6ae913d6d3/html5/thumbnails/19.jpg)
Hadoop MapReduce Architecture
TaskTracker
t1HDFS
TaskTracker
HDFS
TaskTracker
t2HDFS
TaskTracker
t3HDFS
Node #1 Node #2 Node #3 Node #n
job queue
JobTracker
job #1input list
heartbeat
taskallocation
JobClient
JobSubmission
WritingInputbyHDFS
![Page 20: Hadoop Overview - Tistorycfs9.tistory.com/upload_control/download.blog?fhandle=... · 2015-01-22 · Hadoop • History • 2005년 Nutch 오픈소스 검색 엔진의 분산확장](https://reader030.vdocuments.mx/reader030/viewer/2022040515/5e70c64a7c88db6ae913d6d3/html5/thumbnails/20.jpg)
Features
• Mapper locality
• Overlap of maps, shuffle, sort
• Speculative execution
![Page 21: Hadoop Overview - Tistorycfs9.tistory.com/upload_control/download.blog?fhandle=... · 2015-01-22 · Hadoop • History • 2005년 Nutch 오픈소스 검색 엔진의 분산확장](https://reader030.vdocuments.mx/reader030/viewer/2022040515/5e70c64a7c88db6ae913d6d3/html5/thumbnails/21.jpg)
장점과 단점
• 모든 일에 적합한 것은 아니다.
• 주로 해야 할 일들이 잘 나눠지는 것들에 적합하고, 분산된 일들끼리 통신이나 데이타 공유가 필요하면 적용하기 까다롭다.
• Optimal을 보장하지 않는다.
• 하지만, 대부분 구현가능하고, 구현하기 쉬우며 재사용성이 높다.
![Page 22: Hadoop Overview - Tistorycfs9.tistory.com/upload_control/download.blog?fhandle=... · 2015-01-22 · Hadoop • History • 2005년 Nutch 오픈소스 검색 엔진의 분산확장](https://reader030.vdocuments.mx/reader030/viewer/2022040515/5e70c64a7c88db6ae913d6d3/html5/thumbnails/22.jpg)
MapReduce 구현예
Doug Cutting “MapReduce in Nutch”
![Page 23: Hadoop Overview - Tistorycfs9.tistory.com/upload_control/download.blog?fhandle=... · 2015-01-22 · Hadoop • History • 2005년 Nutch 오픈소스 검색 엔진의 분산확장](https://reader030.vdocuments.mx/reader030/viewer/2022040515/5e70c64a7c88db6ae913d6d3/html5/thumbnails/23.jpg)
MapReduce 구현예
![Page 24: Hadoop Overview - Tistorycfs9.tistory.com/upload_control/download.blog?fhandle=... · 2015-01-22 · Hadoop • History • 2005년 Nutch 오픈소스 검색 엔진의 분산확장](https://reader030.vdocuments.mx/reader030/viewer/2022040515/5e70c64a7c88db6ae913d6d3/html5/thumbnails/24.jpg)
Nutch의 주요 모듈
• Inject: 추가로 crawl할 url (seed urls)을 CrawlDB형식으로 변환
• Generate: CrawlDB에서 fetch할 url들을 선택
• Fetch: 선택된 Url들의 내용을 가져옴.
• Parse: 가져온 내용을 Parsing
• Invert links: 모든 url들에 대해 inlink들을 찾음
• Index: Indexing
Doug Cutting “MapReduce in Nutch”
![Page 25: Hadoop Overview - Tistorycfs9.tistory.com/upload_control/download.blog?fhandle=... · 2015-01-22 · Hadoop • History • 2005년 Nutch 오픈소스 검색 엔진의 분산확장](https://reader030.vdocuments.mx/reader030/viewer/2022040515/5e70c64a7c88db6ae913d6d3/html5/thumbnails/25.jpg)
Fetch
• Input: (url, CrawlDatum)
• Map(url, CrawlDatum) (url,(CrawlDatum, Content))해당 url을 protocol module을 이용해 받아옴.
• Reduce: identity
• Output: (url, CrawlDatum), (url,Content)
Doug Cutting “MapReduce in Nutch”
![Page 26: Hadoop Overview - Tistorycfs9.tistory.com/upload_control/download.blog?fhandle=... · 2015-01-22 · Hadoop • History • 2005년 Nutch 오픈소스 검색 엔진의 분산확장](https://reader030.vdocuments.mx/reader030/viewer/2022040515/5e70c64a7c88db6ae913d6d3/html5/thumbnails/26.jpg)
Invert Links• 모든 url들에 대해 자신을 가르키고 있는
url(inlinks)를 계산
• Input: <srcUrl, ParseData> (page outlinks를 가지고 있음)
• Map: (srcUrl, ParseData) (destUrl, inlink)*ParseData의 모든 destUrl들에대해 collectinlink는 srcUrl
• Reduce: (destUrl, inlink*) (destUrl, inlinks)같은 destUrl들에 대해 inlink들을 합함
• Output: (destUrl, inlinks)*Doug Cutting “MapReduce in Nutch”
![Page 27: Hadoop Overview - Tistorycfs9.tistory.com/upload_control/download.blog?fhandle=... · 2015-01-22 · Hadoop • History • 2005년 Nutch 오픈소스 검색 엔진의 분산확장](https://reader030.vdocuments.mx/reader030/viewer/2022040515/5e70c64a7c88db6ae913d6d3/html5/thumbnails/27.jpg)
PageRank• PR(A) = PR(B)/L(B) + PR(C)/L(C) + PR(D)/L(D) + ....
PR(B) = PR(C)/L(C) + PR(F)/L(F) + ..........
• Map: (url, (PR, outlinks)) ( outlink, PR/N)자기의 PageRank 점수를 outlinks 수로 나누어 각각의 outlink들에게 나누어줌.
• Reduce: ( url, PRs* ) ( url, PR)자기가 받은 PageRank들을 더하여 새로운 PR을 얻음.
• Interation to converge. (initial state and damping factor)
Michael Kleber, “What is MapReduce?”
![Page 28: Hadoop Overview - Tistorycfs9.tistory.com/upload_control/download.blog?fhandle=... · 2015-01-22 · Hadoop • History • 2005년 Nutch 오픈소스 검색 엔진의 분산확장](https://reader030.vdocuments.mx/reader030/viewer/2022040515/5e70c64a7c88db6ae913d6d3/html5/thumbnails/28.jpg)
Q&A