hadoop overview - tistorycfs9.tistory.com/upload_control/download.blog?fhandle=... ·...

Hadoop Overview2008년 5월 30일

이준복[email protected]

발표순서• Hadoop 이란?

• Hadoop Architecture

• HDFS

• MapReduce

• HBase

• MapReduce Programming

Hadoop

• History• 2005년 Nutch 오픈소스 검색 엔진의 분산확장 문제에서 출발

(Inspired from Google’s GFS, BigTable, MapReduce)

• 2006년 Yahoo의 전폭적인 지원

• 2008년 Apache Top-level Project로 승격• 현재 0.17.0 Release

• 특징• Java 언어 기반• Apache License

• 많은 컴포넌트들 (HDFS, HBase, MapReduce, Hadoop On Demand(HOD), Stremaing, HQL, Hama, Mahout, etc)

Hadoop 사용현황

• Nutch: Open Source Web Search Software

• Yahoo!

• ~10000 machines running Hadoop

• Porting ~100 webmap applications to MapReduce

• The New York Times: Times Machine

• EC2/S3/Hadoop

• Large TIFF images(405,000) , articles(3,300,000), meta data(405,000) -> 810,000 PNG data

Hadoop Architecture

HDFS

HBase

MapReduce : Distributed Programming Model

: Distributed Database (BigTable in Google)

: Hadoop Distributed FileSystem (GFS in Google)

Commodity PC cluster

HDFS

• User-level distributed file system

• Non-standard file system interface

• Master/Slave (namenode and datanodes)

• Replication (3 copies)

• Large chunk size (64MB)

• No cache chunk (cache metadata, however)

Architecture

Write Operation

Namenode

DataNode

DataNode

DataNode

ClientMeta data

Pipelined Data Transfer

HBase

• Distributed database modeled on Bigtable

• Column-oriented store

• Goal of billions of rows x millions of cells

• Petabytes of data across thousands of servers

• Is Not SQL Database.

DataModel

• Table of rows x columns; timestamp• Sparse Table• Column-based DB

DataModel

• Physical Storage View: Store Column Family• Column Name: <Family>:<Label>

MapReduce

MapReduce• Programming model and implementation for

parallel processing large data sets

• Automatic parallelism & distribution

• Fault-tolerant

• Clean abstraction for programmers

• map & reduce functions

map (k1, v1) -> list (k2, v2)reduce (k2, list (v2)) -> list (v2)

Example: Word Count

• map(String input_key, String input_value):

// input_key: document name // input_value: document contents for each word w in input_value: EmitIntermediate(w, "1");

• reduce(String output_key, Iterator intermediate_values):

// output_key: a word // output_values: a list of counts int result = 0; for each v in intermediate_values: result += ParseInt(v); Emit(AsString(result));

Data Processing Flow

J Dean, S Ghemawat “MapReduce: Simplified Data Processing on Large Clusters”

Parallelization

J Dean, S Ghemawat “MapReduce: Simplified Data Processing on Large Clusters”

Hadoop MapReduce Architecture

TaskTracker

t1HDFS

TaskTracker

HDFS

TaskTracker

t2HDFS

TaskTracker

t3HDFS

Node #1 Node #2 Node #3 Node #n

job queue

JobTracker

job #1input list

heartbeat

taskallocation

JobClient

JobSubmission

WritingInputbyHDFS

Features

• Mapper locality

• Overlap of maps, shuffle, sort

• Speculative execution

장점과 단점

• 모든 일에 적합한 것은 아니다.

• 주로 해야 할 일들이 잘 나눠지는 것들에 적합하고, 분산된 일들끼리 통신이나 데이타 공유가 필요하면 적용하기 까다롭다.

• Optimal을 보장하지 않는다.

• 하지만, 대부분 구현가능하고, 구현하기 쉬우며 재사용성이 높다.

MapReduce 구현예

Doug Cutting “MapReduce in Nutch”

MapReduce 구현예

Nutch의 주요 모듈

• Inject: 추가로 crawl할 url (seed urls)을 CrawlDB형식으로 변환

• Generate: CrawlDB에서 fetch할 url들을 선택

• Fetch: 선택된 Url들의 내용을 가져옴.

• Parse: 가져온 내용을 Parsing

• Invert links: 모든 url들에 대해 inlink들을 찾음

• Index: Indexing


Fetch

• Input: (url, CrawlDatum)

• Map(url, CrawlDatum) (url,(CrawlDatum, Content))해당 url을 protocol module을 이용해 받아옴.

• Reduce: identity

• Output: (url, CrawlDatum), (url,Content)


Invert Links• 모든 url들에 대해 자신을 가르키고 있는

url(inlinks)를 계산

• Input: <srcUrl, ParseData> (page outlinks를 가지고 있음)

• Map: (srcUrl, ParseData) (destUrl, inlink)*ParseData의 모든 destUrl들에대해 collectinlink는 srcUrl

• Reduce: (destUrl, inlink*) (destUrl, inlinks)같은 destUrl들에 대해 inlink들을 합함

• Output: (destUrl, inlinks)*Doug Cutting “MapReduce in Nutch”

PageRank• PR(A) = PR(B)/L(B) + PR(C)/L(C) + PR(D)/L(D) + ....

PR(B) = PR(C)/L(C) + PR(F)/L(F) + ..........

• Map: (url, (PR, outlinks)) ( outlink, PR/N)자기의 PageRank 점수를 outlinks 수로 나누어 각각의 outlink들에게 나누어줌.

• Reduce: ( url, PRs* ) ( url, PR)자기가 받은 PageRank들을 더하여 새로운 PR을 얻음.

• Interation to converge. (initial state and damping factor)

Michael Kleber, “What is MapReduce?”

hadoop overview - tistorycfs9.tistory.com/upload_control/download.blog?fhandle=... ·...

Documents