mapreduce framework suffling & sorting. mapreduce example - wordcount

47
mapreduce framework suffling & sorting

Upload: jewel-hodges

Post on 19-Jan-2016

286 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Mapreduce framework suffling & sorting. mapreduce example - wordcount

mapreduce framework

suffling & sorting

Page 2: Mapreduce framework suffling & sorting. mapreduce example - wordcount

mapreduce example - wordcount

Page 3: Mapreduce framework suffling & sorting. mapreduce example - wordcount

chapter 3. Filtering Pat-terns

filtering / bloom filtering / top 10 / distinct

서울시립대학교 전기전자컴퓨터공학과G201449015 이가희

2015 동계 랩세미나

Page 4: Mapreduce framework suffling & sorting. mapreduce example - wordcount

0. Filtering PatternsFiltering Patterns

filtering bloom filtering

top ten distinct

M

M

10 records

10 records

R

10 records

file

M

M

Boolean function

file

? records

? records

M

M

distinct

R

? records

file

M

M

B

100 terms training

B

file

? records

? records

MBR

MapperBloom FilterReducer

Page 5: Mapreduce framework suffling & sorting. mapreduce example - wordcount

1. Filtering : Pattern Description

• intent : 관심 없는 레코드들은 필터링 단계에서 버림

• motivation : 큰 데이터를 쪼갠 데이터에서 관심 있는 것만 후속 분석을 하고 싶을 때

• applicability : 레코드 단위로 파싱 – 분류 성능 ↑

Filtering Patterns

1filtering

2bloom filter-ing

3top ten

4distinct

ftrue

falsean evaluation function

레코드 유지레코드 버림

관심 있는 레코드만 취함

Page 6: Mapreduce framework suffling & sorting. mapreduce example - wordcount

1. Filtering : Pattern Description

• no “reducer”

Filtering Patterns

1filtering

2bloom filter-ing

3top ten

4distinct

same key, value

Page 7: Mapreduce framework suffling & sorting. mapreduce example - wordcount

1. Filtering : Pattern Description

• consequences : 선택 기준들을 통과한 레코드들의 subset

• known uses : • closer view of data• distributed grep• data cleansing• simple random sampling

• resemblances• SQL : select * from table where value < 3;• Pig : b = filter a by value < 3;

• performance analysis : • no reducers• data never has to be transmitted between the map and reduce

phase• most of the map tasks pull data off of their locally attached disks

and then write back out to that node.• both the sort phase and the reduce phase are cut out

Filtering Patterns

1filtering

2bloom filter-ing

3top ten

4distinct

Page 8: Mapreduce framework suffling & sorting. mapreduce example - wordcount

1. Filtering : Pattern Examples

• Distributed grep

• 입력 파일로 주어진 여러 개의 파일 내에 존재하는 특정한 문자열 패턴을 추출

Filtering Patterns

1filtering

2bloom filter-ing

3top ten

4distinct

Page 9: Mapreduce framework suffling & sorting. mapreduce example - wordcount

1. Filtering : Pattern Examples

• Distributed grep : main code

Filtering Patterns

1filtering

2bloom filter-ing

3top ten

4distinct

인스턴스 생성

입력한 정규식을 mapregex 에 입력

job 인스턴스 생성

job 의 최종 출력 key, value type 지정

job 에 입력한 파일 위치 정보를 보냄출력 파일이 저장될 위치 정보를 보냄

job 을 위한 정보 전달 ( 클래스 지정 )

job 이 실행할 준비가 끝났다고 신호 보냄

Page 10: Mapreduce framework suffling & sorting. mapreduce example - wordcount

1. Filtering : Pattern Examples

• Distributed grep : mapper code

Filtering Patterns

1filtering

2bloom filter-ing

3top ten

4distinct

map 에서 필요한 resource 할당( 파일 검색 방법으로 정규식 설정 )

특정 패턴에 매칭되는 라인 리턴 (key, record)

null

TextInputFormat (default)

map 에서 필요한 변수 선언

Page 11: Mapreduce framework suffling & sorting. mapreduce example - wordcount

1. Filtering : Pattern Examples

• Distributed grep 실행

• hadoop jar mrdp.jar mrdp.ch3.DistributedGrep <regex> <input> <output>

Filtering Patterns

1filtering

2bloom filter-ing

3top ten

4distinct

.

.

.

Page 12: Mapreduce framework suffling & sorting. mapreduce example - wordcount

1. Filtering : Pattern Examples

• Simple Random Sampling (SRS)

• 각각의 파일에서 일정 비율만큼 랜덤 추출

Filtering Patterns

1filtering

2bloom filter-ing

3top ten

4distinct

Page 13: Mapreduce framework suffling & sorting. mapreduce example - wordcount

1. Filtering : Pattern Examples

• Simple Random Sampling (SRS) : main code

Filtering Patterns

1filtering

2bloom filter-ing

3top ten

4distinct

입력한 수 호출 (0~100)

Page 14: Mapreduce framework suffling & sorting. mapreduce example - wordcount

1. Filtering : Pattern Examples

• Simple Random Sampling (SRS) : mapper code

Filtering Patterns

1filtering

2bloom filter-ing

3top ten

4distinct

map 에서 필요한 resource 할당( 파일 검색 방법으로 비율 (%) 설정 )

조건에 맞는 라인 리턴 (key, record)0.0 ~ 1.0

※Configuration conf = new Configra-tion();conf.set(“filter_percentage”, .5);

Page 15: Mapreduce framework suffling & sorting. mapreduce example - wordcount

1. Filtering : Pattern Examples

• Simple Random Sampling (SRS) 실행

• hadoop jar mrdp.jar mrdp.ch3.SimpleRandomSampling <percent-age> <input> <output>

Filtering Patterns

1filtering

2bloom filter-ing

3top ten

4distinct

.

.

.

Page 16: Mapreduce framework suffling & sorting. mapreduce example - wordcount

2. Bloom Filtering : Pattern Description

• Bloom filter : 확률을 적용한 자료구조 ( 이 집합에 A 가 있는가 ? 없는가 ?)

Filtering Patterns

1filtering

2bloom filter-ing

3top ten

4distinct

h 의 해시 값이 빈 곳을 가리키고 있기 때문에 bloom filter 에는 h 가 없어 !

bloom filter (m=50, k=3)

※ 부정 오류는 절대로 발생하지 않는다 .

http://www.jasondavies.com/bloomfilter/

Page 17: Mapreduce framework suffling & sorting. mapreduce example - wordcount

2. Bloom Filtering : Pattern Description

• Bloom filtering

• intent : 어떤 값들의 집합을 미리 멤버로 정의한 레코드들을 유지하고 싶을 때

• motivation : 집합에 해당 data 가 들어있는 유무를 판단하는 기준으로 bloom

filter 를 이용

• applicability :• 데이터는 레코드들로 분리될 수 있다 . (filtering)• hot value 를 각각의 레코드에서 추출할 수 있다 . ( 특징 추출 )• hot value 를 위한 item set 을 미리 결정해둘 수 있다 .• 부정 오류는 절대 일어날 수 없다 .

Filtering Patterns

1filtering

2bloom filter-ing

3top ten

4distinct

hot value!a,b,c,d,e

a 는 꼭 남아 있어야 해 !f 는 필요없어 버렷 !

Page 18: Mapreduce framework suffling & sorting. mapreduce example - wordcount

2. Bloom Filtering : Pattern Description

• structure : training + actual filtering

Filtering Patterns

1filtering

2bloom filter-ing

3top ten

4distinct

bloom filter 에 원소 추가

bloom filter 에원소 있나 없나 검사

추가하려는 data 에 대해 k 가지의 해시 값을 계산 , 그의 대응하는 비트를 1 로 설정

새로운 data 에 대한 해시 값 계산 후 비트 값 읽기

data 가 bloom filter 에 존재하는지 검사

Page 19: Mapreduce framework suffling & sorting. mapreduce example - wordcount

2. Bloom Filtering : Pattern Description

• consequences• a subset of the records in that passed the Bloom filter membership

test• exists false positives records

• known uses• removing most of the non-watched values• prefiltering a data set for an expensive set membership check

Filtering Patterns

1filtering

2bloom filter-ing

3top ten

4distinct

Page 20: Mapreduce framework suffling & sorting. mapreduce example - wordcount

2. Bloom Filtering : Examples

• hotlist

• a Bloom filter is trained with a hot list of keywords

• 100 terms

Filtering Patterns

1filtering

2bloom filter-ing

3top ten

4distinct

Page 21: Mapreduce framework suffling & sorting. mapreduce example - wordcount

2. Bloom Filtering : Examples

• Bloom filter training

Filtering Patterns

1filtering

2bloom filter-ing

3top ten

4distinct

inputFile (.gz)numMembers (training 개수 )falsePosRate ( 부정오류율 )bfFile ( 최종 bloom filter)

bloom filter 생성

Page 22: Mapreduce framework suffling & sorting. mapreduce example - wordcount

2. Bloom Filtering : Examples

• Bloom filter training

Filtering Patterns

1filtering

2bloom filter-ing

3top ten

4distinct

training 할 file 읽기 (hotlist)

bloom filter 에 data 추가 (train-ing)

bloom filter 를 file 형태로 내보냄

Page 23: Mapreduce framework suffling & sorting. mapreduce example - wordcount

2. Bloom Filtering : Examples

• Bloom filter training 실행

• hadoop jar mrdp.jar mrdp.appendixA.BloomFilter-Driver <inputfile> <nummembers> <falseposrate> <bfoutfile>

.

.

.

Filtering Patterns

1filtering

2bloom filter-ing

3top ten

4distinct

Page 24: Mapreduce framework suffling & sorting. mapreduce example - wordcount

2. Bloom Filtering : Examples

• bloom filtering driver :

Filtering Patterns

1filtering

2bloom filter-ing

3top ten

4distinct

ID,UserId,Text,…

Page 25: Mapreduce framework suffling & sorting. mapreduce example - wordcount

2. Bloom Filtering : Examples

• bloom filtering driver : main code

Filtering Patterns

1filtering

2bloom filter-ing

3top ten

4distinct

reduce task 사용하지 않음

분산 캐시에 등록

job 에 입력한 파일 위치 정보를 보냄출력 파일이 저장될 위치 정보를 보냄

Page 26: Mapreduce framework suffling & sorting. mapreduce example - wordcount

2. Bloom Filtering : Examples

• mapper code

Filtering Patterns

1filtering

2bloom filter-ing

3top ten

4distinct 인스턴스 생성

1

1

the ≠The

Page 27: Mapreduce framework suffling & sorting. mapreduce example - wordcount

2. Bloom Filtering : Examples

• Bloom filtering driver 실행

Filtering Patterns

1filtering

2bloom filter-ing

3top ten

4distinct

Page 28: Mapreduce framework suffling & sorting. mapreduce example - wordcount

2. Bloom Filtering : Examples

• Hbase Query using a Bloom filter

• HBase : 하둡 기반의 NoSQL, 물리적 저장소로 HDFS 를 사용• 컬럼기반 / 스키마 X / 조인 , 인덱스 X• 비 관계형 데이터베이스이면서 분산 데이터 저장환경 제공

• HBase query 를 진행하기 전에 Bloom filter 를 사용해 data training

Filtering Patterns

1filtering

2bloom filter-ing

3top ten

4distinct

사전에 불필요한 쿼리 제거 !

Page 29: Mapreduce framework suffling & sorting. mapreduce example - wordcount

2. Bloom Filtering : Examples

• mapper code (1/2)

Filtering Patterns

1filtering

2bloom filter-ing

3top ten

4distinct

1

1

Page 30: Mapreduce framework suffling & sorting. mapreduce example - wordcount

2. Bloom Filtering : Examples

• mapper code (2/2)

Filtering Patterns

1filtering

2bloom filter-ing

3top ten

4distinct

Page 31: Mapreduce framework suffling & sorting. mapreduce example - wordcount

3. Top Ten : Pattern Description

• intent : 데이터들을 순위를 매겨 상위 10 개의 레코드들만 검색하고 싶을 때

• motivation : • 가장 관심 있는 레코드들만 찾기• 구체적인 기준에 가장 부합하는 레코드들을 찾기

Filtering Patterns

1filtering

2bloom filter-ing

3top ten

4distinct

Page 32: Mapreduce framework suffling & sorting. mapreduce example - wordcount

3. Top Ten : Pattern DescriptionFiltering Patterns

1filtering

2bloom filter-ing

3top ten

4distinct

관심 있는 레코드만 취함find local top ten

1 개의 reducer10*M record -> the final top ten

setup / map / cleanup

setup / reduce

Page 33: Mapreduce framework suffling & sorting. mapreduce example - wordcount

3. Top Ten : Pattern Description

• consequences : the top K records are returned (10 records)

• known uses :• outlier analysis• select interesting data• catchy dashboards

• resemblances :• SQL : select * from table order by col4 desc limit 10;• Pig : B = order A by col4 desc; C = limit B 10;

• performance analysis – one single reducer

Filtering Patterns

1filtering

2bloom filter-ing

3top ten

4distinct

Page 34: Mapreduce framework suffling & sorting. mapreduce example - wordcount

3. Top Ten : Examples

• top ten user by reputation• Reputation 이 큰 상위 10 명의 유저 검색

• Id, Reputation, CreationDate, DisplayName, …

• 2.7MB

Filtering Patterns

1filtering

2bloom filter-ing

3top ten

4distinct

Page 35: Mapreduce framework suffling & sorting. mapreduce example - wordcount

3. Top Ten : Examples

• Top Ten : main code

reducer 의 수 = 1

Filtering Patterns

1filtering

2bloom filter-ing

3top ten

4distinct

Page 36: Mapreduce framework suffling & sorting. mapreduce example - wordcount

3. Top Ten : Examples

• mapper code :

Filtering Patterns

1filtering

2bloom filter-ing

3top ten

4distinct

treemap 생성 ( 정렬기능 )

10 위 이상이면 해당 레코드 버림

Page 37: Mapreduce framework suffling & sorting. mapreduce example - wordcount

3. Top Ten : Examples

• mapper code :

Filtering Patterns

1filtering

2bloom filter-ing

3top ten

4distinct

마무리작업( 최종적으로 남은 레코드 출력 )(null key)

Page 38: Mapreduce framework suffling & sorting. mapreduce example - wordcount

3. Top Ten : Examples

• reducer code : 문서 Reputation 을 계속 추가한 후 해당 조건에 맞는 결과 레코드를 출력

Filtering Patterns

1filtering

2bloom filter-ing

3top ten

4distinct

null

Page 39: Mapreduce framework suffling & sorting. mapreduce example - wordcount

3. Top Ten : Examples

• Top Ten Driver 실행

• hadoop jar mrdp.jar mrdp.ch3.TopTenDriver <input> <output>

Filtering Patterns

1filtering

2bloom filter-ing

3top ten

4distinct

.

.

.

.

.

.

Page 40: Mapreduce framework suffling & sorting. mapreduce example - wordcount

4. Distinct : Pattern Description

• intent : 유일한 값들만 모인 집합을 구하고 싶을 때

• motivation : reducing a data set to a unique set of values has several uses

• structure :

Filtering Patterns

1filtering

2bloom filter-ing

3top ten

4distinct

값만 출력 ( 중복 데이터 처리 )

값에 대한 키값 출력

Page 41: Mapreduce framework suffling & sorting. mapreduce example - wordcount

4. Distinct : Pattern Description

• consequences : unique records

• known uses :• deduplicate data• getting distinct values• protecting from an inner join explosion

• resemblances :• SQL : select distinct * from table;• Pig : B = distinct A;

• performance analysis : reducer 수는 필요한 만큼 사용

Filtering Patterns

1filtering

2bloom filter-ing

3top ten

4distinct

Page 42: Mapreduce framework suffling & sorting. mapreduce example - wordcount

4. Distinct : Examples

• distinct user IDs

• Id, PostId, Text, UserId, …

• 5.9MB

Filtering Patterns

1filtering

2bloom filter-ing

3top ten

4distinct

Page 43: Mapreduce framework suffling & sorting. mapreduce example - wordcount

4. Distinct : Examples

• Distinct : main code

Filtering Patterns

1filtering

2bloom filter-ing

3top ten

4distinct

컴바이너 : 맵 테스크 출력에 리듀스 코드를 먼저 적용

Page 44: Mapreduce framework suffling & sorting. mapreduce example - wordcount

4. Distinct : Examples

• mapper code

Filtering Patterns

1filtering

2bloom filter-ing

3top ten

4distinct

Page 45: Mapreduce framework suffling & sorting. mapreduce example - wordcount

4. Distinct : Examples

• reducer code

Filtering Patterns

1filtering

2bloom filter-ing

3top ten

4distinct

null value

Page 46: Mapreduce framework suffling & sorting. mapreduce example - wordcount

4. Distinct : Examples

• Disrinct User Driver 실행

• hadoop jar mrdp.jar mrdp.ch3.DistinctUserDriver <input> <output>

Filtering Patterns

1filtering

2bloom filter-ing

3top ten

4distinct

.

.

.

Page 47: Mapreduce framework suffling & sorting. mapreduce example - wordcount

5. Hadoop 실행 결과 보기 및 output 폴더 지우기

• sudo –u gh hadoop fs –cat output/part-r-00000

• sudo –u gh hadoop fs –rmr –skipTrash output

Filtering Patterns