on large-scale retrieval tasks with ivory and mapreduce

47
Tamer Elsayed Qatar University On Large-Scale Retrieval Tasks with Ivory and MapReduce Nov 7 th , 2012

Upload: lena

Post on 22-Feb-2016

45 views

Category:

Documents


0 download

DESCRIPTION

On Large-Scale Retrieval Tasks with Ivory and MapReduce. Nov 7 th , 2012. My Field …. Information Retrieval (IR ) is … Finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: On Large-Scale Retrieval Tasks with Ivory and MapReduce

Tamer ElsayedQatar University

On Large-Scale Retrieval Taskswith Ivory and MapReduce

Nov 7th, 2012

Page 2: On Large-Scale Retrieval Tasks with Ivory and MapReduce

2

My Field …

Information Retrieval (IR) is …Finding material (usually documents)

of an unstructured nature (usually text) that satisfies an information need from within large collections

Quite effective (at some things) Highly visible (mostly) Commercially successful (some of them)

Page 3: On Large-Scale Retrieval Tasks with Ivory and MapReduce

3

IR is not just “Document Retrieval” Clustering and Classification Question answering Filtering, tracking, routing Recommender systems Leveraging XML and other Metadata Text mining Novelty identification Meta-search (multi-collection searching) Summarization Cross-language mechanisms Evaluation techniques Multimedia retrieval Social media analysis …

Page 4: On Large-Scale Retrieval Tasks with Ivory and MapReduce

4

My Research …

Text

Large-ScaleProcessing

emails

+ web pages

Enron

CLuEWebIdentity

Resolution

WebSearch

~500,000

~1,000,000,000

User Application

Page 5: On Large-Scale Retrieval Tasks with Ivory and MapReduce

5

Back in 2009 … Before 2009, small text collections are available● Largest: ~ 1M documents

ClueWeb09● Crawled by CMU in 2009● ~ 1B documents !● need to move to cluster environments

MapReduce/Hadoop seems like promising framework

Page 6: On Large-Scale Retrieval Tasks with Ivory and MapReduce

6

E2E Search Toolkit using MapReduce Completely designed for the Hadoop environment Experimental Platform for research Supports common text collections

● + ClueWeb09 Open source release Implements state-of-the-art retrieval models

http://ivory.ccIvory

Page 7: On Large-Scale Retrieval Tasks with Ivory and MapReduce

7

MapReduce Framework

map

map

map

map

reduce

reduce

reduce

input

input

input

input

output

output

output

Shuffling

group values by: [keys]

(a) Map (b) Shuffle (c) Reduce

(k2, [v2])(k1, v1)

[(k3, v3)][k2, v2]

Framework handles “everything else” !

Page 8: On Large-Scale Retrieval Tasks with Ivory and MapReduce

8

The IR Black Box

DocumentsQuery

Hits

Page 9: On Large-Scale Retrieval Tasks with Ivory and MapReduce

9

Inside the IR Black Box

DocumentsQuery

Hits

RepresentationFunction

RepresentationFunction

Query Representation Document Representation

ComparisonFunction Index

offlineonline

Page 10: On Large-Scale Retrieval Tasks with Ivory and MapReduce

10

Indexing

ClintonCheney

B

ClintonObamaClinton

A

ClintonBarackObama

CCheney

Barack

Obama

ClintonA, 2

C, 1B, 1

A, 1C, 1

B, 1

C, 1

Collection Inverted IndexDocuments, IDs Terms, Posting Lists

Page 11: On Large-Scale Retrieval Tasks with Ivory and MapReduce

11

Indexing

ClintonRomney

B

ClintonObamaClinton

A

ClintonBarackObama

CRomney

Barack

Obama

ClintonA, 2

C, 1B, 1

A, 1C, 1

B, 1

C, 1

Collection Inverted IndexDocuments, IDs Terms, Posting Lists

Page 12: On Large-Scale Retrieval Tasks with Ivory and MapReduce

12

Indexing(a) Map (b) Shuffle (c) Reduce

Clinton

Romney

Clinton

Barack

Obama

Clinton

Clinton

Obama

Clinton

Obama

Romney

Barack

Romney

Barack

Obama

Clinton

ClintonRomney

ClintonBarackObama

ClintonObamaClinton

Shuffl

ing

reducemap

map

mapreduce

reduce

reduce

ClintonObamaClinton

ClintonRomney

ClintonBarackObama

2

B

A

C

Page 13: On Large-Scale Retrieval Tasks with Ivory and MapReduce

Retrieval Directly from HDFS!

Cute hack: use Hadoop to launch partition servers● Embed an HTTP server inside each mapper● Mappers start up, initialize servers, enter into infinite service loop!

Why do this?● Unified Hadoop ecosystem● Simplifies data management issues

PartitionServer

PartitionServer

PartitionServer

RetrievalBroker

SearchClient

HDFSdatanode

HDFSdatanode

HDFSdatanode

HDFSdatanode

HDFSnamenode

PartitionServer

Local Disk

TREC’10

TREC’09

Page 14: On Large-Scale Retrieval Tasks with Ivory and MapReduce

14

RoadmapIndexing

& Retrieval

• Batch Retrieval• Approx. Pos. Indexes

Pairwise Similarity

• Monolingual• Cross-Lingual

Pseudo Test

Collection• Training L2R

Iterative Process

• iHadoop

Ivory

SIGIR 2011

SIGIR 2011

CIKM 2011

ACL 2008

TREC 2009TREC 2010

CloudCom 2011

Page 15: On Large-Scale Retrieval Tasks with Ivory and MapReduce

15

RoadmapIndexing

& Retrieval

• Batch Retrieval• Approx. Pos. Indexes

Pairwise Similarity

• Monolingual• Cross-Lingual

Pseudo Test

Collection• Training L2R

Iterative Process

• iHadoop

Ivory

SIGIR 2011ACL 2008

Page 16: On Large-Scale Retrieval Tasks with Ivory and MapReduce

16

Abstract Problem

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

0.200.300.540.210.000.340.340.130.74

0.200.300.540.210.000.340.340.130.74

0.200.300.540.210.000.340.340.130.74

0.200.300.540.210.000.340.340.130.74

0.200.300.540.210.000.340.340.130.74

Applications: Clustering Coreference resolution “more-like-that” queries

Page 17: On Large-Scale Retrieval Tasks with Ivory and MapReduce

17

Decomposition

reduce

Each term contributes only if appears in

map

Page 18: On Large-Scale Retrieval Tasks with Ivory and MapReduce

18

Pairwise Similarity(a) Generate pairs (b) Group pairs (c) Sum pairs

Clinton

Barack

Romney

Obama

2

1

1

1

1

1

1

2

2

1

11

2

2 2

2

1

13

1

Page 19: On Large-Scale Retrieval Tasks with Ivory and MapReduce

19

Terms: Zipfian Distribution

term rank

doc f

req

(df)

each term t contributes o(dft2) partial results

very few terms dominate the computations

most frequent term (“said”) 3%most frequent 10 terms 15%

most frequent 100 terms 57%most frequent 1000 terms 95%

~0.1% of total terms(99.9% df-cut)

Page 20: On Large-Scale Retrieval Tasks with Ivory and MapReduce

20

Efficiency (disk space)

0

1,000

2,000

3,000

4,000

5,000

6,000

7,000

8,000

9,000

0 10 20 30 40 50 60 70 80 90 100Corpus Size (%)

Inte

rmed

iate

Pai

rs (b

illio

ns)

no df-cutdf-cut at 99.999%df-cut at 99.99%df-cut at 99.9%df-cut at 99%

8 trillionintermediate pairs

0.5 trillion intermediate pairs

Hadoop, 19 PCs, each w/: 2 single-core processors, 4GB memory, 100GB disk

Aquaint-2 Collection, ~ 906k doc

Page 21: On Large-Scale Retrieval Tasks with Ivory and MapReduce

21

EffectivenessEffect of df-cut on effectiveness

Medline04 - 909k abstracts- Ad-hoc retrieval

50556065707580859095

100

99.00 99.10 99.20 99.30 99.40 99.50 99.60 99.70 99.80 99.90 100.00df-cut (%)

Rel

ativ

e P5

(%)

Drop 0.1% of terms“Near-Linear” Growth

Fit on diskCost 2% in Effectiveness

Hadoop, 19 PCs, each w/: 2 single-core processors, 4GB memory, 100GB disk

ACL’08

Page 22: On Large-Scale Retrieval Tasks with Ivory and MapReduce

22

Cross-Lingual Pairwise Similarity Find similar document pairs in different languages

Multilingual text mining, Machine Translation

Application: automatic generation of potential “interwiki” language links

More difficult than monolingual!

Page 23: On Large-Scale Retrieval Tasks with Ivory and MapReduce

23

Vocabulary Space Matching

MTDoc A

MT translate

doc vector A

German English

DocB

English

doc vector B

Doc A

CLIR project

doc vector A

German

DocB

English

doc vector B

doc vector ACLIR

Ff

Ff

fdfefpedf

ftfefpetf

)()|()(

)()|()(

*

*

Page 24: On Large-Scale Retrieval Tasks with Ivory and MapReduce

24

Locality-Sensitive Hashing (LSH) Cosine score is a good similarity measure but expensive! LSH is a method for effectively reducing the search

space when looking for similar pairs Each vector is converted into a compact representation,

called a signature

A sliding window-based algorithm uses these signatures to search for similar articles in the collection

Vectors close to each other are likely to have similar signatures

Page 25: On Large-Scale Retrieval Tasks with Ivory and MapReduce

Solution Overview

CLIRprojection

Nf German articles

Ne

Englisharticles

Preprocess

Ne+Nf

English document

vectors

Ne+Nf

SignaturesSignature

generation

Sliding window

algorithm

Similar article pairs

<nobel=0.324, prize=0.227, book=0.01, …>

0111000010111100001010

Random Projection/Minhash/Simhash

Page 26: On Large-Scale Retrieval Tasks with Ivory and MapReduce

MapReduce 1: Table Generation Phase

Signatures

….110110111010111000010110101010000…

S1’

sortp1

pQ

.

.

.

S1

SQ

.

.

.

SQ’

sort

….111110010110010100111010010000101…

….111111010101001100011001100100100…

permute

….011001001001001100011011111101010…

….001010011101001000010111111001011…

tables

Page 27: On Large-Scale Retrieval Tasks with Ivory and MapReduce

27

MapReduce 2: Detection Phase

00000110101000100011110010010110100110000000001100100000011001111100110101000001110100101001001101110010110011

table chunks

Page 28: On Large-Scale Retrieval Tasks with Ivory and MapReduce

28

Evaluation Ground truth: ● Sample 1064 German articles ● cosine score >= 0.3

Compare sliding window with brute force approach● required for exact solution● good reference as an upper-bound for recall and running time

Page 29: On Large-Scale Retrieval Tasks with Ivory and MapReduce

Evaluation

95% recall at 39% cost

99% recall at 62% cost

No Free Lunch!

Page 30: On Large-Scale Retrieval Tasks with Ivory and MapReduce

30

Contribution to Wikipedia Identify links between German and English Wikipedia

articles● “Metadaten” “Metadata”, “Semantic Web”, “File Format”● “Pierre Curie” “Marie Curie”, “Pierre Curie”, “Helene

Langevin-Joliot”● “Kirgisistan” “Kyrgyzstan”, “Tulip Revolution”, “2010

Kyrgyzstani uprising”, “2010 South Kyrgyzstan riots”, “Uzbekistan”

Bad results when significant difference in length.

SIGIR’11

Page 31: On Large-Scale Retrieval Tasks with Ivory and MapReduce

31

RoadmapIndexing

& Retrieval

• Batch Retrieval• Approx. Pos. Indexes

Pairwise Similarity

• Monolingual• Cross-Lingual

Pseudo Test

Collection• Training L2R

Iterative Process

• iHadoop

Ivory

CIKM 2011

Page 32: On Large-Scale Retrieval Tasks with Ivory and MapReduce

32

Approximate Positional Indexes

Learn

“Learning to Rank” models

Termpositions

effective ranking functions

Proximity features

Approximate

Largeindex

Slow query evaluation

X XSmaller

indexFaster query evaluation√ √

Close Enough is Good Enough?

Page 33: On Large-Scale Retrieval Tasks with Ivory and MapReduce

33

Variable-Width Buckets 5 buckets / document

………...........….………...........….………...........….………...........….………...........….

d2d1………...........….………...........….………...........….………...........….………...........….………...........….………...........….………...........….………...........….………...........….

12345

21

534

Page 34: On Large-Scale Retrieval Tasks with Ivory and MapReduce

34

Fixed-Width Buckets Buckets of length W

………...........….………...........….………...........….………...........….………...........….

d2

123

d1………...........….………...........….………...........….………...........….………...........….………...........….………...........….………...........….………...........….………...........….

12345

Page 35: On Large-Scale Retrieval Tasks with Ivory and MapReduce

EffectivenessCIKM’11

Page 36: On Large-Scale Retrieval Tasks with Ivory and MapReduce

36

Roadmap

Indexing & Retrieval

• Batch Retrieval• Approx. Pos.

Indexes

Pairwise Similarity

• Monolingual• Cross-Lingual

Pseudo Test

Collections

• Training L2R• Evaluation

Ivory

SIGIR ‘11

iHad

oop

Page 37: On Large-Scale Retrieval Tasks with Ivory and MapReduce

Test Collections Documents, queries, and relevance judgments Important driving force behind IR innovation Without test collections, it’s impossible to:

● Evaluate search systems● Tune ranking functions / train models

Traditional● Exhaustive● Pooling

Recent Methodologies● Behavioral logging (query logs, click logs, etc.)● Minimal test collections● Crowdsourcing

Page 38: On Large-Scale Retrieval Tasks with Ivory and MapReduce

Web Graphweb search

SIGIR 2012

web search

web search

web search

Google

web search

P1

P4

P2

P5

P7

P3

P6

Page 39: On Large-Scale Retrieval Tasks with Ivory and MapReduce

Queries and Judgments?

SIGIR 2012P1

P4

P2

P7

P3

P6

web search

BingP5

Google

anchor text lines ≈ pseudo queries

target pages ≈ relevant candidates

noise reduction ?

Page 40: On Large-Scale Retrieval Tasks with Ivory and MapReduce

40

SIGIR’11

Page 41: On Large-Scale Retrieval Tasks with Ivory and MapReduce

41

RoadmapIndexing

& Retrieval

• Batch Retrieval• Approx. Pos. Indexes

Pairwise Similarity

• Monolingual• Cross-Lingual

Pseudo Test

Collection• Training L2R

Iterative Process

• iHadoop

Ivory

CloudCom 2011

Page 42: On Large-Scale Retrieval Tasks with Ivory and MapReduce

42

Iterative MapReduce Applications Many machine learning, and data mining applications● PageRank, k-means, HITS, …

Every iteration has to wait until the previous iteration has written its output completely to the DFS (unnecessary waiting time)

Every iteration starts by reading from the DFS what has just been written by the earlier iteration (wastes CPU time, I/O, bandwidth)

MapReduce is not designed to run iterative applications efficiently

Page 43: On Large-Scale Retrieval Tasks with Ivory and MapReduce

43

Goal

Page 44: On Large-Scale Retrieval Tasks with Ivory and MapReduce

44

Asynchronous PipelineCloudCom’11

Page 45: On Large-Scale Retrieval Tasks with Ivory and MapReduce

45

Conclusion MapReduce allows large-scale processing over web data Ivory

● E2E open-source IR retrieval engine for research● Completely on Hadoop

• even retrieval: from HDFS Efficiency-effectiveness tradeoff

● Cross-Lingual Pairwise Similarity• Efficient implementation using MapReduce• Efficiency-effectiveness tradeoff

● Approx Positional Indexes• Efficient and as effective as exact positions

● Pseudo Test Collections• Possible!• Effective for training L2R models

MapReduce is not good for iterative algorithms

http://ivory.cc

Page 46: On Large-Scale Retrieval Tasks with Ivory and MapReduce

46

Collaborators Jimmy Lin Don Metzler Doug Oard Ferhan Ture Nima Asadi Lidan Wang Eslam Elnikety Hany Ramadan

Page 47: On Large-Scale Retrieval Tasks with Ivory and MapReduce

47

Thank You!

Questions?