on large-scale retrieval tasks with ivory and mapreduce

Tamer ElsayedQatar University

On Large-Scale Retrieval Taskswith Ivory and MapReduce

Nov 7th, 2012

2

My Field …

Information Retrieval (IR) is …Finding material (usually documents)

of an unstructured nature (usually text) that satisfies an information need from within large collections

Quite effective (at some things) Highly visible (mostly) Commercially successful (some of them)

http://www.yahoo.com/

http://www.yahoo.com/

3

IR is not just “Document Retrieval” Clustering and Classification Question answering Filtering, tracking, routing Recommender systems Leveraging XML and other Metadata Text mining Novelty identification Meta-search (multi-collection searching) Summarization Cross-language mechanisms Evaluation techniques Multimedia retrieval Social media analysis …

4

My Research …

Text

Large-ScaleProcessing

emails

+ web pages

Enron

CLuEWebIdentity

Resolution

WebSearch

~500,000

~1,000,000,000

User Application

5

Back in 2009 … Before 2009, small text collections are available● Largest: ~ 1M documents

ClueWeb09● Crawled by CMU in 2009● ~ 1B documents !● need to move to cluster environments

MapReduce/Hadoop seems like promising framework

6

E2E Search Toolkit using MapReduce Completely designed for the Hadoop environment Experimental Platform for research Supports common text collections

● + ClueWeb09 Open source release Implements state-of-the-art retrieval models

http://ivory.ccIvory

7

MapReduce Framework

map

map

map

map

reduce

reduce

reduce

input

input

input

input

output

output

output

Shuffling

group values by: [keys]

(a) Map (b) Shuffle (c) Reduce

(k2, [v2])(k1, v1)

[(k3, v3)][k2, v2]

Framework handles “everything else” !

8

The IR Black Box

DocumentsQuery

Hits

9

Inside the IR Black Box

DocumentsQuery

Hits

RepresentationFunction

RepresentationFunction

Query Representation Document Representation

ComparisonFunction Index

offlineonline

10

Indexing

ClintonCheney

B

ClintonObamaClinton

A

ClintonBarackObama

CCheney

Barack

Obama

ClintonA, 2

C, 1B, 1

A, 1C, 1

B, 1

C, 1

Collection Inverted IndexDocuments, IDs Terms, Posting Lists

11

Indexing

ClintonRomney

B

ClintonObamaClinton

A

ClintonBarackObama

CRomney

Barack

Obama

ClintonA, 2

C, 1B, 1

A, 1C, 1

B, 1

C, 1

Collection Inverted IndexDocuments, IDs Terms, Posting Lists

12

Indexing(a) Map (b) Shuffle (c) Reduce

Clinton

Romney

Clinton

Barack

Obama

Clinton

Clinton

Obama

Clinton

Obama

Romney

Barack

Romney

Barack

Obama

Clinton

ClintonRomney

ClintonBarackObama

ClintonObamaClinton

Shuffl

ing

reducemap

map

mapreduce

reduce

reduce

ClintonObamaClinton

ClintonRomney

ClintonBarackObama

2

B

A

C

Retrieval Directly from HDFS!

Cute hack: use Hadoop to launch partition servers● Embed an HTTP server inside each mapper● Mappers start up, initialize servers, enter into infinite service loop!

Why do this?● Unified Hadoop ecosystem● Simplifies data management issues

PartitionServer

PartitionServer

PartitionServer

RetrievalBroker

SearchClient

HDFSdatanode

HDFSdatanode

HDFSdatanode

HDFSdatanode

HDFSnamenode

PartitionServer

Local Disk

TREC’10

TREC’09

14

RoadmapIndexing

& Retrieval

• Batch Retrieval• Approx. Pos. Indexes

Pairwise Similarity

• Monolingual• Cross-Lingual

Pseudo Test

Collection• Training L2R

Iterative Process

• iHadoop

Ivory

SIGIR 2011

SIGIR 2011

CIKM 2011

ACL 2008

TREC 2009TREC 2010

CloudCom 2011

15

RoadmapIndexing

& Retrieval


Pairwise Similarity


Pseudo Test


Iterative Process

• iHadoop

Ivory

SIGIR 2011ACL 2008

16

Abstract Problem

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

0.200.300.540.210.000.340.340.130.74

0.200.300.540.210.000.340.340.130.74

0.200.300.540.210.000.340.340.130.74

0.200.300.540.210.000.340.340.130.74

0.200.300.540.210.000.340.340.130.74

Applications: Clustering Coreference resolution “more-like-that” queries

17

Decomposition

reduce

Each term contributes only if appears in

map

18

Pairwise Similarity(a) Generate pairs (b) Group pairs (c) Sum pairs

Clinton

Barack

Romney

Obama

2

1

1

1

1

1

1

2

2

1

11

2

2 2

2

1

13

1

19

Terms: Zipfian Distribution

term rank

doc f

req

(df)

each term t contributes o(dft2) partial results

very few terms dominate the computations

most frequent term (“said”) 3%most frequent 10 terms 15%

most frequent 100 terms 57%most frequent 1000 terms 95%

~0.1% of total terms(99.9% df-cut)

20

Efficiency (disk space)

0

1,000

2,000

3,000

4,000

5,000

6,000

7,000

8,000

9,000

0 10 20 30 40 50 60 70 80 90 100Corpus Size (%)

Inte

rmed

iate

Pai

rs (b

illio

ns)

no df-cutdf-cut at 99.999%df-cut at 99.99%df-cut at 99.9%df-cut at 99%

8 trillionintermediate pairs

0.5 trillion intermediate pairs

Hadoop, 19 PCs, each w/: 2 single-core processors, 4GB memory, 100GB disk

Aquaint-2 Collection, ~ 906k doc

21

EffectivenessEffect of df-cut on effectiveness

Medline04 - 909k abstracts- Ad-hoc retrieval

50556065707580859095

100

99.00 99.10 99.20 99.30 99.40 99.50 99.60 99.70 99.80 99.90 100.00df-cut (%)

Rel

ativ

e P5

(%)

Drop 0.1% of terms“Near-Linear” Growth

Fit on diskCost 2% in Effectiveness

Hadoop, 19 PCs, each w/: 2 single-core processors, 4GB memory, 100GB disk

ACL’08

22

Cross-Lingual Pairwise Similarity Find similar document pairs in different languages

Multilingual text mining, Machine Translation

Application: automatic generation of potential “interwiki” language links

More difficult than monolingual!

23

Vocabulary Space Matching

MTDoc A

MT translate

doc vector A

German English

DocB

English

doc vector B

Doc A

CLIR project

doc vector A

German

DocB

English

doc vector B

doc vector ACLIR

Ff

Ff

fdfefpedf

ftfefpetf

)()|()(

)()|()(

*

*

24

Locality-Sensitive Hashing (LSH) Cosine score is a good similarity measure but expensive! LSH is a method for effectively reducing the search

space when looking for similar pairs Each vector is converted into a compact representation,

called a signature

A sliding window-based algorithm uses these signatures to search for similar articles in the collection

Vectors close to each other are likely to have similar signatures

Solution Overview

CLIRprojection

Nf German articles

Ne

Englisharticles

Preprocess

Ne+Nf

English document

vectors

Ne+Nf

SignaturesSignature

generation

Sliding window

algorithm

Similar article pairs

<nobel=0.324, prize=0.227, book=0.01, …>

0111000010111100001010

Random Projection/Minhash/Simhash

MapReduce 1: Table Generation Phase

Signatures

….110110111010111000010110101010000…

S1’

sortp1

pQ

.

.

.

S1

SQ

.

.

.

SQ’

sort

….111110010110010100111010010000101…

….111111010101001100011001100100100…

permute

….011001001001001100011011111101010…

….001010011101001000010111111001011…

tables

27

MapReduce 2: Detection Phase

00000110101000100011110010010110100110000000001100100000011001111100110101000001110100101001001101110010110011

table chunks

28

Evaluation Ground truth: ● Sample 1064 German articles ● cosine score >= 0.3

Compare sliding window with brute force approach● required for exact solution● good reference as an upper-bound for recall and running time

Evaluation

95% recall at 39% cost

99% recall at 62% cost

No Free Lunch!

30

Contribution to Wikipedia Identify links between German and English Wikipedia

articles● “Metadaten” “Metadata”, “Semantic Web”, “File Format”● “Pierre Curie” “Marie Curie”, “Pierre Curie”, “Helene

Langevin-Joliot”● “Kirgisistan” “Kyrgyzstan”, “Tulip Revolution”, “2010

Kyrgyzstani uprising”, “2010 South Kyrgyzstan riots”, “Uzbekistan”

Bad results when significant difference in length.

SIGIR’11

31

RoadmapIndexing

& Retrieval


Pairwise Similarity


Pseudo Test


Iterative Process

• iHadoop

Ivory

CIKM 2011

32

Approximate Positional Indexes

Learn

“Learning to Rank” models

Termpositions

effective ranking functions

Proximity features

Approximate

Largeindex

Slow query evaluation

√

X XSmaller

indexFaster query evaluation√ √

Close Enough is Good Enough?

33

Variable-Width Buckets 5 buckets / document

………...........….………...........….………...........….………...........….………...........….

d2d1………...........….………...........….………...........….………...........….………...........….………...........….………...........….………...........….………...........….………...........….

12345

21

534

34

Fixed-Width Buckets Buckets of length W

………...........….………...........….………...........….………...........….………...........….

d2

123

d1………...........….………...........….………...........….………...........….………...........….………...........….………...........….………...........….………...........….………...........….

12345

EffectivenessCIKM’11

36

Roadmap

Indexing & Retrieval

• Batch Retrieval• Approx. Pos.

Indexes

Pairwise Similarity


Pseudo Test

Collections

• Training L2R• Evaluation

Ivory

SIGIR ‘11

iHad

oop

Test Collections Documents, queries, and relevance judgments Important driving force behind IR innovation Without test collections, it’s impossible to:

● Evaluate search systems● Tune ranking functions / train models

Traditional● Exhaustive● Pooling

Recent Methodologies● Behavioral logging (query logs, click logs, etc.)● Minimal test collections● Crowdsourcing

Web Graphweb search

SIGIR 2012

web search

web search

web search

Google

web search

P1

P4

P2

P5

P7

P3

P6

Queries and Judgments?

SIGIR 2012P1

P4

P2

P7

P3

P6

web search

BingP5

Google

anchor text lines ≈ pseudo queries

target pages ≈ relevant candidates

noise reduction ?

40

SIGIR’11

41

RoadmapIndexing

& Retrieval


Pairwise Similarity


Pseudo Test


Iterative Process

• iHadoop

Ivory

CloudCom 2011

42

Iterative MapReduce Applications Many machine learning, and data mining applications● PageRank, k-means, HITS, …

Every iteration has to wait until the previous iteration has written its output completely to the DFS (unnecessary waiting time)

Every iteration starts by reading from the DFS what has just been written by the earlier iteration (wastes CPU time, I/O, bandwidth)

MapReduce is not designed to run iterative applications efficiently

43

Goal

44

Asynchronous PipelineCloudCom’11

45

Conclusion MapReduce allows large-scale processing over web data Ivory

● E2E open-source IR retrieval engine for research● Completely on Hadoop

• even retrieval: from HDFS Efficiency-effectiveness tradeoff

● Cross-Lingual Pairwise Similarity• Efficient implementation using MapReduce• Efficiency-effectiveness tradeoff

● Approx Positional Indexes• Efficient and as effective as exact positions

● Pseudo Test Collections• Possible!• Effective for training L2R models

MapReduce is not good for iterative algorithms

http://ivory.cc

46

Collaborators Jimmy Lin Don Metzler Doug Oard Ferhan Ture Nima Asadi Lidan Wang Eslam Elnikety Hany Ramadan

47

Thank You!

Questions?