a study of using search engine page hits as a proxy for n-gram frequencies preslav nakov and marti...

37
A Study of Using Search Engine Page Hits as a Proxy for n-gram Frequencies Preslav Nakov and Marti Hearst Computer Science Division and SIMS University of California, Berkeley Supported by NSF DBI-0317510 and a gift from Genentech

Post on 18-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Study of Using Search Engine Page Hits as a Proxy for n-gram Frequencies Preslav Nakov and Marti Hearst Computer Science Division and SIMS University

A Study of Using Search Engine Page Hits as a Proxy for n-gram

Frequencies

A Study of Using Search Engine Page Hits as a Proxy for n-gram

Frequencies

Preslav Nakov and Marti HearstComputer Science Division and SIMS

University of California, Berkeley

Supported by NSF DBI-0317510 and a gift from Genentech

Page 2: A Study of Using Search Engine Page Hits as a Proxy for n-gram Frequencies Preslav Nakov and Marti Hearst Computer Science Division and SIMS University

Overview

Web as a corpus n-gram frequencies Concern: Instability of n-gram estimates

Study the impact of the variability of the n-gram estimates for a particular task: Across time For different search engines (Not) using a language filter (Not) using inflections

Page 3: A Study of Using Search Engine Page Hits as a Proxy for n-gram Frequencies Preslav Nakov and Marti Hearst Computer Science Division and SIMS University

Introduction

(Banko & Brill, 2001) “Scaling to Very Very Large Corpora for Natural Language

Disambiguation”, ACL 2001 Simple task: choose from a set of commonly confused set

of words for a given context, e.g. {principle, principal} Data comes for free: Assuming correct usage in the

training raw text. Log-linear improvement even to billion words

=> Getting more data is better than fine-tuning algorithms.

Today the obvious source of very large data is the Web.

Page 4: A Study of Using Search Engine Page Hits as a Proxy for n-gram Frequencies Preslav Nakov and Marti Hearst Computer Science Division and SIMS University

Web as a Corpus

Machine Translation (Grefenstette 98; Resnik 99; Cao & Li 02; Way & Gough 03)

Question Answering (Dumais et al. 02; Soricut & Brill 04), Word Sense Disambiguation (Mihalcea & Moldovan 99;

Rigau et al. 02; Santamar´ıa et al. 03; Zahariev 04), Extraction of Semantic Relations (Chklovski & Pantel 04;

Idan Szpektor & Coppola 04; Shinzato & Torisawa 04), Anaphora Resolution: (Modjeska et al. 03), Prepositional Phrase Attachment: (Volk 01; Calvo &

Gelbukh 03; Nakov & Hearst 05), Language Modeling: (Zhu & Rosenfeld 01; Keller &

Lapata 03)

Page 5: A Study of Using Search Engine Page Hits as a Proxy for n-gram Frequencies Preslav Nakov and Marti Hearst Computer Science Division and SIMS University

Page Hits as a Proxy for n-gram Frequencies

Plausibility: (Keller & Lapata 03) demonstrate a high correlation between: page hits and corpus bigram frequencies page hits and human plausibility judgments

Web as a baseline (Lapata & Keller 05): machine translation candidate selection, spelling correction, adjective ordering, article generation, noun compound bracketing, noun compound interpretation, countability detection and prepositional phrase attachment

More than a baseline State of the art results for noun compound

bracketing (Nakov & Hearst 05)

Page 6: A Study of Using Search Engine Page Hits as a Proxy for n-gram Frequencies Preslav Nakov and Marti Hearst Computer Science Division and SIMS University

Web Count Problems (1) Page hits are not really n-gram frequencies

This may be OK (Keller&Lapata,2003) The Web lacks linguistic annotation

Cannot handle stem cells VERB PREPOSITION brain protein synthesis’ inhibition

Pr(health|care) = #(“health care”) / #(care) health: noun care: both verb and noun can be adjacent by chance can come from different sentences

Page 7: A Study of Using Search Engine Page Hits as a Proxy for n-gram Frequencies Preslav Nakov and Marti Hearst Computer Science Division and SIMS University

Web Count Problems (2) Instability of the n-gram counts

Dynamics over time Query inconsistencies

Indexes spread across multiple machines Multiple (inconsistent) index copies

Search engine “dancing”,

tool at: http://www.seochat.com/googledance

Problem: Web experiments are not reproducible.

Page 8: A Study of Using Search Engine Page Hits as a Proxy for n-gram Frequencies Preslav Nakov and Marti Hearst Computer Science Division and SIMS University

Web Count Problems (3) Rounding of page hits

Exact estimates MSN: always Google and Yahoo: for small numbers only

Possible reasons for rounding Not necessary for typical users Expensive to compute:

Distributed index Constant changes

Under high loads, search engines probably sample from their indexes.

Page 9: A Study of Using Search Engine Page Hits as a Proxy for n-gram Frequencies Preslav Nakov and Marti Hearst Computer Science Division and SIMS University

The Task

Problem: What is the impact of n-gram variability (inconsistencies, rounding etc.)?

Approach NOT absolute n-gram variability BUT experiments wrt. a real task

noun compound bracketing allows for the use of n-grams of different lengths

Page 10: A Study of Using Search Engine Page Hits as a Proxy for n-gram Frequencies Preslav Nakov and Marti Hearst Computer Science Division and SIMS University

Our Particular Task:

Noun Compound Bracketing

Page 11: A Study of Using Search Engine Page Hits as a Proxy for n-gram Frequencies Preslav Nakov and Marti Hearst Computer Science Division and SIMS University

Noun Compound Bracketing

(a) [ [ liver cell ] antibody ] (left bracketing)

(b) [ liver [cell line] ] (right bracketing)

In (a), the antibody targets the liver cell. In (b), the cell line is derived from the liver.

liver cell line liver cell antibody

Page 12: A Study of Using Search Engine Page Hits as a Proxy for n-gram Frequencies Preslav Nakov and Marti Hearst Computer Science Division and SIMS University

Related Work

Marcus(1980), Pustejosky&al.(1993), Resnik(1993) adjacency model: Pr(w1|w2) vs. Pr(w2|w3)

Lauer (1995) dependency model: Pr(w1|w2) vs. Pr(w1|w3)

Keller & Lapata (2004): use the Web unigrams and bigrams

Girju & al. (2005) supervised model bracketing in context requires WordNet senses to be given

Pr that w1 precedes w2

Page 13: A Study of Using Search Engine Page Hits as a Proxy for n-gram Frequencies Preslav Nakov and Marti Hearst Computer Science Division and SIMS University

Adjacency & Dependency (1)

right bracketing: [w1 [w2w3] ] w2w3 is a compound (modified by w1)

home health care

w1 and w2 independently modify w3

adult male rat

left bracketing : [ [w1w2 ] w3] only 1 modificational choice possible

law enforcement officer

w1 w2 w3

w1 w2 w3

Page 14: A Study of Using Search Engine Page Hits as a Proxy for n-gram Frequencies Preslav Nakov and Marti Hearst Computer Science Division and SIMS University

Adjacency & Dependency (2)

right bracketing: [w1 [w2w3] ] w2w3 is a compound (modified by w1)

w1 and w2 independently modify w3

adjacency model Is w2w3 a compound?

(vs. w1w2 being a compound)

dependency model Does w1 modify w3?

(vs. w1 modifying w2)

w1 w2 w3

w1 w2 w3

w1 w2 w3

Page 15: A Study of Using Search Engine Page Hits as a Proxy for n-gram Frequencies Preslav Nakov and Marti Hearst Computer Science Division and SIMS University

Probabilities: Why?

Why should we use: (a) Pr(w1w2|w2), rather than (b) Pr(w2w1|w1)?

Keller&Lapata (2004) calculate: AltaVista queries:

(a): 70.49% (b): 68.85%

British National Corpus: (a): 63.11% (b): 65.57%

Page 16: A Study of Using Search Engine Page Hits as a Proxy for n-gram Frequencies Preslav Nakov and Marti Hearst Computer Science Division and SIMS University

Association Models: 2 (Chi Squared)

A = #(wi,wj)

B = #(wi) – #(wi,wj)

C = #(wj) – #(wi,wj)

D = N – (A+B+C) N = 8 trillion (= A+B+C+D)

8 billion Web pages x 1,000 words

Page 17: A Study of Using Search Engine Page Hits as a Proxy for n-gram Frequencies Preslav Nakov and Marti Hearst Computer Science Division and SIMS University

Web-derived Surface Features:Possessive Marker

Attached to the first word brain’s stem cell right

Attached to the second word brain stem’s cell left

We can query directly for possessives Search engines drop the possessive marker,

but s is kept. Still, we cannot query for “brain stems’ cell”

Page 18: A Study of Using Search Engine Page Hits as a Proxy for n-gram Frequencies Preslav Nakov and Marti Hearst Computer Science Division and SIMS University

Other Web-derived Features:Abbreviation

After the second word tumor necrosis (TN) factor left

After the third word tumor necrosis factor (NF) right

We query for e.g., “tumor necrosis tn factor” Problems:

Roman digits: IV, vii States: CA Short words: me

Page 19: A Study of Using Search Engine Page Hits as a Proxy for n-gram Frequencies Preslav Nakov and Marti Hearst Computer Science Division and SIMS University

Other Web-derived Features:Concatenation

Consider health care reform healthcare : 79,500,000 carereform : 269 healthreform: 812

Adjacency model healthcare vs. carereform

Dependency model healthcare vs. healthreform

Triples (adjacency) “healthcare reform” vs. “health carereform”

Page 20: A Study of Using Search Engine Page Hits as a Proxy for n-gram Frequencies Preslav Nakov and Marti Hearst Computer Science Division and SIMS University

Other Web-derived Features:Reorder

Reorders for “health care reform” “care reform health” right “reform health care” left

Page 21: A Study of Using Search Engine Page Hits as a Proxy for n-gram Frequencies Preslav Nakov and Marti Hearst Computer Science Division and SIMS University

Other Web-derived Features:Internal Inflection Variability

First word bone mineral density bones mineral density

Second word bone mineral density bone minerals density

right

left

Page 22: A Study of Using Search Engine Page Hits as a Proxy for n-gram Frequencies Preslav Nakov and Marti Hearst Computer Science Division and SIMS University

Experiments

Page 23: A Study of Using Search Engine Page Hits as a Proxy for n-gram Frequencies Preslav Nakov and Marti Hearst Computer Science Division and SIMS University

Experiments

Lauer (95) dataset 244 noun compounds (NCs)

from Grolier’s encyclopedia inter-annotator agreement: 81.5%

Exact phrase queries (min freq. 5) Inflections:

Carroll’s morphological tools

Page 24: A Study of Using Search Engine Page Hits as a Proxy for n-gram Frequencies Preslav Nakov and Marti Hearst Computer Science Division and SIMS University

Experiments

4 dimensions: time search engine language filter inflected forms

Page 25: A Study of Using Search Engine Page Hits as a Proxy for n-gram Frequencies Preslav Nakov and Marti Hearst Computer Science Division and SIMS University

Comparison over time (P): Google

Precision (in %) for any language, no inflections. Average recall is shown in parentheses.

Varying time intervals,in case index changes happen periodically

Page 26: A Study of Using Search Engine Page Hits as a Proxy for n-gram Frequencies Preslav Nakov and Marti Hearst Computer Science Division and SIMS University

Comparison over time (P): MSN

Precision (in %) for any language, no inflections. Average recall is shown in parentheses.

Statistically significant

Page 27: A Study of Using Search Engine Page Hits as a Proxy for n-gram Frequencies Preslav Nakov and Marti Hearst Computer Science Division and SIMS University

Experiments

4 dimensions: timesearch engine language filter inflected forms

Page 28: A Study of Using Search Engine Page Hits as a Proxy for n-gram Frequencies Preslav Nakov and Marti Hearst Computer Science Division and SIMS University

Comparison by search engine (P): for 6/6/2005

Precision (in %) for any language, no inflections. for 6/6/2005 Average recall is shown in parentheses.

Statistically significant

Page 29: A Study of Using Search Engine Page Hits as a Proxy for n-gram Frequencies Preslav Nakov and Marti Hearst Computer Science Division and SIMS University

Comparison by search engine (R): for 6/6/2005

Recall (in %) for any language, no inflections. for 6/6/2005

No much variability of recall (but Googlehas the biggest index)

Page 30: A Study of Using Search Engine Page Hits as a Proxy for n-gram Frequencies Preslav Nakov and Marti Hearst Computer Science Division and SIMS University

Experiments

4 dimensions: time search engine language filter inflected forms

Page 31: A Study of Using Search Engine Page Hits as a Proxy for n-gram Frequencies Preslav Nakov and Marti Hearst Computer Science Division and SIMS University

Comparison by language (P): any language vs. English

Precision (in %), no inflections. for 6/6/2005

Minor inconsistent impact on precision

Page 32: A Study of Using Search Engine Page Hits as a Proxy for n-gram Frequencies Preslav Nakov and Marti Hearst Computer Science Division and SIMS University

Comparison by language (R): any language vs. English

Recall (in %), no inflections. for 6/6/2005

Minor consistent drop in recall

Page 33: A Study of Using Search Engine Page Hits as a Proxy for n-gram Frequencies Preslav Nakov and Marti Hearst Computer Science Division and SIMS University

Experiments

4 dimensions: search engine time language filter inflected forms

Page 34: A Study of Using Search Engine Page Hits as a Proxy for n-gram Frequencies Preslav Nakov and Marti Hearst Computer Science Division and SIMS University

Comparison by search engine (P): inflections

Precision (in %), any language. for 6/6/2005

Minor inconsistent impact on precision

Page 35: A Study of Using Search Engine Page Hits as a Proxy for n-gram Frequencies Preslav Nakov and Marti Hearst Computer Science Division and SIMS University

Comparison by search engine (R): inflections

Recall (in %), any language. for 6/6/2005

Minor consistent improvement in recall

Page 36: A Study of Using Search Engine Page Hits as a Proxy for n-gram Frequencies Preslav Nakov and Marti Hearst Computer Science Division and SIMS University

Conclusions and Future Work

Good news: n-gram variability does not have a statistically significant impact on the performance (for our task).

Future work other NLP tasks other languages

Page 37: A Study of Using Search Engine Page Hits as a Proxy for n-gram Frequencies Preslav Nakov and Marti Hearst Computer Science Division and SIMS University

The End

Thank you!