using the web as a source of linguistic data: experiences...
Post on 04-Jul-2020
3 Views
Preview:
TRANSCRIPT
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Using the web as a source of linguistic data:experiences, problems and perspectives
Marco Baroni
SSLMIT, University of Bologna
ICST/CNR Roma, April 2005
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Outline
Introduction
Frequency estimates from search enginesWeb-based Mutual Information
The “linguists’ friendly” interfaces
Building your own web corpusSmall corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine
Enter WaCky!
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
The Web as Corpus
I Computational/corpus linguists, lexicographers,ontologists, language technologists constantly hungry fordata.
I The web is a huge database of documents, mostly text.
I Kilgarriff: The web is the most exciting thing thathappened to human beings in the last 20 years or so, andit’s all about linguistic communication – we linguists arein a good position to lead the study of it!!!
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
The Web as Corpus
I Computational/corpus linguists, lexicographers,ontologists, language technologists constantly hungry fordata.
I The web is a huge database of documents, mostly text.
I Kilgarriff: The web is the most exciting thing thathappened to human beings in the last 20 years or so, andit’s all about linguistic communication – we linguists arein a good position to lead the study of it!!!
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
The Web as Corpus
I Computational/corpus linguists, lexicographers,ontologists, language technologists constantly hungry fordata.
I The web is a huge database of documents, mostly text.
I Kilgarriff: The web is the most exciting thing thathappened to human beings in the last 20 years or so, andit’s all about linguistic communication – we linguists arein a good position to lead the study of it!!!
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
The Web as Corpus (cont.)
I Kilgarriff and Grefenstette, Introduction to the SpecialIssue on the Web as Corpus, Computational Linguistics2003.
English 76,598,718,000German 7,035,850,000Italian 1,845,026,000
Finnish 326,379,000Esperanto 57,154,000
Latin 55,943,000Basque 55,340,000
Albanian 10,332,000
I (Obsolete, conservative estimates)
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Some General Problems
I Web is not balanced corpus.
I More worryingly: if you use search engine, no control overdata.
I Constantly changing.
I Many languages, a lot of non-native English.
I Python.
I Google frequency of “colorless green ideas sleepfuriously”: 13,000.
I Desperately seeking Blaberus Giganteus.
I . . .
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Some General Problems
I Web is not balanced corpus.
I More worryingly: if you use search engine, no control overdata.
I Constantly changing.
I Many languages, a lot of non-native English.
I Python.
I Google frequency of “colorless green ideas sleepfuriously”: 13,000.
I Desperately seeking Blaberus Giganteus.
I . . .
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Some General Problems
I Web is not balanced corpus.
I More worryingly: if you use search engine, no control overdata.
I Constantly changing.
I Many languages, a lot of non-native English.
I Python.
I Google frequency of “colorless green ideas sleepfuriously”: 13,000.
I Desperately seeking Blaberus Giganteus.
I . . .
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Some General Problems
I Web is not balanced corpus.
I More worryingly: if you use search engine, no control overdata.
I Constantly changing.
I Many languages, a lot of non-native English.
I Python.
I Google frequency of “colorless green ideas sleepfuriously”: 13,000.
I Desperately seeking Blaberus Giganteus.
I . . .
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Some General Problems
I Web is not balanced corpus.
I More worryingly: if you use search engine, no control overdata.
I Constantly changing.
I Many languages, a lot of non-native English.
I Python.
I Google frequency of “colorless green ideas sleepfuriously”: 13,000.
I Desperately seeking Blaberus Giganteus.
I . . .
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Some General Problems
I Web is not balanced corpus.
I More worryingly: if you use search engine, no control overdata.
I Constantly changing.
I Many languages, a lot of non-native English.
I Python.
I Google frequency of “colorless green ideas sleepfuriously”: 13,000.
I Desperately seeking Blaberus Giganteus.
I . . .
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Some General Problems
I Web is not balanced corpus.
I More worryingly: if you use search engine, no control overdata.
I Constantly changing.
I Many languages, a lot of non-native English.
I Python.
I Google frequency of “colorless green ideas sleepfuriously”: 13,000.
I Desperately seeking Blaberus Giganteus.
I . . .
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Some General Problems
I Web is not balanced corpus.
I More worryingly: if you use search engine, no control overdata.
I Constantly changing.
I Many languages, a lot of non-native English.
I Python.
I Google frequency of “colorless green ideas sleepfuriously”: 13,000.
I Desperately seeking Blaberus Giganteus.
I . . .
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
But still. . . more data is better data!(Mercer quoted by Church)
I Banko and Brill 2001 HLT paper.
I Confusion set disambiguation task.
I Choose correct word in context from set of words it istypically confused with: affect/effect, principal/principle.
I Even most naive learning algorithm trained on 10M wordtraining set outperforms any learner trained on 1M wordtraining set.
I With 1 billion word training set, learners have not reachedperformance asymptote.
I (Learn language function by simple algorithm that hasaccess to full extension of function.)
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
But still. . . more data is better data!(Mercer quoted by Church)
I Banko and Brill 2001 HLT paper.
I Confusion set disambiguation task.
I Choose correct word in context from set of words it istypically confused with: affect/effect, principal/principle.
I Even most naive learning algorithm trained on 10M wordtraining set outperforms any learner trained on 1M wordtraining set.
I With 1 billion word training set, learners have not reachedperformance asymptote.
I (Learn language function by simple algorithm that hasaccess to full extension of function.)
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
But still. . . more data is better data!(Mercer quoted by Church)
I Banko and Brill 2001 HLT paper.
I Confusion set disambiguation task.
I Choose correct word in context from set of words it istypically confused with: affect/effect, principal/principle.
I Even most naive learning algorithm trained on 10M wordtraining set outperforms any learner trained on 1M wordtraining set.
I With 1 billion word training set, learners have not reachedperformance asymptote.
I (Learn language function by simple algorithm that hasaccess to full extension of function.)
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
But still. . . more data is better data!(Mercer quoted by Church)
I Banko and Brill 2001 HLT paper.
I Confusion set disambiguation task.
I Choose correct word in context from set of words it istypically confused with: affect/effect, principal/principle.
I Even most naive learning algorithm trained on 10M wordtraining set outperforms any learner trained on 1M wordtraining set.
I With 1 billion word training set, learners have not reachedperformance asymptote.
I (Learn language function by simple algorithm that hasaccess to full extension of function.)
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
But still. . . more data is better data!(Mercer quoted by Church)
I Banko and Brill 2001 HLT paper.
I Confusion set disambiguation task.
I Choose correct word in context from set of words it istypically confused with: affect/effect, principal/principle.
I Even most naive learning algorithm trained on 10M wordtraining set outperforms any learner trained on 1M wordtraining set.
I With 1 billion word training set, learners have not reachedperformance asymptote.
I (Learn language function by simple algorithm that hasaccess to full extension of function.)
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
But still. . . more data is better data!(Mercer quoted by Church)
I Banko and Brill 2001 HLT paper.
I Confusion set disambiguation task.
I Choose correct word in context from set of words it istypically confused with: affect/effect, principal/principle.
I Even most naive learning algorithm trained on 10M wordtraining set outperforms any learner trained on 1M wordtraining set.
I With 1 billion word training set, learners have not reachedperformance asymptote.
I (Learn language function by simple algorithm that hasaccess to full extension of function.)
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
More web-data is better data!
I Keller and Lapata 2003, Computational Linguistics.
I Google- and AltaVista-based frequencies of A N, N N andV N bigrams:
I correlate with BNC and NANTC frequencies;I correlate with WordNet-class-based smoothed
frequencies;I correlate with human plausibility judgments more than
corpus-based frequencies do (smoothed or notsmoothed).
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
More web-data is better data!
I Keller and Lapata 2003, Computational Linguistics.
I Google- and AltaVista-based frequencies of A N, N N andV N bigrams:
I correlate with BNC and NANTC frequencies;I correlate with WordNet-class-based smoothed
frequencies;I correlate with human plausibility judgments more than
corpus-based frequencies do (smoothed or notsmoothed).
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
More web-data is better data!
I Keller and Lapata 2003, Computational Linguistics.
I Google- and AltaVista-based frequencies of A N, N N andV N bigrams:
I correlate with BNC and NANTC frequencies;
I correlate with WordNet-class-based smoothedfrequencies;
I correlate with human plausibility judgments more thancorpus-based frequencies do (smoothed or notsmoothed).
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
More web-data is better data!
I Keller and Lapata 2003, Computational Linguistics.
I Google- and AltaVista-based frequencies of A N, N N andV N bigrams:
I correlate with BNC and NANTC frequencies;I correlate with WordNet-class-based smoothed
frequencies;
I correlate with human plausibility judgments more thancorpus-based frequencies do (smoothed or notsmoothed).
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
More web-data is better data!
I Keller and Lapata 2003, Computational Linguistics.
I Google- and AltaVista-based frequencies of A N, N N andV N bigrams:
I correlate with BNC and NANTC frequencies;I correlate with WordNet-class-based smoothed
frequencies;I correlate with human plausibility judgments more than
corpus-based frequencies do (smoothed or notsmoothed).
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Approaches to Web as Corpus
I Collect (frequency) data directly from commercial searchengines (e.g. Turney 2001, many many others).
I Linguist’s friendly interfaces to commercial searchengines: WebCorp, KwicFinder, LSE (Kehoe and Renouf2002, Fletcher 2002, Resnik and and Elkiss 2003).
I Small(-ish), focused crawls of the web to find and retrieverelevant pages (e.g. Ghani et al. 2001, Baroni andBernardini 2004, Sharoff submitted).
I WaCky!
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Approaches to Web as Corpus
I Collect (frequency) data directly from commercial searchengines (e.g. Turney 2001, many many others).
I Linguist’s friendly interfaces to commercial searchengines: WebCorp, KwicFinder, LSE (Kehoe and Renouf2002, Fletcher 2002, Resnik and and Elkiss 2003).
I Small(-ish), focused crawls of the web to find and retrieverelevant pages (e.g. Ghani et al. 2001, Baroni andBernardini 2004, Sharoff submitted).
I WaCky!
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Approaches to Web as Corpus
I Collect (frequency) data directly from commercial searchengines (e.g. Turney 2001, many many others).
I Linguist’s friendly interfaces to commercial searchengines: WebCorp, KwicFinder, LSE (Kehoe and Renouf2002, Fletcher 2002, Resnik and and Elkiss 2003).
I Small(-ish), focused crawls of the web to find and retrieverelevant pages (e.g. Ghani et al. 2001, Baroni andBernardini 2004, Sharoff submitted).
I WaCky!
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Approaches to Web as Corpus
I Collect (frequency) data directly from commercial searchengines (e.g. Turney 2001, many many others).
I Linguist’s friendly interfaces to commercial searchengines: WebCorp, KwicFinder, LSE (Kehoe and Renouf2002, Fletcher 2002, Resnik and and Elkiss 2003).
I Small(-ish), focused crawls of the web to find and retrieverelevant pages (e.g. Ghani et al. 2001, Baroni andBernardini 2004, Sharoff submitted).
I WaCky!
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Web-based Mutual Information
Outline
Introduction
Frequency estimates from search enginesWeb-based Mutual Information
The “linguists’ friendly” interfaces
Building your own web corpusSmall corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine
Enter WaCky!
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Web-based Mutual Information
Collecting frequency data from search engines
I Probably the most popular method (Keller and Lapata2003, Turney 2001, many others).
I Rough approximation to frequency, but:
I Empirically successful;I Easy: the engine does most of the hard work.
I Web-based mutual information: typical example ofresearch using search engine-based frequency data.
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Web-based Mutual Information
Collecting frequency data from search engines
I Probably the most popular method (Keller and Lapata2003, Turney 2001, many others).
I Rough approximation to frequency, but:
I Empirically successful;I Easy: the engine does most of the hard work.
I Web-based mutual information: typical example ofresearch using search engine-based frequency data.
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Web-based Mutual Information
Collecting frequency data from search engines
I Probably the most popular method (Keller and Lapata2003, Turney 2001, many others).
I Rough approximation to frequency, but:I Empirically successful;
I Easy: the engine does most of the hard work.
I Web-based mutual information: typical example ofresearch using search engine-based frequency data.
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Web-based Mutual Information
Collecting frequency data from search engines
I Probably the most popular method (Keller and Lapata2003, Turney 2001, many others).
I Rough approximation to frequency, but:I Empirically successful;I Easy: the engine does most of the hard work.
I Web-based mutual information: typical example ofresearch using search engine-based frequency data.
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Web-based Mutual Information
Collecting frequency data from search engines
I Probably the most popular method (Keller and Lapata2003, Turney 2001, many others).
I Rough approximation to frequency, but:I Empirically successful;I Easy: the engine does most of the hard work.
I Web-based mutual information: typical example ofresearch using search engine-based frequency data.
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Web-based Mutual Information
Web-based Mutual Information (WMI)Turney 2001
I (Pointwise) mutual information:
MI (w1, w2) = log2
P(w1, w2)
P(w1)P(w2)= log2 N
C(w1, w2)
C(w1)C(w2)
I WMI: compute mutual information of word pairs usingfrequency/cooccurrence frequency data extracted fromthe web via AltaVista search engine.
WMI (w1, w2) = log2 Nhits(w1 NEAR w2)
hits(w1)hits(w2)
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Web-based Mutual Information
Web-based Mutual Information (WMI)Turney 2001
I (Pointwise) mutual information:
MI (w1, w2) = log2
P(w1, w2)
P(w1)P(w2)= log2 N
C(w1, w2)
C(w1)C(w2)
I WMI: compute mutual information of word pairs usingfrequency/cooccurrence frequency data extracted fromthe web via AltaVista search engine.
WMI (w1, w2) = log2 Nhits(w1 NEAR w2)
hits(w1)hits(w2)
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Web-based Mutual Information
Web-based Mutual Information (WMI)Turney 2001
I (Pointwise) mutual information:
MI (w1, w2) = log2
P(w1, w2)
P(w1)P(w2)= log2 N
C(w1, w2)
C(w1)C(w2)
I WMI: compute mutual information of word pairs usingfrequency/cooccurrence frequency data extracted fromthe web via AltaVista search engine.
WMI (w1, w2) = log2 Nhits(w1 NEAR w2)
hits(w1)hits(w2)
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Web-based Mutual Information
Web-based Mutual Information
I Semantic similarity as direct cooccurrence (vs. occurrencein similar contexts).
I Simplicity of method counterbalanced by size of database(the WWW).
I Very effective: Turney 2001, Lin et al. 2003, Turney andLittman 2003.
I Most researchers report that WMI outperforms moresophisticated methods based on smaller corpora.
I My own experience with WMI: Baroni and Bisi 2004,Baroni and Vegnaduzzo 2004.
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Web-based Mutual Information
Web-based Mutual Information
I Semantic similarity as direct cooccurrence (vs. occurrencein similar contexts).
I Simplicity of method counterbalanced by size of database(the WWW).
I Very effective: Turney 2001, Lin et al. 2003, Turney andLittman 2003.
I Most researchers report that WMI outperforms moresophisticated methods based on smaller corpora.
I My own experience with WMI: Baroni and Bisi 2004,Baroni and Vegnaduzzo 2004.
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Web-based Mutual Information
Web-based Mutual Information
I Semantic similarity as direct cooccurrence (vs. occurrencein similar contexts).
I Simplicity of method counterbalanced by size of database(the WWW).
I Very effective: Turney 2001, Lin et al. 2003, Turney andLittman 2003.
I Most researchers report that WMI outperforms moresophisticated methods based on smaller corpora.
I My own experience with WMI: Baroni and Bisi 2004,Baroni and Vegnaduzzo 2004.
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Web-based Mutual Information
Web-based Mutual Information
I Semantic similarity as direct cooccurrence (vs. occurrencein similar contexts).
I Simplicity of method counterbalanced by size of database(the WWW).
I Very effective: Turney 2001, Lin et al. 2003, Turney andLittman 2003.
I Most researchers report that WMI outperforms moresophisticated methods based on smaller corpora.
I My own experience with WMI: Baroni and Bisi 2004,Baroni and Vegnaduzzo 2004.
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Web-based Mutual Information
Web-based Mutual Information
I Semantic similarity as direct cooccurrence (vs. occurrencein similar contexts).
I Simplicity of method counterbalanced by size of database(the WWW).
I Very effective: Turney 2001, Lin et al. 2003, Turney andLittman 2003.
I Most researchers report that WMI outperforms moresophisticated methods based on smaller corpora.
I My own experience with WMI: Baroni and Bisi 2004,Baroni and Vegnaduzzo 2004.
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Web-based Mutual Information
WMI takes the TOEFLTurney 2001
I TOEFL synonym match task.
I Target: levied; Candidates: imposed, believed, requested,correlated.
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Web-based Mutual Information
WMI takes the TOEFLTurney 2001
I TOEFL synonym match task.
I Target: levied; Candidates: imposed, believed, requested,correlated.
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Web-based Mutual Information
WMI takes the TOEFLTurney 2001
I TOEFL synonym match task.
I Target: levied; Candidates: imposed, believed, requested,correlated.
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Web-based Mutual Information
WMI takes the TOEFL (cont.)
I Performance on TOEFL synonym match task:
I Average foreign test taker: 64.5%I Latent Semantic Analysis: 65.4%I WMI: 72.5%
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Web-based Mutual Information
WMI takes the TOEFL (cont.)
I Performance on TOEFL synonym match task:I Average foreign test taker: 64.5%
I Latent Semantic Analysis: 65.4%I WMI: 72.5%
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Web-based Mutual Information
WMI takes the TOEFL (cont.)
I Performance on TOEFL synonym match task:I Average foreign test taker: 64.5%I Latent Semantic Analysis: 65.4%
I WMI: 72.5%
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Web-based Mutual Information
WMI takes the TOEFL (cont.)
I Performance on TOEFL synonym match task:I Average foreign test taker: 64.5%I Latent Semantic Analysis: 65.4%I WMI: 72.5%
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Web-based Mutual Information
WMI and synonym detection in terminology
I Baroni and Bisi 2004 applied WMI-method to synonymmining task in technical domain.
I A harder task:
I Technical terms less frequent than general languageterms (potential data sparseness issues);
I All terms in domain tend to be semantically related, tosome extent.
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Web-based Mutual Information
WMI and synonym detection in terminology
I Baroni and Bisi 2004 applied WMI-method to synonymmining task in technical domain.
I A harder task:
I Technical terms less frequent than general languageterms (potential data sparseness issues);
I All terms in domain tend to be semantically related, tosome extent.
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Web-based Mutual Information
WMI and synonym detection in terminology
I Baroni and Bisi 2004 applied WMI-method to synonymmining task in technical domain.
I A harder task:I Technical terms less frequent than general language
terms (potential data sparseness issues);
I All terms in domain tend to be semantically related, tosome extent.
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Web-based Mutual Information
WMI and synonym detection in terminology
I Baroni and Bisi 2004 applied WMI-method to synonymmining task in technical domain.
I A harder task:I Technical terms less frequent than general language
terms (potential data sparseness issues);I All terms in domain tend to be semantically related, to
some extent.
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Web-based Mutual Information
Materials
I Nautical terminology.
I Terms and relational information from structuredtermbase of Bisi (2003).
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Web-based Mutual Information
Materials
I Nautical terminology.
I Terms and relational information from structuredtermbase of Bisi (2003).
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Web-based Mutual Information
Task
I Given a list of pairs in any order, rank them so thatsynonym pairs will be on top of list.
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Web-based Mutual Information
Task: example
I decks/cockpit
I frames/ribs
I bottom/hull
I ...
I frames/hull
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Web-based Mutual Information
Task: example
I frames/ribs
I bottom/hull
I decks/cockpit
I ...
I frames/hull
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Web-based Mutual Information
Task: settings
I Synonym term pairs vs. random term pairs (Exp 1).
I Synonym term pairs vs. other “nymic” pairs (Exp 2).
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Web-based Mutual Information
Task: settings
I Synonym term pairs vs. random term pairs (Exp 1).
I Synonym term pairs vs. other “nymic” pairs (Exp 2).
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Web-based Mutual Information
Cosine Similarity
I Term of comparison.
I Intuition: Words with similar patterns of cooccurrence arelikely to be similar.
I Correlation of vectors of cooccurrence frequencies oftargets with (almost) all words in corpus:
cos(−→x ,−→y ) = −→x · −→y =n∑
i=1
xiyi
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Web-based Mutual Information
Cosine Similarity
I Term of comparison.
I Intuition: Words with similar patterns of cooccurrence arelikely to be similar.
I Correlation of vectors of cooccurrence frequencies oftargets with (almost) all words in corpus:
cos(−→x ,−→y ) = −→x · −→y =n∑
i=1
xiyi
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Web-based Mutual Information
Cosine Similarity
I Term of comparison.
I Intuition: Words with similar patterns of cooccurrence arelikely to be similar.
I Correlation of vectors of cooccurrence frequencies oftargets with (almost) all words in corpus:
cos(−→x ,−→y ) = −→x · −→y =n∑
i=1
xiyi
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Web-based Mutual Information
Cosine Similarity (cont.)
I Corpora:
I 1.2M word specialized corpus manually assembled byterminologist;
I 4.27M word corpus constructed via random nauticalterm queries to Google.
I Context windows:
I 2 words to either side of target;I 5 words to either side of target.
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Web-based Mutual Information
Cosine Similarity (cont.)
I Corpora:I 1.2M word specialized corpus manually assembled by
terminologist;
I 4.27M word corpus constructed via random nauticalterm queries to Google.
I Context windows:
I 2 words to either side of target;I 5 words to either side of target.
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Web-based Mutual Information
Cosine Similarity (cont.)
I Corpora:I 1.2M word specialized corpus manually assembled by
terminologist;I 4.27M word corpus constructed via random nautical
term queries to Google.
I Context windows:
I 2 words to either side of target;I 5 words to either side of target.
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Web-based Mutual Information
Cosine Similarity (cont.)
I Corpora:I 1.2M word specialized corpus manually assembled by
terminologist;I 4.27M word corpus constructed via random nautical
term queries to Google.
I Context windows:
I 2 words to either side of target;I 5 words to either side of target.
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Web-based Mutual Information
Cosine Similarity (cont.)
I Corpora:I 1.2M word specialized corpus manually assembled by
terminologist;I 4.27M word corpus constructed via random nautical
term queries to Google.
I Context windows:I 2 words to either side of target;
I 5 words to either side of target.
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Web-based Mutual Information
Cosine Similarity (cont.)
I Corpora:I 1.2M word specialized corpus manually assembled by
terminologist;I 4.27M word corpus constructed via random nautical
term queries to Google.
I Context windows:I 2 words to either side of target;I 5 words to either side of target.
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Web-based Mutual Information
Experiment 1: Data
I 24 synonym pairs (e.g., bottom/hull, frames/ribs,displacement/weight).
I 124 non-synonym pairs:
I 100 random pairs of nautical terms;I 24 recombinations of terms in synonym set.
I 29% of random pairs rated “strongly semantically related”(e.g., awning/stern board, install/hatch, keel/coated).
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Web-based Mutual Information
Experiment 1: Data
I 24 synonym pairs (e.g., bottom/hull, frames/ribs,displacement/weight).
I 124 non-synonym pairs:
I 100 random pairs of nautical terms;I 24 recombinations of terms in synonym set.
I 29% of random pairs rated “strongly semantically related”(e.g., awning/stern board, install/hatch, keel/coated).
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Web-based Mutual Information
Experiment 1: Data
I 24 synonym pairs (e.g., bottom/hull, frames/ribs,displacement/weight).
I 124 non-synonym pairs:I 100 random pairs of nautical terms;
I 24 recombinations of terms in synonym set.
I 29% of random pairs rated “strongly semantically related”(e.g., awning/stern board, install/hatch, keel/coated).
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Web-based Mutual Information
Experiment 1: Data
I 24 synonym pairs (e.g., bottom/hull, frames/ribs,displacement/weight).
I 124 non-synonym pairs:I 100 random pairs of nautical terms;I 24 recombinations of terms in synonym set.
I 29% of random pairs rated “strongly semantically related”(e.g., awning/stern board, install/hatch, keel/coated).
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Web-based Mutual Information
Experiment 1: Data
I 24 synonym pairs (e.g., bottom/hull, frames/ribs,displacement/weight).
I 124 non-synonym pairs:I 100 random pairs of nautical terms;I 24 recombinations of terms in synonym set.
I 29% of random pairs rated “strongly semantically related”(e.g., awning/stern board, install/hatch, keel/coated).
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Web-based Mutual Information
Experiment 1: ResultsPercentage precision at various percentage recall levels
recall WMI Cosinesman corp man corp web corp web corp
2-word win 5-word win 2-word win 5-word win
12.5 100.0 100.0 60.0 60.0 42.925.0 100.0 75.0 60.0 46.2 46.237.5 90.0 42.9 39.1 40.9 45.050.0 92.3 17.9 19.4 26.7 25.562.5 88.2 10.8 15.0 19.0 17.675.0 36.7 12.7 12.7 12.7 13.487.5 30.4 14.5 14.5 14.5 14.5100.0 16.2 16.2 16.2 16.2 16.2
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Web-based Mutual Information
Experiment 2: Data
I Same 24 synonym pairs as above.
I 31 nymic pairs from Bisi termbase added to test set:
I 19 cohyponym pairs (e.g., Bruce anchor/mushroomanchor);
I 10 hypo/hypernym pairs (e.g., stern platform/sun deck);I 2 antonyms (e.g., ahead/astern).
I 31 randomly selected non-synonym pairs removed fromtest set (same synonym-to-non-synonym pair ratio asabove).
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Web-based Mutual Information
Experiment 2: Data
I Same 24 synonym pairs as above.
I 31 nymic pairs from Bisi termbase added to test set:
I 19 cohyponym pairs (e.g., Bruce anchor/mushroomanchor);
I 10 hypo/hypernym pairs (e.g., stern platform/sun deck);I 2 antonyms (e.g., ahead/astern).
I 31 randomly selected non-synonym pairs removed fromtest set (same synonym-to-non-synonym pair ratio asabove).
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Web-based Mutual Information
Experiment 2: Data
I Same 24 synonym pairs as above.
I 31 nymic pairs from Bisi termbase added to test set:I 19 cohyponym pairs (e.g., Bruce anchor/mushroom
anchor);
I 10 hypo/hypernym pairs (e.g., stern platform/sun deck);I 2 antonyms (e.g., ahead/astern).
I 31 randomly selected non-synonym pairs removed fromtest set (same synonym-to-non-synonym pair ratio asabove).
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Web-based Mutual Information
Experiment 2: Data
I Same 24 synonym pairs as above.
I 31 nymic pairs from Bisi termbase added to test set:I 19 cohyponym pairs (e.g., Bruce anchor/mushroom
anchor);I 10 hypo/hypernym pairs (e.g., stern platform/sun deck);
I 2 antonyms (e.g., ahead/astern).
I 31 randomly selected non-synonym pairs removed fromtest set (same synonym-to-non-synonym pair ratio asabove).
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Web-based Mutual Information
Experiment 2: Data
I Same 24 synonym pairs as above.
I 31 nymic pairs from Bisi termbase added to test set:I 19 cohyponym pairs (e.g., Bruce anchor/mushroom
anchor);I 10 hypo/hypernym pairs (e.g., stern platform/sun deck);I 2 antonyms (e.g., ahead/astern).
I 31 randomly selected non-synonym pairs removed fromtest set (same synonym-to-non-synonym pair ratio asabove).
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Web-based Mutual Information
Experiment 2: Data
I Same 24 synonym pairs as above.
I 31 nymic pairs from Bisi termbase added to test set:I 19 cohyponym pairs (e.g., Bruce anchor/mushroom
anchor);I 10 hypo/hypernym pairs (e.g., stern platform/sun deck);I 2 antonyms (e.g., ahead/astern).
I 31 randomly selected non-synonym pairs removed fromtest set (same synonym-to-non-synonym pair ratio asabove).
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Web-based Mutual Information
Experiment 2: ResultsPercentage precision at various percentage recall levels
recall WMI Cosinesman corp man corp web corp web corp
2-word win 5-word win 2-word win 5-word win
12.5 60.0 42.9 37.5 27.3 20.025.0 33.3 46.2 46.2 28.6 27.337.5 36.0 39.1 39.1 29.0 31.050.0 40.0 19.7 21.1 23.1 22.662.5 37.5 10.8 17.4 19.2 18.175.0 26.5 12.7 12.7 12.7 14.187.5 25.6 14.5 14.5 14.5 14.5100.0 16.2 16.2 16.2 16.2 16.2
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Web-based Mutual Information
Houston, we have a problem
I On 31 March 2004, AltaVista’s parent company Yahoo!replaced the AltaVista’s engine with Yahoo!’s own engine.
I End of the NEAR operator.
I Change of underlying database.
I WMI without NEAR:
WMI (w1, w2) = log2 Nhits(w1 w2)
hits(w1)hits(w2)
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Web-based Mutual Information
Houston, we have a problem
I On 31 March 2004, AltaVista’s parent company Yahoo!replaced the AltaVista’s engine with Yahoo!’s own engine.
I End of the NEAR operator.
I Change of underlying database.
I WMI without NEAR:
WMI (w1, w2) = log2 Nhits(w1 w2)
hits(w1)hits(w2)
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Web-based Mutual Information
Houston, we have a problem
I On 31 March 2004, AltaVista’s parent company Yahoo!replaced the AltaVista’s engine with Yahoo!’s own engine.
I End of the NEAR operator.
I Change of underlying database.
I WMI without NEAR:
WMI (w1, w2) = log2 Nhits(w1 w2)
hits(w1)hits(w2)
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Web-based Mutual Information
Houston, we have a problem
I On 31 March 2004, AltaVista’s parent company Yahoo!replaced the AltaVista’s engine with Yahoo!’s own engine.
I End of the NEAR operator.
I Change of underlying database.
I WMI without NEAR:
WMI (w1, w2) = log2 Nhits(w1 w2)
hits(w1)hits(w2)
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Web-based Mutual Information
Houston, we have a problem
I On 31 March 2004, AltaVista’s parent company Yahoo!replaced the AltaVista’s engine with Yahoo!’s own engine.
I End of the NEAR operator.
I Change of underlying database.
I WMI without NEAR:
WMI (w1, w2) = log2 Nhits(w1 w2)
hits(w1)hits(w2)
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Web-based Mutual Information
Experiment 1: with and without NEARPercentage precision at various percentage recall levels
recall AltaVista AltaVista Googlew/ NEAR w/o NEAR
12.5 100.0 100.0 100.025.0 100.0 100.0 85.737.5 90.0 100.0 81.850.0 92.3 75 85.762.5 88.2 62.5 60.075.0 36.7 45.0 64.387.5 30.4 34.4 45.6100.0 16.2 19.3 17.3
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Web-based Mutual Information
Experiment 2: with and without NEARPercentage precision at various percentage recall levels
recall AltaVista AltaVista Googlew/ NEAR w/o NEAR
12.5 60.0 42.8 50.025.0 33.3 50.0 37.537.5 36.0 52.9 45.050.0 40.0 38.7 40.062.5 37.5 32.6 31.975.0 26.5 28.6 34.087.5 25.6 25.6 30.0100.0 16.2 18.5 17.0
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Web-based Mutual Information
Pros and cons of search engine frequencies
I The main advantage: it’s easy.
I The main problem: we depend on commercial searchengines.
I Linguist’s satisfaction is obviously not their priority.
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Web-based Mutual Information
Pros and cons of search engine frequencies
I The main advantage: it’s easy.
I The main problem: we depend on commercial searchengines.
I Linguist’s satisfaction is obviously not their priority.
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Web-based Mutual Information
Pros and cons of search engine frequencies
I The main advantage: it’s easy.
I The main problem: we depend on commercial searchengines.
I Linguist’s satisfaction is obviously not their priority.
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Web-based Mutual Information
A telling anecdote
(Talking to a new acquaintance who works at Google)
Me: So, do you guys have plans to introduce the NEARoperator?
The Google Acquaintance: You are a linguist right? Onlylinguists ask about that sort of stuff. . .
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Web-based Mutual Information
A telling anecdote
(Talking to a new acquaintance who works at Google)
Me: So, do you guys have plans to introduce the NEARoperator?
The Google Acquaintance: You are a linguist right? Onlylinguists ask about that sort of stuff. . .
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Web-based Mutual Information
A telling anecdote
(Talking to a new acquaintance who works at Google)
Me: So, do you guys have plans to introduce the NEARoperator?
The Google Acquaintance: You are a linguist right? Onlylinguists ask about that sort of stuff. . .
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Web-based Mutual Information
Consequences
I Limited query options (not even diacritics and accents),limited research options.
I You must know the words you are looking for.
I No annotation, few, unreliable metadata.
I Automated querying constraints, over-querying stronglydiscouraged.
I We know very little about the data we get.
I No control over how search engines evolve.
I Brittleness!
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Web-based Mutual Information
Consequences
I Limited query options (not even diacritics and accents),limited research options.
I You must know the words you are looking for.
I No annotation, few, unreliable metadata.
I Automated querying constraints, over-querying stronglydiscouraged.
I We know very little about the data we get.
I No control over how search engines evolve.
I Brittleness!
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Web-based Mutual Information
Consequences
I Limited query options (not even diacritics and accents),limited research options.
I You must know the words you are looking for.
I No annotation, few, unreliable metadata.
I Automated querying constraints, over-querying stronglydiscouraged.
I We know very little about the data we get.
I No control over how search engines evolve.
I Brittleness!
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Web-based Mutual Information
Consequences
I Limited query options (not even diacritics and accents),limited research options.
I You must know the words you are looking for.
I No annotation, few, unreliable metadata.
I Automated querying constraints, over-querying stronglydiscouraged.
I We know very little about the data we get.
I No control over how search engines evolve.
I Brittleness!
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Web-based Mutual Information
Consequences
I Limited query options (not even diacritics and accents),limited research options.
I You must know the words you are looking for.
I No annotation, few, unreliable metadata.
I Automated querying constraints, over-querying stronglydiscouraged.
I We know very little about the data we get.
I No control over how search engines evolve.
I Brittleness!
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Web-based Mutual Information
Consequences
I Limited query options (not even diacritics and accents),limited research options.
I You must know the words you are looking for.
I No annotation, few, unreliable metadata.
I Automated querying constraints, over-querying stronglydiscouraged.
I We know very little about the data we get.
I No control over how search engines evolve.
I Brittleness!
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Web-based Mutual Information
Consequences
I Limited query options (not even diacritics and accents),limited research options.
I You must know the words you are looking for.
I No annotation, few, unreliable metadata.
I Automated querying constraints, over-querying stronglydiscouraged.
I We know very little about the data we get.
I No control over how search engines evolve.
I Brittleness!
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Web-based Mutual Information
Fletcher 2004 saying the same things
Search engines are not research libraries but commercial enterprisestargeted at the needs of the general public. The availability andimplementation of their services change constantly: features are added ordropped to mimic or outdo the competition; acquisitions and mergersthreaten their independence; financial uncertainties and legal battleschallenge their very survival. The search sites’ quest for revenue candiminish the objectivity of their search results, and various “pageranking” algorithms may lead to results that are not representative of theWeb as a whole. Most frustrating is the minimal support for therequirements of serious researchers: current trends lead away from siteslike AltaVista supporting sophisticated complex queries (which few usersemploy) to ones like Google offering only simple search criteria. In short,the search engines’ services are useful to investigators by coincidence, notdesign, and researchers are tolerated on mainstream search sites only aslong as their use does not affect site performance adversely.
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Web-based Mutual Information
Worrying data from the Google APIsPattern discovered by Luca Onnis
Query APIs Website Ratio
pleasantly 369000 870000 0.42awkwardly 124000 292000 0.42silent 4610000 11000000 0.42pleasantly silent 107 135 0.79awkwardly silent 396 566 0.70
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Web-based Mutual Information
A few more things to worry about
I Google inflating its counts (Veronis’s blog, 2005).
I Is the * operator still supported?
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Web-based Mutual Information
A few more things to worry about
I Google inflating its counts (Veronis’s blog, 2005).
I Is the * operator still supported?
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Outline
Introduction
Frequency estimates from search enginesWeb-based Mutual Information
The “linguists’ friendly” interfaces
Building your own web corpusSmall corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine
Enter WaCky!
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
The “linguist’s friendly” interfaces
I WebCorp, KwicFinder, Linguist’s Search Engine.
I “Wrappers” around Google, AltaVista, etc.
I Nice interfaces, but ultimately inherit all problems ofsearch engines, and perhaps add some more with theirfilters. . .
I E.g., “spongi*” query in webCorp (Stefan Evert).
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
The “linguist’s friendly” interfaces
I WebCorp, KwicFinder, Linguist’s Search Engine.
I “Wrappers” around Google, AltaVista, etc.
I Nice interfaces, but ultimately inherit all problems ofsearch engines, and perhaps add some more with theirfilters. . .
I E.g., “spongi*” query in webCorp (Stefan Evert).
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
The “linguist’s friendly” interfaces
I WebCorp, KwicFinder, Linguist’s Search Engine.
I “Wrappers” around Google, AltaVista, etc.
I Nice interfaces, but ultimately inherit all problems ofsearch engines, and perhaps add some more with theirfilters. . .
I E.g., “spongi*” query in webCorp (Stefan Evert).
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
The “linguist’s friendly” interfaces
I WebCorp, KwicFinder, Linguist’s Search Engine.
I “Wrappers” around Google, AltaVista, etc.
I Nice interfaces, but ultimately inherit all problems ofsearch engines, and perhaps add some more with theirfilters. . .
I E.g., “spongi*” query in webCorp (Stefan Evert).
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine
Outline
Introduction
Frequency estimates from search enginesWeb-based Mutual Information
The “linguists’ friendly” interfaces
Building your own web corpusSmall corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine
Enter WaCky!
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine
Building special corpora with search engine queries
I By downloading text, more control over data.
I But less work, more targeted data than spidering yourown corpus.
I Good for “special purposes” corpora:
I “minority” languages (CorpusBuilder; Ghani, Jones,Mladenic, CIKM-2001).;
I specialized sub-languages (BootCaT).
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine
Building special corpora with search engine queries
I By downloading text, more control over data.
I But less work, more targeted data than spidering yourown corpus.
I Good for “special purposes” corpora:
I “minority” languages (CorpusBuilder; Ghani, Jones,Mladenic, CIKM-2001).;
I specialized sub-languages (BootCaT).
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine
Building special corpora with search engine queries
I By downloading text, more control over data.
I But less work, more targeted data than spidering yourown corpus.
I Good for “special purposes” corpora:
I “minority” languages (CorpusBuilder; Ghani, Jones,Mladenic, CIKM-2001).;
I specialized sub-languages (BootCaT).
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine
Building special corpora with search engine queries
I By downloading text, more control over data.
I But less work, more targeted data than spidering yourown corpus.
I Good for “special purposes” corpora:I “minority” languages (CorpusBuilder; Ghani, Jones,
Mladenic, CIKM-2001).;
I specialized sub-languages (BootCaT).
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine
Building special corpora with search engine queries
I By downloading text, more control over data.
I But less work, more targeted data than spidering yourown corpus.
I Good for “special purposes” corpora:I “minority” languages (CorpusBuilder; Ghani, Jones,
Mladenic, CIKM-2001).;I specialized sub-languages (BootCaT).
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine
The BootCaT tools
I Bootstrapping Corpora and Terms from the web.
I Perl scripts freely available from:http://sslmit.unibo.it/∼baroni/bootcat.html
I Original motivation: fast construction of ad-hoc corporaand term lists for translation/interpreting tasks,terminography.
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine
The BootCaT tools
I Bootstrapping Corpora and Terms from the web.
I Perl scripts freely available from:http://sslmit.unibo.it/∼baroni/bootcat.html
I Original motivation: fast construction of ad-hoc corporaand term lists for translation/interpreting tasks,terminography.
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine
The BootCaT tools
I Bootstrapping Corpora and Terms from the web.
I Perl scripts freely available from:http://sslmit.unibo.it/∼baroni/bootcat.html
I Original motivation: fast construction of ad-hoc corporaand term lists for translation/interpreting tasks,terminography.
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine
The BootCaT procedure
Select initial terms
Query Google for random term combinations
Extract new terms via corpus comparison
Retrieve pages and format as text (corpus)
Distributional patterns POS templates
Extract multi-word terms using corpus, uni-terms and...
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine
Terms and Term Combinations
I 5-20 terms typical of domain.
I Selection: human or automated (e.g. via text/corpuscomparison).
I Seed terms randomly combined into tuples to performGoogle queries:
I Longer tuples: better precision;I Shorter tuples: better recall.
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine
Terms and Term Combinations
I 5-20 terms typical of domain.
I Selection: human or automated (e.g. via text/corpuscomparison).
I Seed terms randomly combined into tuples to performGoogle queries:
I Longer tuples: better precision;I Shorter tuples: better recall.
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine
Terms and Term Combinations
I 5-20 terms typical of domain.
I Selection: human or automated (e.g. via text/corpuscomparison).
I Seed terms randomly combined into tuples to performGoogle queries:
I Longer tuples: better precision;I Shorter tuples: better recall.
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine
Terms and Term Combinations
I 5-20 terms typical of domain.
I Selection: human or automated (e.g. via text/corpuscomparison).
I Seed terms randomly combined into tuples to performGoogle queries:
I Longer tuples: better precision;
I Shorter tuples: better recall.
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine
Terms and Term Combinations
I 5-20 terms typical of domain.
I Selection: human or automated (e.g. via text/corpuscomparison).
I Seed terms randomly combined into tuples to performGoogle queries:
I Longer tuples: better precision;I Shorter tuples: better recall.
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine
Corpus/Term Bootstrapping
I The bootstrap:
1. Retrieve corpus from web via Google tuple queries;2. Extract typical terms through statistical comparison
with reference corpus (using Mutual Information,Log-Likelihood Ratio, etc.);
3. Use found terms as new seeds and build new randomtuples;
4. Go back to 1.I Retrieved pages formatted as text (character set issues,
non-text format issues; in Japanese: tokenization issues).I Reference corpus: better if balanced, but any corpus on
different topic will usually do (but in Japanese register ofcorpora turns out to be crucial!)
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine
Corpus/Term Bootstrapping
I The bootstrap:
1. Retrieve corpus from web via Google tuple queries;
2. Extract typical terms through statistical comparisonwith reference corpus (using Mutual Information,Log-Likelihood Ratio, etc.);
3. Use found terms as new seeds and build new randomtuples;
4. Go back to 1.I Retrieved pages formatted as text (character set issues,
non-text format issues; in Japanese: tokenization issues).I Reference corpus: better if balanced, but any corpus on
different topic will usually do (but in Japanese register ofcorpora turns out to be crucial!)
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine
Corpus/Term Bootstrapping
I The bootstrap:
1. Retrieve corpus from web via Google tuple queries;2. Extract typical terms through statistical comparison
with reference corpus (using Mutual Information,Log-Likelihood Ratio, etc.);
3. Use found terms as new seeds and build new randomtuples;
4. Go back to 1.I Retrieved pages formatted as text (character set issues,
non-text format issues; in Japanese: tokenization issues).I Reference corpus: better if balanced, but any corpus on
different topic will usually do (but in Japanese register ofcorpora turns out to be crucial!)
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine
Corpus/Term Bootstrapping
I The bootstrap:
1. Retrieve corpus from web via Google tuple queries;2. Extract typical terms through statistical comparison
with reference corpus (using Mutual Information,Log-Likelihood Ratio, etc.);
3. Use found terms as new seeds and build new randomtuples;
4. Go back to 1.I Retrieved pages formatted as text (character set issues,
non-text format issues; in Japanese: tokenization issues).I Reference corpus: better if balanced, but any corpus on
different topic will usually do (but in Japanese register ofcorpora turns out to be crucial!)
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine
Corpus/Term Bootstrapping
I The bootstrap:
1. Retrieve corpus from web via Google tuple queries;2. Extract typical terms through statistical comparison
with reference corpus (using Mutual Information,Log-Likelihood Ratio, etc.);
3. Use found terms as new seeds and build new randomtuples;
4. Go back to 1.
I Retrieved pages formatted as text (character set issues,non-text format issues; in Japanese: tokenization issues).
I Reference corpus: better if balanced, but any corpus ondifferent topic will usually do (but in Japanese register ofcorpora turns out to be crucial!)
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine
Corpus/Term Bootstrapping
I The bootstrap:
1. Retrieve corpus from web via Google tuple queries;2. Extract typical terms through statistical comparison
with reference corpus (using Mutual Information,Log-Likelihood Ratio, etc.);
3. Use found terms as new seeds and build new randomtuples;
4. Go back to 1.I Retrieved pages formatted as text (character set issues,
non-text format issues; in Japanese: tokenization issues).
I Reference corpus: better if balanced, but any corpus ondifferent topic will usually do (but in Japanese register ofcorpora turns out to be crucial!)
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine
Corpus/Term Bootstrapping
I The bootstrap:
1. Retrieve corpus from web via Google tuple queries;2. Extract typical terms through statistical comparison
with reference corpus (using Mutual Information,Log-Likelihood Ratio, etc.);
3. Use found terms as new seeds and build new randomtuples;
4. Go back to 1.I Retrieved pages formatted as text (character set issues,
non-text format issues; in Japanese: tokenization issues).I Reference corpus: better if balanced, but any corpus on
different topic will usually do (but in Japanese register ofcorpora turns out to be crucial!)
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine
Example 1: Pseudo-seizures in EnglishBaroni and Bernardini 2004
I Seed terms: dissociative, epilepsy, interventions,posttraumatic, pseudoseizures, ptsd.
I Reference: Brown (1.1M words).
I Corpus comparison: via Log Odds Ratio.
I Two iterations.
I 1.4M word corpus constructed, 1800 unigram termsextracted.
I 20/30 randomly selected documents from corpus rated asrelevant and informative.
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine
Example 1: Pseudo-seizures in EnglishBaroni and Bernardini 2004
I Seed terms: dissociative, epilepsy, interventions,posttraumatic, pseudoseizures, ptsd.
I Reference: Brown (1.1M words).
I Corpus comparison: via Log Odds Ratio.
I Two iterations.
I 1.4M word corpus constructed, 1800 unigram termsextracted.
I 20/30 randomly selected documents from corpus rated asrelevant and informative.
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine
Example 1: Pseudo-seizures in EnglishBaroni and Bernardini 2004
I Seed terms: dissociative, epilepsy, interventions,posttraumatic, pseudoseizures, ptsd.
I Reference: Brown (1.1M words).
I Corpus comparison: via Log Odds Ratio.
I Two iterations.
I 1.4M word corpus constructed, 1800 unigram termsextracted.
I 20/30 randomly selected documents from corpus rated asrelevant and informative.
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine
Example 1: Pseudo-seizures in EnglishBaroni and Bernardini 2004
I Seed terms: dissociative, epilepsy, interventions,posttraumatic, pseudoseizures, ptsd.
I Reference: Brown (1.1M words).
I Corpus comparison: via Log Odds Ratio.
I Two iterations.
I 1.4M word corpus constructed, 1800 unigram termsextracted.
I 20/30 randomly selected documents from corpus rated asrelevant and informative.
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine
Example 1: Pseudo-seizures in EnglishBaroni and Bernardini 2004
I Seed terms: dissociative, epilepsy, interventions,posttraumatic, pseudoseizures, ptsd.
I Reference: Brown (1.1M words).
I Corpus comparison: via Log Odds Ratio.
I Two iterations.
I 1.4M word corpus constructed, 1800 unigram termsextracted.
I 20/30 randomly selected documents from corpus rated asrelevant and informative.
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine
Example 1: Pseudo-seizures in EnglishBaroni and Bernardini 2004
I Seed terms: dissociative, epilepsy, interventions,posttraumatic, pseudoseizures, ptsd.
I Reference: Brown (1.1M words).
I Corpus comparison: via Log Odds Ratio.
I Two iterations.
I 1.4M word corpus constructed, 1800 unigram termsextracted.
I 20/30 randomly selected documents from corpus rated asrelevant and informative.
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine
Example 2: Hotel terminology in JapaneseBaroni and Ueyama 2004
I 20 manually selected initial terms.
I 3.5M word reference corpus built with BootCaT usingrandom elementary Japanese words as seeds.
I Corpus comparison: via MI and Log Likelihood Ratio.
I Three iterations.
I 1.3M word corpus constructed, 424 terms extracted.
I 76/90 randomly selected documents assigned highestrelevance/informativeness rating.
I 58.4% terms rated very relevant, 81.7% rated at leastsomewhat relevant.
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine
Example 2: Hotel terminology in JapaneseBaroni and Ueyama 2004
I 20 manually selected initial terms.
I 3.5M word reference corpus built with BootCaT usingrandom elementary Japanese words as seeds.
I Corpus comparison: via MI and Log Likelihood Ratio.
I Three iterations.
I 1.3M word corpus constructed, 424 terms extracted.
I 76/90 randomly selected documents assigned highestrelevance/informativeness rating.
I 58.4% terms rated very relevant, 81.7% rated at leastsomewhat relevant.
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine
Example 2: Hotel terminology in JapaneseBaroni and Ueyama 2004
I 20 manually selected initial terms.
I 3.5M word reference corpus built with BootCaT usingrandom elementary Japanese words as seeds.
I Corpus comparison: via MI and Log Likelihood Ratio.
I Three iterations.
I 1.3M word corpus constructed, 424 terms extracted.
I 76/90 randomly selected documents assigned highestrelevance/informativeness rating.
I 58.4% terms rated very relevant, 81.7% rated at leastsomewhat relevant.
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine
Example 2: Hotel terminology in JapaneseBaroni and Ueyama 2004
I 20 manually selected initial terms.
I 3.5M word reference corpus built with BootCaT usingrandom elementary Japanese words as seeds.
I Corpus comparison: via MI and Log Likelihood Ratio.
I Three iterations.
I 1.3M word corpus constructed, 424 terms extracted.
I 76/90 randomly selected documents assigned highestrelevance/informativeness rating.
I 58.4% terms rated very relevant, 81.7% rated at leastsomewhat relevant.
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine
Example 2: Hotel terminology in JapaneseBaroni and Ueyama 2004
I 20 manually selected initial terms.
I 3.5M word reference corpus built with BootCaT usingrandom elementary Japanese words as seeds.
I Corpus comparison: via MI and Log Likelihood Ratio.
I Three iterations.
I 1.3M word corpus constructed, 424 terms extracted.
I 76/90 randomly selected documents assigned highestrelevance/informativeness rating.
I 58.4% terms rated very relevant, 81.7% rated at leastsomewhat relevant.
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine
Example 2: Hotel terminology in JapaneseBaroni and Ueyama 2004
I 20 manually selected initial terms.
I 3.5M word reference corpus built with BootCaT usingrandom elementary Japanese words as seeds.
I Corpus comparison: via MI and Log Likelihood Ratio.
I Three iterations.
I 1.3M word corpus constructed, 424 terms extracted.
I 76/90 randomly selected documents assigned highestrelevance/informativeness rating.
I 58.4% terms rated very relevant, 81.7% rated at leastsomewhat relevant.
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine
Example 2: Hotel terminology in JapaneseBaroni and Ueyama 2004
I 20 manually selected initial terms.
I 3.5M word reference corpus built with BootCaT usingrandom elementary Japanese words as seeds.
I Corpus comparison: via MI and Log Likelihood Ratio.
I Three iterations.
I 1.3M word corpus constructed, 424 terms extracted.
I 76/90 randomly selected documents assigned highestrelevance/informativeness rating.
I 58.4% terms rated very relevant, 81.7% rated at leastsomewhat relevant.
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine
Applications
I Languages: English, Italian, Japanese, Spanish, German,French, Russian, Chinese, Danish.
I Domains: medical, legal, meteorology, food, nauticalterminology, (e-)commerce. . .
I Uses: technical translation, interpreting tasks, resourcesfor LSP teaching, populating ontologies, expanding alexicon in systematic ways, general corpus construction(Sharoff submitted).
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine
Applications
I Languages: English, Italian, Japanese, Spanish, German,French, Russian, Chinese, Danish.
I Domains: medical, legal, meteorology, food, nauticalterminology, (e-)commerce. . .
I Uses: technical translation, interpreting tasks, resourcesfor LSP teaching, populating ontologies, expanding alexicon in systematic ways, general corpus construction(Sharoff submitted).
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine
Applications
I Languages: English, Italian, Japanese, Spanish, German,French, Russian, Chinese, Danish.
I Domains: medical, legal, meteorology, food, nauticalterminology, (e-)commerce. . .
I Uses: technical translation, interpreting tasks, resourcesfor LSP teaching, populating ontologies, expanding alexicon in systematic ways, general corpus construction(Sharoff submitted).
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine
Ongoing and planned work
I Special queries.
I Better character set handling.
I Better pdf/doc conversion.
I Better integration with UCS and other tools.
I Multi-term extraction.
I Yahoo API?
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine
Pros
I We still rely on commercial search engine, but less so.
I We only use most basic query function, less likely tochange.
I Language filtering and good relevance-ranking are crucialcharacteristics of successful search engines.
I We are less likely to bother engine by over-querying, sincewith one query we can obtain MBs of data.
I We have full control over data (e.g. frequency counts,parsing, manual URL filtering) because we downloadthem.
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine
Pros
I We still rely on commercial search engine, but less so.
I We only use most basic query function, less likely tochange.
I Language filtering and good relevance-ranking are crucialcharacteristics of successful search engines.
I We are less likely to bother engine by over-querying, sincewith one query we can obtain MBs of data.
I We have full control over data (e.g. frequency counts,parsing, manual URL filtering) because we downloadthem.
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine
Pros
I We still rely on commercial search engine, but less so.
I We only use most basic query function, less likely tochange.
I Language filtering and good relevance-ranking are crucialcharacteristics of successful search engines.
I We are less likely to bother engine by over-querying, sincewith one query we can obtain MBs of data.
I We have full control over data (e.g. frequency counts,parsing, manual URL filtering) because we downloadthem.
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine
Pros
I We still rely on commercial search engine, but less so.
I We only use most basic query function, less likely tochange.
I Language filtering and good relevance-ranking are crucialcharacteristics of successful search engines.
I We are less likely to bother engine by over-querying, sincewith one query we can obtain MBs of data.
I We have full control over data (e.g. frequency counts,parsing, manual URL filtering) because we downloadthem.
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine
Pros
I We still rely on commercial search engine, but less so.
I We only use most basic query function, less likely tochange.
I Language filtering and good relevance-ranking are crucialcharacteristics of successful search engines.
I We are less likely to bother engine by over-querying, sincewith one query we can obtain MBs of data.
I We have full control over data (e.g. frequency counts,parsing, manual URL filtering) because we downloadthem.
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine
Cons
I We still rely on commercial search engine:
I What happens if Google discontinues API service?I What happens if Google does something too smart or
too commercial with the page ranks?
I Good for content-driven corpus building, problems withsyntax/style/genre-based filtering.
I Good for building small, targeted-corpora (but seeSharoff’s – and Ciaramita’s? – work).
I Not for exploiting vastness of web-as-corpus directly.
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine
Cons
I We still rely on commercial search engine:I What happens if Google discontinues API service?
I What happens if Google does something too smart ortoo commercial with the page ranks?
I Good for content-driven corpus building, problems withsyntax/style/genre-based filtering.
I Good for building small, targeted-corpora (but seeSharoff’s – and Ciaramita’s? – work).
I Not for exploiting vastness of web-as-corpus directly.
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine
Cons
I We still rely on commercial search engine:I What happens if Google discontinues API service?I What happens if Google does something too smart or
too commercial with the page ranks?
I Good for content-driven corpus building, problems withsyntax/style/genre-based filtering.
I Good for building small, targeted-corpora (but seeSharoff’s – and Ciaramita’s? – work).
I Not for exploiting vastness of web-as-corpus directly.
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine
Cons
I We still rely on commercial search engine:I What happens if Google discontinues API service?I What happens if Google does something too smart or
too commercial with the page ranks?
I Good for content-driven corpus building, problems withsyntax/style/genre-based filtering.
I Good for building small, targeted-corpora (but seeSharoff’s – and Ciaramita’s? – work).
I Not for exploiting vastness of web-as-corpus directly.
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine
Cons
I We still rely on commercial search engine:I What happens if Google discontinues API service?I What happens if Google does something too smart or
too commercial with the page ranks?
I Good for content-driven corpus building, problems withsyntax/style/genre-based filtering.
I Good for building small, targeted-corpora (but seeSharoff’s – and Ciaramita’s? – work).
I Not for exploiting vastness of web-as-corpus directly.
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine
Cons
I We still rely on commercial search engine:I What happens if Google discontinues API service?I What happens if Google does something too smart or
too commercial with the page ranks?
I Good for content-driven corpus building, problems withsyntax/style/genre-based filtering.
I Good for building small, targeted-corpora (but seeSharoff’s – and Ciaramita’s? – work).
I Not for exploiting vastness of web-as-corpus directly.
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine
Biting the bullet. . .
I Crawling, cleaning, annotating, managing andmaintaining your own indexed version of the web.
I Obviously, the “ideal” solution.
I But obviously a lot of work!
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine
Biting the bullet. . .
I Crawling, cleaning, annotating, managing andmaintaining your own indexed version of the web.
I Obviously, the “ideal” solution.
I But obviously a lot of work!
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine
Biting the bullet. . .
I Crawling, cleaning, annotating, managing andmaintaining your own indexed version of the web.
I Obviously, the “ideal” solution.
I But obviously a lot of work!
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine
Build your own search engine
I Crawling.
I Post-processing (html/boilerplate stripping, languagerecogntion, duplicate detection, “connected prose”recognition. . . )
I Linguistic processing.
I Categorization, meta-data.
I Indexing.
I Interfaces.
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine
Build your own search engine
I Crawling.
I Post-processing (html/boilerplate stripping, languagerecogntion, duplicate detection, “connected prose”recognition. . . )
I Linguistic processing.
I Categorization, meta-data.
I Indexing.
I Interfaces.
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine
Build your own search engine
I Crawling.
I Post-processing (html/boilerplate stripping, languagerecogntion, duplicate detection, “connected prose”recognition. . . )
I Linguistic processing.
I Categorization, meta-data.
I Indexing.
I Interfaces.
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine
Build your own search engine
I Crawling.
I Post-processing (html/boilerplate stripping, languagerecogntion, duplicate detection, “connected prose”recognition. . . )
I Linguistic processing.
I Categorization, meta-data.
I Indexing.
I Interfaces.
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine
Build your own search engine
I Crawling.
I Post-processing (html/boilerplate stripping, languagerecogntion, duplicate detection, “connected prose”recognition. . . )
I Linguistic processing.
I Categorization, meta-data.
I Indexing.
I Interfaces.
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine
Build your own search engine
I Crawling.
I Post-processing (html/boilerplate stripping, languagerecogntion, duplicate detection, “connected prose”recognition. . . )
I Linguistic processing.
I Categorization, meta-data.
I Indexing.
I Interfaces.
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine
The huge web-corpus of Clarke and collaborators
I Terabyte crawl of the web in 2001.
I From initial seed set of 2392 (English?) educationalURLs.
I No duplicates, not too many pages from same site.
I No language filtering.
I 53 billion words, 77 million documents.
I (BNC has 100 million words; Google indexes 8 billiondocuments.)
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine
The huge web-corpus of Clarke and collaborators
I Terabyte crawl of the web in 2001.
I From initial seed set of 2392 (English?) educationalURLs.
I No duplicates, not too many pages from same site.
I No language filtering.
I 53 billion words, 77 million documents.
I (BNC has 100 million words; Google indexes 8 billiondocuments.)
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine
The huge web-corpus of Clarke and collaborators
I Terabyte crawl of the web in 2001.
I From initial seed set of 2392 (English?) educationalURLs.
I No duplicates, not too many pages from same site.
I No language filtering.
I 53 billion words, 77 million documents.
I (BNC has 100 million words; Google indexes 8 billiondocuments.)
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine
The huge web-corpus of Clarke and collaborators
I Terabyte crawl of the web in 2001.
I From initial seed set of 2392 (English?) educationalURLs.
I No duplicates, not too many pages from same site.
I No language filtering.
I 53 billion words, 77 million documents.
I (BNC has 100 million words; Google indexes 8 billiondocuments.)
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine
The huge web-corpus of Clarke and collaborators
I Terabyte crawl of the web in 2001.
I From initial seed set of 2392 (English?) educationalURLs.
I No duplicates, not too many pages from same site.
I No language filtering.
I 53 billion words, 77 million documents.
I (BNC has 100 million words; Google indexes 8 billiondocuments.)
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine
The huge web-corpus of Clarke and collaborators
I Terabyte crawl of the web in 2001.
I From initial seed set of 2392 (English?) educationalURLs.
I No duplicates, not too many pages from same site.
I No language filtering.
I 53 billion words, 77 million documents.
I (BNC has 100 million words; Google indexes 8 billiondocuments.)
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine
The TOEFL synonym match test, again
I Target: levied; Candidates: imposed, believed, requested,correlated.
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine
The TOEFL synonym match test, again
I Target: levied; Candidates: imposed, believed, requested,correlated.
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine
WMI takes the TOEFL againTerra and Clarke 2003
I Performance on TOEFL synonym match task:
I Average foreign test taker: 64.5%I Latent Semantic Analysis: 65.4%I WMI: 72.5%I Terra & Clarke’s WMI: 81.25%
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine
WMI takes the TOEFL againTerra and Clarke 2003
I Performance on TOEFL synonym match task:I Average foreign test taker: 64.5%
I Latent Semantic Analysis: 65.4%I WMI: 72.5%I Terra & Clarke’s WMI: 81.25%
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine
WMI takes the TOEFL againTerra and Clarke 2003
I Performance on TOEFL synonym match task:I Average foreign test taker: 64.5%I Latent Semantic Analysis: 65.4%
I WMI: 72.5%I Terra & Clarke’s WMI: 81.25%
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine
WMI takes the TOEFL againTerra and Clarke 2003
I Performance on TOEFL synonym match task:I Average foreign test taker: 64.5%I Latent Semantic Analysis: 65.4%I WMI: 72.5%
I Terra & Clarke’s WMI: 81.25%
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine
WMI takes the TOEFL againTerra and Clarke 2003
I Performance on TOEFL synonym match task:I Average foreign test taker: 64.5%I Latent Semantic Analysis: 65.4%I WMI: 72.5%I Terra & Clarke’s WMI: 81.25%
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine
Pros
I Independence from commercial search engines.
I Precious, multi-purpose resource.
I In principle, you can do what you want with it.
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine
Pros
I Independence from commercial search engines.
I Precious, multi-purpose resource.
I In principle, you can do what you want with it.
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine
Pros
I Independence from commercial search engines.
I Precious, multi-purpose resource.
I In principle, you can do what you want with it.
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine
Cons
I A lot of work.
I Resource-intensive.
I In principle, you can do what you want with it. . .
I In practice, almost anything you want to do with aterabyte corpus will be extremely complicated to do.
I Forget about the “do it yourself with a perl script”approach.
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine
Cons
I A lot of work.
I Resource-intensive.
I In principle, you can do what you want with it. . .
I In practice, almost anything you want to do with aterabyte corpus will be extremely complicated to do.
I Forget about the “do it yourself with a perl script”approach.
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine
Cons
I A lot of work.
I Resource-intensive.
I In principle, you can do what you want with it. . .
I In practice, almost anything you want to do with aterabyte corpus will be extremely complicated to do.
I Forget about the “do it yourself with a perl script”approach.
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine
Cons
I A lot of work.
I Resource-intensive.
I In principle, you can do what you want with it. . .
I In practice, almost anything you want to do with aterabyte corpus will be extremely complicated to do.
I Forget about the “do it yourself with a perl script”approach.
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine
Cons
I A lot of work.
I Resource-intensive.
I In principle, you can do what you want with it. . .
I In practice, almost anything you want to do with aterabyte corpus will be extremely complicated to do.
I Forget about the “do it yourself with a perl script”approach.
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Outline
Introduction
Frequency estimates from search enginesWeb-based Mutual Information
The “linguists’ friendly” interfaces
Building your own web corpusSmall corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine
Enter WaCky!
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Enter WaCky!
I The Web-as-Corpus kool ynitiative.
I http://wacky.sslmit.unibo.it/
I WaCky crowd: Marco, Massi, Silvia Bernardini, StefanEvert, Bill Fletcher, Adam Kilgarriff. . .
I Yet Another Linguist’s Search Engine proposal (see also:Kilgarriff 2003, Fletcher 2004).
I The WaCky philosophy: try to get something concreteout there very soon, so that other will feel motivated tocontribute.
I 3 1-billion word corpora (English, German, Italian) byspring 2006.
I Web interface(s) and an open source toolkit.
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Enter WaCky!
I The Web-as-Corpus kool ynitiative.
I http://wacky.sslmit.unibo.it/
I WaCky crowd: Marco, Massi, Silvia Bernardini, StefanEvert, Bill Fletcher, Adam Kilgarriff. . .
I Yet Another Linguist’s Search Engine proposal (see also:Kilgarriff 2003, Fletcher 2004).
I The WaCky philosophy: try to get something concreteout there very soon, so that other will feel motivated tocontribute.
I 3 1-billion word corpora (English, German, Italian) byspring 2006.
I Web interface(s) and an open source toolkit.
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Enter WaCky!
I The Web-as-Corpus kool ynitiative.
I http://wacky.sslmit.unibo.it/
I WaCky crowd: Marco, Massi, Silvia Bernardini, StefanEvert, Bill Fletcher, Adam Kilgarriff. . .
I Yet Another Linguist’s Search Engine proposal (see also:Kilgarriff 2003, Fletcher 2004).
I The WaCky philosophy: try to get something concreteout there very soon, so that other will feel motivated tocontribute.
I 3 1-billion word corpora (English, German, Italian) byspring 2006.
I Web interface(s) and an open source toolkit.
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Enter WaCky!
I The Web-as-Corpus kool ynitiative.
I http://wacky.sslmit.unibo.it/
I WaCky crowd: Marco, Massi, Silvia Bernardini, StefanEvert, Bill Fletcher, Adam Kilgarriff. . .
I Yet Another Linguist’s Search Engine proposal (see also:Kilgarriff 2003, Fletcher 2004).
I The WaCky philosophy: try to get something concreteout there very soon, so that other will feel motivated tocontribute.
I 3 1-billion word corpora (English, German, Italian) byspring 2006.
I Web interface(s) and an open source toolkit.
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Enter WaCky!
I The Web-as-Corpus kool ynitiative.
I http://wacky.sslmit.unibo.it/
I WaCky crowd: Marco, Massi, Silvia Bernardini, StefanEvert, Bill Fletcher, Adam Kilgarriff. . .
I Yet Another Linguist’s Search Engine proposal (see also:Kilgarriff 2003, Fletcher 2004).
I The WaCky philosophy: try to get something concreteout there very soon, so that other will feel motivated tocontribute.
I 3 1-billion word corpora (English, German, Italian) byspring 2006.
I Web interface(s) and an open source toolkit.
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Enter WaCky!
I The Web-as-Corpus kool ynitiative.
I http://wacky.sslmit.unibo.it/
I WaCky crowd: Marco, Massi, Silvia Bernardini, StefanEvert, Bill Fletcher, Adam Kilgarriff. . .
I Yet Another Linguist’s Search Engine proposal (see also:Kilgarriff 2003, Fletcher 2004).
I The WaCky philosophy: try to get something concreteout there very soon, so that other will feel motivated tocontribute.
I 3 1-billion word corpora (English, German, Italian) byspring 2006.
I Web interface(s) and an open source toolkit.
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Enter WaCky!
I The Web-as-Corpus kool ynitiative.
I http://wacky.sslmit.unibo.it/
I WaCky crowd: Marco, Massi, Silvia Bernardini, StefanEvert, Bill Fletcher, Adam Kilgarriff. . .
I Yet Another Linguist’s Search Engine proposal (see also:Kilgarriff 2003, Fletcher 2004).
I The WaCky philosophy: try to get something concreteout there very soon, so that other will feel motivated tocontribute.
I 3 1-billion word corpora (English, German, Italian) byspring 2006.
I Web interface(s) and an open source toolkit.
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Enter WaCky! (cont.)
I We must learn from IR and massive dataset studies (e.g.,near duplicate detection, fast retrieval). . .
I but there are important differences, for example:
I We probably want all data, or perhaps random data, oreven linguistically interesting data, not necessarily mostrelevant data.
I We care about (linguistic) form at least as much asabout content.
I A new challenge in computational linguistics: data arenot given.
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Enter WaCky! (cont.)
I We must learn from IR and massive dataset studies (e.g.,near duplicate detection, fast retrieval). . .
I but there are important differences, for example:
I We probably want all data, or perhaps random data, oreven linguistically interesting data, not necessarily mostrelevant data.
I We care about (linguistic) form at least as much asabout content.
I A new challenge in computational linguistics: data arenot given.
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Enter WaCky! (cont.)
I We must learn from IR and massive dataset studies (e.g.,near duplicate detection, fast retrieval). . .
I but there are important differences, for example:I We probably want all data, or perhaps random data, or
even linguistically interesting data, not necessarily mostrelevant data.
I We care about (linguistic) form at least as much asabout content.
I A new challenge in computational linguistics: data arenot given.
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Enter WaCky! (cont.)
I We must learn from IR and massive dataset studies (e.g.,near duplicate detection, fast retrieval). . .
I but there are important differences, for example:I We probably want all data, or perhaps random data, or
even linguistically interesting data, not necessarily mostrelevant data.
I We care about (linguistic) form at least as much asabout content.
I A new challenge in computational linguistics: data arenot given.
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Enter WaCky! (cont.)
I We must learn from IR and massive dataset studies (e.g.,near duplicate detection, fast retrieval). . .
I but there are important differences, for example:I We probably want all data, or perhaps random data, or
even linguistically interesting data, not necessarily mostrelevant data.
I We care about (linguistic) form at least as much asabout content.
I A new challenge in computational linguistics: data arenot given.
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Enter WaCky! (cont.)
I Emphasis on:
I Transparency;I Stability;I Pre-processing;I Categorization and annotation;I (Also) automated access;I Sophisticated query options.
I Not so important:
I Access speed;I Updating;I Size;I Content-driven relevance.
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Enter WaCky! (cont.)
I Emphasis on:I Transparency;
I Stability;I Pre-processing;I Categorization and annotation;I (Also) automated access;I Sophisticated query options.
I Not so important:
I Access speed;I Updating;I Size;I Content-driven relevance.
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Enter WaCky! (cont.)
I Emphasis on:I Transparency;I Stability;
I Pre-processing;I Categorization and annotation;I (Also) automated access;I Sophisticated query options.
I Not so important:
I Access speed;I Updating;I Size;I Content-driven relevance.
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Enter WaCky! (cont.)
I Emphasis on:I Transparency;I Stability;I Pre-processing;
I Categorization and annotation;I (Also) automated access;I Sophisticated query options.
I Not so important:
I Access speed;I Updating;I Size;I Content-driven relevance.
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Enter WaCky! (cont.)
I Emphasis on:I Transparency;I Stability;I Pre-processing;I Categorization and annotation;
I (Also) automated access;I Sophisticated query options.
I Not so important:
I Access speed;I Updating;I Size;I Content-driven relevance.
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Enter WaCky! (cont.)
I Emphasis on:I Transparency;I Stability;I Pre-processing;I Categorization and annotation;I (Also) automated access;
I Sophisticated query options.
I Not so important:
I Access speed;I Updating;I Size;I Content-driven relevance.
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Enter WaCky! (cont.)
I Emphasis on:I Transparency;I Stability;I Pre-processing;I Categorization and annotation;I (Also) automated access;I Sophisticated query options.
I Not so important:
I Access speed;I Updating;I Size;I Content-driven relevance.
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Enter WaCky! (cont.)
I Emphasis on:I Transparency;I Stability;I Pre-processing;I Categorization and annotation;I (Also) automated access;I Sophisticated query options.
I Not so important:
I Access speed;I Updating;I Size;I Content-driven relevance.
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Enter WaCky! (cont.)
I Emphasis on:I Transparency;I Stability;I Pre-processing;I Categorization and annotation;I (Also) automated access;I Sophisticated query options.
I Not so important:I Access speed;
I Updating;I Size;I Content-driven relevance.
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Enter WaCky! (cont.)
I Emphasis on:I Transparency;I Stability;I Pre-processing;I Categorization and annotation;I (Also) automated access;I Sophisticated query options.
I Not so important:I Access speed;I Updating;
I Size;I Content-driven relevance.
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Enter WaCky! (cont.)
I Emphasis on:I Transparency;I Stability;I Pre-processing;I Categorization and annotation;I (Also) automated access;I Sophisticated query options.
I Not so important:I Access speed;I Updating;I Size;
I Content-driven relevance.
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
Enter WaCky! (cont.)
I Emphasis on:I Transparency;I Stability;I Pre-processing;I Categorization and annotation;I (Also) automated access;I Sophisticated query options.
I Not so important:I Access speed;I Updating;I Size;I Content-driven relevance.
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
The WaCkodules: Where We Are At
I Seeding the Crawls: BNC/Google seeding experimentsand Massi’s measures of randomness.
I Crawling: with Heritrix, the Internet Archive crawler.
I Post-processing: current focus on duplicate detection.
I Linguistic annotation, meta-data: nothing yet.
I Indexing: Lucene vs. the newly open (!) IMS CorpusWorkBench.
I Interfaces: work by Stefan Evert.
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
The WaCkodules: Where We Are At
I Seeding the Crawls: BNC/Google seeding experimentsand Massi’s measures of randomness.
I Crawling: with Heritrix, the Internet Archive crawler.
I Post-processing: current focus on duplicate detection.
I Linguistic annotation, meta-data: nothing yet.
I Indexing: Lucene vs. the newly open (!) IMS CorpusWorkBench.
I Interfaces: work by Stefan Evert.
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
The WaCkodules: Where We Are At
I Seeding the Crawls: BNC/Google seeding experimentsand Massi’s measures of randomness.
I Crawling: with Heritrix, the Internet Archive crawler.
I Post-processing: current focus on duplicate detection.
I Linguistic annotation, meta-data: nothing yet.
I Indexing: Lucene vs. the newly open (!) IMS CorpusWorkBench.
I Interfaces: work by Stefan Evert.
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
The WaCkodules: Where We Are At
I Seeding the Crawls: BNC/Google seeding experimentsand Massi’s measures of randomness.
I Crawling: with Heritrix, the Internet Archive crawler.
I Post-processing: current focus on duplicate detection.
I Linguistic annotation, meta-data: nothing yet.
I Indexing: Lucene vs. the newly open (!) IMS CorpusWorkBench.
I Interfaces: work by Stefan Evert.
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
The WaCkodules: Where We Are At
I Seeding the Crawls: BNC/Google seeding experimentsand Massi’s measures of randomness.
I Crawling: with Heritrix, the Internet Archive crawler.
I Post-processing: current focus on duplicate detection.
I Linguistic annotation, meta-data: nothing yet.
I Indexing: Lucene vs. the newly open (!) IMS CorpusWorkBench.
I Interfaces: work by Stefan Evert.
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
The WaCkodules: Where We Are At
I Seeding the Crawls: BNC/Google seeding experimentsand Massi’s measures of randomness.
I Crawling: with Heritrix, the Internet Archive crawler.
I Post-processing: current focus on duplicate detection.
I Linguistic annotation, meta-data: nothing yet.
I Indexing: Lucene vs. the newly open (!) IMS CorpusWorkBench.
I Interfaces: work by Stefan Evert.
IntroductionFrequency estimates from search engines
The “linguists’ friendly” interfacesBuilding your own web corpus
Enter WaCky!
A few references
M. Baroni and S. Bernardini. 2004. BootCaT: Bootstrapping corpora and terms fromthe web. LREC 2004.M. Baroni and S. Bisi 2004. Using cooccurrence statistics and the web to discoversynonyms in a specialized language. LREC 2004.M. Banko and E. Brill. 2001. Scaling to very very large corpora for natural languagedisambiguation. ACL 2001.W. Fletcher. 2004. Facilitating the compilation and dissemination of ad-hoc webcorpora. Papers from TALC 2002.R. Ghani, R. Jones, and D. Mladenic. 2001. Mining the web to create minoritylanguage corpora. CIKM 2001.F. Keller and M. Lapata. 2003. Using the web to obtain frequencies for unseenbigrams. Computational Linguistics 29.A. Kilgarriff. 2003. Linguistic search engine. Corpus Linguistics 2003.E. Terra and C. L. A. Clarke. 2003. Frequency estimates for statistical word similaritymeasures. HLT-NAACL-03.P. Turney. 2001. Mining the web for synonyms: PMI-IR versus LSA on TOEFL.ECML-2001.
top related