shane bergsma johns hopkins university

Better Together: Large Monolingual, Bilingual and Multimodal Corpora in

Natural Language Processing

Fall, 2011

Shane BergsmaJohns Hopkins University

2

Research Vision

Robust processing of human language requires knowledge beyond what’s in small manually-annotated data sets

Many NLP successes exploit web-scale raw data:• Google Translate• IBM’s Watson• Things people use every day:– spelling correction, speech recognition, etc.

[Banko & Brill, 2001]

Grammar CorrectionTask@Microsoft

More data is better data

4

This Talk

Derive lots of knowledge from web-scale data and apply to syntax, semantics, discourse:1) Raw text on the web (Google N-grams)

Part 1: Non-referential pronouns

2) Bilingual text (words plus their translations) Part 2: Parsing noun phrases

3) Visual data (labelled online images) Part 3: Learning the meaning of words

5

Search Engines for NLP

• Early web work: Use an Internet search engine to get data[Keller & Lapata, 2003]

“Britney Spears” 269,000,000 pages“Britany Spears” 693,000 pages

6

Search Engines• Search Engines for NLP: some objections– Scientific: not reproducible, unreliable

[Kilgarriff, 2007, “Googleology is bad science.”]–Practical: Too slow for millions of queries

7

N-grams

• Google N-gram Data [Brants & Franz, 2006]

–N words in sequence + their count on web–A compressed version of all the text on web• 24 GB zipped fits on your hard drive

– Enables better features for a range of tasks [Bergsma et al. ACL 2008, IJCAI 2009, ACL 2010, etc.]

8

Part 1: Non-Referential Pronouns

E.g. the word “it” in English• “You can make it in advance.”– referential (50-75%)

• “You can make it in Hollywood.”– non-referential (25-50%)

9

Non-Referential Pronouns

• [Hirst, 1981]: detect non-referential pronouns, “lest precious hours be lost in bootless searches for textual referents.”

• Most existing pronoun/coreference systems just ignore the problem

• A common ambiguity:– “it” comprises 1% of English tokens

10

Non-Ref Detection as Classification

• Input: s = “You can make it in advance”

• Output:

Is it a non-referential pronoun in s?

Method: train a supervised classifier to make this decision on the basis of some features

[Evans, 2001, Boyd et al. 2005, Müller 2006]

11

A Machine Learning Approachh(x) = w ∙ x (predict non-ref if h(x) > 0)

• Typical ‘lexical’ features: binary indicators of context:x = (previous-word=make, next-word=in, previous-

two-words=can+make, …)• Use training data to learn good values for the

weights, w– Classifier learns, e.g., to give negative weight to

PPs immediately preceding ‘it’ (e.g. … from it)

12

Better: Features from the Web

• Convert sentence to a context pattern:“make ____ in advance”

• Collect counts from the web:– “make it/them in advance”• 442 vs. 449 occurrences in Google N-gram Data

– “make it/them in Hollywood”• 3421 vs. 0 occurrences in Google N-gram Data

[Bergsma, Lin, Goebel, ACL 2008]

13

Applying the Web Counts

• How wide should the patterns span?– We can use all that Google N-gram Data allows:

You can make _ can make _ in make _ in advance _ in advance .

– Five 5-grams, four 4-grams, three 3-grams and two bigrams

• What fillers to use? (e.g. it, they/them, any NP?)

14

Web Count Features“it”: log-cnt(“You can make it in”) 5-grams log-cnt(“can make it in advance”) log-cnt(“make it in advance .”) ...

log-cnt(“You can make it”) 4-grams log-cnt(“can make it in”)

... ...

“them”: log-cnt(“You can make them in”) 5-grams

... ...

15

A Machine Learning Approach Revisited

h(x) = w ∙ x (predict non-ref if h(x) > 0)

• Typical features: binary indicators of context:x = (previous-word=make, next-word=in, previous-

two-words=can+make, …)• New features: real-valued counts in web text:

x = (log-cnt(“make it in advance”), log-cnt(“make them in advance”, log-cnt(“make * in advance”), …)

• Key conclusion: classifiers with web features are robust on new domains! [Bergsma, Pitler, Lin, ACL 2010]

16

NADA

• Non-Anaphoric Detection Algorithm:– a system for identifying non-referential pronouns

• Works on raw sentences; no parsing/tagging of input needed

• Classifies ‘it’ in up to 20,000 sentences/second• It works well when used out-of-domain – Because it’s got those Web count features

[Bergsma & Yarowsky, DAARC 2011]

http://code.google.com/p/nada-nonref-pronoun-detector/

http://code.google.com/p/nada-nonref-pronoun-detector/

17

Using web counts works great… but is it practical?

All N-grams in the Google N-gram corpus 93 GBExtract N-grams of length-4 only 33 GBExtract N-grams containing it, they, them only 500 MBLower-case, truncate tokens to four characters, replace special tokens (e.g. named entities, pronouns, digits) with symbols, etc.

189 MB

Encode tokens (6 bytes) and values (2 bytes), store only changes from previous line

44 MB

gzip resulting file 33 MB

18

NADA versus Other Systems

Precision Recall F-Score Accuracy35

45

55

65

75

85

Paice & Husk

Charniak & Elsner

NADA

19

Part 1: Conclusion

• N-gram data better than search engines• Classifiers with N-gram counts are very

effective, particularly on new domains• But we needed a large corpus of manually-

annotated data to learn how to use the counts• We’ll see now how bilingual data can provide

the supervision (for some problems)

20

Part 2: Coordination Ambiguity in NPs

1) [dairy and meat] production2) [sustainability] and [meat production]

yes: [dairy production] in (1)no: [sustainability production] in (2)

• new semantic features from raw web text and a new approach to using bilingual data as soft supervision

[Bergsma, Yarowsky & Church, ACL 2011]

21

Coordination Ambiguity

• Words whose POS tags match pattern:[DT|PRP$] (N.*|J.*) and [DT|PRP$] (N.*|J.*) N.*

• Output: Decide if one NP or two • Resolving Coordination is classic hard problem– Treebank doesn’t annotate NP-internal structure – Modern parsers thus do very poorly on these

decisions (78% Minipar, 79% for C&C parser)– For training/evaluation, we patched Treebank with

Vadas & Curran ’07 NP annotations

22

One Noun Phrase or Two:A Machine Learning Approach

Input: “dairy and meat production”→ features: x

x = (…, first-noun=dairy, … second-noun=meat, … first+second-noun=dairy+meat, …)

h(x) = w ∙ x (predict one NP if h(x) > 0)

• Set w via training on annotated training data using some machine learning algorithm

23

Leveraging Web-Derived Knowledge[dairy and meat] production• If there is only one NP, then it is implicitly talking

about “dairy production” • Do we see this phrase occurring a lot on the web? [Yes]

sustainability and [meat production]• If there is only one NP, then it is implicitly talking

about “sustainability production”• Do we see this phrase occurring a lot on the web? [No]

• Classifier has features for these counts– But the web can gives us more!

24

Features for Explicit Paraphrasesdairy and meat production sustainability and meat production

Pattern: ❸ of ❶ and ❷

↑Count(production of dairy and meat)

↓Count(production of sustainability and meat)

Pattern: ❷ ❸ and ❶

↓Count(meat production and dairy)

↑Count(meat production and sustainability)

❶ and ❷ ❸

New paraphrases extending ideas in [Nakov & Hearst, 2005]

❶ and ❷ ❸

25

Training Examples

conservation and good management

motor and heating fuels

freedom and security agenda

Google N-gram Data

Feature Vectorsx1, x2, x3, x4

Classifier: h(x)

Machine Learning

Human-Annotated

Data (small)

Raw Data (HUGE)

26

Using Bilingual Data

• Bilingual data: a rich source of paraphrasesdairy and meat production producción láctea y cárnica

• Build a classifier which uses bilingual features– Applicable when we know the translation of the NP

27

Bilingual “Paraphrase” Featuresdairy and meat production sustainability and meat production

Pattern: ❸ ❶ … ❷ (Spanish)

Count(producc ión láctea y cárnica )

unseen

Pattern: ❶ … ❸ ❷ (Italian)

unseen Count(sosten ib i l i tà e la produz ione d i carne )

❶ and ❷ ❸ ❶ and ❷ ❸

28

Bilingual “Paraphrase” Featuresdairy and meat production sustainability and meat production

Pattern: ❶- … ❷❸ (Finnish)C o u nt ( m a i d o n 3 j a l i h a n t u o t a n to o n )

unseen

❶ and ❷ ❸ ❶ and ❷ ❸

29

Training Examples

conservation and good management

motor and heating fuels

freedom and security agenda

Translation Data

Feature Vectorsx1, x2, x3, x4

Classifier: h(xb)

Machine Learning

Human-Annotated

Data (small)

Bilingual Data (medium)

30

h(xb)

insurrection and regime change

coal and steel money

North and South Carolina

business and computer science

the Bosporus and Dardanelles straits

rocket and mortar attacks

the environment and air transport

pollution and transport safety

h(xm)


coal and steel money







+ Features from Google Data

Training Examples

+ Features from Translation Data

Training Examples

coal and steel money rocket and mortar attacks







Bitext Examples

31

h(xm)


Training Examples


Training Examplescoal and steel money








h(xb)1







business and computer sciencethe Bosporus and Dardanelles straitsthe environment and air transport




32


Training Examples


Training Examplescoal and steel money



the environment and air transportthe Bosporus and Dardanelles straits




h(xb)1

h(xm)1




Co-Training: [Yarowsky’95], [Blum & Mitchell’98]

33

h(xm)i

h(xb)i

Error rate (%) of co-trained classifiers

34

Error rate (%) on Penn Treebank (PTB)

h(xm)N

Broad-co

vera

ge Parse

rs

Nakov & Hearst

(2005)

Pitler e

t al (2

010)

New Supervi

sed M

onoclassi

fier

Co-trained M

onoclassi

fier0

5

10

15

20

800 PTB training

examples800 PTB training

examples 2 training examples

unsupervised

35

Part 2: Conclusion

• Knowledge from large-scale monolingual corpora is crucial for parsing noun phrases– New paraphrase features

• New way to use bilingual data as soft supervision to guide the use of monolingual features

36

Part 3: Using visual data to learn the meaning of words

• Large volumes of visual data also reveal meaning (semantics), but in language-universal way

• Humans label their images as they post them online, providing the word-meaning link

• There’s lots of images to work with

[from Facebook’s Twitter feed]

37

English Web Images Spanish Web Images

turtle

candle

vela

tortuga

cockatoo

cacatúa

[Bergsma and Van Durme, IJCAI 2011]

38

Linking bilingual words by web-based visual similarity

Step 1: Retrieve online images via Google Image Search (in each lang.), 20 images for each word– Google competitive with “hand-prepared

datasets” [Fergus et al., 2005]

39

Step 2: Create Image Feature Vectors

Color histogram features

40

Step 2: Create Image Feature Vectors

SIFT keypoint features

Using David Lowe’s software [Lowe, 2004]

41

Step 3: Compute an Aggregate Similarity for Two Words

0.33

0.55

0.19

0.46

VectorCosine

Similarity

Best match for one English

image

Avg. over all

English images

42

Output: Ranking of Foreign Translations by Aggregate Visual Similarities

English Spanish French

rosary 1. camándula:0.151 1. chapelet:0.213

2. puntaje:0.140 2. activité:0.153

3. accidentalidad:0.139 3. rosaire:0.150

… …

Lots of details in the paper:• Finding a class of words where this works (physical objects)• Comparing visual similarity to string similarity (cognate finder)

43

Task #2: Lexical Semantics from Images

Can you eat “migas”?

Can you eat “carillon”?

Can you eat “mamey”?

Selectional Preference:

Is noun X a plausible object for verb Y?

[Bergsma and Goebel, RANLP 2011]

44

Conclusion

• Robust NLP needs to look beyond human-annotated data to exploit large corpora

• Size matters:– Many NLP systems trained on 1 million words– We use:• billions of words in bitexts• trillions of words of monolingual text• online images: hundreds of billions (⨯1000 words each a 100 trillion words!)

45

Questions + Thanks• Gold sponsors:

• Platinum sponsors (collaborators):– Kenneth Church (Johns Hopkins), Randy Goebel (Alberta), Dekang Lin

(Google), Emily Pitler (Penn), Benjamin Van Durme (Johns Hopkins) and David Yarowsky (Johns Hopkins)

shane bergsma johns hopkins university

Documents

n words

webscale data

webscale raw data

web google ngramspart

nonreferential pronounshirst

nonreferential pronounse

webscale text

use training data