shane bergsma johns hopkins university
DESCRIPTION
Better Together: Large Monolingual, Bilingual and Multimodal Corpora in Natural Language Processing. Shane Bergsma Johns Hopkins University. Fall, 2011. Research Vision. Robust processing of human language requires knowledge beyond what’s in small manually-annotated data sets - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Shane Bergsma Johns Hopkins University](https://reader030.vdocuments.mx/reader030/viewer/2022020208/56813c0e550346895da58055/html5/thumbnails/1.jpg)
Better Together: Large Monolingual, Bilingual and Multimodal Corpora in
Natural Language Processing
Fall, 2011
Shane BergsmaJohns Hopkins University
![Page 2: Shane Bergsma Johns Hopkins University](https://reader030.vdocuments.mx/reader030/viewer/2022020208/56813c0e550346895da58055/html5/thumbnails/2.jpg)
2
Research Vision
Robust processing of human language requires knowledge beyond what’s in small manually-annotated data sets
Many NLP successes exploit web-scale raw data:• Google Translate• IBM’s Watson• Things people use every day:– spelling correction, speech recognition, etc.
![Page 3: Shane Bergsma Johns Hopkins University](https://reader030.vdocuments.mx/reader030/viewer/2022020208/56813c0e550346895da58055/html5/thumbnails/3.jpg)
[Banko & Brill, 2001]
Grammar CorrectionTask@Microsoft
More data is better data
![Page 4: Shane Bergsma Johns Hopkins University](https://reader030.vdocuments.mx/reader030/viewer/2022020208/56813c0e550346895da58055/html5/thumbnails/4.jpg)
4
This Talk
Derive lots of knowledge from web-scale data and apply to syntax, semantics, discourse:1) Raw text on the web (Google N-grams)
Part 1: Non-referential pronouns
2) Bilingual text (words plus their translations) Part 2: Parsing noun phrases
3) Visual data (labelled online images) Part 3: Learning the meaning of words
![Page 5: Shane Bergsma Johns Hopkins University](https://reader030.vdocuments.mx/reader030/viewer/2022020208/56813c0e550346895da58055/html5/thumbnails/5.jpg)
5
Search Engines for NLP
• Early web work: Use an Internet search engine to get data[Keller & Lapata, 2003]
“Britney Spears” 269,000,000 pages“Britany Spears” 693,000 pages
![Page 6: Shane Bergsma Johns Hopkins University](https://reader030.vdocuments.mx/reader030/viewer/2022020208/56813c0e550346895da58055/html5/thumbnails/6.jpg)
6
Search Engines• Search Engines for NLP: some objections– Scientific: not reproducible, unreliable
[Kilgarriff, 2007, “Googleology is bad science.”]–Practical: Too slow for millions of queries
![Page 7: Shane Bergsma Johns Hopkins University](https://reader030.vdocuments.mx/reader030/viewer/2022020208/56813c0e550346895da58055/html5/thumbnails/7.jpg)
7
N-grams
• Google N-gram Data [Brants & Franz, 2006]
–N words in sequence + their count on web–A compressed version of all the text on web• 24 GB zipped fits on your hard drive
– Enables better features for a range of tasks [Bergsma et al. ACL 2008, IJCAI 2009, ACL 2010, etc.]
![Page 8: Shane Bergsma Johns Hopkins University](https://reader030.vdocuments.mx/reader030/viewer/2022020208/56813c0e550346895da58055/html5/thumbnails/8.jpg)
8
Part 1: Non-Referential Pronouns
E.g. the word “it” in English• “You can make it in advance.”– referential (50-75%)
• “You can make it in Hollywood.”– non-referential (25-50%)
![Page 9: Shane Bergsma Johns Hopkins University](https://reader030.vdocuments.mx/reader030/viewer/2022020208/56813c0e550346895da58055/html5/thumbnails/9.jpg)
9
Non-Referential Pronouns
• [Hirst, 1981]: detect non-referential pronouns, “lest precious hours be lost in bootless searches for textual referents.”
• Most existing pronoun/coreference systems just ignore the problem
• A common ambiguity:– “it” comprises 1% of English tokens
![Page 10: Shane Bergsma Johns Hopkins University](https://reader030.vdocuments.mx/reader030/viewer/2022020208/56813c0e550346895da58055/html5/thumbnails/10.jpg)
10
Non-Ref Detection as Classification
• Input: s = “You can make it in advance”
• Output:
Is it a non-referential pronoun in s?
Method: train a supervised classifier to make this decision on the basis of some features
[Evans, 2001, Boyd et al. 2005, Müller 2006]
![Page 11: Shane Bergsma Johns Hopkins University](https://reader030.vdocuments.mx/reader030/viewer/2022020208/56813c0e550346895da58055/html5/thumbnails/11.jpg)
11
A Machine Learning Approachh(x) = w ∙ x (predict non-ref if h(x) > 0)
• Typical ‘lexical’ features: binary indicators of context:x = (previous-word=make, next-word=in, previous-
two-words=can+make, …)• Use training data to learn good values for the
weights, w– Classifier learns, e.g., to give negative weight to
PPs immediately preceding ‘it’ (e.g. … from it)
![Page 12: Shane Bergsma Johns Hopkins University](https://reader030.vdocuments.mx/reader030/viewer/2022020208/56813c0e550346895da58055/html5/thumbnails/12.jpg)
12
Better: Features from the Web
• Convert sentence to a context pattern:“make ____ in advance”
• Collect counts from the web:– “make it/them in advance”• 442 vs. 449 occurrences in Google N-gram Data
– “make it/them in Hollywood”• 3421 vs. 0 occurrences in Google N-gram Data
[Bergsma, Lin, Goebel, ACL 2008]
![Page 13: Shane Bergsma Johns Hopkins University](https://reader030.vdocuments.mx/reader030/viewer/2022020208/56813c0e550346895da58055/html5/thumbnails/13.jpg)
13
Applying the Web Counts
• How wide should the patterns span?– We can use all that Google N-gram Data allows:
You can make _ can make _ in make _ in advance _ in advance .
– Five 5-grams, four 4-grams, three 3-grams and two bigrams
• What fillers to use? (e.g. it, they/them, any NP?)
![Page 14: Shane Bergsma Johns Hopkins University](https://reader030.vdocuments.mx/reader030/viewer/2022020208/56813c0e550346895da58055/html5/thumbnails/14.jpg)
14
Web Count Features“it”: log-cnt(“You can make it in”) 5-grams log-cnt(“can make it in advance”) log-cnt(“make it in advance .”) ...
log-cnt(“You can make it”) 4-grams log-cnt(“can make it in”)
... ...
“them”: log-cnt(“You can make them in”) 5-grams
... ...
![Page 15: Shane Bergsma Johns Hopkins University](https://reader030.vdocuments.mx/reader030/viewer/2022020208/56813c0e550346895da58055/html5/thumbnails/15.jpg)
15
A Machine Learning Approach Revisited
h(x) = w ∙ x (predict non-ref if h(x) > 0)
• Typical features: binary indicators of context:x = (previous-word=make, next-word=in, previous-
two-words=can+make, …)• New features: real-valued counts in web text:
x = (log-cnt(“make it in advance”), log-cnt(“make them in advance”, log-cnt(“make * in advance”), …)
• Key conclusion: classifiers with web features are robust on new domains! [Bergsma, Pitler, Lin, ACL 2010]
![Page 16: Shane Bergsma Johns Hopkins University](https://reader030.vdocuments.mx/reader030/viewer/2022020208/56813c0e550346895da58055/html5/thumbnails/16.jpg)
16
NADA
• Non-Anaphoric Detection Algorithm:– a system for identifying non-referential pronouns
• Works on raw sentences; no parsing/tagging of input needed
• Classifies ‘it’ in up to 20,000 sentences/second• It works well when used out-of-domain – Because it’s got those Web count features
[Bergsma & Yarowsky, DAARC 2011]
http://code.google.com/p/nada-nonref-pronoun-detector/
![Page 17: Shane Bergsma Johns Hopkins University](https://reader030.vdocuments.mx/reader030/viewer/2022020208/56813c0e550346895da58055/html5/thumbnails/17.jpg)
17
Using web counts works great… but is it practical?
All N-grams in the Google N-gram corpus 93 GBExtract N-grams of length-4 only 33 GBExtract N-grams containing it, they, them only 500 MBLower-case, truncate tokens to four characters, replace special tokens (e.g. named entities, pronouns, digits) with symbols, etc.
189 MB
Encode tokens (6 bytes) and values (2 bytes), store only changes from previous line
44 MB
gzip resulting file 33 MB
![Page 18: Shane Bergsma Johns Hopkins University](https://reader030.vdocuments.mx/reader030/viewer/2022020208/56813c0e550346895da58055/html5/thumbnails/18.jpg)
18
NADA versus Other Systems
Precision Recall F-Score Accuracy35
45
55
65
75
85
Paice & Husk
Charniak & Elsner
NADA
![Page 19: Shane Bergsma Johns Hopkins University](https://reader030.vdocuments.mx/reader030/viewer/2022020208/56813c0e550346895da58055/html5/thumbnails/19.jpg)
19
Part 1: Conclusion
• N-gram data better than search engines• Classifiers with N-gram counts are very
effective, particularly on new domains• But we needed a large corpus of manually-
annotated data to learn how to use the counts• We’ll see now how bilingual data can provide
the supervision (for some problems)
![Page 20: Shane Bergsma Johns Hopkins University](https://reader030.vdocuments.mx/reader030/viewer/2022020208/56813c0e550346895da58055/html5/thumbnails/20.jpg)
20
Part 2: Coordination Ambiguity in NPs
1) [dairy and meat] production2) [sustainability] and [meat production]
yes: [dairy production] in (1)no: [sustainability production] in (2)
• new semantic features from raw web text and a new approach to using bilingual data as soft supervision
[Bergsma, Yarowsky & Church, ACL 2011]
![Page 21: Shane Bergsma Johns Hopkins University](https://reader030.vdocuments.mx/reader030/viewer/2022020208/56813c0e550346895da58055/html5/thumbnails/21.jpg)
21
Coordination Ambiguity
• Words whose POS tags match pattern:[DT|PRP$] (N.*|J.*) and [DT|PRP$] (N.*|J.*) N.*
• Output: Decide if one NP or two • Resolving Coordination is classic hard problem– Treebank doesn’t annotate NP-internal structure – Modern parsers thus do very poorly on these
decisions (78% Minipar, 79% for C&C parser)– For training/evaluation, we patched Treebank with
Vadas & Curran ’07 NP annotations
![Page 22: Shane Bergsma Johns Hopkins University](https://reader030.vdocuments.mx/reader030/viewer/2022020208/56813c0e550346895da58055/html5/thumbnails/22.jpg)
22
One Noun Phrase or Two:A Machine Learning Approach
Input: “dairy and meat production”→ features: x
x = (…, first-noun=dairy, … second-noun=meat, … first+second-noun=dairy+meat, …)
h(x) = w ∙ x (predict one NP if h(x) > 0)
• Set w via training on annotated training data using some machine learning algorithm
![Page 23: Shane Bergsma Johns Hopkins University](https://reader030.vdocuments.mx/reader030/viewer/2022020208/56813c0e550346895da58055/html5/thumbnails/23.jpg)
23
Leveraging Web-Derived Knowledge[dairy and meat] production• If there is only one NP, then it is implicitly talking
about “dairy production” • Do we see this phrase occurring a lot on the web? [Yes]
sustainability and [meat production]• If there is only one NP, then it is implicitly talking
about “sustainability production”• Do we see this phrase occurring a lot on the web? [No]
• Classifier has features for these counts– But the web can gives us more!
![Page 24: Shane Bergsma Johns Hopkins University](https://reader030.vdocuments.mx/reader030/viewer/2022020208/56813c0e550346895da58055/html5/thumbnails/24.jpg)
24
Features for Explicit Paraphrasesdairy and meat production sustainability and meat production
Pattern: ❸ of ❶ and ❷
↑Count(production of dairy and meat)
↓Count(production of sustainability and meat)
Pattern: ❷ ❸ and ❶
↓Count(meat production and dairy)
↑Count(meat production and sustainability)
❶ and ❷ ❸
New paraphrases extending ideas in [Nakov & Hearst, 2005]
❶ and ❷ ❸
![Page 25: Shane Bergsma Johns Hopkins University](https://reader030.vdocuments.mx/reader030/viewer/2022020208/56813c0e550346895da58055/html5/thumbnails/25.jpg)
25
Training Examples
conservation and good management
motor and heating fuels
freedom and security agenda
Google N-gram Data
Feature Vectorsx1, x2, x3, x4
Classifier: h(x)
Machine Learning
Human-Annotated
Data (small)
Raw Data (HUGE)
![Page 26: Shane Bergsma Johns Hopkins University](https://reader030.vdocuments.mx/reader030/viewer/2022020208/56813c0e550346895da58055/html5/thumbnails/26.jpg)
26
Using Bilingual Data
• Bilingual data: a rich source of paraphrasesdairy and meat production producción láctea y cárnica
• Build a classifier which uses bilingual features– Applicable when we know the translation of the NP
![Page 27: Shane Bergsma Johns Hopkins University](https://reader030.vdocuments.mx/reader030/viewer/2022020208/56813c0e550346895da58055/html5/thumbnails/27.jpg)
27
Bilingual “Paraphrase” Featuresdairy and meat production sustainability and meat production
Pattern: ❸ ❶ … ❷ (Spanish)
Count(producc ión láctea y cárnica )
unseen
Pattern: ❶ … ❸ ❷ (Italian)
unseen Count(sosten ib i l i tà e la produz ione d i carne )
❶ and ❷ ❸ ❶ and ❷ ❸
![Page 28: Shane Bergsma Johns Hopkins University](https://reader030.vdocuments.mx/reader030/viewer/2022020208/56813c0e550346895da58055/html5/thumbnails/28.jpg)
28
Bilingual “Paraphrase” Featuresdairy and meat production sustainability and meat production
Pattern: ❶- … ❷❸ (Finnish)C o u nt ( m a i d o n 3 j a l i h a n t u o t a n to o n )
unseen
❶ and ❷ ❸ ❶ and ❷ ❸
![Page 29: Shane Bergsma Johns Hopkins University](https://reader030.vdocuments.mx/reader030/viewer/2022020208/56813c0e550346895da58055/html5/thumbnails/29.jpg)
29
Training Examples
conservation and good management
motor and heating fuels
freedom and security agenda
Translation Data
Feature Vectorsx1, x2, x3, x4
Classifier: h(xb)
Machine Learning
Human-Annotated
Data (small)
Bilingual Data (medium)
![Page 30: Shane Bergsma Johns Hopkins University](https://reader030.vdocuments.mx/reader030/viewer/2022020208/56813c0e550346895da58055/html5/thumbnails/30.jpg)
30
h(xb)
insurrection and regime change
coal and steel money
North and South Carolina
business and computer science
the Bosporus and Dardanelles straits
rocket and mortar attacks
the environment and air transport
pollution and transport safety
h(xm)
insurrection and regime change
coal and steel money
North and South Carolina
business and computer science
the Bosporus and Dardanelles straits
rocket and mortar attacks
the environment and air transport
pollution and transport safety
+ Features from Google Data
Training Examples
+ Features from Translation Data
Training Examples
coal and steel money rocket and mortar attacks
insurrection and regime change
North and South Carolina
business and computer science
the Bosporus and Dardanelles straits
the environment and air transport
pollution and transport safety
Bitext Examples
![Page 31: Shane Bergsma Johns Hopkins University](https://reader030.vdocuments.mx/reader030/viewer/2022020208/56813c0e550346895da58055/html5/thumbnails/31.jpg)
31
h(xm)
+ Features from Google Data
Training Examples
+ Features from Translation Data
Training Examplescoal and steel money
rocket and mortar attacks
insurrection and regime change
North and South Carolina
business and computer science
the Bosporus and Dardanelles straits
the environment and air transport
pollution and transport safety
h(xb)1
insurrection and regime change
North and South Carolina
business and computer science
the Bosporus and Dardanelles straits
the environment and air transport
pollution and transport safety
business and computer sciencethe Bosporus and Dardanelles straitsthe environment and air transport
insurrection and regime change
North and South Carolina
pollution and transport safety
![Page 32: Shane Bergsma Johns Hopkins University](https://reader030.vdocuments.mx/reader030/viewer/2022020208/56813c0e550346895da58055/html5/thumbnails/32.jpg)
32
+ Features from Google Data
Training Examples
+ Features from Translation Data
Training Examplescoal and steel money
rocket and mortar attacks
business and computer science
the environment and air transportthe Bosporus and Dardanelles straits
insurrection and regime change
North and South Carolina
pollution and transport safety
h(xb)1
h(xm)1
insurrection and regime change
North and South Carolina
pollution and transport safety
Co-Training: [Yarowsky’95], [Blum & Mitchell’98]
![Page 33: Shane Bergsma Johns Hopkins University](https://reader030.vdocuments.mx/reader030/viewer/2022020208/56813c0e550346895da58055/html5/thumbnails/33.jpg)
33
h(xm)i
h(xb)i
Error rate (%) of co-trained classifiers
![Page 34: Shane Bergsma Johns Hopkins University](https://reader030.vdocuments.mx/reader030/viewer/2022020208/56813c0e550346895da58055/html5/thumbnails/34.jpg)
34
Error rate (%) on Penn Treebank (PTB)
h(xm)N
Broad-co
vera
ge Parse
rs
Nakov & Hearst
(2005)
Pitler e
t al (2
010)
New Supervi
sed M
onoclassi
fier
Co-trained M
onoclassi
fier0
5
10
15
20
800 PTB training
examples800 PTB training
examples 2 training examples
unsupervised
![Page 35: Shane Bergsma Johns Hopkins University](https://reader030.vdocuments.mx/reader030/viewer/2022020208/56813c0e550346895da58055/html5/thumbnails/35.jpg)
35
Part 2: Conclusion
• Knowledge from large-scale monolingual corpora is crucial for parsing noun phrases– New paraphrase features
• New way to use bilingual data as soft supervision to guide the use of monolingual features
![Page 36: Shane Bergsma Johns Hopkins University](https://reader030.vdocuments.mx/reader030/viewer/2022020208/56813c0e550346895da58055/html5/thumbnails/36.jpg)
36
Part 3: Using visual data to learn the meaning of words
• Large volumes of visual data also reveal meaning (semantics), but in language-universal way
• Humans label their images as they post them online, providing the word-meaning link
• There’s lots of images to work with
[from Facebook’s Twitter feed]
![Page 37: Shane Bergsma Johns Hopkins University](https://reader030.vdocuments.mx/reader030/viewer/2022020208/56813c0e550346895da58055/html5/thumbnails/37.jpg)
37
English Web Images Spanish Web Images
turtle
candle
vela
tortuga
cockatoo
cacatúa
[Bergsma and Van Durme, IJCAI 2011]
![Page 38: Shane Bergsma Johns Hopkins University](https://reader030.vdocuments.mx/reader030/viewer/2022020208/56813c0e550346895da58055/html5/thumbnails/38.jpg)
38
Linking bilingual words by web-based visual similarity
Step 1: Retrieve online images via Google Image Search (in each lang.), 20 images for each word– Google competitive with “hand-prepared
datasets” [Fergus et al., 2005]
![Page 39: Shane Bergsma Johns Hopkins University](https://reader030.vdocuments.mx/reader030/viewer/2022020208/56813c0e550346895da58055/html5/thumbnails/39.jpg)
39
Step 2: Create Image Feature Vectors
Color histogram features
![Page 40: Shane Bergsma Johns Hopkins University](https://reader030.vdocuments.mx/reader030/viewer/2022020208/56813c0e550346895da58055/html5/thumbnails/40.jpg)
40
Step 2: Create Image Feature Vectors
SIFT keypoint features
Using David Lowe’s software [Lowe, 2004]
![Page 41: Shane Bergsma Johns Hopkins University](https://reader030.vdocuments.mx/reader030/viewer/2022020208/56813c0e550346895da58055/html5/thumbnails/41.jpg)
41
Step 3: Compute an Aggregate Similarity for Two Words
0.33
0.55
0.19
0.46
VectorCosine
Similarity
Best match for one English
image
Avg. over all
English images
![Page 42: Shane Bergsma Johns Hopkins University](https://reader030.vdocuments.mx/reader030/viewer/2022020208/56813c0e550346895da58055/html5/thumbnails/42.jpg)
42
Output: Ranking of Foreign Translations by Aggregate Visual Similarities
English Spanish French
rosary 1. camándula:0.151 1. chapelet:0.213
2. puntaje:0.140 2. activité:0.153
3. accidentalidad:0.139 3. rosaire:0.150
… …
Lots of details in the paper:• Finding a class of words where this works (physical objects)• Comparing visual similarity to string similarity (cognate finder)
![Page 43: Shane Bergsma Johns Hopkins University](https://reader030.vdocuments.mx/reader030/viewer/2022020208/56813c0e550346895da58055/html5/thumbnails/43.jpg)
43
Task #2: Lexical Semantics from Images
Can you eat “migas”?
Can you eat “carillon”?
Can you eat “mamey”?
Selectional Preference:
Is noun X a plausible object for verb Y?
[Bergsma and Goebel, RANLP 2011]
![Page 44: Shane Bergsma Johns Hopkins University](https://reader030.vdocuments.mx/reader030/viewer/2022020208/56813c0e550346895da58055/html5/thumbnails/44.jpg)
44
Conclusion
• Robust NLP needs to look beyond human-annotated data to exploit large corpora
• Size matters:– Many NLP systems trained on 1 million words– We use:• billions of words in bitexts• trillions of words of monolingual text• online images: hundreds of billions (⨯1000 words each a 100 trillion words!)
![Page 45: Shane Bergsma Johns Hopkins University](https://reader030.vdocuments.mx/reader030/viewer/2022020208/56813c0e550346895da58055/html5/thumbnails/45.jpg)
45
Questions + Thanks• Gold sponsors:
• Platinum sponsors (collaborators):– Kenneth Church (Johns Hopkins), Randy Goebel (Alberta), Dekang Lin
(Google), Emily Pitler (Penn), Benjamin Van Durme (Johns Hopkins) and David Yarowsky (Johns Hopkins)