text is fun: statistical exploration of large...
TRANSCRIPT
![Page 1: Text is fun: Statistical exploration of large corporaltrc.iiit.ac.in/iasnlp2012/slides/siva/IASNLP_TexTisFun.pdfSiva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration](https://reader033.vdocuments.mx/reader033/viewer/2022041801/5e5130b7bfbb1e7e771497f6/html5/thumbnails/1.jpg)
Text is fun: Statistical exploration of large corpora
Siva Reddy
Lexical Computing Ltd, UKhttp://sketchengine.co.uk
IIIT-Hyderabad Advanced School onNatural Language Processing
July 14 2012
Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 1 / 30
![Page 2: Text is fun: Statistical exploration of large corporaltrc.iiit.ac.in/iasnlp2012/slides/siva/IASNLP_TexTisFun.pdfSiva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration](https://reader033.vdocuments.mx/reader033/viewer/2022041801/5e5130b7bfbb1e7e771497f6/html5/thumbnails/2.jpg)
Acknowledgments
Adam Kilgarriff Michael Rundell
Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 2 / 30
![Page 3: Text is fun: Statistical exploration of large corporaltrc.iiit.ac.in/iasnlp2012/slides/siva/IASNLP_TexTisFun.pdfSiva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration](https://reader033.vdocuments.mx/reader033/viewer/2022041801/5e5130b7bfbb1e7e771497f6/html5/thumbnails/3.jpg)
What is “meaning”?
Semantics: Study of meaning in language.
Lexical semantics: Study of meaning of words.
Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 3 / 30
![Page 4: Text is fun: Statistical exploration of large corporaltrc.iiit.ac.in/iasnlp2012/slides/siva/IASNLP_TexTisFun.pdfSiva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration](https://reader033.vdocuments.mx/reader033/viewer/2022041801/5e5130b7bfbb1e7e771497f6/html5/thumbnails/4.jpg)
What is “meaning”?
Semantics: Study of meaning in language.
Lexical semantics: Study of meaning of words.
Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 3 / 30
![Page 5: Text is fun: Statistical exploration of large corporaltrc.iiit.ac.in/iasnlp2012/slides/siva/IASNLP_TexTisFun.pdfSiva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration](https://reader033.vdocuments.mx/reader033/viewer/2022041801/5e5130b7bfbb1e7e771497f6/html5/thumbnails/5.jpg)
What is “meaning”?
Semantics: Study of meaning in language.
Lexical semantics: Study of meaning of words.
Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 3 / 30
![Page 6: Text is fun: Statistical exploration of large corporaltrc.iiit.ac.in/iasnlp2012/slides/siva/IASNLP_TexTisFun.pdfSiva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration](https://reader033.vdocuments.mx/reader033/viewer/2022041801/5e5130b7bfbb1e7e771497f6/html5/thumbnails/6.jpg)
What is “meaning”?
Semantics: Study of meaning in language.
Lexical semantics: Study of meaning of words.
Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 3 / 30
![Page 7: Text is fun: Statistical exploration of large corporaltrc.iiit.ac.in/iasnlp2012/slides/siva/IASNLP_TexTisFun.pdfSiva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration](https://reader033.vdocuments.mx/reader033/viewer/2022041801/5e5130b7bfbb1e7e771497f6/html5/thumbnails/7.jpg)
How are dictionaries built in pre-computer era?
James Murray and colleagues: Oxford English Dictionary
Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 4 / 30
![Page 8: Text is fun: Statistical exploration of large corporaltrc.iiit.ac.in/iasnlp2012/slides/siva/IASNLP_TexTisFun.pdfSiva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration](https://reader033.vdocuments.mx/reader033/viewer/2022041801/5e5130b7bfbb1e7e771497f6/html5/thumbnails/8.jpg)
How are dictionaries built in pre-computer era?
Storage of Evidences
Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 5 / 30
![Page 9: Text is fun: Statistical exploration of large corporaltrc.iiit.ac.in/iasnlp2012/slides/siva/IASNLP_TexTisFun.pdfSiva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration](https://reader033.vdocuments.mx/reader033/viewer/2022041801/5e5130b7bfbb1e7e771497f6/html5/thumbnails/9.jpg)
How are dictionaries built in pre-computer era?
IndexingSiva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 6 / 30
![Page 10: Text is fun: Statistical exploration of large corporaltrc.iiit.ac.in/iasnlp2012/slides/siva/IASNLP_TexTisFun.pdfSiva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration](https://reader033.vdocuments.mx/reader033/viewer/2022041801/5e5130b7bfbb1e7e771497f6/html5/thumbnails/10.jpg)
Revolution: Internet Era
Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 7 / 30
![Page 11: Text is fun: Statistical exploration of large corporaltrc.iiit.ac.in/iasnlp2012/slides/siva/IASNLP_TexTisFun.pdfSiva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration](https://reader033.vdocuments.mx/reader033/viewer/2022041801/5e5130b7bfbb1e7e771497f6/html5/thumbnails/11.jpg)
Dictionary building: Requirements
Corpus (Text) Collection
Wordlist
Evidence collection: Words in action.
Word Profiles
Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 8 / 30
![Page 12: Text is fun: Statistical exploration of large corporaltrc.iiit.ac.in/iasnlp2012/slides/siva/IASNLP_TexTisFun.pdfSiva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration](https://reader033.vdocuments.mx/reader033/viewer/2022041801/5e5130b7bfbb1e7e771497f6/html5/thumbnails/12.jpg)
Dictionary building: Requirements
Corpus (Text) Collection
Wordlist
Evidence collection: Words in action.
Word Profiles
Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 8 / 30
![Page 13: Text is fun: Statistical exploration of large corporaltrc.iiit.ac.in/iasnlp2012/slides/siva/IASNLP_TexTisFun.pdfSiva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration](https://reader033.vdocuments.mx/reader033/viewer/2022041801/5e5130b7bfbb1e7e771497f6/html5/thumbnails/13.jpg)
Web as Corpus: Challenges
Crawling
Text extraction
Spamming
Duplication
Exercise 1: WebBootCaTCollect corpus from web on a topic of interest.(Baroni et al., 2006; Kilgarriff et al., 2010)
Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 9 / 30
![Page 14: Text is fun: Statistical exploration of large corporaltrc.iiit.ac.in/iasnlp2012/slides/siva/IASNLP_TexTisFun.pdfSiva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration](https://reader033.vdocuments.mx/reader033/viewer/2022041801/5e5130b7bfbb1e7e771497f6/html5/thumbnails/14.jpg)
Web as Corpus: Challenges
Crawling
Text extraction
Spamming
Duplication
Exercise 1: WebBootCaTCollect corpus from web on a topic of interest.(Baroni et al., 2006; Kilgarriff et al., 2010)
Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 9 / 30
![Page 15: Text is fun: Statistical exploration of large corporaltrc.iiit.ac.in/iasnlp2012/slides/siva/IASNLP_TexTisFun.pdfSiva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration](https://reader033.vdocuments.mx/reader033/viewer/2022041801/5e5130b7bfbb1e7e771497f6/html5/thumbnails/15.jpg)
Wordlist
Generalized dictionary
Domain-specific dictionary
Exercise 2: Keyword Extraction
Collect keywords from the corpus you collected above.
Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 10 / 30
![Page 16: Text is fun: Statistical exploration of large corporaltrc.iiit.ac.in/iasnlp2012/slides/siva/IASNLP_TexTisFun.pdfSiva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration](https://reader033.vdocuments.mx/reader033/viewer/2022041801/5e5130b7bfbb1e7e771497f6/html5/thumbnails/16.jpg)
Wordlist
Generalized dictionary
Domain-specific dictionary
Exercise 2: Keyword Extraction
Collect keywords from the corpus you collected above.
Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 10 / 30
![Page 17: Text is fun: Statistical exploration of large corporaltrc.iiit.ac.in/iasnlp2012/slides/siva/IASNLP_TexTisFun.pdfSiva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration](https://reader033.vdocuments.mx/reader033/viewer/2022041801/5e5130b7bfbb1e7e771497f6/html5/thumbnails/17.jpg)
Evidence collection
Words in action
Google like searching isn’t enough
Get all the word forms of test?
Words which are at a distance of three from test?
Corpus Query Language: regular expressions
Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 11 / 30
![Page 18: Text is fun: Statistical exploration of large corporaltrc.iiit.ac.in/iasnlp2012/slides/siva/IASNLP_TexTisFun.pdfSiva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration](https://reader033.vdocuments.mx/reader033/viewer/2022041801/5e5130b7bfbb1e7e771497f6/html5/thumbnails/18.jpg)
Regular expressions
Regular Expression Table:
http://bit.ly/KZT7Kj
Exercise 3: Write regular expressions for . . .
http://sketchengine.co.uk/exercises/regex/
Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 12 / 30
![Page 19: Text is fun: Statistical exploration of large corporaltrc.iiit.ac.in/iasnlp2012/slides/siva/IASNLP_TexTisFun.pdfSiva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration](https://reader033.vdocuments.mx/reader033/viewer/2022041801/5e5130b7bfbb1e7e771497f6/html5/thumbnails/19.jpg)
CQL: Corpus Query Language
query pattern matching set of tokens
tokens have attributes (word, lemma, tag, lempos, lc)
[attribute="value"] for each token pattern
value is a regular expression
Additional Pointershttp://bit.ly/LPRuju
http://trac.sketchengine.co.uk/wiki/SkE/CorpusQuerying
Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 13 / 30
![Page 20: Text is fun: Statistical exploration of large corporaltrc.iiit.ac.in/iasnlp2012/slides/siva/IASNLP_TexTisFun.pdfSiva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration](https://reader033.vdocuments.mx/reader033/viewer/2022041801/5e5130b7bfbb1e7e771497f6/html5/thumbnails/20.jpg)
Corpus Processing: Challenges
What are the noun forms of the word test?
Will "test.*" work?
Word Tokenization
Morphological analysis
Part-of-Speech Tagging
CQL: [lemma="treat" & tag="N.*"]
Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 14 / 30
![Page 21: Text is fun: Statistical exploration of large corporaltrc.iiit.ac.in/iasnlp2012/slides/siva/IASNLP_TexTisFun.pdfSiva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration](https://reader033.vdocuments.mx/reader033/viewer/2022041801/5e5130b7bfbb1e7e771497f6/html5/thumbnails/21.jpg)
Corpus Processing: Challenges
What are the noun forms of the word test?
Will "test.*" work?
Word Tokenization
Morphological analysis
Part-of-Speech Tagging
CQL: [lemma="treat" & tag="N.*"]
Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 14 / 30
![Page 22: Text is fun: Statistical exploration of large corporaltrc.iiit.ac.in/iasnlp2012/slides/siva/IASNLP_TexTisFun.pdfSiva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration](https://reader033.vdocuments.mx/reader033/viewer/2022041801/5e5130b7bfbb1e7e771497f6/html5/thumbnails/22.jpg)
Collocations (word associations)
When do you say a word A is important to word B?
mouse: laser
mouse: food
Exercise 4: Collocations of the words girl and boy?
Download data from http://sivareddy.in/textisfun.tgz
Rank context words using mutual informationa: P(x ,y)P(x)P(y)
aRemoved log for simplicity
Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 15 / 30
![Page 23: Text is fun: Statistical exploration of large corporaltrc.iiit.ac.in/iasnlp2012/slides/siva/IASNLP_TexTisFun.pdfSiva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration](https://reader033.vdocuments.mx/reader033/viewer/2022041801/5e5130b7bfbb1e7e771497f6/html5/thumbnails/23.jpg)
Collocations (word associations)
When do you say a word A is important to word B?
mouse: laser
mouse: food
Exercise 4: Collocations of the words girl and boy?
Download data from http://sivareddy.in/textisfun.tgz
Rank context words using mutual informationa: P(x ,y)P(x)P(y)
aRemoved log for simplicity
Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 15 / 30
![Page 24: Text is fun: Statistical exploration of large corporaltrc.iiit.ac.in/iasnlp2012/slides/siva/IASNLP_TexTisFun.pdfSiva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration](https://reader033.vdocuments.mx/reader033/viewer/2022041801/5e5130b7bfbb1e7e771497f6/html5/thumbnails/24.jpg)
Word Sketch - a profile describing collocations
Word Sketch of write-v http://bit.ly/KUCBFj
The voice of the majority
Sketch Grammar: describes the frequent constructions of words inlanguage
Exercise 5: Objects of eat-v?
Write the Sketch Grammar capturing object relation?
Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 16 / 30
![Page 25: Text is fun: Statistical exploration of large corporaltrc.iiit.ac.in/iasnlp2012/slides/siva/IASNLP_TexTisFun.pdfSiva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration](https://reader033.vdocuments.mx/reader033/viewer/2022041801/5e5130b7bfbb1e7e771497f6/html5/thumbnails/25.jpg)
My near-dream for Indian languages?
Writing Sketch Grammar is not so time-taking.
Exploit Sketch Grammar to build Syntactic Parser
A parser for every language
Cash the similarities between different languages
Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 17 / 30
![Page 26: Text is fun: Statistical exploration of large corporaltrc.iiit.ac.in/iasnlp2012/slides/siva/IASNLP_TexTisFun.pdfSiva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration](https://reader033.vdocuments.mx/reader033/viewer/2022041801/5e5130b7bfbb1e7e771497f6/html5/thumbnails/26.jpg)
When do you say two words are similar?
Distributional Hypothesis (Harris, 1954)
The words that occur in similar contexts tend to have similar meaning
e.g: laptop, computer
Backbone for Vector Space Model of Semantics.
Firth (Firth, 1957)
You shall know a person from his friends - Chinese Proverb
You shall know a word from its context - Firth’s Principle
Bag of words hypothesis
Two documents tend to be similar if they have similar distribution of similarwords
Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 18 / 30
![Page 27: Text is fun: Statistical exploration of large corporaltrc.iiit.ac.in/iasnlp2012/slides/siva/IASNLP_TexTisFun.pdfSiva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration](https://reader033.vdocuments.mx/reader033/viewer/2022041801/5e5130b7bfbb1e7e771497f6/html5/thumbnails/27.jpg)
When do you say two words are similar?
Distributional Hypothesis (Harris, 1954)
The words that occur in similar contexts tend to have similar meaning
e.g: laptop, computer
Backbone for Vector Space Model of Semantics.
Firth (Firth, 1957)
You shall know a person from his friends - Chinese Proverb
You shall know a word from its context - Firth’s Principle
Bag of words hypothesis
Two documents tend to be similar if they have similar distribution of similarwords
Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 18 / 30
![Page 28: Text is fun: Statistical exploration of large corporaltrc.iiit.ac.in/iasnlp2012/slides/siva/IASNLP_TexTisFun.pdfSiva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration](https://reader033.vdocuments.mx/reader033/viewer/2022041801/5e5130b7bfbb1e7e771497f6/html5/thumbnails/28.jpg)
Vector Space Models (VSMs) of Semantics
Interpret semantics using VSMBackbone: Distributional Hypothesis
Text entity (we are interested in) as a Vector (point) in dimensional space.
Context of the entity as dimensionsExisting methods represent knowledge in VSMs mainly in three types(Turney and Pantel, 2010)
term-documentterm-contextpair-pattern
Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 19 / 30
![Page 29: Text is fun: Statistical exploration of large corporaltrc.iiit.ac.in/iasnlp2012/slides/siva/IASNLP_TexTisFun.pdfSiva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration](https://reader033.vdocuments.mx/reader033/viewer/2022041801/5e5130b7bfbb1e7e771497f6/html5/thumbnails/29.jpg)
Vector Space Models (VSMs) of Semantics
Interpret semantics using VSMBackbone: Distributional Hypothesis
Text entity (we are interested in) as a Vector (point) in dimensional space.
Context of the entity as dimensionsExisting methods represent knowledge in VSMs mainly in three types(Turney and Pantel, 2010)
term-documentterm-contextpair-pattern
Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 19 / 30
![Page 30: Text is fun: Statistical exploration of large corporaltrc.iiit.ac.in/iasnlp2012/slides/siva/IASNLP_TexTisFun.pdfSiva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration](https://reader033.vdocuments.mx/reader033/viewer/2022041801/5e5130b7bfbb1e7e771497f6/html5/thumbnails/30.jpg)
Vector Space Models (VSMs) of Semantics
Interpret semantics using VSMBackbone: Distributional Hypothesis
Text entity (we are interested in) as a Vector (point) in dimensional space.
Context of the entity as dimensionsExisting methods represent knowledge in VSMs mainly in three types(Turney and Pantel, 2010)
term-documentterm-contextpair-pattern
Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 19 / 30
![Page 31: Text is fun: Statistical exploration of large corporaltrc.iiit.ac.in/iasnlp2012/slides/siva/IASNLP_TexTisFun.pdfSiva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration](https://reader033.vdocuments.mx/reader033/viewer/2022041801/5e5130b7bfbb1e7e771497f6/html5/thumbnails/31.jpg)
Term-Document: (Salton et al., 1975)
1
d1: Human machine interface for Lab ABC computer applications
1Image courtesy: (Landauer et al., 1998)Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 20 / 30
![Page 32: Text is fun: Statistical exploration of large corporaltrc.iiit.ac.in/iasnlp2012/slides/siva/IASNLP_TexTisFun.pdfSiva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration](https://reader033.vdocuments.mx/reader033/viewer/2022041801/5e5130b7bfbb1e7e771497f6/html5/thumbnails/32.jpg)
Term-Document: (Salton et al., 1975)
2
Document similarity can be found using Cosine similarity
sim(D1,D2) = D1.D2‖D1‖‖D2‖
2Image courtesy: (Salton et al., 1975)Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 21 / 30
![Page 33: Text is fun: Statistical exploration of large corporaltrc.iiit.ac.in/iasnlp2012/slides/siva/IASNLP_TexTisFun.pdfSiva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration](https://reader033.vdocuments.mx/reader033/viewer/2022041801/5e5130b7bfbb1e7e771497f6/html5/thumbnails/33.jpg)
Term-Document: (Salton et al., 1975)
2
Document similarity can be found using Cosine similarity
sim(D1,D2) = D1.D2‖D1‖‖D2‖
2Image courtesy: (Salton et al., 1975)Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 21 / 30
![Page 34: Text is fun: Statistical exploration of large corporaltrc.iiit.ac.in/iasnlp2012/slides/siva/IASNLP_TexTisFun.pdfSiva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration](https://reader033.vdocuments.mx/reader033/viewer/2022041801/5e5130b7bfbb1e7e771497f6/html5/thumbnails/34.jpg)
Term-Context: Word Space Model
Meaning of a word as a vector (Schütze, 1998)
Meaning of a word is represented as a cooccurrence vector built from a corpus
police-n photon-n speed-n car-n soul-nTraffic 142 0 293 347 1Light 41 29 222 198 50TrafficLight 5 0 13 48 0
Exercise 6: Compute similarity between girl, boy, dog
Hint: Represent words as vectors using mutual information scores of contextwords, and compute Cosine similarity.
Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 22 / 30
![Page 35: Text is fun: Statistical exploration of large corporaltrc.iiit.ac.in/iasnlp2012/slides/siva/IASNLP_TexTisFun.pdfSiva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration](https://reader033.vdocuments.mx/reader033/viewer/2022041801/5e5130b7bfbb1e7e771497f6/html5/thumbnails/35.jpg)
Term-Context: Word Space Model
Meaning of a word as a vector (Schütze, 1998)
Meaning of a word is represented as a cooccurrence vector built from a corpus
police-n photon-n speed-n car-n soul-nTraffic 142 0 293 347 1Light 41 29 222 198 50TrafficLight 5 0 13 48 0
Exercise 6: Compute similarity between girl, boy, dog
Hint: Represent words as vectors using mutual information scores of contextwords, and compute Cosine similarity.
Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 22 / 30
![Page 36: Text is fun: Statistical exploration of large corporaltrc.iiit.ac.in/iasnlp2012/slides/siva/IASNLP_TexTisFun.pdfSiva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration](https://reader033.vdocuments.mx/reader033/viewer/2022041801/5e5130b7bfbb1e7e771497f6/html5/thumbnails/36.jpg)
Word Senses
So far we represented a word with a single word sketch
mouse vs mouse?
Word Sense Disambiguation: collocations are the clue
WordNet have been used extensively
Can we guess the number of senses of a word?
Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 23 / 30
![Page 37: Text is fun: Statistical exploration of large corporaltrc.iiit.ac.in/iasnlp2012/slides/siva/IASNLP_TexTisFun.pdfSiva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration](https://reader033.vdocuments.mx/reader033/viewer/2022041801/5e5130b7bfbb1e7e771497f6/html5/thumbnails/37.jpg)
Word Sense Induction
Figure: Word Sense Induction in a Graph based setting
Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 24 / 30
![Page 38: Text is fun: Statistical exploration of large corporaltrc.iiit.ac.in/iasnlp2012/slides/siva/IASNLP_TexTisFun.pdfSiva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration](https://reader033.vdocuments.mx/reader033/viewer/2022041801/5e5130b7bfbb1e7e771497f6/html5/thumbnails/38.jpg)
Semantic Word Sketches
Semantic FramesDemo: http://corpdev.sketchengine.co.uk/run.cgi/first_form?corpname=5dcaa5fe
Exercise 7: abstract entities which modify boy and girl
Use word sense of context words as clue.
Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 25 / 30
![Page 39: Text is fun: Statistical exploration of large corporaltrc.iiit.ac.in/iasnlp2012/slides/siva/IASNLP_TexTisFun.pdfSiva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration](https://reader033.vdocuments.mx/reader033/viewer/2022041801/5e5130b7bfbb1e7e771497f6/html5/thumbnails/39.jpg)
Beyond Words: Compositional Semantics
Given meanings of
couch
roast
potato
Can we interpret the meanings of
couch potato
roast potato
Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 26 / 30
![Page 40: Text is fun: Statistical exploration of large corporaltrc.iiit.ac.in/iasnlp2012/slides/siva/IASNLP_TexTisFun.pdfSiva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration](https://reader033.vdocuments.mx/reader033/viewer/2022041801/5e5130b7bfbb1e7e771497f6/html5/thumbnails/40.jpg)
Beyond Words: Compositional Semantics
Given meanings of
couch
roast
potato
Can we interpret the meanings of
couch potato
roast potato
Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 26 / 30
![Page 41: Text is fun: Statistical exploration of large corporaltrc.iiit.ac.in/iasnlp2012/slides/siva/IASNLP_TexTisFun.pdfSiva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration](https://reader033.vdocuments.mx/reader033/viewer/2022041801/5e5130b7bfbb1e7e771497f6/html5/thumbnails/41.jpg)
Couch Potato
Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 27 / 30
![Page 42: Text is fun: Statistical exploration of large corporaltrc.iiit.ac.in/iasnlp2012/slides/siva/IASNLP_TexTisFun.pdfSiva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration](https://reader033.vdocuments.mx/reader033/viewer/2022041801/5e5130b7bfbb1e7e771497f6/html5/thumbnails/42.jpg)
Roast Potato
Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 28 / 30
![Page 43: Text is fun: Statistical exploration of large corporaltrc.iiit.ac.in/iasnlp2012/slides/siva/IASNLP_TexTisFun.pdfSiva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration](https://reader033.vdocuments.mx/reader033/viewer/2022041801/5e5130b7bfbb1e7e771497f6/html5/thumbnails/43.jpg)
Bibliography I
Baroni, M., Kilgarriff, A., Pomikalek, J., and Rychly, P. (2006). Webbootcat:Instant domain-specific corpora to support human translators. InProceedings of the 11th Annual Conference of the European Association forMachine Translation (EAMT), Norway.
Firth, J. R. (1957). A Synopsis of Linguistic Theory, 1930-1955. Studies inLinguistic Analysis, pages 1–32.
Harris, Z. S. (1954). Distributional structure. Word, 10:146–162.
Kilgarriff, A., Reddy, S., Pomikálek, J., and PVS, A. (2010). A corpus factory formany languages. In Proceedings of the Seventh International Conferenceon Language Resources and Evaluation (LREC’10), Valletta, Malta.
Landauer, T. K., Foltz, P. W., and Laham, D. (1998). An introduction to latentsemantic analysis. Discourse Processes, 25:259–284.
Salton, G., Wong, A., and Yang, C. S. (1975). A vector space model forautomatic indexing. Commun. ACM, 18:613–620.
Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 29 / 30
![Page 44: Text is fun: Statistical exploration of large corporaltrc.iiit.ac.in/iasnlp2012/slides/siva/IASNLP_TexTisFun.pdfSiva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration](https://reader033.vdocuments.mx/reader033/viewer/2022041801/5e5130b7bfbb1e7e771497f6/html5/thumbnails/44.jpg)
Bibliography II
Schütze, H. (1998). Automatic Word Sense Discrimination. ComputationalLinguistics, 24(1):97–123.
Turney, P. D. and Pantel, P. (2010). From frequency to meaning: vector spacemodels of semantics. J. Artif. Int. Res., 37:141–188.
Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 30 / 30