my grad school experience
DESCRIPTION
My Grad School Experience. By Shereen Khoja. Introduction. I will talk about: Research for my M.Sc. Research for my Ph.D. This will involve talking about: Stemming Tagging The Arabic Language. M.Sc. I started my M.Sc at the University of Essex in 1997. University of Essex. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: My Grad School Experience](https://reader035.vdocuments.mx/reader035/viewer/2022062723/56813bf5550346895da53f9e/html5/thumbnails/1.jpg)
My Grad School Experience
By
Shereen Khoja
![Page 2: My Grad School Experience](https://reader035.vdocuments.mx/reader035/viewer/2022062723/56813bf5550346895da53f9e/html5/thumbnails/2.jpg)
February 7, 2005
Introduction
• I will talk about:– Research for my M.Sc.– Research for my Ph.D.
• This will involve talking about:– Stemming– Tagging– The Arabic Language
![Page 3: My Grad School Experience](https://reader035.vdocuments.mx/reader035/viewer/2022062723/56813bf5550346895da53f9e/html5/thumbnails/3.jpg)
February 7, 2005
M.Sc
• I started my M.Sc at the University of Essex in 1997
![Page 4: My Grad School Experience](https://reader035.vdocuments.mx/reader035/viewer/2022062723/56813bf5550346895da53f9e/html5/thumbnails/4.jpg)
February 7, 2005
University of Essex
• The University of Essex is located in Colchester in the south east of England
![Page 5: My Grad School Experience](https://reader035.vdocuments.mx/reader035/viewer/2022062723/56813bf5550346895da53f9e/html5/thumbnails/5.jpg)
February 7, 2005
Colchester
• Colchester is Britain’s oldest recorded city
![Page 6: My Grad School Experience](https://reader035.vdocuments.mx/reader035/viewer/2022062723/56813bf5550346895da53f9e/html5/thumbnails/6.jpg)
February 7, 2005
The University of Essex
• The university has 9,100 students, 25% in the graduate programs
• The computer science department has 34 faculty members
• There were 120 students on the M.Sc. program
![Page 7: My Grad School Experience](https://reader035.vdocuments.mx/reader035/viewer/2022062723/56813bf5550346895da53f9e/html5/thumbnails/7.jpg)
February 7, 2005
M.Sc. Courses
• I completed 7 courses while I was at the University of Essex– Computer Networks– Computer Vision– Expert Systems– Distributed AI & Artificial Life– Machine Learning– Neural Networks– Natural Language Processing
![Page 8: My Grad School Experience](https://reader035.vdocuments.mx/reader035/viewer/2022062723/56813bf5550346895da53f9e/html5/thumbnails/8.jpg)
February 7, 2005
M.Sc Research• I decided to do my research in the area of Natural
Language Processing
• Natural Language Processing is a broad area of AI that focuses on how computers process language
• NLP has many sub-areas such as:– Computational Linguistics– Speech Synthesis– Speech Recognition– Information Retrieval
![Page 9: My Grad School Experience](https://reader035.vdocuments.mx/reader035/viewer/2022062723/56813bf5550346895da53f9e/html5/thumbnails/9.jpg)
February 7, 2005
M.Sc Research
• My M.Sc research was conducted under the supervision of Professor Anne De Roeck
• She was already supervising a Ph.D student who was working on an Arabic language system
• It was decided that I would work on an Arabic language stemmer
![Page 10: My Grad School Experience](https://reader035.vdocuments.mx/reader035/viewer/2022062723/56813bf5550346895da53f9e/html5/thumbnails/10.jpg)
February 7, 2005
Stemming
• What is stemming?– Stemming is the process of removing a
words’ prefixes and suffixes to extract the root or stem
– Computers– Compute– Computing
![Page 11: My Grad School Experience](https://reader035.vdocuments.mx/reader035/viewer/2022062723/56813bf5550346895da53f9e/html5/thumbnails/11.jpg)
February 7, 2005
Uses of Stemming
• Compression
• Text Searching
• Spell Checking
• Text Analysis
![Page 12: My Grad School Experience](https://reader035.vdocuments.mx/reader035/viewer/2022062723/56813bf5550346895da53f9e/html5/thumbnails/12.jpg)
February 7, 2005
Arabic Language
• Arabic is an old language
• The language hasn’t changed much in 1400 years
• Arabic is written from right to left
![Page 13: My Grad School Experience](https://reader035.vdocuments.mx/reader035/viewer/2022062723/56813bf5550346895da53f9e/html5/thumbnails/13.jpg)
February 7, 2005
Arabic Language
• Arabic is a cursive language
• The shape of the letters change depending on whether they are at the beginning, the middle or the end of the word
![Page 14: My Grad School Experience](https://reader035.vdocuments.mx/reader035/viewer/2022062723/56813bf5550346895da53f9e/html5/thumbnails/14.jpg)
February 7, 2005
Arabic Language
• 28 consonants in Arabic
• 3 of these are used as long vowels
• A number of short vowels or diacritics– drs– darasa– durisa– dars
![Page 15: My Grad School Experience](https://reader035.vdocuments.mx/reader035/viewer/2022062723/56813bf5550346895da53f9e/html5/thumbnails/15.jpg)
February 7, 2005
The Arabic Language
• Arabic is a Semitic language, so words are built up from, and can be analysed down to roots following fixed patterns
• Patterns add prefixes, suffixes and infixes to the roots.
• Examples of words following the pattern Ma12oo3:– ktb maktoob– drs madroos
![Page 16: My Grad School Experience](https://reader035.vdocuments.mx/reader035/viewer/2022062723/56813bf5550346895da53f9e/html5/thumbnails/16.jpg)
February 7, 2005
The Arabic Stemmer
• I developed an Arabic stemmer
• The stemmer used language rules
• These rules are based on the Arabic grammar, which hasn’t changed for 1,400 years
• The stemmer was developed in Visual C++
![Page 17: My Grad School Experience](https://reader035.vdocuments.mx/reader035/viewer/2022062723/56813bf5550346895da53f9e/html5/thumbnails/17.jpg)
February 7, 2005
Difficulties
• Words that do not have roots
• Root letters that are deleted
• Root letters that change
![Page 18: My Grad School Experience](https://reader035.vdocuments.mx/reader035/viewer/2022062723/56813bf5550346895da53f9e/html5/thumbnails/18.jpg)
February 7, 2005
The Arabic Stemmer
• I released the stemmer under the GNU public licence
• The stemmer has been used by the following:– The University of Massachusetts– MitoSystems
![Page 19: My Grad School Experience](https://reader035.vdocuments.mx/reader035/viewer/2022062723/56813bf5550346895da53f9e/html5/thumbnails/19.jpg)
February 7, 2005
Ph.D.
• I began my Ph.D. research at Lancaster University in 1998
![Page 20: My Grad School Experience](https://reader035.vdocuments.mx/reader035/viewer/2022062723/56813bf5550346895da53f9e/html5/thumbnails/20.jpg)
February 7, 2005
Lancaster University
• Lancaster University is located in the city of Lancaster in the north west of England
![Page 21: My Grad School Experience](https://reader035.vdocuments.mx/reader035/viewer/2022062723/56813bf5550346895da53f9e/html5/thumbnails/21.jpg)
February 7, 2005
Lancaster
• Lancaster also has a castle
• This castle is used as a prison
![Page 22: My Grad School Experience](https://reader035.vdocuments.mx/reader035/viewer/2022062723/56813bf5550346895da53f9e/html5/thumbnails/22.jpg)
February 7, 2005
Corpus Linguistics
• For the last 25 years, professors at Lancaster University have been conducting research in the area of computer corpus linguistics
![Page 23: My Grad School Experience](https://reader035.vdocuments.mx/reader035/viewer/2022062723/56813bf5550346895da53f9e/html5/thumbnails/23.jpg)
February 7, 2005
Computer Corpus Linguistics
• Computer Corpus Linguistics is the sub-discipline of Computational Linguistics that utilises large quantities of texts to heighten the understanding of linguistic phenomena.
• A corpus is a collection of texts that has been assembled for linguistic analysis
![Page 24: My Grad School Experience](https://reader035.vdocuments.mx/reader035/viewer/2022062723/56813bf5550346895da53f9e/html5/thumbnails/24.jpg)
February 7, 2005
Annotated Corpora
• Corpora are not much use to linguists in their raw form
• Annotated corpora are richer and more useful
![Page 25: My Grad School Experience](https://reader035.vdocuments.mx/reader035/viewer/2022062723/56813bf5550346895da53f9e/html5/thumbnails/25.jpg)
February 7, 2005
Uses of Corpora
• Speech Synthesis
• Speech Recognition
• Lexicography
• Machine-aided Translation
• Information Retrieval
• EFL
![Page 26: My Grad School Experience](https://reader035.vdocuments.mx/reader035/viewer/2022062723/56813bf5550346895da53f9e/html5/thumbnails/26.jpg)
February 7, 2005
Types of Annotation
[[S[NP Pierre Vinken],[ADJP[NP 61 years] old,]]will[VP join[NP the board][PP as[NP a nonexecutive director]][NP Nov.29]]].]
• Grammatical Annotation
• Prosodic Annotation
• Syntactic Annotation• Semantic Annotation
The_AT0 cat_NN1 sat_VVD on_PRP the_AT0 chair_NN1
Joanna 231112stubbed 21072-31246out 21072-31246her 0cigarette 2111014with 0unnecessary 317fierceness 227052
231112 Personal Name21072 Physical Activity31246 Temperature0 Low Content Word2111014 Luxury Items317 Causality / Chance227052 Anger
![Page 27: My Grad School Experience](https://reader035.vdocuments.mx/reader035/viewer/2022062723/56813bf5550346895da53f9e/html5/thumbnails/27.jpg)
February 7, 2005
Part-of-Speech Tagging
• POS tagging is the process of automatically assigning POS tags to words in running text
• An example:– Where are you from?– Where_AVQ are_VBB you_PNP from_PRP ?_PUN
• POS taggers have been developed for English and other Indo-European with varying degrees of success
![Page 28: My Grad School Experience](https://reader035.vdocuments.mx/reader035/viewer/2022062723/56813bf5550346895da53f9e/html5/thumbnails/28.jpg)
February 7, 2005
Techniques Used in POS Tagging• Statistical Techniques
• Rule-Based Techniques
• Machine Learning
• Neural Networks
![Page 29: My Grad School Experience](https://reader035.vdocuments.mx/reader035/viewer/2022062723/56813bf5550346895da53f9e/html5/thumbnails/29.jpg)
February 7, 2005
Arabic Corpus Linguistics
• Arabic Corpora– No publicly available Arabic corpora– No Arabic POS tagger
• Arabic POS Taggers– POS taggers are being developed at New
Mexico State University and Alexandria University
![Page 30: My Grad School Experience](https://reader035.vdocuments.mx/reader035/viewer/2022062723/56813bf5550346895da53f9e/html5/thumbnails/30.jpg)
February 7, 2005
Developing the Arabic POS Tagger• Developing a manual tagger
– Used to create a training corpus
• Defining an Arabic tagset
• Developing the automatic Arabic POS Tagger
![Page 31: My Grad School Experience](https://reader035.vdocuments.mx/reader035/viewer/2022062723/56813bf5550346895da53f9e/html5/thumbnails/31.jpg)
February 7, 2005
The Arabic Tagset
• Follows the tagging system that has been used for fourteen centuries
• Noun, verb and particle
• All subclasses of these three tags inherit properties from the parent class
• Many subclasses are described in Arabic such as nouns of instrument and nouns of place and time
![Page 32: My Grad School Experience](https://reader035.vdocuments.mx/reader035/viewer/2022062723/56813bf5550346895da53f9e/html5/thumbnails/32.jpg)
PunctuationResidualParti
cleVerbNoun
Word
![Page 33: My Grad School Experience](https://reader035.vdocuments.mx/reader035/viewer/2022062723/56813bf5550346895da53f9e/html5/thumbnails/33.jpg)
AdjectiveCommon Proper Pronoun Numeral
PunctuationResidualParti
cleVerbNoun
Word
![Page 34: My Grad School Experience](https://reader035.vdocuments.mx/reader035/viewer/2022062723/56813bf5550346895da53f9e/html5/thumbnails/34.jpg)
DemonstrativePersonal Rel
ative
AdjectiveCommon Proper Pronoun Numeral
PunctuationResidualParti
cleVerbNoun
Word
![Page 35: My Grad School Experience](https://reader035.vdocuments.mx/reader035/viewer/2022062723/56813bf5550346895da53f9e/html5/thumbnails/35.jpg)
Specific Common
DemonstrativePersonal Rel
ative
AdjectiveCommon Proper Pronoun Numeral
PunctuationResidualParti
cleVerbNoun
Word
![Page 36: My Grad School Experience](https://reader035.vdocuments.mx/reader035/viewer/2022062723/56813bf5550346895da53f9e/html5/thumbnails/36.jpg)
Numerical
AdjectiveCardinal
Ordinal
AdjectiveCommon Proper Pronoun Numeral
PunctuationResidualParti
cleVerbNoun
Word
![Page 37: My Grad School Experience](https://reader035.vdocuments.mx/reader035/viewer/2022062723/56813bf5550346895da53f9e/html5/thumbnails/37.jpg)
ImperativePerfect Imperfect
PunctuationResidualParti
cleVerbNoun
Word
![Page 38: My Grad School Experience](https://reader035.vdocuments.mx/reader035/viewer/2022062723/56813bf5550346895da53f9e/html5/thumbnails/38.jpg)
Prepositions Explanations AnswersSubordinates Adverbi
al
PunctuationResidualParti
cleVerbNoun
Word
![Page 39: My Grad School Experience](https://reader035.vdocuments.mx/reader035/viewer/2022062723/56813bf5550346895da53f9e/html5/thumbnails/39.jpg)
Negatives Exceptions Interjections Conjunctions
PunctuationResidualParti
cleVerbNoun
Word
![Page 40: My Grad School Experience](https://reader035.vdocuments.mx/reader035/viewer/2022062723/56813bf5550346895da53f9e/html5/thumbnails/40.jpg)
Numerals Mathematical
Formulae Foreign
PunctuationResidualParti
cleVerbNoun
Word
![Page 41: My Grad School Experience](https://reader035.vdocuments.mx/reader035/viewer/2022062723/56813bf5550346895da53f9e/html5/thumbnails/41.jpg)
CommaExclamation
MarkQuestion
Mark
PunctuationResidualParti
cleVerbNoun
Word
![Page 42: My Grad School Experience](https://reader035.vdocuments.mx/reader035/viewer/2022062723/56813bf5550346895da53f9e/html5/thumbnails/42.jpg)
February 7, 2005
Arabic POS TaggerPlain Arabi
c Text
ManTagTrainin
g Corpus
Lexicons
Probability Matrix
APTUntagged
Arabic Corpus
Tagged
Corpus
DataExtract
![Page 43: My Grad School Experience](https://reader035.vdocuments.mx/reader035/viewer/2022062723/56813bf5550346895da53f9e/html5/thumbnails/43.jpg)
February 7, 2005
DataExtract Process
• Takes in a tagged corpus and extracts various lexicons and the probability matrix– Lexicon that includes all clitics.
• (Sprout, 1992) defines a clitic as “a syntactically separate word that functions phonologically as an affix”
– Lexicon that removes all clitics before adding the word
![Page 44: My Grad School Experience](https://reader035.vdocuments.mx/reader035/viewer/2022062723/56813bf5550346895da53f9e/html5/thumbnails/44.jpg)
February 7, 2005
DataExtract Process
• Produces a probability matrix for various levels of the tagset– Lexical probability: probability of a word
having a certain tag– Contextual probability: probability of a tag
following another tag
![Page 45: My Grad School Experience](https://reader035.vdocuments.mx/reader035/viewer/2022062723/56813bf5550346895da53f9e/html5/thumbnails/45.jpg)
February 7, 2005
DataExtract Process
N V P No. Pu.
N 0.711 0.065 0.143 0.010 0.071
V 0.926 0.037 0.0 0.008 0.029
P 0.689 0.199 0.085 0.016 0.011
No. 0.509 0.06 0.098 0.009 0.324
Pu. 0.492 0.159 0.152 0.046 0.151
![Page 46: My Grad School Experience](https://reader035.vdocuments.mx/reader035/viewer/2022062723/56813bf5550346895da53f9e/html5/thumbnails/46.jpg)
February 7, 2005
Arabic Corpora
• 59,040 words of the Saudi ``al-Jazirah'' newspaper, dated 03/03/1999
• 3,104 words of the Egyptian ``al-Ahram'' newspaper, date 25/01/2000
• 5,811 words of the Qatari ``al-Bayan'' newspaper, date 25/01/2000
• 17,204 words of al-Mishkat, an Egyptian published paper in social science, April 1999
![Page 47: My Grad School Experience](https://reader035.vdocuments.mx/reader035/viewer/2022062723/56813bf5550346895da53f9e/html5/thumbnails/47.jpg)
February 7, 2005
APT: Arabic Part-of-speech Tagger
Arabic Words
LexiconLookup
Stemmer
StatisticalComponent
Words with
multiple tags
Words with a unique
tag
![Page 48: My Grad School Experience](https://reader035.vdocuments.mx/reader035/viewer/2022062723/56813bf5550346895da53f9e/html5/thumbnails/48.jpg)
February 7, 2005
Result of Lexicon Lookup and Stemmer
• The Arabic word fhm– Would be assigned the tag of conjunction and
personal pronoun by the morphological analyser to mean “so, they”
– Would be found in the lexicon as the noun to mean “understand”
– Would be found in the lexicon as the conjunction and verb meaning “so, he prepared”
![Page 49: My Grad School Experience](https://reader035.vdocuments.mx/reader035/viewer/2022062723/56813bf5550346895da53f9e/html5/thumbnails/49.jpg)
February 7, 2005
w0 w1 w2 w3 w4
t0 t11 t21 t31 t4
t12 t22 t32
t33
Statistical Component
Statistical Component. Hidden Markov Models.
The probability of the words w0-w4 been tagged with t0, t11, t21, t31 and t4 is:
P(w0 | t0) * P(t11 | t0….) * P(w1 | t11 ) * P(t21 | t11….) * P(w2 | t21 ) * P(t31 | t21….) * P(w3 | t31 ) * P(t4 | t31….) * P(w4 | t4 )
![Page 50: My Grad School Experience](https://reader035.vdocuments.mx/reader035/viewer/2022062723/56813bf5550346895da53f9e/html5/thumbnails/50.jpg)
February 7, 2005
Initial Results11%
3%
22%64%
10%3%
23%64%
17%
1%
22%
60%
21%
2%
18%
59%
Stemmed
Unambiguous
Unknown
Ambiguous
Al-Jazirah
Al-Bayan Al-Mishkat
Al-Ahram
![Page 51: My Grad School Experience](https://reader035.vdocuments.mx/reader035/viewer/2022062723/56813bf5550346895da53f9e/html5/thumbnails/51.jpg)
February 7, 2005
Other Findings
63%7%
25%
2%
3%
NounsVerbsParticlesPunctuationResidual
Distribution of Tags
![Page 52: My Grad School Experience](https://reader035.vdocuments.mx/reader035/viewer/2022062723/56813bf5550346895da53f9e/html5/thumbnails/52.jpg)
February 7, 2005
Summary
• For my M.Sc. I developed an Arabic language stemmer
• For my Ph.D. I developed an Arabic part-of-speech tagger– The tagger used the stemmer
![Page 53: My Grad School Experience](https://reader035.vdocuments.mx/reader035/viewer/2022062723/56813bf5550346895da53f9e/html5/thumbnails/53.jpg)
February 7, 2005
Advice
• Stay focused
• Do not procrastinate
• Make use of all the resources around you
• Write down everything
• More importantly, organise everything that you write down