multilingual acquisition of structured information...
TRANSCRIPT
MULTILINGUAL ACQUISITION OF
STRUCTURED INFORMATION VIA NOVEL
RELATIONSHIP EXTRACTION MODELS
OVER DIVERSE KNOWLEDGE SOURCES
by
Nikesh Lucky Garera
A dissertation submitted to The Johns Hopkins University in conformity with the
requirements for the degree of Doctor of Philosophy.
Baltimore, Maryland
September, 2009
© Nikesh Lucky Garera 2009
All rights reserved
Abstract
This dissertation presents original techniques for a class of problems that can be
collectively referred to as relationship extraction. This machine learning task involves
extracting tuples from free text, the exemplar instantiations of which help model the
target relationship. A wide range of relationships are explored, including semantic
relationships between words, their translation equivalents in different languages and
encyclopedic facts about named entities.
This dissertation explores new relationship extraction models which exploit novel
knowledge sources across a diverse set of relationship types in multiple languages. It
ties together extraction of diverse relationships in the classic seed-based minimally
supervised framework. However, this framework has previously failed to capture in-
formation beyond local context such as transitively-derived information, domain con-
straints and knowledge, correlations among relationships and additional novel knowl-
edge sources. Furthermore, the traditional seed-based learning framework fails to
extract non-overt relationships such as an author’s gender or age when they are not
explicitly stated.In contrast, some of these non-overt relationships can be inferred
ii
with an accuracy exceeding 95% via novel document-wide, discourse-feature-based
and interlocutar-sensitive models. This disseration presents new relationship extrac-
tion methods embedding a wide range of such knowledge sources in the minimally
supervised learning framework.
Collectively, these methods outperform previously published algorithms on a diverse
set of natural language data sources and genres including newswire text, biographical
articles, raw webpages, conversational speech transcripts and email, and on a large
set of languages including Albanian, Arabic, Bulgarian, Czech, Farsi, German, Hindi,
Hungarian, Russian, Slovak, Spanish and Swedish.
Thesis Committee:
David Yarowsky Chris Callison-BurchProfessor Assistant Research ProfessorDepartment of Computer Science Department of Computer ScienceJohns Hopkins University Johns Hopkins University
James MayfieldPrincipal Research ScientistApplied Physics LaboratoryJohns Hopkins University
iii
Acknowledgements
I feel truly fortunate to have pursued my Ph.D. studies to completion in an aca-
demic institution and a center that has some of the best faculty and students in
natural language processing. The faculty, students, staff and the entire research envi-
ronment here has helped me in numerous ways during the entire course of my doctoral
studies at Johns Hopkins University.
Before starting my at program JHU, I had read and heard about how a students
advisor can be a critical factor during Ph.D. studies. I feel extremely lucky to have
been advised by David Yarowsky during my time at JHU. David has been a strong
pillar of support and a great mentor. He has an incredible faith in his students
and I feel proud to be counted amongst one of them. He has always allowed me
to work independently on various research projects of my interest, and at the same
time, he has been extremely accessible to provide close guidance whenever I needed
it. I have learned a lot from observing his practical and systematic approach towards
doing research and contributing towards the scientific community. His guidance and
insightful vision have helped me immensely throughout my time here at JHU. Thank
iv
you David for being such an amazing advisor.
I would like to thank my thesis committee members Chris Callison-Burch and
James Mayfield for providing prompt and valuable feedback on several drafts on the
thesis. They were very supportive throughout my dissertation writing process and
their insightful questions and comments helped improve the shape and content of this
thesis considerably.
I would also like to thank Jason Eisner for the influence he has had on me during
the PhD program, although he was not my direct advisor he was very approachable
and open to any kind of discussion. I especially admire the enthusiasm and energy
he brings when discussing and solving research problems.
I would like to thank the National Science Foundation and Johns Hopkins Uni-
versity’s new Center of Excellence in Human Language Technology (COE) for their
financial support of my graduate work. During the last couple of years of my PhD,
I was involved with COE on various research projects. I would like to convey my
thanks to Mark Dredze, Paul McNamee, Christine Piatko, James Mayfield and other
COE members for providing a supportive research environment.
One of the best things that happened to me after coming to Hopkins was to find
the camaraderie of the students here, who are not only among the most brilliant minds
that I know of but are so much fun to hang out with even outside of work! Thanks
to V Balakrishnan, John Blatz, Anoop Deoras, Markus Dreyer, Eliott Drabek, Erin
Fitzgerald, Arnab Ghoshal, Ann Irvine, Sridhar Krishna, Zhifei Li, Gideon Mann,
v
Carolina Prada, Brock Pytlik, Delip Rao, Ariya Rastrow, David Smith, Noah Smith,
Jason Smith, Charles Schafer, Roy Tromble, Chris White and Omar Zaidan.
I would also like to thank the CS and CLSP staff Desiree Cleves, Debbie DeFord,
Monique Folk, Laura Graham and Cathy Thornton for helping me navigate through
all the administrative procedures.
My family has been a great source of strength and support during my graduate
studies. I want to thank my grandparents Parumal Garera and Jaidevi Garera for
their love and affection. Though my grandfather is not here in body, he was and
still is a constant source of inspiration. My parents Lucky Garera and Madhu Garera
have always valued education as a top priority for their children and their upbringing
and values are the strongest reason why I have been able to reach such heights in
my education. My in-laws Bharat Doshi and Vidya Doshi have provided their much
needed love and support along the way. My brother Deepak Garera has always
supported me in many milestones along the way and is always available to talk to
whenever I feel the need for a fun conversation!
Last and in not any way the least, I would like to thank my wife Sujata Garera
for supporting me and encouraging me in just about any goal that I plan to pursue,
not matter how hard the challenge. She has been there with me through all the ups
and downs and no words are sufficient to describe how much her presence has helped
me in my life.
vi
Contents
Abstract ii
Acknowledgements iv
List of Tables xix
List of Figures xxiv
1 Introduction 1
1.1 Types of relationships explored . . . . . . . . . . . . . . . . . . . . . 3
1.2 Basic approach: Relationship
extraction using seed exemplars . . . . . . . . . . . . . . . . . . . . . 6
1.2.1 Algorithm (Rapp 1999; Ravichandran and Hovy, 2002) . . . . 7
1.2.2 Context Representations . . . . . . . . . . . . . . . . . . . . . 9
1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.3.1 Internal knowledge sources . . . . . . . . . . . . . . . . . . . . 12
1.3.2 External knowledge sources . . . . . . . . . . . . . . . . . . . 14
vii
1.4 Outline of this Dissertation . . . . . . . . . . . . . . . . . . . . . . . 16
1.4.1 Part I: Cross-language relationships . . . . . . . . . . . . . . . 16
1.4.2 Part II: Semantic relationships . . . . . . . . . . . . . . . . . . 16
1.4.3 Part III: Factual relationships . . . . . . . . . . . . . . . . . . 17
I Extracting Cross-language/Translation Relationships 18
2 Part I Literature Review 19
2.1 Using Parallel Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2 Using Monolingual Corpora and Seed Lexicons . . . . . . . . . . . . . 21
2.3 Using Bridge Languages . . . . . . . . . . . . . . . . . . . . . . . . . 23
3 Translating Compounds by Learning Component Gloss Translation
Models via Multiple Languages 25
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2 Resources Utilized . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.4 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.4.1 Splitting compound words and gloss
generation with translation lexicon
lookup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.4.2 Using cross-language evidence from different bilingual dictionaries 32
viii
3.4.3 Ranking translation candidates . . . . . . . . . . . . . . . . . 32
3.5 Evaluation using Exact-match
Translation Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.6 Comparison of different compound translation models . . . . . . . . . 34
3.6.1 A simple model using literal English gloss concatenation as the
translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.6.2 Using bilingual dictionaries . . . . . . . . . . . . . . . . . . . 36
3.6.3 Using forward and backward ordering for English gloss search 38
3.6.4 Increasing coverage by automatically discovering compound
morphology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.6.5 Re-ranking using context vector projection . . . . . . . . . . . 42
3.6.6 Using phrase-tables if a parallel corpus is available . . . . . . . 43
3.7 Statistical Significance of Results . . . . . . . . . . . . . . . . . . . . 45
3.8 Quantifying the Role of
Cross-language Selection
and Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.8.1 Coverage/Accuracy Trade-off . . . . . . . . . . . . . . . . . . 46
3.8.2 Varying the size of bilingual dictionaries . . . . . . . . . . . . 46
3.8.3 Greedy vs Random Selection of Utilized Languages . . . . . . 49
3.8.4 Languages found using Greedy selection . . . . . . . . . . . . 52
3.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
ix
4 Improving Translation Lexicon Induction from Monolingual Corpora
via Dependency Contexts and Part-of-Speech Equivalences 55
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.3 Translation by Context Vector
Projection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.3.1 Models of Context . . . . . . . . . . . . . . . . . . . . . . . . 63
4.3.1.1 Baseline model . . . . . . . . . . . . . . . . . . . . . 63
4.3.1.2 Modeling context using dependency trees . . . . . . . 65
4.4 Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.4.1 Training Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.4.2 Evaluation Criterion . . . . . . . . . . . . . . . . . . . . . . . 68
4.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.6 Further Extensions: Generalizing to other word types via tagset mapping 74
4.6.1 Mapping Part-of-Speech tagsets in different languages . . . . . 76
4.7 Application to Unrelated Corpora . . . . . . . . . . . . . . . . . . . . 81
4.8 Statistical Significance of Results . . . . . . . . . . . . . . . . . . . . 81
4.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
II Extracting Semantic Relationships 84
5 Part II Literature Review 85
x
5.1 Extracting relationships in a semantic taxonomy . . . . . . . . . . . . 85
5.1.1 Manually created databases . . . . . . . . . . . . . . . . . . . 86
5.1.2 Hand-crafted Patterns for “is-a” and “part-whole” relationships 87
5.1.3 Weakly supervised approaches . . . . . . . . . . . . . . . . . . 90
5.1.4 Training Supervised Classifiers . . . . . . . . . . . . . . . . . . 91
5.1.5 Clustering Approaches . . . . . . . . . . . . . . . . . . . . . . 91
5.2 Extracting complex semantic
relationships . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6 Minimally Supervised Multilingual Taxonomy and Translation
Lexicon Induction 95
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.3.1 Independently Bootstrapping Lexical
Relationship Models . . . . . . . . . . . . . . . . . . . . . . . 101
6.3.2 A minimally supervised multi-class classifier for identifying dif-
ferent semantic relations . . . . . . . . . . . . . . . . . . . . . 103
6.3.3 Evaluation of the Classification Task . . . . . . . . . . . . . . 104
6.4 Statistical Significance of Results . . . . . . . . . . . . . . . . . . . . 108
6.5 Improving a partial translation
dictionary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
xi
6.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
7 Extraction of Semantic Facts from Unlabeled Corpora targeting
Resolution and Generation of Definite Anaphora 114
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
7.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
7.3 Models for Lexical Acquisition . . . . . . . . . . . . . . . . . . . . . . 120
7.3.1 TheY-Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
7.3.2 WordNet-Model (WN) . . . . . . . . . . . . . . . . . . . . . . 123
7.3.3 Combination: TheY+WordNet Model . . . . . . . . . . . . . . 124
7.3.4 OtherY-Modelfreq . . . . . . . . . . . . . . . . . . . . . . . . . 126
7.3.5 OtherY-ModelMI(normalized) . . . . . . . . . . . . . . . . . . 126
7.3.6 Combination: TheY+OtherYMI Model . . . . . . . . . . . . . 127
7.4 Further Anaphora Resolution Results . . . . . . . . . . . . . . . . . . 127
7.5 Generation Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
7.5.1 Human experiment . . . . . . . . . . . . . . . . . . . . . . . . 131
7.5.2 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
7.5.2.1 Individual Models . . . . . . . . . . . . . . . . . . . 133
7.5.2.2 Combining corpus-based approaches and WordNet . 133
7.5.3 Evaluation of Anaphor Generation . . . . . . . . . . . . . . . 135
7.6 Statistical Significance of Results . . . . . . . . . . . . . . . . . . . . 136
7.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
xii
III Extracting Factual Relationships 138
8 Part III Literature Review 139
8.1 Literature for Modeling Explicit
Relationships . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
8.1.1 Early MUC approaches: Handcrafted
Lexico-syntactic Patterns . . . . . . . . . . . . . . . . . . . . . 140
8.1.2 Machine Learning Approaches . . . . . . . . . . . . . . . . . . 141
8.1.3 Weakly Supervised Approaches using
Seed-exemplars . . . . . . . . . . . . . . . . . . . . . . . . . . 141
8.2 Literature for Modeling Latent
Relationships . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
8.2.1 Sociolinguistic Studies . . . . . . . . . . . . . . . . . . . . . . 145
8.2.2 Computational Approaches . . . . . . . . . . . . . . . . . . . 145
9 Structural, Transitive and Correlational Models for Biographic Fact
Extraction 148
9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
9.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
9.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
9.4 Contextual Pattern-Based Model . . . . . . . . . . . . . . . . . . . . 154
xiii
9.5 Partially Untethered Templatic
Contextual Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
9.6 Document-Position-Based Model . . . . . . . . . . . . . . . . . . . . 159
9.6.1 Learning Relative Ordering in the
Position-Based Model . . . . . . . . . . . . . . . . . . . . . . . 161
9.7 Implicit Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
9.7.1 Extracting Attributes Transitively using Neighboring Person-
Names . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
9.7.2 Latent-Attribute Models based on Document-Wide Context
Profiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
9.8 Model Combination . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
9.9 Further Extensions: Reducing False Positives . . . . . . . . . . . . . . 168
9.9.1 Using Inter-Attribute Correlations . . . . . . . . . . . . . . . . 170
9.9.2 Using Age Distribution . . . . . . . . . . . . . . . . . . . . . . 171
9.10 Statistical Significance of Results . . . . . . . . . . . . . . . . . . . . 172
9.11 Extracting factual relationships from noisy sources for a wider range
of attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
9.11.1 Analysis of pattern learning component for fact extraction . . 174
9.11.2 Manually Filtering Patterns . . . . . . . . . . . . . . . . . . . 174
9.11.3 Filtering Noisy Patterns Automatically . . . . . . . . . . . . . 179
xiv
9.11.4 Evaluating automatic pattern filtering
measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
9.11.5 Error analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
9.12 Application of Position-based Model to News Data . . . . . . . . . . 182
9.12.1 Corpora Details . . . . . . . . . . . . . . . . . . . . . . . . . . 183
9.12.2 Global Position Model of “Occupation”
Attribute . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
9.12.3 Modeling Position with respect to the First Name Mention . . 185
9.12.4 Modeling Position with respect to the
Closest Name Mention . . . . . . . . . . . . . . . . . . . . . . 185
9.12.5 Modeling Position with respect to the
Closest Full or Partial Name Mention . . . . . . . . . . . . . . 188
9.12.6 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
9.13 Using Biographical Facts for Name Disambiguation . . . . . . . . . . 190
9.14 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
10 Modeling Latent Biographical Attributes in Conversational Genres195
10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
10.1.1 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
10.1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
10.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
10.3 Corpus Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
xv
10.4 Modeling Gender via Ngram
features (Boulis and Ostendorf, 2005) . . . . . . . . . . . . . . . . . . 204
10.4.1 Training Vectors . . . . . . . . . . . . . . . . . . . . . . . . . 204
10.4.2 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
10.5 Modeling Based on the Partner’s Gender . . . . . . . . . . . . . . . . 208
10.5.1 Oracle Experiment . . . . . . . . . . . . . . . . . . . . . . . . 209
10.5.2 Replacing Oracle by a Homogeneous vs Heterogeneous Classifier 210
10.5.3 Modeling partner via conditional model and whole-conversation
model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
10.6 Sociolinguistic Features . . . . . . . . . . . . . . . . . . . . . . . . . . 213
10.7 Gender Classification Results . . . . . . . . . . . . . . . . . . . . . . 215
10.7.1 Aggregating results over per-speaker via consensus voting . . . 217
10.8 Effect of Self-Reporting Features on Gender Classification . . . . . . . 219
10.9 Application to Arabic Language . . . . . . . . . . . . . . . . . . . . . 221
10.9.1 Corpus Details . . . . . . . . . . . . . . . . . . . . . . . . . . 221
10.9.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
10.9.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
10.10Application to Email Genre . . . . . . . . . . . . . . . . . . . . . . . 225
10.10.1 Corpus Details . . . . . . . . . . . . . . . . . . . . . . . . . . 225
10.10.2 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
10.10.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
xvi
10.10.4 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
10.11Modeling Other Attributes . . . . . . . . . . . . . . . . . . . . . . . . 229
10.11.1 Corpus details for Age and
Native Language . . . . . . . . . . . . . . . . . . . . . . . . . 232
10.11.2 Results for Age and Native/Non-Native . . . . . . . . . . . . . 232
10.11.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
10.12Regression Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
10.12.1 Evaluation Measures . . . . . . . . . . . . . . . . . . . . . . . 237
10.12.2 Baseline Approaches . . . . . . . . . . . . . . . . . . . . . . . 237
10.12.3 Ngram-based regression model . . . . . . . . . . . . . . . . . . 237
10.12.4 Sociolinguistic features . . . . . . . . . . . . . . . . . . . . . . 238
10.12.5 Top Ngram features . . . . . . . . . . . . . . . . . . . . . . . . 238
10.12.6 Multiple Binary Classifiers Across
Different Age Boundaries . . . . . . . . . . . . . . . . . . . . . 240
10.12.7 Stacked Models . . . . . . . . . . . . . . . . . . . . . . . . . . 242
10.12.7.1 Linear Combination . . . . . . . . . . . . . . . . . . 242
10.12.7.2 Regression Trees . . . . . . . . . . . . . . . . . . . . 243
10.12.8 Balancing Size of Different Age Groups in Test Set . . . . . . 245
10.13Effect of Self-Reporting Features on Age Prediction . . . . . . . . . . 246
10.14Statistical Significance of Results . . . . . . . . . . . . . . . . . . . . 247
10.15Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
xvii
11 Contributions and Conclusion 250
11.1 Applications and Future Work . . . . . . . . . . . . . . . . . . . . . . 255
Bibliography 258
Vita 284
xviii
List of Tables
1.1 A sample of seed pairs for the three major categories of relationshipsextracted. A representation of context is learned starting with suchseed pairs for extracting new pairs describing the relationship as de-scribed in Section 1.2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.1 Example lexical resources used in this task and their application totranslating compound words in new languages. . . . . . . . . . . . . . 27
3.2 Baseline performance using unreordered literal English glosses as trans-lations. The percentages in parentheses indicate what fraction of allthe words in the test (entire) vocabulary were detected and translatedas compounds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3 Coverage and accuracy for the standard model using gloss-to-fluenttranslation mappings learned from bilingual dictionaries in other lan-guages (in forward order only). . . . . . . . . . . . . . . . . . . . . . 38
3.4 Size of various bilingual dictionaries (with other language as English) 393.5 Performance for looking up English gloss via both orderings. The
percentages in parentheses are relative improvements from the per-formance in Table 3.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.6 Top 15 middle glues (fillers) and end glues discovered for each lan-guage along with their probability values. Glue characters allow forappropriately splitting the compound words into the root forms of theindividual components for lookup in a lexicon. . . . . . . . . . . . . . 41
3.7 Performance for increasing coverage by including compounding mor-phology. The percentages in parentheses are relative improvementsfrom the performance in Table 3.5 . . . . . . . . . . . . . . . . . . . . 42
3.8 Average performance on German and Swedish with and without usingcontext vector similarity from monolingual corpora. . . . . . . . . . . 43
3.9 Performance of this work’s BiDict approach compared with and aug-mented with traditional statistical MT learning from bitext. . . . . . 44
xix
3.10 Illustrating 3-best cross-languages obtained for each test language(shown in bold). Each row shows the effect of adding the respectivecross-language to the set of languages in the rows above it and thecorresponding F-scores (Top 1 and Top 10) achieved. . . . . . . . . . 53
4.1 Contrasting context words derived from the adjacent vs dependencymodels for the above example . . . . . . . . . . . . . . . . . . . . . . 65
4.2 Top 10 translation candidates for the Spanish word “camino (way)”and “crecimiento (growth)” for the best adjacent context model(Adjbow) and best dependency context model (Depposn). The bold En-glish terms show the acceptable translations. . . . . . . . . . . . . . . 70
4.3 Performance of various context-based models learned from monolin-gual corpora and phrase-table learned from parallel corpora on Nountranslation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.4 List of 20 most confident mappings using the dependency context basedmodel for noun translation along with exact match evaluation outputbased on whether the mapping is present as a lexicon entry. Note thatalthough the first mapping (senores, gentlemen) is the correct one, itwas not present in the lexicon used for evaluation and hence is markedas incorrect. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.5 Performance of dependency context-based model along with additionof part-of-speech mapping model on translating all word-types. . . . . 79
4.6 List of 25 most confident mappings using the dependency context withthe part-of-speech mapping model translating all word-types alongwith exact match evaluation output based on whether the mappingis present as a lexicon entry. Note that although the second best map-ping in Table 4.4 for noun-translation is for xenophobia with score0.87, xenophobia is not among the 1000 most frequent words (of allword-types) and thus is not in this test set. . . . . . . . . . . . . . . . 80
6.1 Naive pattern scoring: Hyponymy patterns ranked by their raw corpusfrequency scores. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.2 Patterns for hypernymy class re-ranked using evidence from otherclasses. Patterns distributed fairly evenly across multiple relationshiptypes (e.g. “X and Y”) are deprecated more than patterns focusedpredominantly on a single relationship type (e.g. “Y such as X”). . . 102
6.3 A sample of patterns and their relationship type probabilitiesP (class|pattern) extracted at the end of training phase for English. . 105
6.4 A sample of patterns and their class probabilities P (class|pattern)extracted at the end of training phase for Hindi. . . . . . . . . . . . . 105
xx
6.5 A sample of seeds used and model predictions for each class for the tax-onomy induction task. For each of the model predictions shown above,its Hyponym/Meronym/Cousin classification was correctly assigned bythe model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
6.6 Overall accuracy for4-way classification {hypernym,meronym,cousin,other} using differentpattern scoring methods. . . . . . . . . . . . . . . . . . . . . . . . . . 107
6.7 Test set coverage and accuracy results for inducing different semanticrelationship types. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
6.8 Confusion matrix for English (left) Hindi (right) for the four-way clas-sification task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
6.9 Accuracy on Hindi to English word translation using different transitivehypernym algorithms. The additional model components in the bi-d(bi-directional) plus Other model are only used to rerank the top20 candidates of the bidirectional model, and are hence limited to itstop-20 performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
6.10 A sample of correct and incorrect translations using transitive hyper-nymy/hyponym word translation induction . . . . . . . . . . . . . . . 112
7.1 A sample of ranked hyponyms proposed for the definite NP The drugby TheY-Model illustrating the differences in weighting methods. . . 122
7.2 Results using different normalization techniques for the TheY-Modelin isolation. (60 million word corpus) . . . . . . . . . . . . . . . . . . 122
7.3 Accuracy and Average Rank showing combined model performance onthe antecedent selection task. Corpus Size: 60 million words. . . . . . 124
7.4 A sample of output from different models on antecedent selection (60million word corpus). . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
7.5 Accuracy and Average Rank of Models defined in Section 7.3 on theantecedent selection task. . . . . . . . . . . . . . . . . . . . . . . . . . 128
7.6 Agreement of different generation models with human judge and withdefinite NP used in the corpus. . . . . . . . . . . . . . . . . . . . . . 134
7.7 Sample of decisions made by human judge and the best performingmodel (TheY+OtherY+WN) on the generation task. . . . . . . . . . 135
9.1 A sample of partially untethered and fully tethered patterns alongwith their precision. For some of the attributes, only 4-5 fully tetheredpatterns but relaxing the constraint on the <hook> allows extractionof many partially tethered patterns providing improved performanceas shown in Tables 9.5 9.6. . . . . . . . . . . . . . . . . . . . . . . . . 155
xxi
9.2 A sample of partially untethered and fully tethered patterns alongwith their precision. For some of the attributes, only 4-5 fully tetheredpatterns but relaxing the constraint on the <hook> allows extractionof many partially tethered patterns providing improved performanceas shown in Tables 9.5 and 9.6. . . . . . . . . . . . . . . . . . . . . . 156
9.3 Majority rank of the correct attribute value in the Wikipedia pagesof the seed names used for learning relative ordering among attributessatisfying the domain model . . . . . . . . . . . . . . . . . . . . . . . 159
9.4 Sample of occupation weight vectors in English and German learnedusing the latent-attribute-based model. . . . . . . . . . . . . . . . . . 165
9.5 Average Performance of different models across all biographic attributes1689.6 Performance comparison of all the models across several biographic
attributes. Bolded accuracies indicate the top-performing model. . . . 1699.7 Sample of untethered patterns that were annotated as high quality by
human annotators. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1759.8 Sample of untethered patterns that were annotated as high quality by
human annotators. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1769.9 Sample of untethered patterns that were annotated as high quality by
human annotators. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1779.10 Sample of untethered patterns that were annotated as high quality by
human annotators. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1789.11 Pattern relevance based on presence in high quality pattern list gen-
erated by human annotators. Top 5 indicates the fraction of top 5patterns generated by the algorithm that were marked by annotatorsas high quality patterns. The results are averaged over all attributes. 180
9.12 Name disambiguation performance for matching first or last name men-tions to a Wikipedia person page . . . . . . . . . . . . . . . . . . . . 192
9.13 Correlation between occupations based on number of people sharingthe same occupation . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
10.1 Top 20 ngram features for Gender, ranked by the weights assigned bythe linear SVM model . . . . . . . . . . . . . . . . . . . . . . . . . . 205
10.2 Difference in Gender classification accuracy between mixed gender andsame gender conversations using the reference algorithm . . . . . . . 209
10.3 Performance for 4-way classification of the entire conversation into(mm, ff, mf, fm) classes using the reference algorithm on Switchboardcorpus. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
10.4 Results showing improvement in accuracy of gender classifier usingpartner-sensitive model and sociolinguistic features . . . . . . . . . . 216
xxii
10.5 Aggregate results on a “per-speaker” basis via majority consensuson different conversations for the respective speaker. The results onSwitchboard are significantly higher due to more conversations perspeaker as compared to the Fisher corpus . . . . . . . . . . . . . . . . 219
10.6 Fraction of conversations containing self-reporting features such as “mywife”, “my boyfriend”, on different corpora. Although Fisher has sig-nificant conversations with such features, they have little impact onthe overall performance as shown in Table 10.7 . . . . . . . . . . . . . 220
10.7 Self reporting features for gender such as “my wife”, “my boyfriend”,etc. have negligible impact on performance of gender classification. . 221
10.8 Gender classification results for a new language (Gulf Arabic) showingconsistent improvement gains via partner-sensitive model and sociolin-guistic features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
10.9 Application of Ngram model and sociolinguistic features for genderclassification in a new genre (Email) . . . . . . . . . . . . . . . . . . 228
10.10Top 20 ngram features for gender classification in email, ranked by theweights assigned by the linear SVM model. See Section 10.10.4 formore details. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
10.11Results showing improvement in the accuracy of age and native lan-guage classification using partner-sensitive model and sociolinguisticfeatures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
10.12Top 25 ngram features for Age ranked by weights assigned by the linearSVM model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
10.13Results for age regression using different feature and model combina-tions. Substantial performance gains were obtained by utilizing binaryclassifiers across different age boundaries as features in a stacked SVMmodel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
10.14Top 20 ngram features for Age ranked by weights assigned by thengram-based SVM regression model . . . . . . . . . . . . . . . . . . . 240
10.15Results for age regression using different feature and model combi-nations for age-wise balanced test set. While the performance of thebaseline models degrade due to higher variance, regression models showconsistent performance improvements as in Table 10.13 . . . . . . . . 245
10.16Self-reporting features such as “in thirties, i’m fourty five, etc.”. havelittle impact. The performance after deleting such features is similarto the original model containing all ngrams as features. . . . . . . . . 247
xxiii
List of Figures
1.1 Thesis overview: Extracting different types of relationships from un-structured text in multiple languages into a structured multilingualknowledge base. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.1 Illustration of using cross-language evidence using bilingual dictionar-ies of different languages for compound translation. The basic ap-proach is to translate compound words by modeling the mapping ofliteral component-word glosses (e.g. “iron-path”) into fluent English(e.g. “railway”) across multiple languages. . . . . . . . . . . . . . . . 30
3.2 Illustration of the problem with generating fluent translation candi-dates via compositional methods (Grefenstette, 1999; Cao and Li, 2002;Baldwin and Tanaka, 2004) . . . . . . . . . . . . . . . . . . . . . . . 37
3.3 Illustration of compounding morphology using middle and end gluecharacters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.4 Coverage/Accuracy trade-off curve by incrementing the minimumnumber of languages exhibiting a candidate translation for the source-word’s literal English gloss. Accuracy here is the Top1 accuracy aver-aged over all 10 test languages. . . . . . . . . . . . . . . . . . . . . . 47
3.5 F-measure performance given varying sizes of the bilingual dictionariesused for cross-language evidence (as a percentage of words randomlyutilized from each dictionary). . . . . . . . . . . . . . . . . . . . . . . 48
3.6 Top-1 match F-score performance utilizing K languages for cross-language evidence, for both a random K languages and greedy selec-tion of the most effective K languages (typically the closest or largestdictionaries) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.7 Top-10 match F-score performance utilizing K languages for cross-language evidence, for both a random K languages and greedy selec-tion of the most effective K languages (typically the closest or largestdictionaries) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
xxiv
4.1 Illustration of (Rapp, 1999) model for translating Spanish word “crec-imiento (growth)” via dependency context vectors extracted from re-spective monolingual corpora as explained in Section 4.3.1.2 . . . . . 62
4.2 An illustration of dependency tree showing clearly the parent and childnodes. The word marked in bold (“crecimiento”) is used as an examplesource word in the chapter for illustrative purposes, and its adjacentand dependency contexts are shown in Table 4.1. . . . . . . . . . . . 64
4.3 Precision/Recall curve showing superior performance of dependencycontext model as compared to adjacent context at different recallpoints. Precision is the fraction of tested Spanish words with Top1 translation correct and Recall is fraction of the 1000 Spanish wordstested upon. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.4 Illustration of using part-of-speech tag mapping to restrict candidatespace of translations. . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.5 Illustration of mapping Spanish part-of-speech tagset to English tagset.The tagsets vary greatly in notation and the morphological/syntacticconstituents represented and need to be mapped first, using the algo-rithm described in Section 4.6.1. . . . . . . . . . . . . . . . . . . . . . 77
4.6 Precision/Recall curve showing superior performance of using part-of-speech equivalences for translating all word-types. Precision is thefraction of tested Spanish words with Top 1 translation correct andRecall is fraction of the 1000 Spanish words tested upon. . . . . . . . 78
5.1 Example of definite anaphora resolution and generation. Both thetasks require the knowledge of a derived semantic relationship that“pseudoephedrine is-a drug”. . . . . . . . . . . . . . . . . . . . . . . . 93
6.1 Goal: To induce multilingual taxonomy relationships in parallel in mul-tiple languages (such as Hindi and English) for information extractionand machine translation purposes. . . . . . . . . . . . . . . . . . . . . 96
6.2 Illustration of the models of using induced hyponymy and hypernymyfor translation lexicon induction. . . . . . . . . . . . . . . . . . . . . 109
6.3 Reducing the space of likely translation candidates of the wordraaiphala by inducing its hypernym, using a partial dictionary to lookup the translation of hypernym and generating the candidate transla-tions as induced hyponyms in English space. . . . . . . . . . . . . . . 110
7.1 Example of definite anaphora resolution and generation. Both thetasks require the knowledge of semantic relationship that “pseu-doephedrine is-a drug”, however the resolution task is easier becausethere are only a limited set of candidates to choose from (shown bycircled nouns). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
xxv
7.2 Illustrating the problem with WordNet for definite anaphora gener-ation. The immediate parent and grandparent of “pseudophedrine”,“alkaloid” and “organic compound” do not serve as natural definiteanaphoras as compared to the “drug” that is often observed in corpora. 132
8.1 Illustration of basic weakly supervised approach by Ravichandran andHovy (2002) for fact extraction. Using a few seeds of the fact in ques-tion, contextual patterns occurring with the seeds are extracted andranked based on their distribution in the monolingual corpora. Newpairs observing the given fact (for example, occupation) can then beextracted using co-occurrence with these patterns. . . . . . . . . . . . 142
9.1 Goal: extracting attribute-value biographic fact pairs from biographicfree-text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
9.2 Distribution of the observed document mentions of Deathdate, Nation-ality and Religion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
9.4 Illustration of modeling “occupation” and “nationality” transitivelyvia consensus from attributes of neighboring names . . . . . . . . . . 162
9.3 Empirical distribution of the relative position of the correct (seed) an-swers among all text phrases satisfying the domain model for “birth-place” and “death date”. . . . . . . . . . . . . . . . . . . . . . . . . . 163
9.5 Age distribution of famous people on the web (from www.spock.com) 1719.6 Global position “occupation” attribute the New York Times articles.
The position is given as the fraction of the article length on the X-axis,and Y-axis describes the number of times an “occupation” attributewas found in that fraction. . . . . . . . . . . . . . . . . . . . . . . . . 184
9.7 Distribution of “occupation” attribute from first full mention of thename in the New York Times articles. . . . . . . . . . . . . . . . . . . 186
9.8 Distribution of “occupation” attribute from the closest full mention ofthe name in the New York Times articles. . . . . . . . . . . . . . . . 187
9.9 Distribution of “occupation” attribute from the closest full or partial(first name or last name) mention of the name in the New York Timesarticles. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
9.10 Application of biographical attributes for name disambiguation: Dis-ambiguating mention of “Phil Collins” to the correct Wikipedia entryusing the premodifying occupation “rider”. Similarly other biographi-cal attributes such as nationality premodifier “British” can also be usedfor disambiguation. This can be further improved by using compatibleoccupations as shown in Table 9.13. . . . . . . . . . . . . . . . . . . . 191
10.1 A snippet of Fisher telephone transcript between a female (A) andmale (B) speaker. The first two fields indicate the start time and stoptime and the third field contains the utterance. . . . . . . . . . . . . 198
xxvi
10.2 The effect of varying the amount of each conversation side utilized fortraining, based on the utilized % of each conversation, starting from thebeginning of the conversation. While one would expect the accuracy toimprove linearly with increased training data, the anomaly inolving flatportion in the middle could be due to the fact that Fisher and Switch-board participants were complete strangers. The intial ramp up in thecurve is probably due to the addition of speaker data starting from nodata at all and the flat portion is probably due to the time taken forthe speakers to get familiar and speak comfortably with each other,after which, the discourse features for speaker attributes become moreprominent. Another reason could be due to the fact that the middleportion indicates discussion on a specific topic given to the speakersand after they have spoken enough about the topic, the speakers maymove on to more gender biased topics of their choice. . . . . . . . . . 207
10.3 People use stronger gender-specific discourse properties when speak-ing to someone of a similar gender. Stacking whole conversation andpartner-conditioned models as shown above allows to model such be-havior. The common graphic utilized for individual SVM classifiersfirst appeared in (Ustun, 2003). . . . . . . . . . . . . . . . . . . . . . 211
10.4 Empirical differences in sociolinguistic features for Gender on theSwitchboard corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
10.5 Aggregating results over all the conversations of a given speaker viaconsensus voting as explained in Section 10.7.1. One can also uti-lize other ways of combining evidence such as length-weighted voting,confidence-weighted voting, stacking, combining all conversations intoone single conversation, etc. However, since the speakers were sup-posed to speak for a fixed time while collecting the data for Fisher andSwitchboard corpus, the conversations in these corpora are of similarlength. Thus the above simple combination technique is also appropri-ate due to the nature of approximately equal length conversations. . . 218
10.6 Top 20 Arabic ngram features (along with their Roman translitera-tions) for Gender, ranked by the weights assigned by the linear SVMmodel. Section 10.9.3 provides translation and insight into why theseare appropriate gender indicators. . . . . . . . . . . . . . . . . . . . . 224
10.7 Example of an email sent by a male sender in Enron corpus. The headerand signature information containing the sender’s name are removedand only the body of the email is used for gender classification. . . . 226
10.8 Empirical differences in sociolinguistic features for Age. Youngerspeakers tend to use short utterances, pronouns and auxiliaries moreoften than older speakers. . . . . . . . . . . . . . . . . . . . . . . . . 231
10.9 Age histograms for training and test speakers of Switchboard corpusindicating unbalanced age groups of the participating speakers. . . . . 236
xxvii
10.10Stacking approach for age regression utilizing binary classifiers acrossdifferent age boundaries and sociolinguistic features as individual com-ponents. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244
10.11Histograms for different age groups in the test set. The horizontal lineshows the threshold for balancing the size of test set across differentage groups, retaining a total of 600 examples. . . . . . . . . . . . . . 246
xxviii
Chapter 1
Introduction
The amount of available unstructured data is growing at a rapid rate. The In-
ternational Data Corporation (IDC) predicts that in 2011, the amount of digital
information produced in the year will equal nearly 1,800 exabytes, or 10 times that
was reported in a measurement study in 20061. To utilize such vast information in a
meaningful manner, it is essential to organize it into a structured form. However, con-
verting this information into structured repositories usually requires a slow, manual
annotation process with most of the information remaining unexamined. Further-
more, some of such manually created resources including WordNet, DBPedia, etc.,
have limited coverage and are available only for a few of the world’s languages.
A large amount of information which is still untapped and unorganized exists in a
wide range of genres. Some of such genres include multilingual news articles, blogs,
1http://www.emc.com/collateral/analyst-reports/diverse-exploding-digital-universe.pdf
1
emails, conversation transcripts, discussion forums, etc.
A primary step in organizing unstructured text is to identify the relationships between
those general concepts denoted by words or phrases, both within and across different
languages. Techniques for identifying and extracting such relationships provide a ba-
sic relational structure that can then be refined and generalized into a fully-fledged
crosslingual knowledge base.
This dissertation provides new relationship extraction models that exploit novel
knowledge sources, across a diverse set of relationship types in multiple languages. A
wide range of relationships are explored, including semantic relationships between
words, their translation equivalents in different languages and encyclopedic facts
about named entities.
The goal of this dissertation is multi-faceted:
• to tie together extraction of diverse relationships in a common minimally super-
vised framework using seed exemplars for learning typical contexts. Using the
same starting approach, different representations of context can be leveraged
for extracting different relationships.
• to explore novel knowledge sources including social context, correlations among
relationships, cross-language evidence using bilingual dictionaries, and others.
Such novel and multilingual knowledge sources not only help in improving the
performance of relationship extraction but also allow for extracting new rela-
tionship types, such as latent or implicit relationships.
2
• to develop and evaluate relationship extraction models for various domains such
as conversational speech transcripts, email data, web pages, formal genres, etc.
and for diverse languages including Arabic, Bulgarian, German, Hindi, Spanish
and Hungarian.
Having described the motivation and goals of this dissertation, the rest of this chapter
is organized as follows: Section 1.1 categorizes the type of relationships explored in
this dissertation into three broad categories. Section 1.2 explains how extraction of
such relationships can be tied under a basic seed-exemplar based framework along
with a discussion on variants of context representations explored. Section 1.3 sum-
marizes the contributions according to the novel knowledge sources that were used to
build new relationship extraction models. Section 1.4 provides a chapter wise outline
of the dissertation.
1.1 Types of relationships explored
The type of relationships explored in this thesis fall into the following three broad
categories, as illustrated in Figure 1.1.
• Cross-language/Translation relationships: To leverage a vast amount of
multilingual information, it is necessary to identify how similar concepts are
denoted in different languages. This can often be easily identified with manually
created bilingual dictionaries such as Spanish-English dictionary or Chinese-
3
English dictionary. However, the lack of such translation lexicons is a major
bottleneck for low-resource languages. Chapters 3 and 4 of this dissertation
provide several novel methods for inducing such lexicons automatically and
with low annotation effort.
• Semantic relationships: General relationships found in semantic knowledge
bases such as WordNet (hypernymy, synonymy and meronymy) are critical to
many applications that aim at restoring parts of the meaning in the sentence,
such as sentiment analysis tools and semantic search engines. Chapters 6 and
7 show how such semantic relationships can be extracted in multiple languages
and also quantify how such information can be applied to downstream tasks.
• Factual relationships: A large number of relationships are domain-specific
and express facts about the world. For example, DBPedia is a knowledge base
that contains factual relationships about people such as “birthplace”, which is
a relationship between phrases denoting person names and locations, or about
relationships in organization such as “founder”, which is a relationship between
phrases that denote person name and company name. Such factual relationships
can be explicitly stated in the text and can also often be implicitly derived
when not directly stated. Chapters 9 and 10 provide techniques to extract such
relationships in both an explicit and implicit manner, and in different languages.
4
Categorization, Classification, Extraction and Projection Techniques
Dean Golf
English German Prob
Sports
Occupation
Education
Lecturer
Value Prob Value Prob Value ProbChesley B. SullenbergerPilot 0.93 1951 0.98 American 1.0
Captain 0.87 1954 0.65…
A. R. Rehman Musician 0.95 1967 0.95 Indian 1.0Singer 0.93 1966 0.76…
Occupation Birthyear Nationality
English Spanish Probpilot piloto 0.94pilot guía 0.85… … ..land tierra 0.91land desembarcar 0.89
.. La noche de ayer es de las que cambian el curso de la política, y quizás de la historia, en España. Mientras
रहमान को (लमडॉग
मि लि य/यर 0
स2गीत 0 लि ए और
उन0 गीत ...
يتأيو .اهفرص طسو رمتؤملا ةطخ نع ريراقت
نم لصنتت وهاينتنل نيتلودلا لح رايخ نود نايكب دعتو
ةدايس
.... Capt. Chesley B. Sullenberger III is the US Airways pilot who landed an Airbus A320 ...
Unstructured Text in Multiple Languages
Structured Multilingual Knowledgebase
Applications- Fine-grained Information Retrieval- Coreference Resolution- Personalized User Assistance........
Chapters 3,4 (Extraction of Cross-language Relations)- Novel models of Extracting Compound Translations- Using Syntactic Structure and Part-of-Speech Equivalences
Chapters 6,7(Extraction of Semantic Relations)- Extracting semantic relations via evidence from multiple relationship types- Downstream application of semantic relationships to NLP tasks
Chapters 9,10(Extraction of Factual Relations)- Structural, Transitive and Latent models for extracting factual relationships in biography domain - Study of latent discourse models for extractionof implicit biographic facts in informal genres
Figure 1.1: Thesis overview: Extracting different types of relationships from unstruc-
tured text in multiple languages into a structured multilingual knowledge base.
5
1.2 Basic approach: Relationship
extraction using seed exemplars
A common starting approach in extracting the different kinds of relationships in
this thesis is that of bootstrapping from a small set of seed pairs. This approach is
inspired by the success of application of self-learning approaches in machine learn-
ing (Baum, 1972; Dempster et al., 1977) to natural language processing tasks such
as seed-based approaches for word sense disambiguation (Yarowsky, 1995). Simi-
lar approaches have been used for extracting cross-language relationships such as
translation equivalence (Rapp, 1999), and semantic and factual relationships (Thelen
and Riloff, 2002; Ravichandran and Hovy, 2002). Many variants of this basic seed
exemplar-based approach have been further developed in the literature for specific
relationships.While such seed-based techniques have been independently and specifi-
cally developed for different relationship types, this section shows that the underlying
algorithm remains the same and the context representation varies. Most of the pre-
vious approaches use only local context representations. Sections 1.2.2 and 1.3 show
other context representations and a wide array of novel knowledge sources that can
be embedded under the same framework.
The basic algorithm is outlined below followed by description of different context
representations and additional original knowledge sources explored in this thesis.
6
Cross-language Semantic Factual(Spanish-English) (Hypernymy) (Occupation)
(diversidad, diversity) (car, vehicle) (Seamus Heaney, Poet)(chipre, cyprus) (copper, metal) (Amitabh Bachchan, Actor)
(gobierno, government) (gun, weapon) (Desmond Dekker, Singer)(fundamento, certainty) (yet, currency) (Elfriede Jelinek, Novelist)
(ruego, thank) (dog, animal) (John Hume, Politician)(papel, role) (hammer, tool) (Ludwig Wittgenstein,
Philosopher)(fundamento, basis) (tennis, sport) (Monty Hall, Game show host)
(de, of) (cancer, disease) (Robert Boyle, Chemist)(clave, key) (English, language) (Xavier Cugat, Musician)
(entre, between) (passport, legal document) (Rupert Sheldrake, Biologist)
Table 1.1: A sample of seed pairs for the three major categories of relationshipsextracted. A representation of context is learned starting with such seed pairs forextracting new pairs describing the relationship as described in Section 1.2.1
1.2.1 Algorithm (Rapp 1999; Ravichandran and
Hovy, 2002)
The traditional seed-based learning framework is as follows:
Input: A set of seed pairs exhibiting the relationship of interest; and unlabeled mono-
lingual corpora. Table 1.1 shows examples of seed pairs for the different relationships
explored in this thesis.
Output: New word/phrase pairs exhibiting the relationship of interest.
Method:
1. Extract individual contexts for each seed pair occurrence in monolingual cor-
pora. A context can be as simple as the set of adjacent words surrounding
the seed pairs. More complex versions such as pattern templates, dependency
7
contexts, document-wide contexts are explained in Section 1.2.2.
For example, in (Ravichandran and Hovy, 2002), the context is simply the se-
quence of words (also called a pattern template) surrounding the appearance of
a seed pair in a sentence.
2. Aggregate individual contexts into a general context representation. This con-
text representation can be thought of as abstractly representing the relationship
of interest. Thus, there is a single aggregate context representation per rela-
tionship. Some examples of this aggregate representation include a TF.IDF
weighted bag-of-words context vector, list of pattern templates ranked accord-
ing to some pattern reliability score, and position specific context vectors.
For example, in (Ravichandran and Hovy, 2002), this is a list of pattern tem-
plates ranked according to their precision score, which is the fraction of times
the pattern appears with the seed pair in an unlabeled corpus.
3. Extract new pairs that occur with the aggregated context representation in
monolingual corpora. This extraction process depends on the type of context
representation. For example, for a bag-of-words context vector representation,
this process involves creating bag-of-words vectors using the adjacent words that
occur along side the candidate pairs. These pairs are treated as candidates for
the relationship of interest. The candidates are then ranked by different mea-
sures depending on the context representation. For example, in (Ravichandran
8
and Hovy, 2002), the extracted candidate pairs are ranked according to their
frequency in an unlabeled corpus.
This extraction process and the ranking measures and are explained in more
detail in their respective thesis chapters.
Given an initial set of candidate pairs extracted using the above seed-exemplar based
approach, novel relationship extraction models using diverse knowledge sources can
then be exploited to substantially improve the extraction. These knowledge sources
and their application are explained in brief in Section 1.3 of this chapter.
1.2.2 Context Representations
Context representation is a key component of the algorithm outlined in Section
1.2.1. The following section illustrates different major variants of context representa-
tion used in the thesis.
1. Narrow bag-of-words context vectors (Rapp, 1999; Schafer and
Yarowsky, 2002; Koehn and Knight, 2002; Haghghi et al., 2008; Gar-
era et al., 2009) : This representation uses words surrounding the seeds over
a fixed window size and forms a vector with the weights computed from the
co-occurrence frequency. It is also possible to store at what position the con-
textual words occur as compared to using a bag-of-words context.
9
Application: Utilized for cross-language relationships such as translation equiv-
alence and complex semantic relationships such as definite anaphora.
Example: Figure 4.1 in Chapter 4 provides an example of narrow bag-of-words
context vector.
2. Prefix, infix and suffix pattern templates (Ravichandran and Hovy,
2002; Thelen and Riloff, 2002; Pasca et al., 2006; Garera and
Yarowsky, 2009) : When the relationship is often expressed by words in close
proximity to the words of the seed pair, then a very useful context representa-
tion is of pattern templates. For example, “X was born in Y” is an infix pattern
template representing one of the “birthplace” relationship contexts. Each con-
text template is assigned a weight computed based on the seed pairs.
Application: Utilized for generic semantic relationships such as “Is-a”, “Part-
of”, etc., explicit factual relationships such as “birthplace” and cross-language
relationships among compound words.
Example: Figure 8.1 in Chapter 8 provides an example of pattern-based context
representation.
3. Document-wide contexts2 (Garera and Yarowsky, 2009) : This repre-
2Although these were not utilized in the previous seed-based approaches described at the begin-
10
sentation is similar to that of fixed window bag-of-words context vector but
the context is expanded to the entire document, used for topically modeling a
factual relationship, as described in Chapter 9.
Application: Utilized for extracting latent factual relationships that are not
explicitly stated in the text but can be inferred indirectly from the topical na-
ture of the document.
Example: Table 9.4 in Chapter 9 provides an example of utilizing document-
wide contexts for extracting “occupation” relationship.
4. Multilingual Dependency contexts (Garera et al., 2009) : This represen-
tation involves obtaining dependency parses in multiple languages for modeling
long range dependencies and word ordering as part of contextual clues.
Application: Utilized for extracting translation relationships via context pro-
jection across languages, as described in Chapter 4.
Example: Figure 4.2 and Table 4.1 in Chapter 4 provide an example of uti-
lizing dependency context for extracting translation relationship.
ning of Section 1.2, such contexts can be naturally embedded under the seed-based approach andare listed here for completeness.
11
1.3 Contributions
This dissertation makes several novel contributions to the general natural language
learning framework described in Section 1.2. It explores an array of new relationship
extraction techniques, exploiting novel internal and external knowledge sources. In
what follows, I outline the specific contributions of this dissertation. These are orga-
nized according to the knowledge sources that have been used in the development of
new relationship extraction models.
1.3.1 Internal knowledge sources
As defined in this dissertation for the purposes of distinguishing two major classes
of knowledge sources, Internal knowledge sources are directly derived from the corpora
from which relationships are extracted. These include new context representations,
transitive knowledge and many more as described below:
1. Evidence across different relationship types: Several relationships, es-
pecially within the same category can benefit from joint modeling. Chapter
6 provides a novel minimal-resource model for the acquisition of multilingual
lexical taxonomies (including hyponymy/hypernymy and meronymy) using ev-
idence from multiple relationship-types.
2. Global document-wide contexts: Latent relationships that are not explic-
itly stated in text are difficult to extract using the basic seed-based approach.
12
Chapter 9 shows how latent-attribute models of wide-document context, both
monolingually and translingually, can capture facts that are not stated directly
in a text.
3. Transitivity information via co-occurrence statistics: Factual relation-
ships often have similar values for related entities, and co-occurrence statistics
can be used to find entities that are related. Chapter 9 provides a transitive
model that predicts values of factual attributes based on consensus voting via
the extracted attributes of neighboring entities.
4. Structural information: Some of the relationships such as “birthdate” tend
to occur in characteristic positions, such positional information can be very
useful when a good context model is not available. Chapter 9 provides the first
known work on illustrating a global structural model for factual relationship
extraction, utilizing absolute and relative document-wide positions as opposed
to only modeling local contextual patterns.
5. Social context: For extracting facts about people, social context can play an
important role. This involves modeling of speaker attributes sensitive to partner
speaker attributes, given the differences in lexical usage and discourse style such
as observed between same-gender and mixed-gender conversations. Chapter 10
makes use of such social contexts for improving extraction of implicit factual
relationships from conversation genres.
13
6. Sociolinguistic information: Chapter 10 also explores a rich variety of novel
sociolinguistic and discourse-based features, including mean utterance length,
passive/active usage, percentage domination of the conversation, speaking rate
and filler word usage.
7. Derived relationships: Successful extraction of simple relationships can be
used as an input for extracting more complicated relationship types. Chap-
ters 6 and 7 show how automatically extracted “is-a” semantic relationships
can be used as a knowledge source for extracting cross-language and anaphoric
relationships.
8. Morphology and sequence information: Chapter 3 illustrates the use of
component-sequence and compound morphology for extracting cross-language
relationships among compound words.
1.3.2 External knowledge sources
These knowledge sources consist of external tools and data such as dependency
parsers, bilingual dictionaries, etc., used alongside the corpora from which relation-
ships are extracted.
1. Bilingual dictionaries for leveraging cross-language evidence: Chapter
3 presents the first known work on non-compositional extraction of compound
word equivalences across different languages, leading to fluent translations. The
14
key knowledge source that makes this possible is a set of bilingual dictionaries,
which is used for learning multilingual similarities between compound words.
2. Richer contexts via dependency parses: Dependency trees not only help
in modeling richer contexts but are also helpful with respect to modeling long-
distance relationships and word-reordering. Chapter 4 shows novel use of de-
pendency parsers for extracting cross-language relationships.
3. Part-of-speech equivalences: Part-of-speech tags are usually preserved in
cross-language relationships. Chapter 4 shows hows how the entropy of can-
didate translations can be reduced by mapping part-of-speech tagsets. The
chapter also provides a mechanism for learning such a mapping automatically.
4. Domain models: The values of attribute-value pairs of a factual relationship
can often be modeled based on its domain. Chapter 9 shows how using external
gazetteers and syntax constraints can be used as domain models for extracting
factual relationships.
5. Correlation statistics among instances of different relationship types:
Different attributes describing facts about the same entity are often correlated
and can be used to reduce the entropy of candidate extraction space. Chapter
9 shows how such correlations can be learned and applied using an external
database of factual relationships3.
3 Note that correlations among different relationships can also be derived internally as describedin the first point of Section 1.3.1, and in more detail in Chapter 6.
15
1.4 Outline of this Dissertation
This dissertation is organized into the following chapters:
1.4.1 Part I: Cross-language relationships
• Chapter 2 presents a literature review for extracting translation equivalents
across different languages, with a focus on minimally supervised methods.
• Chapter 3 presents an approach for fluent, non-compositional translation of
compound words by learning component gloss translation models across multi-
ple languages.
• Chapter 4 presents novel improvements to the induction of translation lexicons
from monolingual corpora by incorporating multilingual dependency parses and
part-of-speech equivalences.
1.4.2 Part II: Semantic relationships
• Chapter 5 presents a literature review for extracting semantic relationships
and their downstream application to anaphora resolution.
• Chapter 6 presents a novel algorithm for the acquisition of multilingual se-
mantic taxonomies, using evidence from different semantic relationship types.
16
• Chapter 7 shows how corpus-based approaches for extracting semantic rela-
tionships can be utilized for resolving and generating definite anaphora.
1.4.3 Part III: Factual relationships
• Chapter 8 present a literature review for extracting a third category of lexical
relationships, namely domain-specific factual relationships.
• Chapters 9 and 10 present structural, transitive and latent approaches for
fact extraction in the biographic domain, across different genres.
• Chapter 11 concludes this dissertation.
17
Part I
Extracting
Cross-language/Translation
Relationships
18
Chapter 2
Part I Literature Review
This chapter covers the literature review for extracting cross-language relation-
ships in the form of translation equivalences. The different lines of previous work can
be classified according to the nature of the resources utilized, namely, parallel cor-
pora (Section 2.1), monolingual corpora with seed lexicons (Section 2.2) and bridge
languages (Section 2.3).
2.1 Using Parallel Corpora
Learning translation equivalence relationships across words in different languages
can be traced back to the first statistical approach to machine translation by Brown
et al. (1990) from parallel text. This was formally defined as a translation model
in Brown et al. (1993) using the word alignments learned via the expectation max-
19
imization algorithm (Dempster et al., 1977). After obtaining word alignments, the
translation lexicon can be induced using the IBM Model 1 (Brown et al., 1993) as
follows:
t(e|f ; e,f) =
∑(e,f) c(e|f ; e,f)∑
e
∑(e,f) c(e|f ; e,f)
(2.1)
where,
t(e|f ; e,f) is the translation probability estimated from a sentence aligned corpus (say,
an English-French corpus denoted by e,f) and c(e|f ; e,f) denotes the number of times
English word e aligns with the French word f in the aligned corpus.
IBM models 2-5 take into account reordering, fertility and deficiency issues. Since
then there have been several improvements to IBM models including the introduction
of HMM models (Vogel et al., 1996; Toutanova et al., 2002), use of posterior methods
(Kumar and Byrne, 2002) and discriminative training methods (Och and Ney, 2002),
incorporating manual word alignments (Callison-Burch et al., 2004) and using log-
linear model combination of simpler models (Liu et al., 2005). Moving beyond words,
there has also been extensive work in training phrase-based models from parallel text.
These models are rooted in the work by Och and Weber (1998) and Och et al. (1999)
on alignment template models, with many variations in the recent literature that are
beyond the scope of this dissertation. There has been also work in combining par-
allel corpus with additional extractable noisy parallel text from monolingual corpora
20
(Munteanu et al., 2004; Fung and Cheung 2004) and then applying the word-based or
phrased-based statistical models treating the noisy text as a part of sentence-aligned
corpora.
The focus of the translation lexicon induction methods discussed in this dissertation
is on methods not requiring a parallel corpus, in order to alleviate the bottleneck of
manual annotation efforts. Towards this goal, several weakly supervised approaches
have been proposed that are detailed in the sections below.
2.2 Using Monolingual Corpora and Seed
Lexicons
The primary idea behind weakly supervised methods is to exploit noisy clues ex-
tracted from the monolingual corpora of source and target languages. The noisy
clues include diverse similarity measures such as contextual, orthographic, frequency
distribution, etc. A highly effective source of similarity often used in this literature
is that of similar contexts (Schafer and Yarowsky, 2002; Koehn and Knight, 2002;
Haghighi et al., 2008). The idea of words with similar meaning having similar con-
texts in the same language comes from the Distributional Hypothesis (Harris, 1954),
and Rapp (1999) was the first to propose using context of a given word as a clue to
its translation.
The algorithm presented by Rapp (1999) shows how an English translation for a Ger-
21
man word can be obtained by first constructing a German context vector by counting
its surrounding words in a monolingual German corpus. Then, using an incomplete
bilingual dictionary, the counts of the German context words with known translations
are projected onto an English vector. The projected vector for the German word is
compared to the vectors constructed for all English words using a monolingual English
corpus. The English words with the highest vector similarity are treated as transla-
tion candidates. Rapp (1999) makes use of the city-block similarity metric, however
other researchers have found cosine similarity to perform better (Koehn and Knight,
2002; Schafer and Yarowsky, 2002). The original Rapp (1999) work employed a rel-
atively large bilingual dictionary containing approximately 16,000 words and tested
only on a small collection of 100 manually selected nouns. A detailed illustration
of this approach is described in the context of the improvements presented in this
dissertation in Section 4.3 of Chapter 4.
Fung (1998) also used a similar approach to Rapp (1999) for Chinese-English, also
using a large seed dictionary (20,000 words) for projection used for comparing Chi-
nese and English context vectors after projection into a common vector space.
Koehn and Knight (2002) tested this idea on a larger test set consisting of the 1000
most frequent words from a German-English lexicon. They also incorporated clues
such as frequency and orthographic similarity in addition to context. Schafer and
Yarowsky, (2002) independently proposed using frequency, orthographic similarity
and also showed improvements using temporal and word-burstiness similarity mea-
22
sures, in addition to context. Haghighi et al., (2008) made use of contextual and
orthographic clues for learning a generative model from monolingual corpora and a
seed lexicon.
A key notion in the above class of work is that of the similarity of projected context
vector to candidate translation context vector and all of the previous literature (Fung,
1998; Rapp, 1999; Koehn and Knight, 2002; Schafer and Yarowsky, 2002; Haghighi
et al., 2008) has made use of naive fixed-window adjacent contexts for the construc-
tion of context vectors. Chapter 4 shows how richer contexts exploiting dependency
information can allow for dynamic context size, and account for word reordering in
the source and target language.
2.3 Using Bridge Languages
Another highly useful weakly supervised method is that of using “bridge lan-
guages”, often also referred to as “pivot languages” (Hajic et al., 2000; Gollins and
Sanderson, 2001) . Often, a low-resource source language (say, Romanian) has a
language closely related within its family that has a large translation lexicon for the
target language, for example, Spanish is closely related to Romanian and a large
Spanish-English dictionary is easily available. The mapping from the low-resource
language to the close language in the family is established by learning statistical
models of cognate surface similarity.
23
Mann and Yarowsky (2001) introduced the idea of using bridge languages for tran-
sitive lexical translation induction using cognates. A cognate model was developed
consisting of several string distance measures such as raw Levenshtein distance and
trained single-state probabilistic transducers presented in Ristad and Yianilos (1997).
Schafer and Yarowsky (2002) extended this work by investigating a range of trans-
ducer structures for modeling cognates and further by using diverse similarity mea-
sures such as context similarity, temporal and word-burstiness similarity for ranking
the translation candidates when a cognate is found.
Chapter 3 shows another novel use of bridge languages, for compound translation
induction by leveraging cross-language similarity of compound components as a tran-
sitivity bridge. Translating compound words such as “lighthouse”, “fireplace”, etc.,
are often a challenge for corpus-based methods due to their often low frequency and
potentially complex compounding behavior, thus needing special treatment. The rel-
evant literature for compound word translation is provided in Section 3.3 of Chapter
3, all of which are based on surface-based compositional methods. Chapter 3 shows
how the bridge language paradigm can be utilized for generating fluent translation
candidates for compound translation, as opposed to providing glossy translations via
fixed surface-based pattern templates.
24
Chapter 3
Translating Compounds by
Learning Component Gloss
Translation Models via Multiple
Languages
Summary
This chapter describes an approach to the translation of compound words and
phrases without the need for bilingual training text, by modeling the mapping of
literal component word glosses (e.g. “iron-path”) into fluent English (e.g. “railway”)
across multiple languages. Performance is improved by adding component-sequence
25
and learned-morphology models along with context similarity from monolingual text
and optional combination with traditional bilingual-text-based translation discovery.
Components of this chapter were originally published by the author of this dissertation
in the forum referenced below1.
3.1 Introduction
Compound words such as lighthouse and fireplace, that are composed of two or
more component words, are often a challenge for machine translation due to their
potentially complex compounding behavior and ambiguous interpretations. Further-
more, compound words/phrases are often poorly covered in bilingual dictionaries.
Compounds also exist in many other languages and some of the examples are shown
below:
• German: “Krankenhaus” (hospital) is compound word with it’s individual com-
ponents as “Kranken” (sick) and “Haus” (house). Another example is “Re-
genschirm” (umbrella) that has individual components as “Regen” (rain) and
“Schirm” (guard).
• Farsi: “mehmankhane2” (hotel) is a compound word with it’s individual com-
ponents as “mehman” (guest) and “khane” (house). Another example is
1Reference: N. Garera and D. Yarowsky. Translating Compounds by Learning Component GlossTranslation Models via Multiple Languages. Proceedings of International Joint Conference on Nat-ural Language Processing (IJCNLP), 2008.
2All the non-Latin-1 languages were represented using unicode format while performing the ex-periments reported in this chapter.
26
Compound Splitting English Gloss TranslationInput: Distilled glosses from German-English dictionaryKrankenhaus Kranken-Haus sick-house hospitalRegenschirm Regen-Schirm rain-guard umbrellaWorterBuch Worter-Buch words-book dictionaryEisenbahn Eisen-Bahn iron-path railroadInput: Distilled glosses from Swedish-English dictionarySjukhus Sjhu-Khus sick-house hospitalJarnvag Jarn-vag iron-path railwayOrdbok Ord-Bok words-book dictionary
Goal: To translate new Albanian compoundsHekurudhe Hekur-Udhe iron-path ???
Table 3.1: Example lexical resources used in this task and their application to trans-lating compound words in new languages.
“mizetahrir” (desk) that has individual components “miz”(table), “e” (filler),
and “tahrir” (writing).
For many languages, such words form a significant portion of the lexicon and the
compounding process is further complicated by diverse morphological processes and
the properties of different compound sequences such as Noun-Noun, Adj-Adj, Adj-
Noun, Verb-Verb, etc. Compounds also tend to have a high type frequency but a low
token frequency which makes their translation difficult to learn using corpus-based
algorithms (Tanaka and Baldwin, 2003). Furthermore, most of the literature on com-
pound translation has been restricted to a few languages dealing with compounding
phenomena specific to the language in question.
With these challenges in mind, the primary goal of this work is to improve the cover-
age of translation lexicons for compounds, as illustrated in Table 3.1 and Figure 3.1,
in multiple new languages. The algorithms presented show how using cross-language
27
compound evidence obtained from bilingual dictionaries can aid in compound trans-
lation. A primary motivating idea for this work is that the literal component glosses
for compound words (such as “iron path” for railway) is often replicated in multiple
languages, providing insight into the fluent translation of a similar literal gloss in a
new (often resource-poor) language.
3.2 Resources Utilized
The only resource utilized for the compound translation lexicon algorithm is a
collection of bilingual dictionaries and a small lexicon of the source-target language
pair for translating the individual components3). Bilingual dictionary collections for
50 languages were acquired in electronic form over the Internet or via optical character
recognition (OCR) on paper dictionaries. Note that no parallel or even monolingual
corpora are required ; their use described later in the chapter is optional.
3.3 Related Work
The compound-translation literature typically deals with these steps: 1) Com-
pound splitting, 2) translation candidate generation and 3) translation candidate
scoring. Compound splitting is generally done using translation lexicon lookup and
3The individual component words are usually common frequent words and hence are easier obtaintranslations either via a native speaker or corpus-based methods.
28
allowing for different splitting options based on corpus frequency (Zhang et al., 2000;
Koehn and Knight, 2003).
Translation candidate generation is an important phase and this is where this work
differs significantly from the previous literature. Most of the previous work has been
focused on generating compositional translation candidates, that is, the translation
candidates of the compound words are lexically composed of the component word
translations. This has been done by either just concatenating the translations of
component words to form a candidate (Grefenstette, 1999; Cao and Li, 2002), or
using syntactic templates such as “E2 in E1”, “E1 of E2” to form translation candi-
dates from the translation of the component words E2 and E1 (Baldwin and Tanaka,
2004), or using synsets of the component word translations to include synonyms in
the compositional candidates (Navigli et al., 2003).
The above class of work in compositional-candidate generation fails to translate
compounds such as Krankenhaus (hospital) whose component word translations are
Kranken (sick) and Haus (hospital), and composing sick and house in any order will
not result in the correct translation (hospital). Another problem with using fixed
syntactic templates is that they are restricted to the specific patterns occurring in
the target language. This chapter describes how one can use the gloss patterns of
compounds in multiple other languages to hypothesize translation candidates that
are not lexically compositional.
29
Lookup words in other languages that result in
"iron path" after splitting
Goal: To translate this Albanian compound word:
udhë
(English gloss)
Compound splittingusing lexicon lookup
Using small Albanian English dictionary
iron path
hekur
hekurudhë zog birdudhë pathhekur ironvadis water
Italian-English dictionary ferrovia ---> ferro via (railroad) <--- (iron) (path)
German-English dictionary eisenbahn ---> eisen bahn (railroad) <--- (iron) (path)
Swedish-English dictionary järnväg ---> järn väg (railway) <--- (iron) (path)
Uighur-English dictionary tömüryol ---> tömür yol (railroad) <--- (iron) (path)
Candidate translations
of hekurudhë
Other dictionaries
0.190.140.05
railroadrailway
rail
Algorithm outputfor hekurudhë
(iron) (path)
Figure 3.1: Illustration of using cross-language evidence using bilingual dictionaries
of different languages for compound translation. The basic approach is to translate
compound words by modeling the mapping of literal component-word glosses (e.g.
“iron-path”) into fluent English (e.g. “railway”) across multiple languages.
30
3.4 Approach
The approach to compound word translation employed here is illustrated in Figure
3.1.
3.4.1 Splitting compound words and gloss
generation with translation lexicon
lookup
First a given source word is split, such as the Albanian compound hekurudhe,
into a set of component word partitions, such as hekur (iron) and udhe (path). The
initial approach is to consider all possible partitions based on contiguous component
words found in a small dictionary for the language, as in Brown (2002) and Koehn
and Knight (2003)4. For a given split, its English glosses are generated by using all
possible English translations of the component words given in the dictionary of that
language. The algorithm is allowed to generate multiple glosses “iron way,” “iron
road,” etc. based on multiple translations of the component words. Multiple glosses
only add to the number of translation candidates generated.
4 In order to avoid inflections as component-words, the component-word length is limited to atleast three characters.
31
3.4.2 Using cross-language evidence from different
bilingual dictionaries
For many compound words (especially for borrowings), the compounding process
is identical across several languages and the literal English gloss remains the same
across these languages. For example, the English word railway is translated as a
compound word in many languages, and the English gloss of those compounds is
often “iron path” or a similar literal meaning5. Thus knowing the fluent English
translation of the literal gloss “iron path” in some relatively resource-rich language
provides a vehicle for the translation from all other languages sharing that literal
gloss6
3.4.3 Ranking translation candidates
The confidence in the correctness of a mapping between a literal gloss (e.g. “iron
path”) and fluent translation (e.g. “railroad”) can be based on the number of distinct
languages exhibiting this association. Thus the candidate translations generated via
different languages are ranked as in Figure 3.1 as follows: For a given target com-
pound word, say fc with a set of English glosses G obtained via multiple splitting
options or multiple component word translations, the translation probability for a
5For the gloss, “iron path”, 10 out of the 49 other languages were found in which some compoundword has the English gloss after splitting and component-word translation.
6A small translation lexicon in the target language is used for translating the individualcomponent-words, but these are often higher frequency words and present either in a basic dic-tionary or discoverable through corpus-based techniques.
32
candidate translation can be computed as:
p(ec|fc) =∑g∈G
p(ec, g|fc) (3.1)
=∑g∈G
p(g|fc) · p(ec|g, fc) (3.2)
≈∑g∈G
p(g|fc) · p(ec|g) (3.3)
where, p(g|fc) = p(g1|f1) · p(g2|f2). f1, f2 are the individual component-words of
compound and g1, g2 are their translations from the existing dictionary. For human
dictionaries, p(g|fc) is uniform for all g ∈ G, while variable probabilities can also be
acquired from bitext or other translation discovery approaches. Also, p(ec|g) can be
estimated using the relative frequency, freq(g,ec)freq(g)
, where freq(g, ec) is the number of
times the compound word with English gloss g is translated as ec in the bilingual
dictionaries of other languages and freq(g) is the total number of times the English
gloss appears in these dictionaries.
3.5 Evaluation using Exact-match
Translation Accuracy
For evaluation, the performance of the algorithm is tested on the following 10
languages: Albanian, Arabic, Bulgarian, Czech, Farsi, German, Hungarian, Russian,
Slovak and Swedish. The evaluation results show both the average performance for
33
these 10 languages (Avg10), as well as provide individual performance details on
Albanian, Bulgarian, German and Swedish. For each of the compound translation
models, the evaluation results report coverage (the # of compound words for which
a hypothesis was generated by the algorithm) and Top1/Top10 accuracy. Top1 and
Top 10 accuracy are the fraction of words for which a correct translation (listed in
the evaluation dictionary) appears in the Top 1 and Top 10 translation candidates
respectively, as ranked by the algorithm. Because evaluation dictionaries are often
missing acceptable translations (e.g. railroad rather than railway), and any devia-
tion from exact-match is scored as incorrect, these measures will be a lower bound
on acceptable translation accuracy. Also, target language models can often select
effectively among such hypothesis lists in context.
3.6 Comparison of different compound
translation models
This section compares the results of various models for compound translation,
starting from the prior work on using compositional methods (Grefenstette, 1999;
Cao and Li, 2002) and then describing the new non-compositional and cross-language
evidence based methods introduced in this chapter.
34
Language Compound words Top1 Top10 Foundtranslated Acc. Acc. Acc.
Albanian 4472 (10.11%) 0.001 0.010 0.020Bulgarian 9093 (12.50%) 0.001 0.015 0.031German 15731 (29.11%) 0.004 0.079 0.134Swedish 18316 (31.57%) 0.005 0.068 0.111Avg10 14228 (17.84%) 0.002 0.030 0.055
Table 3.2: Baseline performance using unreordered literal English glosses as transla-tions. The percentages in parentheses indicate what fraction of all the words in thetest (entire) vocabulary were detected and translated as compounds.
3.6.1 A simple model using literal English gloss
concatenation as the translation
The baseline model is a simple gloss concatenation model for generating compo-
sitional translation candidates on the lines of Grefenstette (1999) and Cao and Li
(2002). The translations of the individual component-words (e.g. for the compound
word hekurudhe, they would be hekur (iron) and udhe (path)) are used for hypothe-
sizing three translation candidate variants: “ironpath”, “iron path” and “iron-path”.
A test instance is scored as correct if any of these translation candidates occur in
the translations of hekurudhe in the bilingual dictionary. This baseline performance
measures how well simple literal glosses serve as translation candidates. In cases such
as the German compound Nußschale (nutshell), which is a simple concatenation of
the individual components Nuß(nut) and Schale (shell), the literal gloss is correct.
For this baseline, if the component-words have multiple translations, then each of the
possible English gloss is ranked randomly. While Grefenstette (1999) and Cao and
35
Li (2002) proposed re-ranking these candidates using web-data, the potential gains
of this ranking are limited, as it can be seen in Table 3.2 that even the Found Acc.
is very low7, that is for most of the cases the correct translation does not appear
anywhere in the set of English glosses. One explanation for this could be that for
only a small percentage of compound words, their dictionary translations are formed
by concatenating their English glosses. Also, Grefenstette (1999) reports much higher
accuracies for German on this model because his 724 German test compounds were
chosen in such a way that their correct translation is a concatenation of the possible
component word translations.
3.6.2 Using bilingual dictionaries
This section describes the results from the model explained in Section 3.4. To
recap, this model attempts to translate every test word such that there is at least one
additional language whose bilingual dictionary supports an equivalent split and literal
English gloss, and bases its translation hypotheses on the consensus fluent transla-
tion(s) corresponding to the literal glosses in these other languages. The performance
is shown in Table 3.3. The substantial increase in accuracy over the baseline indicates
the usefulness of such gloss-to-translation guidance from other languages. The rest
of the sections detail the investigation of improvements to this model.
7Found Acc. is the fraction of examples for which the correct translation appears anywhere inthe n-best list.
36
Nußschale
Nuß Schale
Nut Shell
Nutshell
Shellnut
Nut in shell
Nut of shell
(English Gloss)
Krankenhaus
Kranken Haus
Sick House
Sickhouse
Housesick
Sick of house
House of sick
(English Gloss)
Figure 3.2: Illustration of the problem with generating fluent translation candi-
dates via compositional methods (Grefenstette, 1999; Cao and Li, 2002; Baldwin
and Tanaka, 2004)
37
Language Compound words Top1 Top10translated Acc. Acc.
Albanian 3085 (6.97%) 0.185 0.332Bulgarian 6719 (9.24%) 0.247 0.416German 11103 (20.55%) 0.195 0.362Swedish 12681 (21.86%) 0.188 0.346Avg10 9320.9 (11.98%) 0.184 0.326
Table 3.3: Coverage and accuracy for the standard model using gloss-to-fluent trans-lation mappings learned from bilingual dictionaries in other languages (in forwardorder only).
3.6.3 Using forward and backward ordering for
English gloss search
In the standard model, the literal English gloss for a source compound word (for
example, iron path) matches glosses in other language dictionaries only in the identical
order. But given that modifier/head word order often differs between languages,
one can test how searching for both orderings (e.g. “iron path” and “path iron”)
can improve performance, as shown in Table 3.5. The percentages in parentheses
show relative increase from the performance of the standard model in Section 3.4. A
substantial improvement is seen in both coverage and accuracy.
3.6.4 Increasing coverage by automatically discov-
ering compound morphology
For many languages, the compounding process introduces its own morphology
(Figure 3.3). For example, in German, the word Geschaftsfuhrer (manager) consists
38
Language Dictionary SizeAfrikaans 11,389Albanian 188,563Arabic 167,189Azeri 231,891Bangla 1,606Basque 880Bosnian 18,283Bulgarian 316,631Chinese 82,080Czech 262,690Dutch 233,805Esperanto 3,001Farsi 198,605French 195,627German 272,230Greek 160,126Hindi 58,179Hungarian 289,225Indonesian 67,633Irish 887Italian 166,966Kapampangan 1,000Kazakh 145,750Korean 229,742Kurdish 9,870Kyrgyz 74,890Latin 18,884Latvian 148,363
Language Dictionary SizeMalay 9,438Maltese 7,574Maori 27,967Mongolian 948Nepali 6,812Polish 261,463Portuguese 840Punjabi 76,311Romanian 249,479Russian 423,009Serbian 168,140Slovak 233,093Somali 230Spanish 347,441Swedish 227,849Tagalog 247,662Tamil 165,004Tatar 8,557Thai 14,925Tibetan 59,083Tigrinya 56Turkish 1,272,881Turkmen 91,928Uighur 16,285Ukrainian 14,056Urdu 36,428Uzbek 190,688Welsh 25,832
Table 3.4: Size of various bilingual dictionaries (with other language as English)
Language Compound words Top1 Top10translated Acc. Acc.
Albanian 3229(+4.67%) .217(+17.30%) .409(+23.19%)Bulgarian 6806(+1.29%) .255(+3.24%) .442(+6.25%)German 11346(+2.19%) .199(+2.05%) .388(+7.18%)Swedish 12970(+2.28%) .189(+0.53%) .361(+4.34%)Avg10 9603(+3.03%) .193(+4.89%) .362(+11.04%)
Table 3.5: Performance for looking up English gloss via both orderings. The percent-ages in parentheses are relative improvements from the performance in Table 3.3.
39
Geschäft s Führer
Paterfamilias
Pater Familia s+ + + +
s as Middle Glue in German
s as End Glue in Latin
Geschäftsführer
(Business) (Guide) (Father) (Family)
(Manager) (Household head)
Figure 3.3: Illustration of compounding morphology using middle and end glue char-
acters.
of the lexemes Geschaft (business) and Fuhrer (guide) joined by the lexeme -s. For
the purposes of these experiments, such lexemes are called fillers or middle glue
characters. Koehn and Knight (2003) used a fixed set of two known fillers s and
es for handling German compounds. To broaden the applicability of this work to
new languages without linguistic guidance, such fillers are estimated directly from
corpora in different languages. In additional to fillers, compound can also introduce
morphology at the suffix or prefix of compounds, for example, in the Latin language,
the lexeme paterfamilias contains the genitive form familias of the lexeme familia
(family), thus s in this case is referred to as the “end glue” character. To augment
the splitting step outlined in Section 3.4.1, deletion of up to two middle characters
and two end characters is allowed. Then, for each glue candidate (for example es), its
probability is estimated as the relative frequency of unique hypothesized compound
words successfully using that particular glue.
The set of glues is ranked by their probability and take the top 10 middle and end
40
Albanian Bulgarian German SwedishTop 15 Middle Glue Character(s)j 0.059 O 0.129 s 0.133 s 0.132s 0.048 N 0.046 n 0.090 l 0.051t 0.042 H 0.036 k 0.066 n 0.049r 0.042 E 0.025 h 0.042 t 0.045i 0.038 A 0.025 f 0.037 r 0.035l 0.031 d 0.024 l 0.036 k 0.030n 0.030 C 0.025 r 0.032 g 0.026e 0.022 3 0.023 t 0.031 v 0.023m 0.022 y 0.021 er 0.027 d 0.023a 0.021 T 0.021 st 0.024 b 0.023a 0.021 CT 0.020 en 0.022 f 0.020k 0.020 l 0.020 b 0.022 e 0.020sh 0.019 CK 0.019 ge 0.019 m 0.019h 0.016 P 0.019 e 0.016 st 0.017u 0.015 K 0.018 ch 0.016 p 0.016Top 15 End Glue Character(s)m 0.146 T 0.124 n 0.188 a 0.074t 0.079 EH 0.092 t 0.167 g 0.073s 0.059 H 0.063 en 0.130 t 0.059k 0.048 M 0.049 e 0.069 e 0.057r 0.037 AM 0.047 d 0.043 d 0.057es 0.037 E 0.046 r 0.041 re 0.046e 0.034 � 0.037 er 0.040 k 0.041je 0.027 K N 0.033 g 0.040 n 0.039i 0.023 A 0.032 ig 0.024 ng 0.037e 0.023 O 0.030 nd 0.018 l 0.034es 0.022 CT 0.027 l 0.015 s 0.031l 0.021 HE 0.025 s 0.014 r 0.029n 0.020 K 0.025 ch 0.012 sk 0.025st 0.019 KA 0.025 i 0.011 ra 0.019te 0.017 NE 0.018 m 0.010 ad 0.018
Table 3.6: Top 15 middle glues (fillers) and end glues discovered for each languagealong with their probability values. Glue characters allow for appropriately splittingthe compound words into the root forms of the individual components for lookup ina lexicon.
41
Language Compound words Top1 Top10translated Acc. Acc.
Albanian 3272(+1.33%) .214(-1.38%) .407(-0.49%)Bulgarian 7211(+5.95%) .258(+1.18%) .443(+0.23%)German 13372(+17.86%) .200(+0.50%) .391(+0.77%)Swedish 15094(+16.38%) .190(+0.53%) .363(+0.55%)Avg10 10273(+6.98%) .194(+0.52%) .363(+0.28%)
Table 3.7: Performance for increasing coverage by including compounding morphol-ogy. The percentages in parentheses are relative improvements from the performancein Table 3.5 .
glues for each language. A sample of glues discovered for some of the languages are
shown in Table 3.6. The performance for the morphology step is shown in Table 3.7.
The relative percentage improvements are with respect to the previous Section 3.6.3.
A statistically significant gain in coverage is observed as the flexibility of glue process
allows discovery of more compounds.
3.6.5 Re-ranking using context vector projection
Performance can be further improved by re-ranking candidate translations based
on the goodness of semantic “fit” between two words, as measured by their context
similarity. This can be accomplished as in Rapp (1999) and Schafer and Yarowsky
(2002) by creating bag-of-words context vectors around both the source and tar-
get language words and then projecting the source vectors into the (English) target
space via the current small translation dictionary. Once in the same language space,
source words and their translation hypotheses are compared via cosine similarity us-
ing their surrounding context vectors. This experiment was performed for German
42
Method Top1avg Top10avgOriginal ranking 0.196 0.388Comb. with Context Sim 0.201 0.391
Table 3.8: Average performance on German and Swedish with and without usingcontext vector similarity from monolingual corpora.
and Swedish and report average accuracies with and without this addition in Table
3.8. For monolingual corpora, the German and Swedish side of the Europarl corpus
(Koehn, 2005) was used, consisting of approximately 15 million and 21 million words
respectively. The context vectors could be projected for an average of 4224.5 words in
the two languages among all the possible compound words detected in Section 3.6.4.
The poor Eurpoarl coverage could be due to the fact that compound words are gener-
ally technical words with low Europarl corpus frequency, especially in parliamentary
proceedings. Thus, the small performance gains here could be attributed to these
limitations of the monolingual corpora.
3.6.6 Using phrase-tables if a parallel corpus is
available
All previous results presented in this chapter have been for translation lexicon dis-
covery without the need for parallel bilingual text (bitext), which is often in limited
supply for lower-resource languages. However, it is useful to assess how this transla-
tion lexicon discovery work compares with traditional bitext-based lexicon induction
(and how well the approaches can be combined). For this purpose, phrase tables
43
Method # of words Top1 Top10translated Acc. Acc.
GermanBiDict 13372 0.200 0.391Parallel Corpus SMT 3281 0.423 0.576Parallel + BiDict 3281 0.452 0.579CzechBiDictthresh=1 3455 0.276 0.514Parallel Corpus SMT 309 0.285 0.404Parallel + BiDict 309 0.359 0.599
Table 3.9: Performance of this work’s BiDict approach compared with and augmentedwith traditional statistical MT learning from bitext.
learned by the standard statistical MT Toolkit Moses (Koehn et al., 2007) were used.
The phrase-table accuracy was tested on two languages, one for which a lot of par-
allel data available (German-English Europarl corpus with approximately 15 million
words) was available and one for which relatively little parallel data (Czech-English
news-commentary corpus with approximately 1 million words) was available. This
was done to see how the amount of parallel data available affects the accuracy and
coverage of compound translation. Table 3.9 shows the performance for this exper-
iment. For German, a significant improvement in accuracy and for Czech a small
improvement in Top1 but a decline in Top10 accuracy is seen. Note that these accu-
racies are still quite low as compared to general performance of phrase tables in an
end-to-end MT system because the evaluation measures exact-match accuracy on a
generally more challenging and often-lower-frequency lexicon subset. The third row
in Table 3.9 for each of the languages shows that if one had a parallel corpus available,
its n-best list can be combined with the n-best list of Bilingual Dictionaries algorithm
44
to provide much higher consensus accuracy gains using weighted voting.
3.7 Statistical Significance of Results
All results reported in this chapter are based on very large test sets, greater than
9,000 examples on average and the algorithms presented results in large gains with
respect to the baseline. Using a binomial test of various sample sizes and baseline
accuracies in different languages, all improvements with respect to the baseline results
(Table 3.2) are statistically significant with p-value less than 0.05. Furthermore, the
accuracy improvements with respect to MOSES system (Koehn et al., 2007) using
parallel corpora in Table 3.9 is also statistically significant for Czech and for Top 1
accuracy on German. The Top 10 accuracy on German in this table is not statistically
significant with respect to phrase-table based accuracy. Nevertheless, this does not
contradict the claim of the methods presented in this chapter being more useful for
languages where large amounts of parallel corpora are not available such as Czech.
45
3.8 Quantifying the Role of
Cross-language Selection
and Usage
3.8.1 Coverage/Accuracy Trade-off
The number of languages offering a translation hypothesis for a given literal En-
glish gloss is a useful parameter for measuring confidence in the algorithm’s selection.
The more distinct languages exhibiting a translation for the gloss, the higher likeli-
hood that the majority translation will be correct rather than noise. Varying this
parameter yields the coverage/accuracy trade-off as shown in Figure 3.4.
3.8.2 Varying the size of bilingual dictionaries
Figure 3.5 illustrates how the size of the bilingual dictionaries used for providing
cross-language evidence affects translation performance. In order to take both cov-
erage and accuracy into account, performance measure used was the F-score which
is a harmonic average of Precision (the accuracy on the subset of words that could
be translated) and Pseudo-recall (which is the correctly translated fraction out of
total words that could be translated using 100% of the dictionary size). Given the
Precision P (same as accuracy) and Pseudo-Recall (R) as defined above, the F-score
46
Coverage/Accuracy Tradeoff
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0 200 400 600 800 1000 1200 1400 1600
# of words translated as compounds
Exact
matc
h a
ccu
racy
Avg Top 1 Acc.
Avg Top 10 Acc.
>= 8
>= 5>= 4
>= 3
>= 6
>= 14
>= x: threshold for # of languages
Figure 3.4: Coverage/Accuracy trade-off curve by incrementing the minimum number
of languages exhibiting a candidate translation for the source-word’s literal English
gloss. Accuracy here is the Top1 accuracy averaged over all 10 test languages.
47
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0 10 20 30 40 50 60 70 80 90 100
% of dictionary used
F-
sco
re
Top 1Top 10
Figure 3.5: F-measure performance given varying sizes of the bilingual dictionaries
used for cross-language evidence (as a percentage of words randomly utilized from
each dictionary).
48
was computed in the standard manner as:
F =2 · P ·RP +R
(3.4)
Figure 3.5 shows that increasing the percentage of dictionary size8 always helps with-
out plateauing, suggesting substantial extrapolation potential from large dictionaries.
3.8.3 Greedy vs Random Selection of Utilized
Languages
A natural question for the compound translation algorithm is how does the choice
of additional languages affect performance. Results of two experiments are reported
on this question. A simple experiment is to use bilingual dictionaries of randomly
selected languages and test the performance of K randomly selected languages9, in-
crementing K until it is the full set of 50 languages. The dashed lines in Figures
3.6 and 3.7 show this trend. The performance is measured by F-score as in Section
3.8.2, where Pseudo-Recall here is the fraction of correct candidates out of the total
candidates that could be translated, had bilingual dictionaries of all the languages
been used. It can be seen that adding random bilingual dictionaries helps improve
the performance in a close to linear fashion.
Furthermore, it can be observed that certain contributing languages are much more
effective than others (e.g. Arabic/Farsi vs. Arabic/Czech). A greedy heuristic is
8Each run of choosing a percentage of dictionary size was averaged over 10 runs.9Each run of randomly selecting K languages was averaged over 10 runs.
49
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
0 10 20 30 40 50
# of languages utilized (K)
F-s
core
(To
p 1
)
K-Random
K-Greedy
Figure 3.6: Top-1 match F-score performance utilizing K languages for cross-language
evidence, for both a random K languages and greedy selection of the most effective
K languages (typically the closest or largest dictionaries)
50
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0 10 20 30 40 50
# of languages utilized (K)
F-s
core
(To
p 1
0)
K-Random
K-Greedy
Figure 3.7: Top-10 match F-score performance utilizing K languages for cross-
language evidence, for both a random K languages and greedy selection of the most
effective K languages (typically the closest or largest dictionaries)
51
used for ranking an additional cross-language, that is the number of test words for
which the correct English translation can be provided by the bilingual dictionary of
the respective cross-language. Figures 3.6 and 3.7 show that greedy selection of the
most effective K utilized languages using this heuristic substantially accelerates per-
formance. In fact, beyond the best 10 languages, performance plateaus and actually
decreases slightly, indicating that increased noise is outweighing increased coverage.
3.8.4 Languages found using Greedy selection
Table 3.10 shows the sets of the most effective three cross-languages per test
language selected using the greedy heuristic explained in previous section. Unsur-
prisingly, related languages tend to help more than distant languages. For example,
Dutch is most effective for the test language German, and Slovak is most effective
for Czech. Interesting symmetries can also be seen between related languages, for
example: Farsi is the top language used for test language Arabic and vice-versa.
Such symmetries can also be seen for other pairs of related languages such as (Czech,
Slovak) and (Russian, Bulgarian). Thus, related languages are most helpful and they
can be related in several ways such as etymologically, culturally and physically (such
as Hungarian contact with the Germanic languages). The second point to note is that
languages having large dictionaries also tend to be especially helpful, even when un-
related. This can be seen by the presence of Hungarian in top three cross-languages
for most of the test languages. This is likely because Hungarian was one of the
52
Albanian ArabicRussian 0.067 0.116 Farsi 0.051 0.090+Spanish 0.100 0.169 +Spanish 0.059 0.111+Bulgarian 0.119 0.201 +French 0.077 0.138
Bulgarian CzechRussian 0.186 0.294 Slovak 0.177 0.289+Hungarian 0.190 0.319 +Russian 0.222 0.368+Swedish 0.203 0.339 +Hungarian 0.235 0.407
Farsi GermanArabic 0.031 0.047 Dutch 0.130 0.228+Dutch 0.038 0.070 +Swedish 0.191 0.316+Spanish 0.044 0.079 +Hungarian 0.204 0.355
Hungarian RussianSwedish 0.073 0.108 Bulgarian 0.185 0.250+Dutch 0.103 0.158 +Hungarian 0.199 0.292+German 0.117 0.182 +Swedish 0.216 0.319
Slovak SwedishCzech 0.145 0.218 German 0.120 0.188+Russian 0.168 0.280 +Hungarian 0.152 0.264+Hungarian 0.176 0.300 +Dutch 0.182 0.309
Table 3.10: Illustrating 3-best cross-languages obtained for each test language (shownin bold). Each row shows the effect of adding the respective cross-language to the setof languages in the rows above it and the corresponding F-scores (Top 1 and Top 10)achieved.
largest dictionaries and hence can provide good coverage for obtaining translation
candidates of rarer or technical compounds, which may have more language universal
literal glosses. For reference, the sizes of various bilingual dictionaries is provided in
Table 3.4.
3.9 Conclusion
This chapter presents a successful approach to extracting compound translation
relationships without the need for bilingual training text, by modeling the mapping of
53
literal component-word glosses (e.g. “iron-path”) into fluent English (e.g. “railway”)
across multiple languages. An interesting property of using such cross-language ev-
idence is that one does need to restrict the candidate translations to compositional
(or “glossy”) translations, as this model allows the successful generation of more
fluent non-compositional translations. Performance is further improved by adding
component-sequence and learned-morphology models along with context similarity
from monolingual text and optional combination with traditional bilingual-text-based
translation discovery. These models show consistent performance gains across 10 di-
verse test languages.
54
Chapter 4
Improving Translation Lexicon
Induction from Monolingual
Corpora via Dependency Contexts
and Part-of-Speech Equivalences
Summary
This chapter presents novel improvements to the induction of translation lexi-
cons from monolingual corpora by incorporating multilingual dependency parses. A
dependency-based context model was introduced that incorporates long-range depen-
dencies, variable context sizes, and reordering. It provides a 16% relative improvement
55
over the baseline approach that uses a fixed context window of adjacent words. Its
Top 10 accuracy for noun translation is higher than that of a statistical translation
model trained on a Spanish-English parallel corpus containing 100,000 sentence pairs.
The evaluation was generalized to other word-types, and it was shown that the rela-
tive gain can be increased to 18% relative by preserving part-of-speech equivalencies
during translation.
Components of this chapter were originally published by the author of this disserta-
tion in the forum referenced below1.
4.1 Introduction
Recent trends in machine translation illustrate that highly accurate word and
phrase translations can be learned automatically given enough parallel training data.
However, large parallel corpora exist for only a small fraction of the world’s languages,
leading to a bottleneck for building translation systems in low resource languages
such as Swahili, Uzbek or Punjabi. While parallel training data is uncommon for
such languages, more readily available resources include small translation dictionaries,
comparable corpora, and large amounts of monolingual data.
The marked difference in the availability of monolingual vs parallel corpora has
led several researchers to develop methods for automatically learning bilingual lex-
1Reference: N. Garera, C. Callison-Burch, D. Yarowsky. Improving Translation Lexicon Induc-tion from Monolingual Corpora via Dependency Contexts and Part-of-Speech Equivalences. Toappear in Proceedings of the Conference on Natural Language Learning (CoNLL), 2009.
56
icons, either by using monolingual corpora (Rapp, 1999; Koehn and Knight, 2002;
Schafer and Yarowsky, 2002; Haghighi et al., 2008) or by exploiting the cross-language
evidence of closely related “bridge” languages that have more resources (Mann and
Yarowsky, 2001).
This chapter investigates new ways of learning translations from monolingual cor-
pora. The Rapp (1999) model of context vector projection using a seed lexicon is
extended. It is based on the intuition that translation equivalents will have similar
lexical context, even in unrelated corpora. For example, in order to translate the word
“airplane”, the algorithm builds a context vector which might contain terms such as
“passengers”, “runway”, “airport”, etc. and words in target language that have their
translations (obtained via seed lexicon) in surrounding context can be considered as
likely translations.
The basic approach is extended by formulating a context model that uses depen-
dency parses as an external knowledge source. The use of dependency parses has the
following advantages:
• Long distance dependencies allow associated words to be included in the context
vector even if they fall outside of the fixed-window used in the baseline model.
• Using relationships like parent and child instead of absolute positions like pre-
ceding and following word alleviates problems when projecting vectors between
languages with different word orders.
57
• It achieves better performance than baseline context models across the board,
and better performance than statistical translation models on Top-10 accuracy
for noun translation when trained on identical data.
It is shown that an extension based on part-of-speech clustering can give similar
accuracy gains for learning translations of all word types, thus deepening the findings
of previous literature which mainly focused on translating nouns (Rapp, 1999; Koehn
and Knight, 2002; Haghighi et al., 2008).
4.2 Related Work
The literature on translation lexicon induction for low resource languages falls in to
two broad categories: 1) Effectively utilizing similarity between languages by choosing
a high-resource “bridge” language for translation (Mann and Yarowsky, 2001; Garera
and Yarowsky, 2008) and 2) Extracting noisy clues (such as similar context) from
monolingual corpora with help of a seed lexicon (Rapp, 1999; Koehn and Knight,
2002; Schafer and Yarowsky, 2002, Haghighi et al., 2008). The latter category is
more relevant to this work and is explained in detail below.
The idea of words with similar meaning having similar contexts in the same lan-
guage comes from the Distributional Hypothesis (Harris, 1985) and Rapp (1999)
was the first to propose using context of a given word as a clue to its translation.
Given a German word with an unknown translation, a German context vector is con-
58
structed by counting its surrounding words in a monolingual German corpus. Using
an incomplete bilingual dictionary, the counts of the German context words with
known translations are projected onto an English vector. The projected vector for
the German word is compared to the vectors constructed for all English words using
a monolingual English corpus. The English words with the highest vector similarity
are treated as translation candidates. The original work employed a relatively large
bilingual dictionary containing approximately 16,000 words and tested only on a small
collection of 100 manually selected nouns.
Koehn and Knight (2002) tested this idea on a larger test set consisting of the
1000 most frequent words from a German-English lexicon. They also incorporated
clues such as frequency and orthographic similarity in addition to context. Schafer
and Yarowsky, (2002) independently proposed using frequency, orthographic simi-
larity and also showed improvements using temporal and word-burstiness similarity
measures, in addition to context. Haghighi et al., (2008) made use of contextual and
orthographic clues for learning a generative model from monolingual corpora and a
seed lexicon.
All of the aforementioned work defines context similarity in terms of the adjacent
words over a window of some arbitrary size (usually 2 to 4 words), as initially proposed
by Rapp (1999). This work shows that the model for surrounding context can be
improved by using dependency information rather than strictly relying on adjacent
words, based on the success of dependency trees for monolingual clustering tasks (Lin
59
and Pantel, 2002) and the recent developments in multilingual dependency parsing
literature (Buchholz and Marsi, 2006; Nivre et al., 2007).
This work further includes a second evaluation that examines the accuracy of
translating all word types, rather than just nouns, thus differentiating it from previous
work. While the straightforward application of context-based model gives a lower
overall accuracy than nouns alone, this work shows how learning a mapping of part-
of-speech tagsets between the source and target language can result in comparable
performance to that of noun translation.
4.3 Translation by Context Vector
Projection
This section details how translations are discovered from monolingual corpora
through context vector projection. Section 4.3.1 defines alternative ways of modeling
context vectors, and including baseline models and our dependency-based model.
The central idea of Rapp’s method for learning translations from unrelated mono-
lingual corpora is based on context vector projection and vector similarity. The good-
ness of semantic “fit” of candidate translations is measured as the vector similarity
between two words. Those vectors are drawn from two different languages, so the
vector for one word must first be projected onto the language space of the other. The
Rapp’s algorithm for creating, projecting and comparing vectors is described below,
60
and illustrated in Figure 4.1.
Algorithm:
1. Extract context vectors:
Given a word in source language, say sw, create a vector using the surrounding
context words and call this reference source vector rssw for source word sw.
The actual composition of this vector varies depending on how the surrounding
context is modeled. The context model is independent of the algorithm, and
various models are explained in later sections.
2. Project reference source vector:
Project all the source vector words contained in the projection dictionary onto
the vector space for the target language, retaining the counts from source corpus.
This vector now exists in the target language space and is called the reference
target vector rtsw . This vector may be sparse, depending on how complete the
bilingual dictionary is, because words without dictionary entries will receive
zero counts in the reference target vector.
3. Rank candidates by vector similarity:
For each word twiin the target language a context vector is created using the
target language monolingual corpora as in Step 1. Compute a similarity score
between the context vector of twi= 〈ci1, ci2, ...., cin〉 and reference target vector
61
Figure 4.1: Illustration of (Rapp, 1999) model for translating Spanish word “crec-
imiento (growth)” via dependency context vectors extracted from respective mono-
lingual corpora as explained in Section 4.3.1.2
62
rtsw = 〈r1, r2, ...., rn〉. The word with the maximum similarity score t∗wiis chosen
as the candidate translation of sw.
The vector similarity can be computed in a number of ways. Cosine similar-
ity was used in our implementation setup, and the formula for this similarity
measure is given below:
t∗wi= argmaxtwi
ci1·r1+ci2·r2+....+cin·rn√c2i1+c2i2+...+c2in
√r21+r2
2+...+r2n
Rapp (1999) used l1-norm metric after normalizing the vectors to unit length,
Koehn and Knight (2002) used Spearman rank order correlation, and Schafer
and Yarowsky (2002) use cosine similarity. Cosine similarity was found to give
the best results in the experimental conditions evaluated. Other similarity mea-
sures may be used equally well.
4.3.1 Models of Context
Several context models are compared in this section. Empirical results for their
ability to find accurate translations are given in Section 4.5.
4.3.1.1 Baseline model
In the baseline model, the context is computed using adjacent words as in
(Rapp,1999; Koehn and Knight, 2002; Schafer and Yarowsky, 2002; Haghighi et al.,
2008). Given a word in source language, say sw, count all its immediate context
63
Figure 4.2: An illustration of dependency tree showing clearly the parent and child
nodes. The word marked in bold (“crecimiento”) is used as an example source word
in the chapter for illustrative purposes, and its adjacent and dependency contexts are
shown in Table 4.1.
words appearing in a window of four words. The counts are collected separately for
each position by keeping track of four separate vectors for positions -2, -1, +1 and
+2. Thus each vector is a sparse vector, having the number of dimensions as the size
of source language vocabulary. In addition to the term frequency, each dimension is
reweighted by multiplying the inverse document frequency (IDF) as in the standard
TF.IDF weighting scheme. In order to compute the IDF, while there were no clear
document boundaries in our corpus, a virtual document boundary was created by
binning after every 1000 words. These vectors are then concatenated into a single
vector, having dimension four times the size of the vocabulary. This vector is called
the reference source vector rssw for source word sw.
64
Position Adjacent DependencyContext Context
-2 para camino-1 el para+1 y prosperidad, y, el+2 la economica
Table 4.1: Contrasting context words derived from the adjacent vs dependency modelsfor the above example
4.3.1.2 Modeling context using dependency trees
Dependency parsing is used to extend the context model. Our context vectors
use contexts derived from head-words linked by dependency trees instead of using the
immediate adjacent lexical words. The use of dependency trees for modeling contexts
has been shown to help in monolingual clustering tasks of finding words with similar
meaning (Lin and Pantel, 2002) and this work shows how they can be effectively used
for translation lexicon induction.
The four vectors for positions -1, +1, -2 and +2 in the baseline model get mapped
to immediate parent (-1), immediate child (+1), grandparent (-2) and grandchild
(+2). An example of using the dependency tree context is shown in Figure 4.2, and
the dependency context is shown in contrast with the adjacent context in Table 4.1,
showing the selection of more salient words by using the dependency tree representa-
tion.
Note that while this approach limits to using four positions in the tree, it does not
imply that only a maximum of four context words are selected for a given sentence
since the word can have multiple immediate children depending upon the dependency
65
parse of the sentence. Hence, this approach allows for a dynamic context size, with
the number of context words varying with the number of children and parents at the
two levels.
Another advantage of this method is that it alleviates the reordering problem
as tree positions (consisting of head-words) are used as compared to usage of the
adjacent position in the baseline context model. For example, if the source Spanish
word to be translated was “prosperidad”, then in the example shown in Figure 4.2, in
case of adjacent context, the context word “economica” will show up in +1 position
in Spanish and -1 position in English (as adjectives come before nouns in English) but
in case of dependency context, the adjective will be the child of noun and hence will
show up in +1 position in both languages. Thus, a bag of word model as in Section
4.3 need not be used in order to avoid learning the explicit mapping that adjectives
and nouns in Spanish and English are reversed.
4.4 Experimental Design
For our initial set of experiments several different vector-based context models are
compared:
• Adjbow – A baseline model which used bag of words model with a fixed window
of 4 words, two on either side of the word to be translated.
• Adjposn – A second baseline that used a fixed window of 4 words but which took
66
positional into account
• Depbow – A dependency model which did not distinguish between grandparent,
parent, child and grandparent relations, analogous to the bag of words model.
• Depposn – A dependency model which did include such relationships, and was
analogous to the position-based baseline.
• Depposn + rev – The above Depposn model applied in both directions (Spanish-to-
English and English-to-Spanish) using their sum as the final translation score.
Translation accuracy of the above methods, which use monolingual corpora, is
contrasted with a statistical model trained on bilingual parallel corpora. That model
is referred to as Mosesen-es-100k, because it was trained using the Moses toolkit (Koehn
et al., 2007).
4.4.1 Training Data
All context models were trained on a Spanish corpus containing 100,000 sentences
with 2.13 million words and an English corpus containing 100,000 sentences with 2.07
million words. The Spanish corpus was parsed using the MST dependency parser
(McDonald et al., 2005) trained using dependency trees generated from the the En-
glish Penn Treebank (Marcus et al., 1993) and Spanish CoNLL-X data (Buchholz and
Marsi, 2006).
67
In order to directly compare against statistical translation models, Spanish and
English monolingual corpora were drawn from the Europarl parallel corpus (Koehn,
2005). The fact that the two monolingual corpora are taken from a parallel corpus
ensures that the assumption that similar contexts are a good indicator of translation
holds. This assumption underlies in all work of translation lexicon induction from
comparable monolingual corpora, and this work maintains a strong bias toward that
assumption. Despite the bias, the comparison of different context models holds, since
all models are trained on the same data.
4.4.2 Evaluation Criterion
The models were evaluated in terms of exact-match translation accuracy of the
1000 most frequent nouns in a English-Spanish dictionary. The accuracy was calcu-
lated by counting how many mappings exactly match one of the entries in the dic-
tionary. This evaluation criterion is similar to the setup used by Koehn and Knight
(2002). The Top N accuracy was computed in the standard way as the number of
Spanish words whose Top N English translation candidates contain a lexicon trans-
lation entry out of the total number of Spanish words that can be mapped correctly
using the lexicon entries. Thus if “crecimiento, growth” is the correct mapping based
on the lexicon entries, the translation for “crecimiento” will be counted as correct if
“growth” occurs in the Top N English translation candidates for “crecimiento”.
Note that the exact-match accuracy is a conservative estimate as it is possible
68
that the algorithm may propose a reasonable translation for the given Spanish word
but is marked incorrect if it does not exist in the lexicon.
Because it would be intractable to compare each projected vector against the
vectors for all possible English words, only the projected vector from each Spanish
word are compared against the vectors for the 1000 most frequent English nouns,
following along the lines of previous work (Koehn and Knight, 2002; Haghighi et al.,
2008).
4.5 Results
Table 4.3 gives the Top 1 and Top 10 accuracy for each of the models on their
ability to translate Spanish nouns into English. Examples of the top 10 translations
using the best performing of the baseline and dependency-based models are shown in
Table 4.2. The baseline models Adjposn and Adjbow differ in that the latter disregards
the position information in the context vector and simply uses a bag of words instead.
Table 4.3 shows that Adjbow gains using this simplification. A bag of words vector
approach pools counts together, which helps to reduce data sparsity. In the position
based model the vector is four times as long. Additionally, the bag of words model can
help when there is local re-ordering between the two languages. For instance, Spanish
adjectives often follow nouns whereas in English the the ordering is reversed. Thus,
one can either learn position mappings, that is, position +1 for adjectives in Spanish
69
caminoDepposn Context Model Adjbow Context Modelway 0.124 intentions 0.22solution 0.097 way 0.21steps 0.094 idea 0.20path 0.093 thing 0.20debate 0.085 faith 0.18account 0.082 steps 0.17means 0.080 example 0.17work 0.079 news 0.16approach 0.074 work 0.16issue 0.073 attitude 0.15
crecimientogrowth 0.27 growth 0.27activity 0.14 loss 0.22development 0.12 source 0.20recovery 0.12 activity 0.15integration 0.12 integration 0.15prosperity 0.11 savings 0.15creation 0.10 coordination 0.13cohesion 0.09 taxation 0.13order 0.08 prosperity 0.13package 0.08 expense 0.12
Table 4.2: Top 10 translation candidates for the Spanish word “camino (way)” and“crecimiento (growth)” for the best adjacent context model (Adjbow) and best de-pendency context model (Depposn). The bold English terms show the acceptabletranslations.
Model AccTop 1 AccTop 10
Adjbow 35.3% 59.8%Adjposn 20.9% 46.9%Depbow 41.0% 62.0%Depposn 41.0% 64.1%Depposn + rev 42.9% 65.5%Mosesen-es-100k 56.4% 62.7%
Table 4.3: Performance of various context-based models learned from monolingualcorpora and phrase-table learned from parallel corpora on Noun translation.
70
Figure 4.3: Precision/Recall curve showing superior performance of dependency
context model as compared to adjacent context at different recall points. Precision
is the fraction of tested Spanish words with Top 1 translation correct and Recall is
fraction of the 1000 Spanish words tested upon.
71
is the same as position -1 in English or just add the the word counts from different
positions into one common vector as considered in the bag of words approach.
Using dependency trees also alleviates the problem of position mapping between
source and target language. Table 4.3 shows the performance using the dependency
tree based models outperforms the baseline models substantially. Comparing Depbow
to Depposn shows that ignoring the tree depth and treating it as a bag of words does not
increase the performance. This contrasts with the baseline models. The dependency
positions account for re-ordering automatically. The precision-recall curve in Figure
4.3 shows that the dependency-based context performs better than adjacent context
at almost all recall levels.
The Mosesen-es-100k model shows the performance of the statistical translation
model trained on a bilingual parallel corpus. While the system performs best in
Top 1 accuracy, the dependency context-based model that ignores the sentence align-
ments surprisingly performs better in case of Top 10 accuracy, showing substantial
promise.
While computing the accuracy using the phrase-table learned from parallel cor-
pora (Mosesen-es-100k), the translation probabilities from both directions (p(es|en) and
p(en|es)) were used to rank the candidates. The monolingual context-based model is
also applied in the reverse direction (from English to Spanish) and the row with label
Depposn + rev in Table 4.3 shows further performance gains using both directions.
72
Spanish English Sim Is presentScore in lexicon
senores gentlemen 0.99 NOxenofobia xenophobia 0.87 YESdiversidad diversity 0.73 YESchipre cyprus 0.66 YESmujeres women 0.65 YESalemania germany 0.65 YESexplotacion exploitation 0.63 YEShombres men 0.62 YESrepublica republic 0.60 YESracismo racism 0.59 YEScomercio commerce 0.58 YEScontinente continent 0.53 YESgobierno government 0.52 YESisrael israel 0.52 YESfrancia france 0.52 YESfundamento certainty 0.51 NOsuecia sweden 0.50 YEStrafico space 0.49 NOtelevision tv 0.48 YESfrancesa portuguese 0.48 NO
Table 4.4: List of 20 most confident mappings using the dependency context basedmodel for noun translation along with exact match evaluation output based onwhether the mapping is present as a lexicon entry. Note that although the firstmapping (senores, gentlemen) is the correct one, it was not present in the lexiconused for evaluation and hence is marked as incorrect.
73
4.6 Further Extensions: Generalizing to
other word types via tagset mapping
Most of the previous literature on this problem focuses on evaluating on nouns
(Rapp, 1999; Koehn and Knight 2002; Haghighi et al., 2008). However the vector
projection approach is general, and should be applicable to other word-types as well.
The models with new test set containing 1000 most frequent words (not just nouns)
in the English-Spanish lexicon are evaluated.
The dependency-based context model is used to create translations for this new
set. The row labeled Depposn in Table 4.5 shows that the accuracy on this set is lower
when compared to evaluating only on nouns. The main reason for lower accuracy is
that closed class words are often the most frequent and tend to have a wide range of
contexts resulting in reasonable translation for most words include open class words
via the context model. For instance, the English preposition “to” appears as the most
confident translation for 147 out of the 1000 Spanish test words and in none (rightly
so) after restricting the translations by part-of-speech categories.
This problem can be greatly reduced by making use of the intuition that part-of-
speech is often preserved in translation, thus the space of possible candidate transla-
tion can be largely reduced based on the part-of-speech restrictions. For example, a
noun in source language will usually be translated as noun in target language, deter-
miner will be translated as determiner and so on. This idea is more clearly illustrated
74
Figure 4.4: Illustration of using part-of-speech tag mapping to restrict candidate
space of translations.
in in Figure 4.4. Rather than imposing a hard restriction, this work computes a rank-
ing based on the conditional probability of candidate translation’s part-of-speech tag
given source word’s tag.
An interesting problem in using part-of-speech restrictions is that corpora in dif-
ferent languages have been tagged using widely different tagsets and the following
subsection explains this problem in detail:
75
4.6.1 Mapping Part-of-Speech tagsets in different
languages
The English tagset was derived from the Penn treebank consisting of 53 tags (in-
cluding punctuation markers) and the Spanish tagset was derived from the Cast3LB
dataset consisting of 57 tags but there is a large difference in the morphological and
syntactic features marked by the tagset. For example, the Spanish tagset as different
tags for masculine and feminine nouns and also has a different tag for coordinated
nouns, all of which need to be mapped to the singular or plural noun category available
in English tagset. Figure 4.5 shows an illustration of the mapping problem between
the Spanish and English POS tags.
An empirical approach for learning the mapping between tagsets using the English-
Spanish projection dictionary used in the monolingual context-based models for trans-
lation is now described. Given a small English-Spanish bilingual dictionary and n-best
list of part-of-speech tags for each word in the dictionary2, the conditional probabil-
ity of translating a source word with pos tag sposito a target with pos tag tposj
is
computed as follows:
2The n-best part-of-speech tag list for any word in the dictionary was derived using the relativefrequencies in a part-of-speech annotated corpora in the respective languages.
76
Figure 4.5: Illustration of mapping Spanish part-of-speech tagset to English tagset.
The tagsets vary greatly in notation and the morphological/syntactic constituents
represented and need to be mapped first, using the algorithm described in Section
4.6.1.
p(tposj|sposi
) =c(sposi
, tposj)
c(sposi)
(4.1)
=
∑sw∈S, tw∈T p(sposi
|sw) · p(tposj|tw) · Idict(sw, tw)∑
sw∈S p(sposi|sw) ·
∑tw∈T Idict(sw, tw)
(4.2)
where
• S and T are the source and target vocabulary in the seed dictionary, with sw
and tw being any of the words in the respective sets.
• p(sposi|sw), p(tposj
|tw) are obtained using relative frequencies in a part-of-speech
tagged corpus in the source/target languages respectively, and are used as soft
counts.
• Idict(sw, tw) is the indicator function with value 1 if the pair (sw, tw) occurs in
the seed dictionary and 0 otherwise.
77
Figure 4.6: Precision/Recall curve showing superior performance of using part-of-
speech equivalences for translating all word-types. Precision is the fraction of tested
Spanish words with Top 1 translation correct and Recall is fraction of the 1000 Spanish
words tested upon.
78
Model AccTop 1 AccTop 10
Depposn 35.1% 62.9%+ POS 41.3% 66.4%
Table 4.5: Performance of dependency context-based model along with addition ofpart-of-speech mapping model on translating all word-types.
In essence, the mapping between tagsets is learned using the known translations
from a small dictionary.
Given a source word sw to translate, its most likely tag s∗pos, and the most likely
mapping of this tag into English t∗pos computed as above, the translation candidates
with part-of-speech tag t∗pos are considered for comparison with vector similarity and
the other candidates with tposj6= t∗pos are discarded from the candidate space. Figure
4.4 shows an example of restricting the candidate space using part-of-speech tags.
The row labeled +POS in Table 4.5 shows the part-of-speech tags provides sub-
stantial gain in performance as compared to direct application of dependency context-
based model and is also comparable to the accuracy obtained evaluating just on nouns
in Table 4.3.
79
Spanish English Sim Is presentScore in lexicon
senores gentlemen 0.99 NOchipre cyprus 0.66 YESmujeres women 0.65 YESalemania germany 0.65 YEShombres men 0.62 YESexpresar express 0.60 YESracismo racism 0.59 YESinterior internal 0.55 YESgobierno government 0.52 YESfrancia france 0.52 YEScultural cultural 0.51 YESsuecia sweden 0.50 YESfundamento basis 0.48 YESfrancesa french 0.48 YESentre between 0.47 YESorigen origin 0.46 YEStrafico traffic 0.45 YESde of 0.44 YESsocial social 0.43 YESruego thank 0.43 NOenergıa energy 0.42 YESclave key 0.42 YESpapel role 0.42 YESinstitucional institutional 0.42 YEStransporte transport 0.41 YES
Table 4.6: List of 25 most confident mappings using the dependency context withthe part-of-speech mapping model translating all word-types along with exact matchevaluation output based on whether the mapping is present as a lexicon entry. Notethat although the second best mapping in Table 4.4 for noun-translation is for xeno-phobia with score 0.87, xenophobia is not among the 1000 most frequent words (ofall word-types) and thus is not in this test set.
80
4.7 Application to Unrelated Corpora
One of the challenges in translation lexicon literature is to apply the monolingual
corpus-based methods utilizing completely unrelated corpora as opposed to un-aligned
bitext or comparable corpora, that have been commonly utilized. This section pro-
vides an initial foray into studying the effect in performance of utilizing unrelated
monolingual corpora for seed-based translation lexicon induction.
As a sample of unrelated corpora, the English Gigaword Corpus (News) and Span-
ish Europarl Corpus (Parliamentary Proceedings) were utilized as target-side and
source-side monolingual corpora respectively. Since the bag-of-words of word adja-
cency context-based model can be applied in a straightforward manner to the new
corpora, the difference in performance for this model was calculated when utilizing
the unrelated corpora. It was found that the top 10 accuracy drops from 59.8% to
48.8% when using unrelated corpora. The drop in magnitude is expected as unrelated
corpora will have less salient context vectors that get utilized for projection, never-
theless the performance is still reasonable as compared to other accuracies obtained
using parallel corpora, showing substantial promise for application to diverse corpora.
4.8 Statistical Significance of Results
Using a binomial test of sample size 1000 and the best baseline accuracies of 35.3%
(Top 1) and 59.8% (Top 10) for noun-translations (Table 4.3), any improvements in
81
accuracy over 37.8% (Top 1) and 62.3% (Top 10) are statistically significant with a
p-value less than 0.05. Furthermore, even the improvement in Top-10 accuracy (row
Depposn+rev) in table 4.3 with respect to the MOSES system that utilizes parallel
corpora (Koehn et al., 2007) is statistically significant with a p-value less than 0.05.
In Table 4.5, showing results for improvements obtained via part-of-speech mapping,
any accuracy improvement over 37.6% (Top 1) and 65.4% (Top 10) are statistically
significant with a p-value less than 0.05. Thus, the results in table 4.5 are also
statistically significant.
4.9 Conclusion
This chapter presents a novel contribution to the standard context models used
when learning translation lexicons from monolingual corpora by vector projection.It
is shown that using contexts based on dependency parses can provide more salient
contexts, allow for dynamic context size, and account for word reordering in the source
and target language. An exact-match evaluation shows 16% relative improvement by
using a dependency-based context model over the standard approach. Furthermore,
it is shown that the introduced model, which is trained only on monolingual corpora,
outperforms the standard statistical MT approach to learning phrase tables when
trained on the same amount of sentence-aligned parallel corpora, when evaluated on
Top 10 accuracy.
82
As a second contribution, this work goes beyond previous literature which evalu-
ated only on nouns. It is showed how preserving a word’s part-of-speech in translation
can improve performance. Furthermore, a solution to an interesting sub-problem en-
countered on the way is proposed. Since part-of-speech tagsets are not identical across
two languages, this work proposes a way of learning their mapping automatically.
Restricting candidate space based on this learned tagset mapping resulted in 18%
improvement over the direct application of context-based model to all word-types.
Dependency trees help improve the context for translation substantially and their
use opens up the question of how the context can be enriched further making use of the
hidden structure that may provide clues for a word’s translation. Through this work
the belief that the problem of learning the mapping between tagsets in two different
languages can be used in general for other NLP tasks making use of projection of
words and morphological/syntactic properties between languages is strengthened.
83
Part II
Extracting Semantic Relationships
84
Chapter 5
Part II Literature Review
This chapter covers the literature review for extracting semantic relations. Section
5.1 describes previous work for extracting relationships such as “is-a” and “part-of”
present in a semantic taxonomy. Section 5.2 describes previous work for extracting
more complex semantic relationships such as definite anaphora. Both these types
are also inter-related and Chapter 7 shows how improved modeling of taxonomic
relationships can aid in extraction of more complex semantic relationships.
5.1 Extracting relationships in a semantic
taxonomy
There has been a plethora of work on extracting generic semantic relationships
such as “is-a”, “part-of”, etc. Some examples of such generic semantic relations in-
85
clude:
• Hypernyms/Hyponyms1 These constitute “is-a” relationships such as “(banana,
fruit)”, “(car, vehicle)”, etc.
• Meronyms: These constitute part-of/member-of relationships such as “(wheel,
car)”, “(floor, house)”, etc.
• Synonyms: These constitute pairs that describe the same concept or have similar
meaning within the language “(path, way)”, “(instruct, teach)”, etc.
• Cousins/Siblings: These constitute pairs that share a close common hypernym
such as “(bus, truck)”, “(diabetes, arthritis)”, etc.
The main approaches in the literature for learning such relationships are described
below:
5.1.1 Manually created databases
A major line of research has been on manually creating a semantic taxonomy from
scratch. A popular example of such a database is WordNet (Miller, 1995; Fellbaum
1998) that has been used widely in natural language processing problems. It contains
over 150,000 unique strings laid out in a taxonomy that identifies the hypernymy,
1Given a “is-a” relationship, hypernym is the parent and hyponym is the child. Thus for “(banana,fruit)”, “fruit” is the hypernym and “banana” is the hyponym.
86
meronymy, synonymy and cousin/sibling relationships. Another popular semantic
resource is CYC (Lenat, 1995). Such a vast manual effort has also been replicated for
other languages leading to creation of Eurowordnet (Vossen, 1998), Hindi WordNet
(Narayan, 2002), Japanese WordNet (Isahara et al., 2008) etc.
However, taxonomy resources such as WordNet are limited or non-existent for most
of the world’s languages. Building a WordNet manually from scratch requires a huge
amount of human effort and for rare languages the required human and linguistic
resources may simply not be available. Hence a major line of research in the com-
putational linguistics literature has been focused on automatically extracting such
semantic relations. These approaches are explained in the following sections.
5.1.2 Hand-crafted Patterns for “is-a” and “part-
whole” relationships
Some of the semantic relationships found in WordNet tend to occur using a few
evocative fixed patterns. Thus given a corpus of the language and a list of hand-
crafted patterns, a large amount of semantic knowledge can be extracted from cor-
pora. This observation was first explored in detail by Hearst (1992) for extracting
hypernymy or “is-a” relationships from unstructured corpora. She observed that the
hypernyms usually co-occur with the following patterns/regular expressions:
1. NP such as NP,* ( or | and ) NP
87
2. such NP as NP ,* ( or | and ) NP
3. NP, NP* , or other NP
4. NP, NP* , and other NP
5. NP , including NP,* or | and NP
6. NP , especially NP,* or | and NP
A total of approximately 400 word pairs were extracted using the above patterns
using a corpus of about 20 million words. While the accuracy was not reported, the
extracted pairs were found to exhibit the hypernymy relationship. She also proposed
how such patterns may be learned from a set of seeds but did not perform an empir-
ical evaluation of such an approach. The seed-based pattern induction literature has
become on of the mainstream approaches for information extraction and is explained
in Section 5.1.3.
While the above patterns have been widely used for extracting hypernym relationship
between common nouns, Mann (2002) used the manually crafted pattern “<Common
Noun> <Proper Noun>” for extracting “is-a” relationship for proper nouns. This
pattern exploits the premodifier position for providing a description of the proper
noun. For example, this pattern will match phrases such as “[the] automaker Mer-
cedes Benz”.
Berland and Charniak (1999) used hand-crafted patterns for extracting meronyms
(“part-of”) relationships. The following two patterns were used to extract word pairs
88
exhibiting meronym relationship:
1. whole-NN ’s part-NN
2. part-NN of the | a whole-NN
Using a corpus of 100,000 words, they were able to extract meronym relations with
55% accuracy. Girju et al. (2003) improved upon their work using a combination
of hand-crafted pattern and supervised learning. They suggested using the following
patterns for extracting meronym pairs:
1. whole-NP ’s part-NP
2. part-NP of whole-NP
3. part-NP VERB while-NP
After extracting sentences that match the above patterns, they filtered out the bad
examples using decision trees. The decision tree model was trained using features
from the WordNet.
The problems with using a few fixed patterns is the often low coverage of such pat-
terns; thus there is a need for discovering additional informative patterns automati-
cally. Approaches for automatically learning such patterns is described in the follow-
ing section.
89
5.1.3 Weakly supervised approaches
Weakly supervised approaches using seed exemplars have been widely used in the
literature for extraction semantic relations. The basic paradigm is similar to the seed-
based approach for crosslingual relationship extraction explained in earlier chapters.
Given a set of seeds of the relationship of interest, patterns such as “X and other
Ys” (for hypernymy) are learned automatically (Ravichandran and Hovy, 2002). The
pattern-learning approaches for learning semantic relationships have been also ap-
plied successfully for learning factual relationships. A more detailed description of
such approaches is provided in the literature review for factual relationships in Chap-
ter 8.
A major problem often noticed in pattern-learning approaches approaches with re-
spect to learning semantic relationships is the high recall but low precision of pat-
terns2. Furthermore, much of the semantic relation extraction work has focused on
extracting a particular relation independently of other relations. Chapter 6 describes
how this problem can be solved by combining evidence from multiple relations in Sec-
tion 6.3.2. Furthermore, Chapter 6 also describes how derived semantic relationships
can be used for extracting cross-lingual relationships.
2also noted by Pantel and Pennacchiotti (2006)
90
5.1.4 Training Supervised Classifiers
Going beyond the popular pattern-based approaches, several researchers have also
tried training a fully supervised classifier such as logistic regression, using features
derived from the parse tree of the sentence (Snow et al., 2006). However detailed
level annotations and other resources such as parse trees may be difficult to obtain
for other languages. Furthermore, requirements of such resources can also become a
bottleneck when scaling up to large corpora.
5.1.5 Clustering Approaches
The other end of the spectrum involves fully unsupervised clustering-based ap-
proaches. A distinct line of work on inducing taxonomies was based on agglomerative
clustering of words using a notion of word similarity (Caraballo, 1999). Cederberg
and Widdows (2003) use latent semantic analysis and noun co-ordination patterns to
improve the precision and recall of hyponymy extraction. The clustering by commit-
tee (CBC) algorithm (Pantel and Lin, 2002) has also been used in extracting noun
clusters that belong to the same class (Pantel and Ravichandran, 2004).
91
5.2 Extracting complex semantic
relationships
This section covers literature on extracting more complex semantic relationships
that are difficult to extract via contextual pattern templates. Definite anaphora (see
Figure 5.1) is a typical example of such a relationship where surface local context is
not sufficient. The standard approaches for coreference resolution that are evaluated
on MUC-style (Hirschman and Chinchor, 1997) corpora have been reported to per-
form poorly on resolution of definite anaphors (Connolly et al., 1997; Strube et al.,
2002; Ng and Cardie, 2002; Yang et al., 2003). For instance, the coreference system
for German texts by Strube et al. (2002) report an F-measure of 33.9% for definite
NPs as compared to 82.8% for personal pronouns.
Definite anaphors are also interesting as a case study because it requires deriving
simple relationships such as “is-a” for successful anaphora resolution/generation. For
example, in Figure 5.1, determining the antecedent to the definite anaphor “the drug”
in text requires knowledge of what previous noun-phrase candidates could be drugs.
Likewise, generating a definite anaphor for the antecedent “Morphine” in text requires
both knowledge of potential hypernyms (e.g. “the opiate”, “the narcotic”, “the drug”,
and “the substance”), as well as selection of the most appropriate level of generality
along the hypernym tree in context (i.e. the “natural” hypernym anaphor). In or-
der to obtain such “is-a” relationship knowledge for dealing with definite anaphors,
92
...pseudoephedrine is found in an allergy treatment, which was given to Wilson by a doctor when he attended Blinn junior college in Houston. In a unanimous vote, the Norwegian sports confederation ruled that Wilson had not taken the drug to enhance his performance...
...pseudoephedrine is found in an allergy treatment, which was given to Wilson by a doctor when he attended Blinn junior college in Houston. In a unanimous vote, the Norwegian sports confederation ruled that Wilson had not taken the __?__ to enhance his performance...
Resolution Task
Generation Task
Figure 5.1: Example of definite anaphora resolution and generation. Both the tasks
require the knowledge of a derived semantic relationship that “pseudoephedrine is-a
drug”.
93
many resolution systems rely on manually built WordNet database (Poesio et al.,
1997; Meyer and Dale, 2002). WordNet has also been used as an important feature
in machine learning of coreference resolution using supervised training data (Soon et
al., 2001; Ng and Cardie, 2002).
However, there are several disadvantages to using handcrafted ontologies. First is
that building, extending and maintaining ontologies by hand is expensive. Second,
some of the anaphoric relationships are context dependent. Hearst (1992) raises the
issue of whether underspecified, context or point-of-view dependent hyponymy rela-
tions should be included in a fixed ontology. For example, “corruption” is referred to
as “the tool” in the corpora utilized for this study. This is a metaphoric usage that
would be difficult to predict unless given the usage sentence and its context. Third,
using all senses of anaphor and potential antecedents in the ontology can result in an
incorrect link due to wrong antecedent selection. Finally the most significant disad-
vantage is that WordNet has a rigid and complicated hierarchy levels. Thus there is
no notion of a “natural” parent that is essential for definite anaphora generation.
In order to alleviate these problems, corpus-based approaches for automatically deriv-
ing “is-a” relationships for definite anaphora have been used in the literature (Poesio
et al., 2002; Markert and Nissim, 2005). The relevant literature for such corpus-
based approaches and contributions to it are described in more detail in Section 7.2
of Chapter 7.
94
Chapter 6
Minimally Supervised Multilingual
Taxonomy and Translation
Lexicon Induction
Summary
This chapter presents a novel algorithm for the acquisition of multilingual lex-
ical taxonomies (including hyponymy/hypernymy, meronymy and taxonomic cous-
inhood), from monolingual corpora with minimal supervision in the form of seed
exemplars using discriminative learning across the major WordNet semantic relation-
ships. This capability is also extended robustly and effectively to a second language
(Hindi) via cross-language projection of the various seed exemplars. This chapter also
95
(grenade)haathagolaa
(explosive)baaruuda
(bomb)bama
(gun)banduuka
explosivegrenade bomb gun
weapon
Induced Hindi Hypernymy (with glosses)
Induced English Hypernymy
hathiyaara(weapon)
Figure 6.1: Goal: To induce multilingual taxonomy relationships in parallel in mul-
tiple languages (such as Hindi and English) for information extraction and machine
translation purposes.
presents a novel model of translation dictionary induction via multilingual transitive
models of hypernymy and hyponymy, using these induced taxonomies. Candidate
lexical translation probabilities are based on the probability that their induced hy-
ponyms and/or hypernyms are translations of one another. All of the above models
are evaluated on English and Hindi.
Components of this chapter were originally published by the author of this disserta-
tion in the forum referenced below1.
1Reference: N. Garera and D. Yarowsky. Minimally Supervised Multilingual Taxonomy andTranslation Lexicon Induction. Proceedings of International Joint Conference on Natural LanguageProcessing (IJCNLP), 2008.
96
6.1 Introduction
Taxonomy resources such as WordNet (Miller, 1995; Fellbaum 1998) are limited
or non-existent for most of the world’s languages. Building a WordNet manually
from scratch requires a huge amount of human effort and for rare languages the
required human and linguistic resources may simply not be available. Most of the
automatic approaches for extracting semantic relations (such as hyponyms) have been
demonstrated for English and some of them rely on various language-specific resources
(such as supervised training data, language-specific lexicosyntactic patterns, shallow
parsers, etc.). This chapter presents a language independent approach for induc-
ing taxonomies such as shown in Figure 6.1 using limited supervision and linguistic
resources. A seed learning based approach for extracting semantic relations (hy-
ponyms, meronyms and cousins) is presented that improves upon existing induction
frameworks by combining evidence from multiple semantic relation types. Using a
joint model for extracting different semantic relations helps to induce more relation-
specific patterns and filter out the generic patterns2. The patterns can then be used
for extracting new word-pairs expressing the relation. Note that the only training
data used in the algorithm are the few seed pairs required to start the bootstrapping
process, which are relatively easy to obtain. The algorithm is evaluated on English
and a second language (Hindi), showing reliable and accurate induction of taxonomies
2The phrase “generic patterns” means patterns that cannot distinguish between different semanticrelations. For example, the pattern “X and Y” is a generic pattern whereas the pattern “Y such asX” is a hyponym-specific pattern.
97
in two diverse languages.
This chapter further describes how having induced parallel taxonomies in two lan-
guages can be used for augmenting a translation dictionary between those two
languages. The translation algorithm make use of the automatically induced hy-
ponym/hypernym relations in each language to create a transitive “bridge” for dic-
tionary induction. Specifically, it relies on the key observation that words in two
languages (e.g. English and Hindi) have increased probabilities of being translations
of each other if their hypernyms or hyponyms are translations of one another.
6.2 Related Work
While manually created WordNets for English (Miller, 1995; Fellbaum, 1998) and
Hindi (Narayan, 2002) have been made available, a lot of time and effort is required
in building such semantic taxonomies from scratch. Hence several automatic corpus
based approaches for acquiring lexical knowledge have been proposed in the litera-
ture. Much of this work has been done for English based on using a few evocative
fixed patterns including “X and other Ys”, “Y such as X”, as in the classic work
by Hearst (1992). The problems with using a few fixed patterns is the often low
coverage of such patterns; thus there is a need for discovering additional informative
patterns automatically. There has been a plethora of work in the area of informa-
tion extraction using automatically derived patterns contextual patterns for semantic
98
categories (e.g. companies, locations, time, person-names, etc.) based on bootstrap-
ping from a small set of seed words (Riloff and Jones, 1999; Agichtein and Gravano,
2000; Ravichandran and Hovy, 2002; Pasca et al. 2006). This framework has been
also shown to work for extracting semantic relations between entities: Girju et al.
(2003) used 100 seed words from WordNet to extract patterns for part-of relations.
Pantel and Pennacchiotti (2006) use pattern-based approaches to extract is-a, part-of
and other semantic relations. While most of the above pattern induction work has
been shown to work well for specific relations (such as “birthdates, companies, etc.”),
Section 6.3.1 explains why directly applying seed learning for semantic relations can
result in high recall but low precision patterns, a problem also noted by Pantel and
Pennacchiotti (2006). Furthermore, much of the semantic relation extraction work
has focused on extracting a particular relation independently of other relations. This
chapter describes how this problem can be solved by combining evidence from multiple
relations in Section 6.3.2. Snow et al.(2006) also describe a probabilistic framework
for combining evidence using constraints from hyponymy and cousin relations. How-
ever, they use a supervised logistic regression model. Moreover, their features rely on
parsing dependency trees which may not be available for most languages.
The key contribution of this work is using evidence from multiple relationship types
in the seed learning framework for inducing these relationships and conducting a mul-
tilingual evaluation for the same. Furthermore, the extraction of semantic relations
in multiple languages can serve as a useful tool for improving a dictionary between
99
Rank English Hindi1 Y, the X Y aura X (Gloss: Y and X)2 Y and X Y va X (Gloss: Y in addition to X)3 X and other Y Y ne X (Gloss: Y (case marker) X)4 X and Y X ke Y (Gloss: X’s Y)5 Y, X Y me.n X (Gloss: Y in X)
Table 6.1: Naive pattern scoring: Hyponymy patterns ranked by their raw corpusfrequency scores.
those languages.
6.3 Approach
To be able to automatically create taxonomies such as WordNet, it is useful to be
able to learn not only hyponymy/hyponymy directly, but also the additional semantic
relationships of meronymy and taxonomic cousinhood. Specifically, given a pair of
words (X, Y), the task is to answer the following questions: 1. Is X a hyponym of
Y (e.g. weapon, gun)? 2. Is X a part/member of Y (e.g. trigger, gun)? 3. Is X a
cousin/sibling3 of Y (e.g. gun, missile)? 4. Do none of the above 3 relations apply
but X is observed in the context of Y (e.g. airplane,accident)?4 Class 4 is referred as
“other” in the rest of the chapter.
3Cousins/siblings are words that share a close common hypernym.4Note that this does not imply X is unrelated or independent of Y. On the contrary, the required
sentential co-occurrence implies a topic similarity. Thus, this is a much harder class to distinguishfrom classes 1-3 than non co-occurring unrelatedness (such as gun, protazoa) and hence was includedin the evaluation.
100
6.3.1 Independently Bootstrapping Lexical
Relationship Models
Following the pattern induction framework of Ravichandran and Hovy (2002),
one of the ways of extracting different semantic relations is to learn patterns for each
relation independently using seeds of that relation and extract new pairs using the
learned patterns. For example, to build an independent model of hyponymy using
this framework, approximately 50 seed exemplars of hyponym pairs were used for
extracting all the patterns that match with the seed pairs5. As in Ravichandran
and Hovy (2002), the patterns were ranked by corpus frequency and a frequency
threshold was set to select the final patterns. These patterns were then used to
extract new word pairs expressing the hyponymy relation by finding word pairs that
occur with these patterns in an unlabeled corpus. However, the problem with this
approach is that generic patterns (like “X and Y”) occur many times in a corpus and
thus low-precision patterns may end up with high cumulative scores. This problem
is illustrated more clearly in Table 6.1, which shows a list of top five hyponymy
patterns (ranked by their corpus frequency) using this approach. This problem can
be overcome by exploiting the multi-class nature of this task and combine evidence
from multiple relations in order to learn high precision patterns (with high conditional
probabilities) for each relation. The key idea is to weed out the patterns that occur
5A pattern is the ngrams occurring between the seedpair (also called gluetext). The length ofthe pattern was thresholded to 15 words.
101
Rank English Hindi1 Y like X X aura anya Y (Gloss: X and other Y)2 Y such as X Y, X (Gloss: Y, X)3 X and other Y X jaise Y (Gloss: X like Y)4 Y and X Y tathaa X (Gloss: Y or X)5 Y, including X X va anya Y (Gloss: X and other Y)
Table 6.2: Patterns for hypernymy class re-ranked using evidence from other classes.Patterns distributed fairly evenly across multiple relationship types (e.g. “X and Y”)are deprecated more than patterns focused predominantly on a single relationshiptype (e.g. “Y such as X”).
in more than one semantic relation and keep the ones that are relation-specific6, thus
using the relations meronymy, cousins and other as negative evidence for hyponymy
and vice versa. Table 6.2 shows the pattern ranking by using the model developed
in Section 6.3.2 that makes use of evidence from different classes. More hyponymy
specific patterns are ranked at the top7 suggesting the usefulness of this method in
finding class-specific patterns.
6In the actual algorithm, the common patterns are not entirely removed but an estimate of theconditional class probabilities for each pattern: p(class|pattern) is computed.
7It is interesting to see in Table 6.2 that the top learned Hindi hyponymy patterns seem to betranslations of the English patterns suggested by Hearst (1992). This leads to an interesting futurework question: Are the most effective hyponym patterns in other languages usually translations ofthe English hyponym patterns proposed by Hearst (1992) and what are frequent exceptions?
102
6.3.2 A minimally supervised multi-class classifier
for identifying different semantic relations
First, a list of patterns is extracted from an unlabeled corpus8 independently for
each relationship type (class) using the seeds9 for the respective class as in Section
6.3.1.10 In order to develop a multi-class probabilistic model, the probability of each
class c given the pattern p is obtained as follows:
P (c|p) =seedfreq(p, c)∑c′ seedfreq(p, c
′)(6.1)
where seedfreq(p, c) is the number of seeds of class c that were found with the pattern p
in an unlabeled corpus. A sample of the P (class|pattern) tables for English and Hindi
are shown in the Tables 6.3 and 6.4 respectively. It is clear how occurrence of a pat-
tern in multiple classes can be used for finding reliable patterns for a particular class.
For example, in Table 6.3: although the pattern “X and Y” will get a higher seed fre-
quency than the pattern “Y, especially X”, the probability P (“X and Y ”|hyponymy)
is much lower than P (“Y, especially X”|hyponymy), since the pattern “Y, especially
X” is unlikely to occur with seeds of other relations.
Now, instead of using the seedfreq(p, c) as the score for a particular pattern with re-
8Unlabeled monolingual corpora were used for this task, the English corpus was the LDC Giga-word corpus and the Hindi corpus was news-wire text extracted from the web containing a total of64 million words.
9The number of seeds used for classes {hyponym, meronym, cousin, other} were {48,40,49,50}for English and were {32,58,31,35} for Hindi respectively. A sample of seeds used is shown in Table6.5.
10Only the patterns that had seed frequency greater than one were retained for extracting new wordpairs. The total number of retained patterns across all classes for {English,Hindi} were {455,117}respectively.
103
spect to a class, the patterns can be rescored using the probabilities P (class|pattern).
Thus the final score for a pattern p with respect to class c is obtained as:
score(p, c) = seedfreq(p, c) · P (c|p) (6.2)
This equation can be viewed as balancing recall and precision, where the first term
is the frequency of the pattern with respect to seeds of class c (representing recall),
and the second term represents the relation-specificness of the pattern with respect to
class c (representing precision). The score for each pattern is recomputed in the above
manner and obtain a ranked list of patterns for each of the classes for English and
Hindi. Now, to extract new pairs for each class, all the patterns with a seed frequency
greater than 2 are used to extract word pairs from an unlabeled corpus. The semantic
class for each extracted pair is then predicted using the multi-class classifier as follows:
Given a pair of words (X1, X2), note all the patterns that matched with this pair in
the unlabeled corpus, denote this set as P . Choose the predicted class c∗ for this pair
as:
c∗ = argmaxc∑p∈P
score(p, c) (6.3)
6.3.3 Evaluation of the Classification Task
Over 10,000 new word relationship pairs were extracted based on the above algo-
rithm. While it is hard to evaluate all the extracted pairs manually, one can certainly
104
Hyponym Meronym Cousin/Sibling OtherX of the Y 0 0.66 0.04 0.3
Y, especially X 1 0 0 0Y, whose X 0 1 0 0
X and other Y 0.63 0.08 0.18 0.11X and Y 0.23 0.3 0.33 0.14
Table 6.3: A sample of patterns and their relationship type probabilitiesP (class|pattern) extracted at the end of training phase for English.
Hyponym Meronym Cousin/Sibling OtherX aura anya Y (X and other Y) 1 0 0 0
X aura Y (X and Y) 0.09 0.09 0.71 0.11X jaise Y (X like Y) 1 0 0 0X va Y (X and Y) 0.11 0 0.89 0
Y kii X (Y’s X) 0.33 0.67 0 0
Table 6.4: A sample of patterns and their class probabilities P (class|pattern) ex-tracted at the end of training phase for Hindi.
English HindiSeed Pairs Model Predictions Seed Pairs Model Predictions
tool,hammer weapon,gun khela,Tenisa paarTii,kaa.ngresa,(game,tennis) (party,congress)
Hypernym currency,yen sport,hockey appraadha,hatyaa kaagajaata,passporTa(crime,murder) (document,passport)
metal,copper disease,cancer jaanvara,bhaaga bhaashhaa,a.ngrejii(animal,tiger) (language,English)
wheel,truck room,hotel u.ngalii,haatha jeba,sharTa(finger,hand) (pocket,shirt)
Meronym headline,newspaper bark,tree kamaraa,aspataala kaptaana,Tiima(room,hospital) (captain,team)
wing,bird lens,camera ma.njila,imaarata darvaaja,makaana(floor,building) (door,house)
dollar,euro guitar,drum bhaajapa,kaa.ngresa peTrola,Diijala(bjp,congress) (petrol,diesel)
Cousin heroin,cocaine history, geography Hindii,a.ngrejii Daalara,rupayaa(Hindi,English) (dollar,rupee)
helicopter,submarine diabetes,arthritis basa,Traka talaaba,nadii(bus,truck) (pond,river)
Table 6.5: A sample of seeds used and model predictions for each class for thetaxonomy induction task. For each of the model predictions shown above, its Hy-ponym/Meronym/Cousin classification was correctly assigned by the model.
105
create a representative smaller test set and evaluate performance on that set. The
test set was created by randomly identifying word pairs in WordNet and newswire
corpora and annotating their correct semantic class relationships. Test set construc-
tion was done entirely independently from the algorithm application, and hence some
of the test pairs were missed entirely by the learning algorithm, yielding only partial
coverage.
The total number of test examples including all classes were 200 and 140 for English
and Hindi test-sets respectively. The overall coverage11 on these test-sets was 81% and
79% for English and Hindi respectively. Table 6.6 reports the overall accuracy12 for
the 4-way classification using different patterns scoring methods. Baseline 1 is scoring
patterns by their corpus frequency as in Ravichandran and Hovy (2002), Baseline 2
is another intutive method of scoring patterns by the number of seeds they extract.
The third row in Table 6.6 indicates the result of rescoring patterns by their class
conditional probabilties, giving the best accuracy.
While this method yields some improvement over other baselines, the main point to
note here is that the pattern-based methods which have been shown to work well for
English also perform reasonably well on Hindi, inspite of the fact that the size of the
unlabeled corpus available for Hindi was 15 times smaller than for English.
Table 6.7 shows detailed accuracy results for each relationship type using the model
11Coverage is defined as the percentage of the test cases that were present in the unlabeled corpus,that is, cases for which an answer was given.
12Accuracy on a particular set of pairs is defined as the percentage of pairs in that set whose classwas correctly predicted.
106
Model English HindiAccuracy Accuracy
Baseline 1 [RH02] 65% 63%Baseline 2 seedfreq
70% 65%seedfreq · P (c|p) 73% 66%
Table 6.6: Overall accuracy for 4-way classification{hypernym,meronym,cousin,other} using different pattern scoring methods.
English HindiTotal Coverage Accuracy Total Coverage Accuracy
Hyponym 83 74% 97% 59 82% 75%Meronym 41 81% 88% 33 63% 81%
Cousin/Sibling 42 91% 55% 23 91% 71%Other 34 85% 31% 25 80% 20%
Overall 200 81% 73% 140 79% 66%
Table 6.7: Test set coverage and accuracy results for inducing different semanticrelationship types.
English HindiHypo. Mero. Cous. Other Hypo. Mero. Cous. Other
Hypo. 59 1 1 0 36 1 10 1Mero. 1 28 1 3 0 17 4 0Cous. 14 3 21 0 6 0 15 0Other 7 3 10 9 1 4 11 4
Table 6.8: Confusion matrix for English (left) Hindi (right) for the four-way classifi-cation task
107
developed in Section 6.3.2. It is also interesting to see in Table 6.8 that most of the
confusion is due to “other” class being classified as “cousin” which is expected as
cousin words are only weakly semantically related and uses more generic patterns
such as “X and Y” which can often be associated with the “other” class as well.
Strongly semantically clear classes like Hypernymy and Meronymy seem to be well
discriminated as their induced patterns are less likely to occur in other relationship
types.
6.4 Statistical Significance of Results
Using a binomial test of sample sizes 200 (English) and 140 (Hindi), and the
baseline algorithm performance of 65% (English) and 63% (Hindi), any improvement
in accuracy over 70.3% for English and over 70% for Hindi are statistically significant
with a p-value of less than 0.05. Thus the final overall accuracy obtained for English
(73%) is statistically significant and for Hindi is not statistically significant.
108
baaruuda
hathiyaara
bama
[via inducedhypernymy]
bomb explosive grenadegun
weapon
banduuka
hyponymy][via induced
Goal: To learn this translation
haathagolaa
[via existing dictionary entries or previous induced translations]
EnglishHindi
Figure 6.2: Illustration of the models of using induced hyponymy and hypernymy for
translation lexicon induction.
6.5 Improving a partial translation
dictionary
In this section, I describe the application of automatically generated multilingual
taxonomies to the task of translation dictionary induction. The hypothesis is that a
pair of words in two languages would have increased probability of being translations
of each other if their hypernyms or hyponyms are translations of one another.
As illustrated in Figure 6.2, the probability that weapon is a translation of the Hindi
word hathiyaara can be decomposed into the sum of the probabilities that their hy-
ponyms in both languages (as induced in Section 6.3.2) are translations of each other.
Thus:
PH−>E (WE|WH) =∑i
Phyper (WE|Eng(Hi)) Phypo(Hi|WH) (6.4)
109
raaiphala missile grenade bomb rifle
weapon
(hypothesis space)
[via inducedhathiyaara
or previous induced translations][via existing dictionary entries
hypernymy][via induced
hyponymy]
Hindi English
Goal: To learn this translation
Figure 6.3: Reducing the space of likely translation candidates of the word raaiphala
by inducing its hypernym, using a partial dictionary to look up the translation of
hypernym and generating the candidate translations as induced hyponyms in English
space.
for induced hyponyms Hi of the source word WH , and using an existing (and likely
very incomplete) Hindi-English dictionary to generate Eng(Hi) for these hyponyms,
and the corresponding induced hypernyms of these translations in English.13. A pre-
liminary evaluation of this idea was conducted for obtaining English translations of
a set of 25 Hindi words. The Hindi candidate hyponym space had been pruned of
function words and non-noun words. The likely English translation candidates for
each Hindi word were ranked according to the probability PH−>E(WE|WH).
13One of the challenges of inducing a dictionary via using a corpus based taxonomy is sensedisambiguation of the words to be translated. In the current model, the more dominant sense (interms of corpus frequency of its hyponyms) is likely to get selected by this approach. While thecurrent model can still help in getting translations of the dominant sense, possible future work wouldbe to cluster all the hyponyms according to contextual features such that each cluster can representthe hyponyms for a particular sense. The current dictionary induction model can then be appliedagain using the hyponym clusters to distinguish different senses for translation.
110
Accuracy Accuracy Accuracy(uni-d) (bi-d) bi-d + Other
Top 1 20% 36% 36%Top 5 56% 64% 72%Top 10 72% 72% 80%Top 20 84% 84% 84%
Table 6.9: Accuracy on Hindi to English word translation using different transitivehypernym algorithms. The additional model components in the bi-d(bi-directional)plus Other model are only used to rerank the top 20 candidates of the bidirectionalmodel, and are hence limited to its top-20 performance.
The first column of Table 6.9 shows the stand-alone performance for this model on
the dictionary induction task. This standalone model has a reasonably good accuracy
for finding the correct translation in the Top 10 and Top 20 English candidates.
This approach can be further improved by also implementing the above model in the
reverse direction and computing the P (WH |WEi) for each of the English candidates
Ei. P (WH |WEi) was computed for top 20 English candidate translations. The final
score for an English candidate translation given a Hindi word was combined by a sim-
ple average of the two directions, that is, by summing P (WEi|WH) + P (WH |WEi
).
The second column of Table 6.9 shows how this bidirectional approach helps in get-
ting the right translations in Top 1 and Top 5 as compared to the unidirectional
approach. Table 6.10 shows a sample of correct and incorrect translations generated
by the above model. It is interesting to see that the incorrect translations seem to be
the words that are very general (like “topic”, “stuff”, etc.) and hence their hyponym
space is very large and diffuse, resulting in incorrect translations. While the columns
111
Correctly translated Incorrectly translatedaujaara (tool) vishaya (topic)
biimaarii (disease) saamana (stuff)hathiyaara (weapon) dala (group,union)
dastaaveja (documents) tyohaara (festival)aparaadha (crime) jagaha (position,location)
Table 6.10: A sample of correct and incorrect translations using transitive hyper-nymy/hyponym word translation induction
1 and 2 of Table 6.9 show the standalone application of translation dictionary in-
duction method, it can also be combined with existing work on dictionary induction
using other translation induction measures such as using relative frequency similar-
ity in multilingual corpora and using cross-language context similarity between word
co-occurrence vectors (Schafer and Yarowsky, 2002). The above dictionary induction
measures were implemented and combined with the taxonomy based dictionary induc-
tion model by just summing the two scores14. The preliminary results for bidirectional
hypernym/hyponym + other features are shown in column 3 of Table 6.9.
6.6 Conclusion
This chapter presents a novel minimal-resource algorithm for the acquisition of
multilingual lexical taxonomies (including hyponymy/hypernymy and meronymy).
The algorithm is based on cross language projection of various monolingual indica-
tors of these taxonomic relationships in free text and via bootstrapping thereof. Using
only 31-58 seed examples, the algorithm achieves accuracies of 73% and 66% for En-
14after renormalizing each of the individual score to be in the range 0 to 1.
112
glish and Hindi respectively on the tasks of hyponymy/meronomy/cousinhood/other
model induction. The robustness of this approach is shown by the fact that the unan-
notated Hindi development corpus was only 1/15th the size of the utilized English
corpus. A novel model of unsupervised translation dictionary induction is also pre-
sented via multilingual transitive models of hypernymy and hyponymy, using these
induced taxonomies and evaluated on Hindi-English. Performance starting from no
multilingual dictionary supervision is quite promising.
113
Chapter 7
Extraction of Semantic Facts from
Unlabeled Corpora targeting
Resolution and Generation of
Definite Anaphora
Summary
This chapter outlines an original and successful approach for both resolving and
generating definite anaphora. Models for extracting hypernym relations are learned
by mining co-occurrence data of definite NPs and potential antecedents in an un-
labeled corpus. The algorithm outperforms a standard WordNet-based approach to
114
resolving and generating definite anaphora. It also substantially outperforms recent
related work using pattern-based extraction of such hypernym relations for corefer-
ence resolution.
Components of this chapter were originally published by the author of this disserta-
tion in the forum referenced below1.
7.1 Introduction
Successful resolution and generation of definite anaphora requires knowledge of
hypernym and hyponym relationships. For example, determining the antecedent to
the definite anaphor “the drug” in text requires knowledge of what previous noun-
phrase candidates could be drugs. Likewise, generating a definite anaphor for the
antecedent “Morphine” in text requires both knowledge of potential hypernyms (e.g.
“the opiate”, “the narcotic”, “the drug”, and “the substance”), as well as selection of
the most appropriate level of generality along the hypernym tree in context (i.e. the
“natural” hypernym anaphor). Unfortunately existing manual hypernym databases
such as WordNet are very incomplete, especially for technical vocabulary and proper
names. WordNets are also limited or non-existent for most of the world’s languages.
Finally, WordNets also do not include notation of the “natural” hypernym level for
anaphora generation, and using the immediate parent performs quite poorly, as quan-
1Reference: N. Garera and D. Yarowsky. Resolving and Generating Definite Anaphora by Mod-eling Hypernymy using Unlabeled Corpora. Proceedings of the Conference on Natural LanguageLearning (CoNLL), 2006.
115
...pseudoephedrine is found in an allergy treatment, which was given to Wilson by a doctor when he attended Blinn junior college in Houston. In a unanimous vote, the Norwegian sports confederation ruled that Wilson had not taken the drug to enhance his performance...
...pseudoephedrine is found in an allergy treatment, which was given to Wilson by a doctor when he attended Blinn junior college in Houston. In a unanimous vote, the Norwegian sports confederation ruled that Wilson had not taken the __?__ to enhance his performance...
Resolution Task
Generation Task
Figure 7.1: Example of definite anaphora resolution and generation. Both the tasks
require the knowledge of semantic relationship that “pseudoephedrine is-a drug”,
however the resolution task is easier because there are only a limited set of candidates
to choose from (shown by circled nouns).
116
tified in Section 7.5.
The first part of this chapter describes a novel approach for resolving definite anaphora
involving hyponymy relations, which performs substantially better than previous ap-
proaches on the task of antecedent selection. In the second part, the same approach
is successfully extended to the problem of generating a natural definite NP given a
specific antecedent.
The following example taken from the LDC Gigaword corpus (Graff et al., 2005) ex-
plains the antecedent selection task for definite anaphora more clearly (see also Figure
7.1):
(1)...pseudoephedrine is found in an allergy treatment, which was given to Wilson
by a doctor when he attended Blinn junior college in Houston. In a unanimous vote,
the Norwegian sports confederation ruled that Wilson had not taken the drug to
enhance his performance...
In the above example, the task is to resolve the definite NP the drug to its correct
antecedent pseudoephedrine, among the potential antecedents <pseudoephedrine, al-
lergy, blinn, college, houston, vote, confederation, wilson>. Only Wilson can be ruled
out on syntactic grounds (Hobbs, 1978). To be able to resolve the correct antecedent
from the remaining potential antecedents, the system requires the knowledge that
pseudoephedrine is a drug. Thus, the problem is to create such a knowledge source
and apply it to this task of antecedent selection. A total of 177 such anaphoric ex-
amples were extracted randomly from the LDC Gigaword corpus and a human judge
117
identified the correct antecedent for the definite NP in each example (given a context
of previous sentences).2 Two human judges were asked to perform the same task over
the same examples. The agreement between the judges was 92% (of all 177 exam-
ples), indicating a clearly defined task for evaluation purposes.
This chapter describes an unsupervised approach to this task that extracts examples
containing definite NPs from a large corpus, considers all head words appearing be-
fore the definite NP as potential antecedents and then filters the noisy <antecedent,
definite-NP> pair using Mutual Information space. The co-occurrence statistics of
such pairs can then be used as a mechanism for detecting a hypernym relation be-
tween the definite NP and its potential antecedents. This approach is compared with
a WordNet-based algorithm and with an approach presented by Markert and Nissim
(2005) on resolving definite NP coreference that makes use of lexico-syntactic patterns
such as ’X and Other Ys’ as utilized by Hearst (1992).
7.2 Related work
There is a rich tradition of work using lexical and semantic resources for anaphora
and coreference resolution. Several researchers have used WordNet as a lexical and
semantic resource for certain types of bridging anaphora (Poesio et al., 1997; Meyer
2The test examples were selected as follows: First, all the sentences containing definite NP “TheY ” were extracted from the corpus. Then, the sentences containing instances of anaphoric definiteNPs were kept and other cases of definite expressions (like existential NPs “The White House”,“Theweather”) were discarded. From this anaphoric set of sentences, 177 sentence instances covering 13distinct hypernyms were randomly selected as the test set and annotated for the correct antecedentby human judges.
118
and Dale, 2002). WordNet has also been used as an important feature in machine
learning of coreference resolution using supervised training data (Soon et al., 2001;
Ng and Cardie, 2002). However, several researchers have reported that knowledge
incorporated via WordNet is still insufficient for definite anaphora resolution. And
of course, WordNet is not available for all languages and is missing inclusion of large
segments of the vocabulary even for covered languages. Hence researchers have inves-
tigated use of corpus-based approaches to build a WordNet like resource automatically
(Hearst, 1992; Caraballo, 1999; Berland and Charniak, 1999). Poesio et al. (2002)
have proposed extracting lexical knowledge about part-of relations using Hearst-style
patterns and applied it to the task of resolving bridging references. Markert et al.
(2003) have applied relations extracted from lexico-syntactic patterns such as ’X and
other Ys’ for Other-Anaphora (referential NPs with modifiers other or another) and
for bridging involving meronymy.
There has generally been a lack of work in the existing literature for automatically
building lexical resources for definite anaphora resolution involving hyponyms rela-
tions such as presented in Example (1). However, this issue was recently addressed by
Markert and Nissim (2005) by extending their work on Other-Anaphora using lexico
syntactic pattern ’X and other Y’s to antecedent selection for definite NP coreference.
However, the task here is more challenging since the anaphoric definite NPs in the
test set include only hypernym anaphors without including the much simpler cases
of headword repetition and other instances of string matching. For direct evaluation,
119
their corpus-based approach was also implemented and compared with the models
presented in this chapter on identical test data.
Later in the chapter, a mechanism for combining the knowledge obtained from Word-
Net and the six corpus-based approaches is also presented. The resulting models are
able to overcome the weaknesses of a WordNet-only model and substantially outper-
forms any of the individual models.
7.3 Models for Lexical Acquisition
7.3.1 TheY-Model
The algorithm developed in this section is one of the core contributions of this
chapter. This algorithm is motivated by the observation that in a discourse, the
use of the definite article (“the”) in a non-deictic context is primarily licensed if the
concept has already been mentioned in the text. Hence a sentence such as “The drug
is very expensive” generally implies that either the word drug itself was previously
mentioned (e.g. “He is taking a new drug for his high cholesterol.”) or a hyponym of
drug was previously mentioned (e.g. “He is taking Lipitor for his high cholesterol.”).
Because it is straightforward to filter out the former case by string matching, the
residual instances of the phrase “the drug” (without previous mentions of the word
“drug” in the discourse) are likely to be instances of hypernymic definite anaphora.
One can then determine which nouns earlier in the discourse (e.g. Lipitor) are likely
120
antecedents by unsupervised statistical co-occurrence modeling aggregated over the
entire corpus. All that is needed is a large corpus without any anaphora annotation
and a basic tool for noun tagging and NP head annotation. The detailed algorithm
is as follows:
1. Find each sentence in the training corpus that contains a definite NP (’the
Y’ ) and does not contain ’a Y’, ’an Y’ or other instantiations of Y appearing
before the definite NP within a fixed window. The window size was set to two
sentences, a larger window size of five sentences was also experimented with and
the results obtained were similar. While matching for both ’the Y’ and ’a/an
Y’, the algorithm also accounts for Nouns getting modified by other words such
as adjectives. Thus ’the Y’ will still match to ’the green and big Y’ 3.
2. In the sentences that pass the above definite NP and a/an test, regard all the
head words (X) occurring in the current sentence before the definite NP and
the ones occurring in previous two sentences as potential antecedents.
3. Count the frequency c(X,Y) for each pair obtained in the above two steps and
pre-store it in a table.4 The frequency table can be modified to give other scores
for pair(X,Y) such as standard TF.IDF and Mutual Information scores.
4. Given a test sentence having an anaphoric definite NP Y, consider the nouns
appearing before Y within a fixed window as potential antecedents. Rank the
3The noun phrase and its head were identified using a simple and noisy heuristic, eliminating theneed for parsing the sentences.
4Note that the count c(X,Y) is asymmetric.
121
Rank Raw freq TF.IDF MI1 today kilogram amphetamine2 police heroin cannabis3 kilogram police cocaine4 year cocaine heroin5 heroin today marijuana6 dollar trafficker pill7 country officer hashish8 official amphetamine tablet
Table 7.1: A sample of ranked hyponyms proposed for the definite NP The drugby TheY-Model illustrating the differences in weighting methods.
Acc Acctag Av RankMI 0.531 0.577 4.82
TF.IDF 0.175 0.190 6.63Raw Freq 0.113 0.123 7.61
Table 7.2: Results using different normalization techniques for the TheY-Model inisolation. (60 million word corpus)
candidates by their pre-computed co-occurrence measures as computed in Step
3.
Since all head words preceding the definite NP as potential correct antecedents are
considered, the raw frequency of the pair (X,Y ) can be very noisy. This can be seen
clearly in Table 7.1, where the first column shows the top potential antecedents of
definite NP the drug as given by raw frequency. The raw frequency is normalized using
standard TF.IDF and Pointwise Mutual Information scores to filter the noisy pairs.
Note that MI(X, Y ) = log P (X,Y )P (X)P (Y )
and this is directly proportional to P (Y |X) =
p(X,Y )p(X)
for a fixed Y . Thus, one can simply use this conditional probability during
implementation since the definite NP Y is fixed for the task of antecedent selection.
Table 7.2 reports results for antecedent selection using Raw frequency c(X,Y), TF.IDF
122
5 and MI in isolation. Accuracy is the fraction of total examples that were assigned the
correct antecedent and Accuracytag is the same excluding the examples that had POS
tagging errors for the correct antecedent.6 Av Rank is the rank of the true antecedent
averaged over the number of test examples.7 Based on the above experiment, the rest
of this chapter makes use of Mutual Information scoring technique for TheY-Model.
7.3.2 WordNet-Model (WN)
Because WordNet is considered as a standard resource of lexical knowledge and is
often used in coreference tasks, it is useful to know how well corpus-based approaches
perform as compared to a standard model based on the WordNet (version 2.0). A
simple baseline was also investigated, namely, selecting the closest previous headword
as the correct antecedent. This recency based baseline obtained a low accuracy of
15% and hence a stronger WordNet based model was used for comparison purposes.
The algorithm for the WordNet-Model is as follows:
Given a definite NP Y and its potential antecedent X, choose X if it occurs as a
hyponym (either direct or indirect inheritance) of Y. If multiple potential antecedents
occur in the hierarchy of Y, choose the one that is closest in the hierarchy.
5For the purposes of TF.IDF computation, document frequency df(X) is defined as the numberof unique definite NPs for which X appears as an antecedent.
6Since the POS tagging was done automatically, it is possible for any model to miss the correctantecedent because it was not tagged correctly as a noun in the first place. There were 14 suchexamples in the test set and none of the model variants can find the correct antecedent in theseinstances.
7Knowing average rank can be useful when a n-best ranked list from coreference task is used asan input to other downstream tasks such as information extraction.
123
Acc Acctag Av RankTheY+WN 0.695 0.755 3.37WordNet 0.593 0.644 3.29
TheY 0.531 0.577 4.82
Table 7.3: Accuracy and Average Rank showing combined model performance on theantecedent selection task. Corpus Size: 60 million words.
7.3.3 Combination: TheY+WordNet Model
Most of the literature on using lexical resources for definite anaphora has focused
on using individual models (either corpus-based or manually build resources such as
WordNet) for antecedent selection. Some of the difficulties with using WordNet is
its limited coverage and its lack of empirical ranking model. Thus, a combination
of TheY-Model and WordNet-Model is used in order to overcome these problems.
Essentially, the hypotheses found in WordNet-Model are reranked based on ranks of
TheY-model or using a backoff scheme if WordNet-Model does not return an answer
due to its limited coverage. Given a definite NP Y and a set of potential antecedents
Xs the detailed algorithm is specified as follows:
1. Rerank with TheY-Model: Rerank the potential antecedents found in the
WordNet-Model table by assigning them the ranks given by TheY-Model. If
TheY-Model does not return a rank for a potential antecedent, use the rank
given by the WordNet-Model. Now pick the top ranked antecedent after rerank-
ing.
2. Backoff: If none of the potential antecedents were found in the WordNet-Model
124
Summary Keyword True TheY Truth WordNet Truth TheY+WN Truth(Def. Ana) Antecedent Choice Rank Choice Rank Choice Rank
Both metal gold gold 1 gold 1 gold 1correct sport soccer soccer 1 soccer 1 soccer 1
TheY-Model drug steroid steroid 1 NA NA steroid 1helps drug azt azt 1 medication 2 azt 1
WN-Model instrument trumpet king 10 trumpet 1 trumpet 1helps drug naltrexone alcohol 14 naltrexone 1 naltrexone 1
Both weapon bomb artillery 3 NA NA artillery 3incorrect instrument voice music 9 NA NA music 9
Table 7.4: A sample of output from different models on antecedent selection (60million word corpus).
then pick the correct antecedent from the ranked list of The-Y model. If none
of the models return an answer then assign ranks uniformly at random.
The above algorithm harnesses the strength of WordNet-Model to identify good hy-
ponyms and the strength of TheY-model to identify which are more likely to be used
as an antecedent. Note that this combination algorithm can be applied using any
corpus-based technique to account for poor-ranking and low-coverage problems of
WordNet and the Sections 7.3.4, 7.3.5 and 7.3.6 will show the results for backing off
to a Hearst-style hypernym model. Table 7.4 shows the decisions made by TheY-
model, WordNet-Model and the combined model for a sample of test examples. It
is interesting to see how both the models mutually complement each other in these
decisions. Table 7.3 shows the results for the models presented so far using a 60 mil-
lion word training text from the Gigaword corpus. The combined model results in a
substantially better accuracy than the individual WordNet-Model and TheY-Model,
indicating its strong merit for the antecedent selection task.
125
7.3.4 OtherY-Modelfreq
This model is a reimplementation of the corpus-based algorithm proposed by
Markert and Nissim (2005) for the equivalent task of antecedent selection for definite
NP coreference. Their approach of using the lexico-syntactic pattern X and A* other
B* Y{pl} for extracting (X,Y) pairs was replicated. Markert and Nissim (2005) also
report a Web algorithm that makes use of hits from Google for instantiations of
X and other Ys. Also, they used ’X{sl} OR X{pl} in their patterns to take both
singular and plurals into account. The lemmatized form of X was used during test
and unsupervised training.
The A* and B* allow for adjectives or other modifiers to be placed in between the
pattern. The model presented in their article uses the raw frequency as the criteria
for selecting the antecedent.
7.3.5 OtherY-ModelMI(normalized)
Normalization of the OtherY-Model is done using Mutual Information scoring
method. Although Markert and Nissim (2005) report that using Mutual Information
performs similar to using raw frequency, Table 7.5 shows that using Mutual Infor-
mation makes a substantial impact on results using large training corpora relative to
using raw frequency.
126
7.3.6 Combination: TheY+OtherYMI Model
The two corpus-based approaches (TheY and OtherY) make use of different linguis-
tic phenomena and it would be interesting to see whether they are complementary
in nature. A similar combination algorithm was used as in Section 7.3.3 with the
WordNet-Model replaced with the OtherY-Model for hypernym filtering, and use
the noisy TheY-Model for reranking and backoff. The results for this approach are
showed as the entry TheY+OtherYMI in Table 7.5. A combination (OtherY+WN)
of Other-Y model and WordNet-Model was also computed by replacing TheY-Model
with OtherY-Model in the algorithm described in Section 7.3.3. The respective results
are indicated as OtherY+WN entry in Table 7.5.
7.4 Further Anaphora Resolution Results
Table 7.5 summarizes results obtained from all the models defined in Section 7.3
on three different sizes of training unlabeled corpora (from Gigaword corpus). The
models are listed from high accuracy to low accuracy order. The OtherY-Model
performs particularly poorly on smaller data sizes, where coverage of the Hearst-style
patterns maybe limited, as also observed by Berland and Charniak (1999). With
increased corpus sizes, the Markert and Nissim (2005) OtherY-Model and MI-based
improvement do show substantial relative performance growth, although they still
under perform the basic TheY-Model at all tested corpus sizes. Also, the combination
127
Acc Acctag Av Rank60 million words
TheY+WN 0.695 0.755 3.37OtherYMI+WN 0.633 0.687 3.04
WordNet 0.593 0.644 3.29TheY 0.531 0.577 4.82
TheY+OtherYMI 0.497 0.540 4.96OtherYMI 0.356 0.387 5.38OtherYfreq 0.350 0.380 5.39
230 million wordsTheY+WN 0.678 0.736 3.61
OtherYMI+WN 0.650 0.705 2.99WordNet 0.593 0.644 3.29
TheY+OtherYMI 0.559 0.607 4.50TheY 0.519 0.564 4.64
OtherYMI 0.503 0.546 4.37OtherYfreq 0.418 0.454 4.52
380 million wordsTheY+WN 0.695 0.755 3.47
OtherYMI+WN 0.644 0.699 3.03WordNet 0.593 0.644 3.29
TheY+OtherYMI 0.554 0.601 4.20TheY 0.537 0.583 4.26
OtherYMI 0.525 0.571 4.20OtherYfreq 0.446 0.485 4.36
Table 7.5: Accuracy and Average Rank of Models defined in Section 7.3 on theantecedent selection task.
128
of corpus-based models (TheY-Model+OtherY-model) does indeed performs better
than either of them in isolation. Finally, note that the basic TheY-algorithm still
does relatively well by itself on smaller corpus sizes, suggesting its merit on resource-
limited languages with smaller available online text collections and the unavailability
of WordNet. The combined models of WordNet-Model with the two corpus-based
approaches still substantially outperform any of the other individual models.
Also, the syntactic co-reference candidate filters such as the Hobbs algorithm were not
utilized in this study. To assess the performance implications, the Hobbs algorithm
was applied to a randomly selected 100-instance subset of the test data. Although the
Hobbs algorithm frequently pruned at least one of the coreference candidates, in only
2% of the data did such candidate filtering change system output. However, since
both of these changes were improvements, it could be worthwhile to utilize Hobbs
filtering in future work, although the gains would likely be modest.
129
7.5 Generation Task
Having shown positive results for the task of antecedent selection in the first
part, the second part of the chapter presents a more difficult task, namely generating
an anaphoric definite NP given a nominal antecedent. In Example (1), this would
correspond to generating “the drug” as an anaphor knowing that the antecedent is
pseudoephedrine. This task clearly has many applications: current generation systems
often limit their anaphoric usage to pronouns and thus an automatic system that does
well on hypernymic definite NP generation can directly be helpful. It also has strong
potential application in abstractive summarization where rewriting a fluent passage
requires a good model of anaphoric usage.
There are many interesting challenges in this problem: first of all, there maybe be
multiple acceptable choices for definite anaphor given a particular antecedent, com-
plicating automatic evaluation. Second, when a system generates a definite anaphora,
the space of potential candidates is essentially unbounded, unlike in antecedent selec-
tion, where it is limited only to the number of potential antecedents in prior context.
In spite of the complex nature of this problem, the experiments with the human judg-
ments, WordNet and corpus-based approaches show a simple feasible solution. All
the approaches are evaluated based on exact-match agreement with definite anaphora
actually used in the corpus (accuracy) and also by agreement with definite anaphora
predicted independently by a human judge in an absence of context.
130
7.5.1 Human experiment
A total of 103 <true antecedent, definite NP> pairs were extracted from the set of
test instances used in the resolution task. Then a human judge was asked (a native
speaker of English) to predict a parent class of the antecedent that could act as a
good definite anaphora choice in general, independent of a particular context. Thus,
the actual corpus sentence containing the antecedent and definite NP and its context
was not provided to the judge. The predictions provided by the judge were matched
with the actual definite NPs used in the corpus and the agreement between corpus
and the human judge was 79% which can thus be considered as an upper bound of
algorithm performance. Table 7.7 shows a sample of decisions made by the human
and how they agree with the definite NPs observed in the corpus. It is interesting
to note the challenge of the sense variation and figurative usage. For example, “cor-
ruption” is refered to as a “tool” in the actual corpus anaphora, a metaphoric usage
that would be difficult to predict unless given the usage sentence and its context.
However, a human agreement of 79% indicate that such instances are relatively rare
and the task of predicting a definite anaphor without its context is viable. In gen-
eral, it appears from the experiements that humans tend to select from a relatively
small set of parent classes when generating hypernymic definite anaphora. Further-
more, there appears to be a relatively context-independent concept of the “natural”
level in the hypernym hierarchy for generating anaphors8. For example, although
8 This is somewhat similar to the notion of “natural kind” in philosophy that describes the notionof a “natural” grouping as opposed to artificial grouping of things (Quine, 1969).
131
05/12/2006 01:13 AMWordNet Search - 2.1
Page 1 of 1http://wordnet.princeton.edu/perl/webwn?o2=&o0=1&o7=&o5=&o1=1&o6=&o4=&o3=&r=1&s=pseudoephedrine&i=1&h=110#c
WordNet Search - 2.1
Return to WordNet Home
Glossary - Help
SEARCH DISPLAY OPTIONS: (Select option to change) Change
Enter a word to search for: pseudoephedrine Search WordNet
KEY: "S:" = Show Synset (semantic) relations, "W:" = Show Word (lexical) relations
Noun
S: (n) pseudoephedrine (poisonous crystalline alkaloid occurring with ephedrine and isomorphic withit)
direct hypernym / inherited hypernym / sister term
S: (n) alkaloid (natural bases containing nitrogen found in plants)S: (n) organic compound (any compound of carbon and another element or aradical)
S: (n) compound, chemical compound ((chemistry) a substance formed bychemical union of two or more elements or ingredients in definite proportionby weight)
S: (n) substance, matter (that which has mass and occupies space) "an
atom is the smallest indivisible unit of matter"
S: (n) physical entity (an entity that has physical existence)S: (n) entity (that which is perceived or known or inferredto have its own distinct existence (living or nonliving))
Return to WordNet Home
Figure 7.2: Illustrating the problem with WordNet for definite anaphora generation.
The immediate parent and grandparent of “pseudophedrine”, “alkaloid” and “organic
compound” do not serve as natural definite anaphoras as compared to the “drug” that
is often observed in corpora.
<“alkaloid”, “organic compound”, “compound”, “substance”, “entity”> are all hy-
pernyms of “Pseudoephederine” in WordNet (see Figure 7.2), “the drug” appears to
be the preferred hypernym for definite anaphora in the data, with the other alterna-
tives being either too specific or too general to be natural. This natural level appears
to be difficult to define by rule. For example, using just the immediate parent hy-
pernym in the WordNet hierarchy only achieves 4% match with the corpus data for
definite anaphor generation.
132
7.5.2 Algorithms
The following sections presents the corpus-based algorithms as more effective al-
ternatives.
7.5.2.1 Individual Models
For the corpus-based approaches, the TheY-Model and OtherY-Model were
trained in the same manner as for the antecedent selection task. The only differ-
ence was that in the generation case, the frequency statistics were reversed to provide
a hypernym given a hyponym. 9 Additionally, it can be seen that the raw frequency
outperformed either TF.IDF or Mutual Information and was used for all results in
Table 7.6.
The stand-alone WordNet model is also very simple: Given an antecedent, its direct
hypernym (using first sense) is looked up in the WordNet and use it as the definite
NP, for lack of a better rule for preferred hypernym location10.
7.5.2.2 Combining corpus-based approaches and WordNet
Each of the corpus-based approaches was combined with WordNet resulting in
two different models as follows: Given an antecedent X, the corpus-based approach
9On this task, it was found that using raw frequency worked better than other scoring techniques.10The first sense of the antecedent was used to find its location in the WordNet. However, using
appropriate Word Sense Disambiguation techniques can be very helpful for this task.
133
Agreement Agreementw/ human w/ corpus
judgeTheY+OtherY+WN 47% 46%
OtherY +WN 43% 43%TheY+WN 42% 37%
TheY +OtherY 39% 36%OtherY 39% 36%
WordNet 4% 4%Human judge 100% 79%
Corpus 79% 100%
Table 7.6: Agreement of different generation models with human judge and withdefinite NP used in the corpus.
looks up in its table the hypernym of X, for example Y, and only produces Y as
the output if Y also occurs in the WordNet as hypernym. Thus WordNet is used
as a filtering tool for detecting viable hypernyms. This combination resulted in two
models: ’TheY+WN’ and ’OtherY+WN’.
The combination of all the three approaches, ’TheY’, ’OtherY’ and WordNet is
represented as ’TheY+OtherY+WN’. The combination was done as follows: First
the models ’TheY’ and ’OtherY’ were combined using a backoff model. The first
priority is to use the hypernym from the model ’OtherY’, if not found then use the
hypernym from the model ’TheY’. Given a definite NP from the backoff model, apply
the WordNet filtering technique, specifically, choose it as the correct definite NP if it
also occurs as a hypernym in the WordNet hierarchy of the antecedent.
134
Antecedent Corpus Human TheY+OtherYDef Ana Choice +WN
racing sport sport sportazt drug drug drug
missile weapon weapon weaponalligator animal animal animal
steel metal metal metalosteporosis disease disease condition
grenade device weapon devicebaikonur site city station
corruption tool crime activity
Table 7.7: Sample of decisions made by human judge and the best performing model(TheY+OtherY+WN) on the generation task.
7.5.3 Evaluation of Anaphor Generation
The resulting algorithms from Section 7.5.2 were evaluated on the definite NP
prediction task as described earlier. Table 7.6 shows the agreement of the algorithm
predictions with the human judge as well as with the definite NP actually observed in
the corpus. It is interesting to see that WordNet by itself performs very poorly on this
task since it does not have any word-specific mechanism to choose the correct level
in the hierarchy and the correct word sense for selecting the hypernym. However,
when combined with the corpus-based approaches, the agreement increases substan-
tially indicating that the corpus-based approaches are effectively filtering the space
of hypernyms that can be used as natural classes. Likewise, WordNet helps to filter
the noisy hypernyms from the corpus predictions. Thus, this interplay between the
corpus-based and WordNet algorithm works out nicely, resulting in the best model
being a combination of all three individual models and achieving a substantially bet-
135
ter agreement with both the corpus and human judge than any of the individual
models. Table 7.7 shows decisions made by this algorithm on a sample test data.
7.6 Statistical Significance of Results
This section analyzes statistical significance of results reported in Tables 7.5 and
7.6. Using a binomial test and the best baseline accuracies (obtained via WordNet)
of 59.3% (Acc) and 64.4% (Acctag) for all corpora sizes in Table 7.5, any resulting
accuracy over 65.5% (Acc) and over 70.1% (Acctag) is statistically significant with a
p-value less than 0.05. Thus, the results obtained with the best model for all corpora
in Table 7.5 are statistically significant.
For the results on generation task in Table 7.6, WordNet performs poorly resulting in
only a 4% accuracy, and any resulting accuracy over 7.77% is statistically significant
with a p-value less than 0.05. While Markert and Nissim (2005) did not apply their
approach for generation task, applying it to the generation task and using the accu-
racies of “OtherY” row as the baseline, any resulting accuracy over 46.6% (agreement
w/human judge) and over 43.7% (agreement w/corpus) are statistically significant.
Thus, the accuracies of the best model in Table 7.6 are also statistically significant
with respect to the Markert and Nissim (2006) model as applied to the generation
task.
136
7.7 Conclusion
This chapter presents a successful solution to the problem of incomplete lexical
resources for definite anaphora resolution and further demonstrates how the resources
built for resolution can be naturally extended for the less studied task of anaphora
generation. First, a simple and noisy corpus-based approach is presented based on
globally modeling headword co-occurrence around likely anaphoric definite NPs. This
was shown to outperform a recent approach by Markert and Nissim (2005) that makes
use of standard Hearst-style patterns extracting hypernyms for the same task. Even
with a relatively small training corpora, the simple TheY-model was able to achieve
relatively high accuracy, making it suitable for resource-limited languages where an-
notated training corpora and full WordNets are likely not available. Then, several
variants of this algorithm were evaluated based on model combination techniques.
The best combined model was shown to exceed 75% accuracy on the resolution task,
beating any of the individual models. On the much harder anaphora generation task,
where the stand-alone WordNet-based model only achieved an accuracy of 4%, the
corpus-based models can achieve 35%-47% accuracy on blind exact-match evaluation,
thus motivating the use of such corpus-based learning approaches on the generation
task as well.
137
Part III
Extracting Factual Relationships
138
Chapter 8
Part III Literature Review
This section covers the literature review for extracting factual relations. Most of
the literature in computational linguistics has focused on extracting “explicit” factual
relationships (described in Section 8.1). Often, factual properties can also be latently
expressed and there has been a plethora of work in the sociolinguistics literature
for extracting such “implicit” or non-overt relationships. Section 8.2 describes the
literature for latent fact extraction.
8.1 Literature for Modeling Explicit
Relationships
The literature for extracting explicitly stated facts can be broadly classified into
hand crafted rules (Section 8.1.1), supervised machine learning approaches (Section
139
8.1.2) and seed-based approaches (Section 8.1.3).
8.1.1 Early MUC approaches: Handcrafted
Lexico-syntactic Patterns
Factual relationships give some domain specific information or a knowledge about
properties of a concept or an entity and how it is related to other concepts/entities
such as “Mozart-birthplace-Salzburg”. The main bottleneck of obtaining such knowl-
edge is the huge amount of manual annotation required to build such structured
database. Furthermore, such manually built databases are limited to only few of the
world’s languages.
One key property of identifying such relationships automatically is that regardless of
the type of fact (“birthplace”, “occupation”, etc.), it is common to observe textual
patterns that tie the concepts along with the relationship-type. Fixed lexico-syntactic
patterns have been used in the early Message Understanding Conference (MUC) eval-
uations where the goal was to extract segments containing the relevant fact. UMass
CIRCUS (Lehnert et al., 1991) was one of the most successful system in the MUC-3
evaluation and was based on handcrafted patterns, with SRI’s FASTUS (Appelt et
al., 1993) in MUC-4 setting the trend towards pattern-based approaches by show-
ing use of robust pattern inventory performing better than even a full parsing-based
TACITUS (Hobbs, 1986) system using abductive inference rules.
140
8.1.2 Machine Learning Approaches
The next direction in the field was towards building supervised models for learn-
ing extraction rules from annotated data, with specialized models such as WHISK
(Soderland, 1999) and then moving towards more general statistical models such as
Hidden Markov Models (Leek, 1997), Conditional Random Fields (Lafferty et al.,
2001; Culotta et al., 2006), Support Vector Machines (Culotta and Sorensen, 2004)
and a Logistic Regression Model (Snow et al., 2006).
8.1.3 Weakly Supervised Approaches using
Seed-exemplars
The problem with handcrafted extraction rules or using annotated data and train-
ing supervised models to learn such rules was that lot of manual effort was needed
to annotate the data. In order to overcome this problem, a new direction towards
a bootstrapping framework (Yarowsky, 1995) was investigated, leading to a plethora
of work in the area of extracting new relationships starting from a small set of seed
pairs. The basic seed-based pattern induction work (Brin, 1998; Agichtein and Gra-
vano, 2000; Ravichandran and Hovy (2002)) consisted of two stages, pattern-learning
and extraction of new-pairs. For example, to build a model of extracting “occupa-
tion” using this framework, few seed examples of <Person name, occupation> are
used for extracting all the patterns that match with the seed pairs. The patterns
141
SEED PAIRSPeter Rasmussen, PhysicistAlison Wolfe, SingerRichard Smith, ProfessorAlex Wilson, Tennis playerSteve Pierce, Actor............
NEW PAIRSMike Beres, EconomistJames Young, Social workerKenneth Bohr, SwimmerDean Schoppe, NarratorRyan Cooper, Actor............
Iterate
…Rasmussen worked as a physicist at…
…Wolfe, a well-known singer performed in…
…Shingles served as a professor at the University…
…Wilson trained as a tennis player when he…
Patterns Extractedworked as a 0.91 a well-known 0.87served as a 0.83trained as a 0.78a full-time 0.72....is a 0.21....
…including Mike Beres , a well-known economist who
…Young worked as a social worker at the …
…where Bohr trained as a swimmer with the help …
…Schoppe served as a narrator in the famous…
Monolingual corpora Monolingual corpora
Figure 8.1: Illustration of basic weakly supervised approach by Ravichandran and
Hovy (2002) for fact extraction. Using a few seeds of the fact in question, contextual
patterns occurring with the seeds are extracted and ranked based on their distribu-
tion in the monolingual corpora. New pairs observing the given fact (for example,
occupation) can then be extracted using co-occurrence with these patterns.
142
are then ranked by corpus frequency and a frequency threshold is set to select the
final patterns. In the extraction stage, these patterns are used to extract occupation
for a new name by finding words or phrases that occur in the occupation slot of the
extracted patterns in an unlabeled corpus.
More formally, the probability of a relationship r(“occupation”), given the surround-
ing context “A1 p A2 q A3”, where p and q are <NAME> and <Occupation Value>
respectively, is given using the rote extractor model probability as in (Ravichandran
and Hovy, 2002; Mann and Yarowsky 2005):
P (r(p, q)|A1pA2qA3) =
∑x,y∈r c(A1xA2yA3)∑x,z c(A1xA2zA3)
(8.1)
The variations on the above pattern-based learning approach have been due to pattern
representation and measures used for ranking the patterns. Thelen and Riloff (2002)
also proposed learning semantic lexicons using collective evidence from large number
of contextual patterns, their system was called Basilisk and used the RlogF metric for
ranking patterns (Riloff, 1996) shown below. This metric is similar to Ravichandran
and Hovy’s (2002) precision metric except that the precision is multiple with a log of
the seed frequency of the pattern:
RlogF (A1pA2qA3) =
∑x,y∈r c(A1xA2yA3)∑x,z c(A1xA2zA3)
· log2(∑x,y∈r
c(A1xA2yA3)) (8.2)
The metric for scoring extracted candidate words was performed using the AvgLog
function that uses a log sum of the number of patterns that extracted the candidate
word whereas Ravichandran and Hovy (2002) selected a weighted sum, weighted by
143
pattern precision. An illustration of the basic pattern-based approach is shown in
Figure 8.1.
Pantel et al. (2004) proposed an approach based on edit-distance to learn lexico-
POS patterns for is-a and part-of relations. Pantel and Pennacchiotti (2006) use
pattern-based approaches to extract is-a, part-of and other semantic relations using
a mutual-information based reliability measure to rank the patterns. Mann and
Yarowsky (2005), formalized the definition of the basic pattern-based approach calling
it a rote classifier and showed how bag of words context using a phrase conditional,
naive bayes and conditional random field model along with automatically annotated
negative examples using spurious targets can further aid in extraction performance.
8.2 Literature for Modeling Latent
Relationships
While latent relationships are often not dealt in the standard information extrac-
tion literature, there have been a lot of sociolinguistic studies on identifying properties
of a population sample based on their discourse usage. In particular, biographic facts
such as age, gender, education level, etc., have received a lot of attention in this liter-
ature. Section 8.2.1 describes the salient sociolinguistic approaches and Section 8.2.2
describes some of the computational approaches based on the features identified in
the sociolinguistics literature.
144
8.2.1 Sociolinguistic Studies
Much attention has been devoted in the sociolinguistics literature to detection of
age, gender, social class, religion, education, etc. from conversational discourse and
monologues starting as early as the 1950s, making use of morphological features such
as the choice between the -ing and the -in variants of the present participle ending
of the verb (Fischer, 1958), and phonological features such as the pronounciation of
the “r” sound in words such as far, four, cards, etc. (Labov, 1966).
Gender differences has been one of the primary areas of sociolinguistic research.
Coates (1998) and Eckert and McConnell-Ginet (2003) provide a detailed overview of
the sociolinguistic approaches for studying gender differences. Some of the important
features used in these studies such as use of pronouns, passive constructions, use of
specific ngrams like “well”, “yeah”, “I mean”, etc. are outlined in Section 10.6.
8.2.2 Computational Approaches
There has also been some work in developing computational models based on lin-
guistically interesting clues suggested by the sociolinguistic literature for detecting
Gender on formal written texts (Koppel et al., 2002; Herring and Paolillo, 2006) but
it has been primarily focused on using a small number of manually selected features,
and on a small number of formal written texts. Another computational model was
proposed by Koppel et al. (2002), where a manually selected set of 1081 features were
145
used consisting of 405 function words, 76 part-of-speech tags, 100 most frequent part-
of-speech bigrams and 500 most frequent part-of-speech trigrams. A variant of the
exponential gradient algorithm (Kivinen and Warmuth, 1997) was used for training.
Their paper reports an accuracy of approximately 80% on a set of 566 gender labeled
documents from British National Corpus. However, later in this thesis it is shown
that an online system (Gender Genie) based on the algorithm described in this paper
performs poorly on conversation speech transcripts.
Another relevant line of work has been on the blog domain, using a bag of word fea-
ture set to discriminate age and gender (Schler et al., 2006; Burger and Henderson,
2006; Nowson and Oberlander, 2006). Schler et al., (2006) used both style based and
content based features for gender classification of a blog entry. Style-based features
consisted of selected parts-of-speech, function words and blog specific features such as
hyperlinks. Content based features are the ngrams used in the body of the blog entry
and they show that words such as “linux, gaming, google, economic” are correlated
with “male” gender and words such as “shopping, cute, mom, boyfriend,” are cor-
related with “female” gender. They also report that the top content based features
suggest a pattern of more “personal” writing by female bloggers than male bloggers.
They train a multi-class real winnow model (MCRW) and report an accuracy of 80.1%
using all the features. Another study on blog data by Nowson and Oberlander (2006)
also shows that learning n-gram contexts perform well in predicting gender of the
blogger. In addition to word-based features, Burger and Henderson (2006) explore
146
a wide range of non-lexical features for blogger age prediction such as mean post
length, mean number of non-image links per post, language/script in which the blog
is written, location and time of the blog entry, number of blogger friends, etc. On the
applications side, Liu and Mihalcea (2007) show study gender preferences for weblogs
based on color, size, time, socialness, affect and cravings. They show how learning
such preferences can be used for improving user interfaces for weblogs and for filtering
gender-specific news data.
While the approaches described above have shed some light on the important features
for extracting latent biographic relationships, most of them have been small-scale
studies. Boulis and Ostendorf (2005) presented the first large-scale study of gen-
der modeling in conversational speech transcripts using Fisher corpus (Cieri et al.,
2004). Chapter 10 describes several novel contributions to this state-of-the-art ap-
proach on gender classification in the previous literature. Section 10.2 of Chapter 10
describes the Boulis and Ostendorf (2005) model and other relevant gender modeling
approaches in more detail.
147
Chapter 9
Structural, Transitive and
Correlational Models for
Biographic Fact Extraction
Summary
This chapter presents novel approaches to biographic fact extraction that model
structural, transitive and latent properties of biographical data. The ensemble of these
proposed models substantially outperforms standard pattern-based biographic fact
extraction methods and performance is further improved by modeling inter-attribute
correlations and distributions over functions of attributes, achieving an average ex-
traction accuracy of 80% over seven types of biographic attributes.
148
Components of this chapter were originally published by the author of this disserta-
tion in the forum referenced below1.
9.1 Introduction
Extracting biographic facts such as “Birthdate”, “Occupation”, “Nationality”,
etc. is a critical step for advancing the state of the art in information processing
and retrieval. An important aspect of web search is to be able to narrow down
search results by distinguishing among people with the same name leading to mul-
tiple efforts focusing on web person name disambiguation in the literature (Mann
and Yarowsky, 2003; Artiles et al., 2007, Cucerzan, 2007). While biographic facts
are certainly useful for disambiguating person names, they also allow for automatic
extraction of encyclopedic knowledge that has been limited to manual efforts such as
Britannica, Wikipedia, etc. Such encyclopedic knowledge can advance vertical search
engines such as http://www.spock.com that are focused on people searches where one
can get an enhanced search interface for searching by various biographic attributes.
Biographic facts are also useful for powerful query mechanisms such as finding what
attributes are common between two people (Auer and Lehmann, 2007).
While there are a large quantity of biographic texts available online, there are only a
1Reference: N. Garera and D. Yarowsky. Structural, Transitive and Latent Models for BiographicFact Extraction. Proceedings of European Chapter of the Association for Computational Linguistics(EACL), 2009.
149
Figure 9.1: Goal: extracting attribute-value biographic fact pairs from biographic
free-text
150
few biographic fact databases available2, and most of them have been created manu-
ally, are incomplete and are available primarily in English.
This chapter presents multiple novel approaches for automatically extracting bio-
graphic facts such as “Birthdate”, “Occupation”, “Nationality”, and “Religion”, mak-
ing use of diverse sources of information present in biographies. In particular, the
following 6 distinct original approaches to this task are evaluated with large collective
empirical gains:
1. An improvement to the Ravichandran and Hovy (2002) algorithm based on
Partially Untethered Contextual Pattern Models
2. Learning a position-based model using absolute and relative positions and se-
quential order of hypotheses that satisfy the domain model. For example,
“Deathdate” very often appears after “Birthdate” in a biography.
3. Using transitive models over attributes via co-occurring entities. For example,
other people mentioned person’s biography page tend to have similar attributes
such as occupation (See Figure 9.4).
4. Using latent wide-document-context models to detect attributes that may not be
mentioned directly in the article (e.g. the words “song, hits, album, recorded,..”
all collectively indicate the occupation of singer or musician in the article.
5. Using inter-attribute correlations, for filtering unlikely biographic attribute com-
2E.g.: http://www.nndb.com, http://www.biography.com, Infoboxes in Wikipedia.
151
binations. For example, a tuple consisting of < “Nationality” = India, “Reli-
gion” = Hindu > has a higher probability than a tuple consisting of < “Na-
tionality” = France, “Religion” = Hindu >.
6. Learning distributions over functions of attributes, for example, using an age
distribution to filter tuples containing improbable <deathyear>-<birthyear>
lifespan values.
The rest of the chapter describes and evaluates techniques for exploiting all of the
above classes of information in the next sections.
9.2 Related Work
The literature for biography extraction falls into two major classes. The first
one deals with identifying and extracting biographical sentences and treats the prob-
lem as a summarization task (Cowie et al., 2000, Schiffman et al., 2001, Zhou et
al., 2004). The second and more closely related class deals with extracting specific
facts such as “birthplace”, “occupation”, etc. For this task, the primary theme of
work in the literature has been to treat the task as a general semantic-class learning
problem where one starts with a few seeds of the semantic relationship of interest
and learns contextual patterns such as “<NAME> was born in <Birthplace>” or
“<NAME> (born <Birthdate>)” (Hearst, 1992; Riloff, 1996; Agichtein and Gra-
vano, 2000; Ravichandran and Hovy, 2002; Mann and Yarowsky, 2003; Mann and
152
Yarowsky, 2005; Alfonseca et al., 2006; Pasca et al., 2006). There has also been some
work on extracting biographic facts directly from Wikipedia pages. Culotta et al.
(2006) deal with learning contextual patterns for extracting family relationships from
Wikipedia. Ruiz-Casado et al. (2006) learn contextual patterns for biographic facts
and apply them to Wikipedia pages.
While the pattern-learning approach extends well for a few biography classes, some of
the biographic facts like “Gender” and “Religion” do not have consistent contextual
patterns, and only a few of the explicit biographic attributes such as “Birthdate”,
“Birthplace” and “Occupation” have been shown to work well in the pattern-learning
framework.
Secondly, there is a general lack of work that attempts to utilize the typical informa-
tion sequencing within biographic texts for fact extraction, and this chapter illustrates
how the information structure of biographies can be used to improve upon pattern
based models. Furthermore, additional novel models of attribute correlation and age
distribution that aid the extraction process are also presented.
9.3 Approach
First, the standard pattern-based approach is implemented for extracting bio-
graphic facts from the raw prose in Wikipedia people pages. Then, an array of novel
techniques is presented exploiting different classes of information including partially-
153
tethered contextual patterns, relative attribute position and sequence, transitive at-
tributes of co-occurring entities, broad-context topical profiles, inter-attribute corre-
lations and likely human age distributions.
9.4 Contextual Pattern-Based Model
A standard model for extracting biographic facts is to learn templatic contextual
patterns such as <NAME> “was born in” <Birthplace>). Such templatic patterns
can be learned using seed examples of the attribute in question and, there has been
a plethora of work in the seed-based bootstrapping literature which addresses this
problem (Ravichandran and Hovy, 2002; Mann and Yarowsky, 2005; Alfonseca et al.,
2006; Pasca et al., 2006)
Thus, as a baseline, the standard Ravichandran and Hovy (2002) pattern learning
model was implemented using 100 seed3 examples from an online biographic database
called NNDB (http://www.nndb.com) for each of the biographic attributes: “Birth-
date”, “Birthplace”, “Deathdate”, “Gender”, “Nationality”, “Occupation” and “Re-
ligion”. Given the seed pairs, patterns for each attribute were learned by searching
for seed <Name,Attribute Value> pairs in the Wikipedia page and extracting the left,
middle and right contexts as various contextual patterns. A noisy model of corefer-
ence resolution was implemented by resolving any gender-correct pronoun used in the
3The seed examples were chosen randomly, with a bias against duplicate attribute values toincrease training diversity.
154
Partially Untethered Patterns PrecisionBirthplace<p> born in <birthplace> 1.0living in <birthplace> 1.0grew up in <birthplace> 1.0family in <birthplace> 1.0( born in <birthplace> 1.0....to return to <birthplace> 0.80returned to <birthplace> 0.79....Birthdatewas born on <birthdate> 1.0<p> born on <birthdate> 1.0<birthdate> - 0.94) ( <birthdate> 0.83....<birthdate> , is 0.56born <birthdate> 0.41....Deathdate- <deathdate> 1.0<deathdate> ) was an 0.91<deathdate> ) was a 0.89<deathdate> ) was 0.62<deathdate> ) , 0.11; <deathdate> 0.056on <deathdate> 0.05<deathdate> ) 0.04
Fully Tethered Patterns PrecisionBirthplace</p> <p> <name> was bornin <birthplace>
1.0
<p> “ <name> ” ( born<DATE>
1.0
in <birthplace><p> “<name>” ( born in<birthplace>
1.0
<name> was born on<DATE> in <birthplace>
1.0
<name> was born and raisedin <birthplace>
1.0
<name> returned to<birthplace>
1.0
<birthplace> where <name>was
0.75
<name> left <birthplace> 0.67....Birthdate</p> <p> <name> was bornon <born>
1.0
<p> “ <name> ” ( born<born>
1.0
<name> “ ( <born> - 1.0“ <name> ” ( born <DATE>in <birthplace>
1.0
Deathdate“ <name> ” ( <DATE> -<died> )
1.0
Table 9.1: A sample of partially untethered and fully tethered patterns along withtheir precision. For some of the attributes, only 4-5 fully tethered patterns butrelaxing the constraint on the <hook> allows extraction of many partially tetheredpatterns providing improved performance as shown in Tables 9.5 9.6.
155
Partially Untethered Patterns PrecisionOccupationreputation as a <occupation> 1.0<occupation> oscar 1.0<occupation> in a musical 1.0<occupation> and author 1.0<occupation> and author . 1.0for best <occupation> 1.0best supporting <occupation> 1.0....a French <occupation> 0.71singer and <occupation> 0.70young <occupation> 0.53as an <occupation> 0.47....<occupation> , and 0.06Nationality) was an <nationality> 1.0) was a <nationality> 1.0native <nationality> 1.0<nationality> , where he 1.0....to become <nationality> 0.8to return to <nationality> 0.78founder of <nationality> 0.78one of <nationality> 0.75....Religion<religion> family in 1.0<religion> faith . 1.0raised as a <religion> 1.0<religion> faith 0.86as an <religion> 0.58<religion> family 0.48<religion> , but 0.26as a <religion> 0.23, a <religion> 0.11...
Fully Tethered Patterns PrecisionOccupationan <occupation> , <name> 1.0“ <name> : <occupation> 0.67<occupation> . <name> was 0.4<name> : <occupation> 0.4and <occupation> . <name> 0.1<occupation> , <name> 0.10<occupation> . <name> 0.06Nationality</p> <p> <name> was the 1.0only <nationality><p> “ <name> ” (<nationality>
1.0
, <name> returned to<nationality>
1.0
term <name> dominated 1.0<nationality> politicsterm of any <nationality>prime minister ,
1.0
and during his second term<name>..... <name> left <nationality> 0.50of <nationality> . <name> 0.43<nationality> . <name> was 0.33...Religion<p> <name> was born to a 1.0<religion> family<name> was raised as a<religion> , but
1.0
Table 9.2: A sample of partially untethered and fully tethered patterns along withtheir precision. For some of the attributes, only 4-5 fully tethered patterns butrelaxing the constraint on the <hook> allows extraction of many partially tetheredpatterns providing improved performance as shown in Tables 9.5 and 9.6.
156
Wikipedia page to the title person name of the article4. The probability of a rela-
tionship r(Attribute Name), given the surrounding context “A1 p A2 q A3”, where p
and q are <NAME> and <Attrib Val> respectively, is given using the rote extractor
model probability as in (Ravichandran and Hovy, 2002; Mann and Yarowsky 2005):
P (r(p, q)|A1pA2qA3) =
∑x,y∈r c(A1xA2yA3)∑x,z c(A1xA2zA3)
(9.1)
Each extracted attribute value q using the given context can thus be ranked according
to the above probability. This approach for extracting values was tested for each of
the above attributes on a test set of 100 held-out names from NNDB and report
Precision, Pseudo-recall and F-score for each attribute which are computed in the
standard way as follows, for say Attribute “Birthplace (bplace)”:
Precisionbplace =# people with bplace correctly extracted
# of people with bplace extracted(9.2)
Pseudo-recbplace =# people with bplace correctly extracted
# of people with bplace in test set(9.3)
F-scorebplace =2 · Precisionbplace · Pseudo-recbplace
Precisionbplace + Pseudo-recbplace
(9.4)
Since the true values of each attribute are obtained from a cleaner and normalized
person-database (NNDB), not all the attribute values maybe present in the Wikipedia
article for a given name. Thus, the evaluation results also report accuracy on the
subset of names for which the value of a given attribute is also explicitly stated in
the article. This is denoted as:
4Gender is also extracted automatically as a biographic attribute.
157
Acctruth pres =# people with bplace correctly extracted
# of people with true bplace stated in article(9.5)
A domain model was further applied for each attribute to filter noisy targets extracted
from lexical patterns. The domain models of attributes include lists of acceptable val-
ues (such as lists of places, occupations and religions) and structural constraints such
as possible date formats for “Birthdate” and “Deathdate”. The row with subscript
“RH02” in Table 9.6 shows the performance of this Ravichandran and Hovy (2002)
model with additional attribute domain modeling for each attribute, and Table 9.5
shows the average performance across all attributes.
9.5 Partially Untethered Templatic
Contextual Patterns
The pattern-learning literature for fact extraction often consists of patterns with
a “hook” and “target” (Mann and Yarowsky, 2005). For example, in the pattern
“<Name> was born in <Birthplace>”, “<NAME>” is the hook and “<Birthplace>”
is the target. The disadvantage of this approach is that the intervening dually-
tethered patterns can be quite long and highly variable, such as “<NAME> was
highly influential in his role as <Occupation>”. This problem was overcome by
modeling partially untethered variable-length ngram patterns adjacent to only the
158
target, with the only constraint being that the hook entity appear somewhere in the
sentence. This constraint is particularly viable in biographic text, which tends to fo-
cus on the properties of a single individual. Examples of these new contextual ngram
features include “his role as <Occupation>” and ‘role as <Occupation>”. The pat-
tern probability model here is essentially the same as in Ravichandran and Hovy, 2002
and just the pattern representation is changed. The rows with subscript “RH02imp”
in Tables 9.6 and 9.5 show performance gains using this improved templatic-pattern-
based model, yielding an absolute 21% gain in accuracy.
Attribute Best rank P(Rank)in seed set
Birthplace 1 0.61Birthdate 1 0.98Deathdate 2 0.58
Gender 1 1.0Occupation 1 0.70Nationality 1 0.83
Religion 1 0.80
Table 9.3: Majority rank of the correct attribute value in the Wikipedia pages of theseed names used for learning relative ordering among attributes satisfying the domainmodel
9.6 Document-Position-Based Model
One of the properties of biographic genres is that primary biographic attributes 5
tend to appear in characteristic positions, often toward the beginning of the article.
5Hyperlinked phrases were used as potential values for all attributes except “Gender”. For“Gender” pronouns were used as potential values ranked according to the their distance from thebeginning of the page.
159
Figure 9.2: Distribution of the observed document mentions of Deathdate, Nationality
and Religion.
Thus, the absolute position (in percentage) can be modeled explicitly using a Gaussian
parametric model as follows for choosing the best candidate value v∗ for a given
attribute A:
v∗ = argmaxv∈domain(A)f(posnv|A) (9.6)
where the density f(posnv|A) is given as,
f(posnv|A) = N (posnv; µA, σ2A) =
1
σA√
2πe−(posnv−µA)2/2σA
2
(9.7)
In the above equation, posnv is the absolute position ratio (position/length) and
µA, σA2 are the sample mean and variance based on the sample of correct position
ratios of attribute values in biographies with attribute A. Figure 9.2, for example,
shows the positional distribution of the seed attribute values for deathdate, nation-
ality and religion in Wikipedia articles, fit to a Gaussian distribution. Combining
160
this empirically derived position model with a domain model6 of acceptable attribute
values is effective enough to serve as a stand-alone model.
9.6.1 Learning Relative Ordering in the
Position-Based Model
In practice, for attributes such as birthdate, the first text pattern satisfying the
domain model is often the correct answer for biographical articles. Deathdate also
tends to occur near the beginning of the article, but almost always some point after
the birthdate. This motivates a second, sequence-based position model based on the
rank of the attribute values among other values in the domain of the attribute, as
follows:
v∗ = argmaxv∈domain(A)P (rankv|A) (9.8)
where P (rankv|A) is the fraction of biographies having attribute a with the correct
value occurring at rank rankv, where rank is measured according to the relative order
in which the values belonging to the attribute domain occur from the beginning of the
article. The seed set was used to learn the relative positions between attributes, that
is, in the Wikipedia pages of seed names what is the rank of the correct attribute.
Table 9.3 shows the most frequent rank of the correct attribute value and Figure 9.3
6The domain model is the same as used in Section 9.4 and remains constant across all the modelsdeveloped in this chapter.
161
Figure 9.4: Illustration of modeling “occupation” and “nationality” transitively via
consensus from attributes of neighboring names
shows the distribution of the correct ranks for a sample of attributes. It can be seen
that 61% of the time the first location mentioned in a biography is the individuals’s
birthplace, while 58% of the time the 2nd date in the article is the deathdate. Thus,
“Deathdate” often appears as the second date in a Wikipedia page as expected.
These empirical distributions for the correct rank provide a direct vehicle for scoring
hypotheses, and the rows with “rel. posn” as the subscript in Table 9.6 shows the
improvement in performance using the learned relative ordering. Averaging across
different attributes, Table 9.5 shows an absolute 11% average gain in accuracy of the
position-sequence-based models relative to the improved Ravichandran and Hovy-
based results achieved here.
162
Figure 9.3: Empirical distribution of the relative position of the correct (seed) answers
among all text phrases satisfying the domain model for “birthplace” and “death date”.
163
9.7 Implicit Models
Some of the biographic attributes such as “Nationality”, “Occupation” and “Re-
ligion” can be extracted successfully even when the answer is not directly mentioned
in the biographic article. Two such models are also presented for doing so in the
following sections:
9.7.1 Extracting Attributes Transitively using
Neighboring Person-Names
Attributes such as “Occupation” are transitive in nature, that is, the people names
appearing close to the target name will tend to have the same occupation as the target
name. Based on this intuition, a transitive model was implemented that predicts
occupation based on consensus voting via the extracted occupations of neighboring
names7 as follows:
v∗ = argmaxv∈domain(A)P (v|A,Sneighbors) (9.9)
where,
P (v|A,Sneighbors) = # neighboring names with attrib value v# of neighboring names in the article
7Only the neighboring names were used whose attribute value can be obtained from an ency-clopedic database. Furthermore, since this work is dealing with biographic pages that talk abouta single person, all other person-names mentioned in the article whose attributes are present in anencyclopedia were considered for consensus voting.
164
Occupation Weight VectorEnglish
Physicist <magnetic:32.7, electromagnetic:18.2, wire: 18.2, electricity: 17.7, optical:14.5, discovered:11.2>Singer <song:40, hits:30.5, hit:29.6, reggae:23.6, album:17.1, francis:15.2, music:13.8, recorded:13.6, ...>Politician <humphrey:367.4, soviet: 97.4, votes: 70.6, senate: 64.7, democratic: 57.2, kennedy: 55.9, ...>Painter <mural:40.0, diego:14.7, paint:14.5, fresco:10.9. paintings:10.9, museum of modern art:8.83, ...>Auto racing <renault:76.3, championship:32.7. schumacher:32.7, race:30.4, pole:29.1, driver:28.1 >
GermanPhysicist <faraday:25.4, chemie:7.3, vorlesungsserie:7.2, 1846:5.8, entdeckt:4.5, rotation:3.6 ...>Singer <song:16.22, jamaikanischen:11.77, platz:7.3, hit: 6.7, solounstler:4.5, album:4.1, widmet:4.0, ...>Politician <konservativen:26.5, wahlkreis:26.5, romano:21.8, stimmen:18.6, gewahlt:18.4, ...>Painter <rivera:32.7, malerin:7.6, wandgemalde:7.3, kunst:6.75, 1940:5.8, maler:5.1, auftrag:4.5, ...>Auto racing <team:29.4,mclaren:18.1,teamkollegen:18.1,sieg:11.7, meisterschaft:10.9, gegner:10.9, ...>
Table 9.4: Sample of occupation weight vectors in English and German learned usingthe latent-attribute-based model.
The set of neighboring names is represented as Sneighbors and the best candidate value
for an attribute A is chosen based on the the fraction of neighboring names having
the same value for the respective attribute. Candidates are ranked according to this
probability and the row labeled “trans” in Table 9.6 shows that this model helps
in substantially improving the recall of “Occupation” and “Religion”, yielding a 7%
and 3% average improvement in F-measure respectively, on top of the position model
described in Section 9.6.
9.7.2 Latent-Attribute
Models based on Document-Wide Context
Profiles
In addition to modeling cross-entity attribute transitively, attributes such as “Oc-
cupation” can also be modeled successfully using a document-wide context or topic
165
model. For example, the distribution of words occurring in a biography of a politi-
cian would be different from that of a scientist. Thus, even if the occupation is not
explicitly mentioned in the article, one can infer it using a bag-of-words topic profile
learned from the seed examples.
Given a value v, for an attribute A, (for example v = “Politician” and A = “Occu-
pation”), a centroid weight vector is learned:
Cv = [w1,v, w2,v, ..., wn,v] (9.10)
where,
wt,v =1
Ntft,v · log
|A||t ∈ A|
(9.11)
tft,v is the frequency of word t in the articles of People having attribute A = v
|A| is the total number of values of attribute A
|t ∈ A| is the total number of values of attribute A, such that the articles of people
having one of those values contain the term t
N is the total number of People in the seed set
Given a biography article of a test name and an attribute in question, a similar
word weight vector C ′ = [w′1, w′2, ..., w
′n] is computed for the test name and its cosine
similarity to the centroid vector of each value of the given attribute is measured.
Thus, the best value a∗ is chosen as:
166
v∗ = argmaxvw′1 · w1,v + w′2 · w2,v + ....+ w′n · wn,v√
w′21 + w′22 + ...+ w′2n
√w2
1,v + w22,v + ...+ w2
n,v
(9.12)
Tables 9.5 and 9.6 show performance using the latent document-wide-context model.
It can be seen that this model by itself gives the top performance on “Occupation”,
outperforming the best alternative model by 9% absolute accuracy, indicating the
usefulness of implicit attribute modeling via broad-context word frequencies.
This latent-attribute-based model can be further extended using the multilingual
nature of Wikipedia. The corresponding German pages of the training names were
used to model the German word distributions characterizing each seed occupation.
Table 9.6 shows that English attribute classification can be successful using only the
words in a parallel German article, and while under performing the direct English
models stand-alone, this additional information gives up to a 1% additional gain in
combination. For some attributes, the performance of latent-attribute-based model
modeled via cross-language (noted as latentCL) is close to that of English suggesting
potential future work by exploiting this multilingual dimension.
It is interesting to note that both the transitive model and the latent wide-context
model do not rely on the actual “Occupation” being explicitly mentioned in the
article, they still outperform explicit pattern-based and position-based models. This
implicit modeling also helps in improving the recall of less-often directly mentioned
attributes such as a person’s “Religion”.
167
Model Fscore Acctruth present
Ravichandran and Hovy, 2002 0.37 0.43Improved RH02 Model 0.54 0.64Position-Based Model 0.53 0.75Combinedabove 3+trans+latent+cl 0.59 0.78Combined + Age Dist + Corr 0.62 0.80
(+24%) (+37%)
Table 9.5: Average Performance of different models across all biographic attributes
9.8 Model Combination
While the pattern-based, position-based, transitive and latent-attribute-based
models are all stand-alone models, they can complement each other in combination as
they provide relatively orthogonal sources of information. To combine these models,
a simple backoff-based combination is used for each attribute based on stand-alone
model performance, and the row with subscript “combined” in Tables 9.5 and 9.6
shows an average 14% absolute performance gain of the combined model relative to
the improved Ravichandran and Hovy (2002) model.
9.9 Further Extensions: Reducing False
Positives
Since the position-and-domain-based models will almost always posit an answer,
one of the problems is the high number of false positives yielded by these algorithms.
The following sections introduce further extensions using interesting properties of
168
Attribute Precision Pseudo-recall Fscore Accuracytruth present
BirthdateRH02 0.86 0.38 0.53 0.88BirthdateRH02imp 0.52 0.52 0.52 0.67
Birthdaterel. posn 0.42 0.40 0.41 0.93Birthdatecombined 0.58 0.58 0.58 0.95Birthdatecomb+age dist 0.63 0.60 0.61 1.00
DeathdateRH02 0.80 0.19 0.30 0.36DeathdateRH02imp 0.50 0.49 0.49 0.59
Deathdaterel. posn 0.46 0.44 0.45 0.86Deathdatecombined 0.49 0.49 0.49 0.86Deathdatecomb+age dist 0.51 0.49 0.50 0.86
BirthplaceRH02 0.42 0.38 0.40 0.42BirthplaceRH02imp 0.41 0.41 0.41 0.45
Birthplacerel. posn 0.47 0.41 0.44 0.48Birthplacecombined 0.44 0.44 0.44 0.48Birthplacecombined+corr 0.53 0.50 0.51 0.55
OccupationRH02 0.54 0.18 0.27 0.26OccupationRH02imp 0.38 0.34 0.36 0.48
Occupationrel. posn 0.48 0.35 0.40 0.50Occupationtrans 0.49 0.46 0.47 0.50Occupationlatent 0.48 0.48 0.48 0.59OccupationlatentCL 0.48 0.48 0.48 0.54Occupationcombined 0.48 0.48 0.48 0.59
NationalityRH02 0.40 0.25 0.31 0.27NationalityRH02imp 0.75 0.75 0.75 0.81
Nationalityrel. posn 0.73 0.72 0.71 0.78Nationalitytrans 0.51 0.48 0.49 0.49Nationalitylatent 0.56 0.56 0.56 0.56NationalitylatentCL 0.55 0.48 0.51 0.48Nationalitycombined 0.75 0.75 0.75 0.81Nationalitycomb+corr 0.77 0.77 0.77 0.84
GenderRH02 0.76 0.76 0.76 0.76GenderRH02imp 0.99 0.99 0.99 0.99
Genderrel. posn 1.00 1.00 1.00 1.00Gendertrans 0.79 0.75 0.77 0.75Genderlatent 0.82 0.82 0.82 0.82GenderlatentCL 0.83 0.72 0.77 0.72Gendercombined 1.00 1.00 1.00 1.00
ReligionRH02 0.02 0.02 0.04 0.06ReligionRH02imp 0.55 0.18 0.27 0.45
Religionrel. posn 0.49 0.24 0.32 0.73Religiontrans 0.38 0.33 0.35 0.48Religionlatent 0.36 0.36 0.36 0.45ReligionlatentCL 0.30 0.26 0.28 0.22Religioncombined 0.41 0.41 0.41 0.76Religioncombined+corr 0.44 0.44 0.44 0.79
Table 9.6: Performance comparison of all the models across several biographic at-tributes. Bolded accuracies indicate the top-performing model.
169
biographic attributes to reduce the effect of false positives.
9.9.1 Using Inter-Attribute Correlations
One of the ways to filter false positives is by filtering empirically incompat-
ible inter-attribute pairings. The motivation here is that the attributes are not
independent of each other when modeled for the same individual. For example,
P(Religion=Hindu | Nationality=India) is higher than P(Religion=Hindu | Nation-
ality=France) and similarly one can find positive and negative correlations among
other attribute pairings. For implementation, all possible 3-tuples of (“Nationality”,
“Birthplace”, “Religion”)8 were considered and searched on NNDB for the presence
of the tuple for any individual in the database (excluding the test data). As an ag-
gressive but effective filter, this model filters the tuples for which no name in NNDB
was found containing the candidate 3-tuples. The rows with label “combined+corr”
in Table 9.6 and Table 9.5 shows substantial performance gains using inter-attribute
correlations, such as the 7% absolute average gain for Birthplace over the Section 9.8
combined models, and a 3% absolute gain for Nationality and Religion.
8The test of joint-presence between these three attributes were used since they are stronglycorrelated.
170
Figure 9.5: Age distribution of famous people on the web (from www.spock.com)
9.9.2 Using Age Distribution
Another way to filter out false positives is to consider distributions on meta-
attributes, for example: while age is not explicitly extracted, one can use the fact that
age is a function of two extracted attributes (<Deathyear>-<Birthyear>) and use
the age distribution to filter out false positives for <Birthdate> and <Deathdate>.
Based on the age distribution for famous people9 on the web shown in Figure 9.5, one
can bias against unusual candidate lifespans and filter out completely those outside
the range of 25-100, as most of the probability mass is concentrated in this range.
Rows with subscript “comb+age dist” in Table 9.6 shows the performance gains using
this feature, yielding an average 5% absolute accuracy gain for Birthdate.
9Since all the seed and test examples were used from nndb.com, the age distribution of famouspeople on the web was used: http://blog.spock.com/2008/02/08/age-distribution-of-people-on-the-web/
171
9.10 Statistical Significance of Results
Using a binomial test of sample size 100 and the baseline accuracy of 43%
(Ravichandran and Hovy, 2002) in Table 9.5, any improvement in accuracy over 51%
is statistically significant with a p-value less than 0.05. With respect to the improved
baseline of 64% accuracy (Improved RH02 Model), any improvement in accuracy over
72% is statistically significant. For results at a per-attribute level presented in Table
9.6, all the accuracies obtained using the best model (reported in bold) developed in
this chapter are statistically significant with respect to the baseline model (RH02),
with a p-value less than 0.05 using the binomial test.
9.11 Extracting factual relationships
from noisy sources for a wider range
of attributes
This section provides a foray into extracting factual relationships from non-
biographic genres where the target entity for extraction is not clear. More specifically,
the facts presented in the given document can belong to multiple entities, resulting
in more noisy extractions. As a case study, the attribute specifications from the Text
Analysis Conference (TAC) 2009 Knowledge Base Population (KBP)10 task were uti-
10http://apl.jhu.edu/˜paulmac/kbp.html
172
lized for extraction11.
The idea behind the TAC KBP task is to explore information extraction of entities
with reference to an external knowledge source. The main challenge here is to popu-
late and expand an existing ontology of entities and their attributes with knowledge
extracted from free text. The slot filling system is given an ontology of entities where
each node contains different attribute-value pairs of the respective entity and a doc-
ument collection that may contain mention of some of the entities in the ontology.
Using the facts present in the ontology as training data, the system should identify
correct mentions of the query entity in a large document collection and augment the
ontology with the facts extracted from these mentions.
Another challenge of this exercise was to test the robustness of current slot filling
approaches on 42 diverse attributes such as “causes of death”, “alternate names”,
etc.
Such a slot filling system has many different components such as document/sentence
selection based on the query entity, pattern-based and domain models for fact ex-
traction, answer ranking, redundancy detection, linking nodes in the ontology, etc.
This section will focus on pattern-based component for fact extraction given a query
relevant sentence. The pattern-based approach used in the system is similar to that
used for biographic genres with the difference that the target entity is ambiguous and
hence partially tethered patterns are used for modeling the target value as explained
11Thanks to Mark Dredze, Tim Finin, Adam Gerber, James Mayfield, Paul McNamee, ChristinePiatko and David Yarowsky for helping in developing different components of the Johns HopkinsUniversity TAC KBP system.
173
in Section 9.5. Some examples of such partially untethered patterns on this data are
also shown in Tables 9.7, 9.8, 9.9 and 9.10.
9.11.1 Analysis of pattern learning component for
fact extraction
The diverse and ambiguous nature of the document collection lead to generation of
large number of noisy patterns. While the noisy nature of patterns hurts the precision
of candidate facts extracted, the domain model component of the slot filling system is
used for filtering noisy candidates. However, some of the attributes such as “parents”
do not have a strong domain model component and it is essential to filter out noisy
or excessively broad patterns such as “of < A >”12.
9.11.2 Manually Filtering Patterns
One of the lessons learned during the TAC exercise was that patterns lists for
such diverse attribute sets are noisy and a manual filtering step can significantly
aid in identifying clean attribute-specific patterns. A sample of manually filtered
untethered patterns for few attributes is shown in Tables 9.7, 9.8, 9.9 and 9.10.
Manually filtering patterns, however, was noted to be a time consuming process. In
12< A > denotes the attribute value, for example name of the query entity’s “parent”.
174
Title Spouseby the british < A >american landscape < A >was a guest < A >late british < A >american screenwriter and < A >modification ; japanese < A >) , Chinese < A >indian model and < A >, a scottish < A >French director and < A >is american < A >any Russian < A >peters , american < A >was american < A >is a leading < A >is an awesome < A >was a pakistani < A >influential Russian < A >young mexican < A >84 , american < A >
former husband , < A >wife of tsar < A >married actress < A >widow , < A >wife of < A >he married < A >) ; married < A >< A > spouse ofmarried to actress < A >husband of < A >hubby < A >< A > says his wifes wife , < A >she and husband < A >< A > marriedhe and wife < A >< A > , widow ofher marriage to < A >was married to < A >before marrying < A >
Age Alternate Name< A > -year-old son .< A > -year-old man< A > -year-old girl< A > years old .< A > years old )< A > -year-old son ,< A > , died< A > -year-old woman< A > -year-old son< A > -year-old< A > ) died at< A > years old when< A > , was born< A > , was arrested< A > , was marriedhe was < A >< A > , was named< A > , was appointed< A > -year-old , whodied at < A >
maiden name , < A >stage name ” < A >been known as < A >better known as < A >formerly known as < A >known as < A >her stage name < A >popularly known as < A >born as < A >well-known as < A >professionally known as < A >known as rapper < A >is best-known as < A >otherwise known as < A >reborn as < A >known as mrs. < A >stage name < A >forever known as < A >universally known as < A >once known as < A >
Table 9.7: Sample of untethered patterns that were annotated as high quality byhuman annotators.
175
Children Other family< A > , son of, whose son < A >son , king < A >a daughter of < A >own son , < A >’s son , < A >daughters : < A >and her son < A >infant son , < A >daughter : < A >father of singer < A >his daughter < A >his oldest son < A >marrying his daughter < A >one daughter ; < A >for her son < A >a daughter , < A >< A > ’s father ,of the son < A >and successor , < A >
< A > grandfatherher cousin , < A >a grandson of < A >< A > , grandson ofnephew of < A >grandchildren , < A >niece < A >cousin of < A >his grandson < A >, nephew of < A >< A > great grandchildrencousins < A >< A > cousin< A > grandson of< A > a niecehis uncle , < A >aunt < A >his uncle < A >grandparents < A >grandmother of < A >
Table 9.8: Sample of untethered patterns that were annotated as high quality byhuman annotators.
176
Cause of Death Chargesfrom complications of < A >< A > victim ’ sdies of < A >in 1995 of < A >, dies of < A >, died of < A >her death from < A >died friday of < A >< A > by hanging .he died of < A >he suffered a < A >< A > -related complicationsattack suffered < A >was diagnosed with < A >she died of < A >, died from < A >complications from < A >commit mass < A >after suffering a < A >of death was < A >
< A > and sentenced to< A > trial ,< A > conviction< A > charges .< A > and sentencedconvicted of < A >< A > conviction ,< A > conviction ,< A > case< A > trial< A > investigation< A > case ,< A > chargestried for < A >
Date of Birth Date of Death) ( b. < A >: born < A >b. < A >, born in < A >was born < A >< A > - d.( b. < A >was born on < A >d. < A >< A > and died( born < A >born on < A >born in < A >< A > ; diedwas born in < A >< A > ) is anb : < A >born : < A >, b. < A >data : born < A >
died < A >death date = < A >( d. < A >death in < A >died in < A >< A > deathsad : likewas assassinated on < A >killed in < A >having died in < A >< A > - death of, and died < A >died on < A >assassinated on < A >died c. < A >( died < A >, died < A >who died < A >he died < A >passed away in < A >death date < A >
Table 9.9: Sample of untethered patterns that were annotated as high quality byhuman annotators.
177
Place of Birth Place of Deathhis birthplace , < A >’s birthplace in < A >man born in < A >: born < A >birthplace = < A >’s born < A >producer. born in < A >< A > -born ”born < A >is born at < A >birth country : < A >< A > , born< A > born striker1981 birth place : < A >< A > -born formeralthough born in < A >birth : < A >composer born in < A >< A > .bornhis birthday < A >
passed away in < A >being killed in < A >just died in < A >died in a < A >and killed in < A >died in his < A >deathplace = < A >his death in < A >< A > till his deathdeath at < A >death at her < A >and died at < A >died : in < A >and murdered in < A >< A > death campand death at < A >killed in < A >died in < A >soldiers to < A >’ dies in < A >
Schools attended Religionmater , the < A >< A > , qbeconomics at < A >doctoral degree from < A >a doctorate from < A >he attended < A >economics at the < A >physics from < A >qb , < A >a former < A >played at < A >graduate of the < A >< A > university years< A > footballyear at < A >< A > quarterbackcampus of the < A >attending < A >< A > , hbcollege career at < A >
< A > martyr in scotlandhutchison of the < A >member of the < A >< A > and muslim communities< A > church whena fundamentalist < A >< A > church ’s top< A > vs. sunnithe walnut hill < A >politicization of < A >< A > shrines in< A > church south of< A > church in memphis< A > faithof the methodist < A >< A > churchrev .< A > denominations and< A > church today< A > fellowship , which< A > and muslim leaders
Table 9.10: Sample of untethered patterns that were annotated as high quality byhuman annotators.
178
the next sections, automated approaches for filtering noisy patterns are discussed.
9.11.3 Filtering Noisy Patterns Automatically
Several corpus statistics can be used for automatically filtering the patterns. For
each of the patterns the following measures were recorded during the pattern learning
step:
1. Token count: This is the total number of correct values extracted for a given
attribute.
2. Type count: This is the unique number of correct values extracted for a given
attribute. It was used to prevent one overwhelming mention and its attribute
value to artificially inflate the pattern score.
3. Slot count: This is the number of different attributes for which the pattern ap-
plies. For example patterns such as “of < A >” will extract different attributes
and patterns such as “son of < A >” will apply for mostly the “parent” at-
tribute. The idea here is similar to document frequency, and the lower its value
the better.
4. TF.IDF1 = Token count * log( NSlotcount
), where N is the number of attributes
(or slot types).
5. TF.IDF2 = Type count * log( NSlotcount
), where N is the number of attributes (or
179
slot types).
9.11.4 Evaluating automatic pattern filtering
measures
In order to evaluate the automatic pattern filtering measures, the manually se-
lected patterns were used as gold truth. Each of the pattern scores provides an n-best
ranking of the patterns, and the accuracy for each of the pattern lists using this rank-
ing is computed. For a given attribute and top n patterns for that attribute, the
accuracy is computed as the fraction of k patterns that are present in the manually
selected list. Table 9.11 report the average accuracy over all the attributes.
Attribute Token count Type count Slot count TFIDF1 TFIDF2Top 5 0.385 0.149 0.077 0.133 0.467Top 10 0.344 0.174 0.079 3:0.141 0.421Top 20 0.277 0.141 0.071 3:0.140 0.383Top 50 0.208 0.103 0.061 3:0.124 0.308
Table 9.11: Pattern relevance based on presence in high quality pattern list generatedby human annotators. Top 5 indicates the fraction of top 5 patterns generated bythe algorithm that were marked by annotators as high quality patterns. The resultsare averaged over all attributes.
Since TFIDF2 measure performs the best, utilizing the unique number of correct
values extracted by a given pattern (type count) is a useful measure in indicating
pattern relevance. Another component of this measure is the number of different
slots in which a pattern occurs, indicating that this slot-specificity of the pattern is
also a useful component of determining pattern relevance. Additional analysis on
180
what other factors can be useful for improving the pattern relevance are discussed in
the following section.
9.11.5 Error analysis
While Table 9.11 provides some insight into what features may be useful in auto-
matically filtering the patterns, there is still a lot of room for improvement. Following
are some of the lessons learned from error analysis of the generated patterns:
• Numeric attributes: Numeric attributes such as “number of employees” resulted
in a poor pattern accuracy due to a wide range of values for the attribute.
However, a syntactic domain model that can identify the range of potential
values was also utilized as one of the components in the pipeline to account for
such attributes.
• All the attributes dealing with aliases (“alternate names”) also resulted in poor
pattern accuracy since most of the values for alternate names are mentioned
in parentheses, resulting in noisy generic patterns that can be applied in many
contexts.
• Attributes such as “political and religious affiliations”, “origin” do not naturally
occur in typical contexts and hence are difficult to model using pattern-based
approach.
• Annotator errors: Given the time constraints, the annotators were able to only
181
sift through a subset of the patterns in order to select good ones. In some cases,
the annotators completely missed selecting good patterns in the initial screening
and in some cases noisy patterns were also incorporated in the final set.
• Related patterns: Since the evaluation reports an “exact-match” accuracy, some
of the patterns that are similar to the ones selected by the annotators were also
be marked incorrect. For example, even though the pattern “better known as
< A >” is similar to “also known as < A >” that is on the list of selected
patterns, the former will be marked as incorrect. A possible direction for eval-
uation of patterns can be to perform fuzzy matching of patterns using content
words.
9.12 Application of Position-based Model
to News Data
While formal biographies such as Wikipedia articles have a well defined structure
that can be easily modeled for salient positions of the attributes, such characteristic
positions are not very clear in non-biographic articles such as news articles. This is
due to the ambiguous nature of news article as opposed to the monosemous nature
of Wikipedia article, with respect to the target named entity. A news article may
contain many named entities and biographic attributes that appear in the article may
182
belong to any of these entities. This section presents an empirical study on finding
such biographic position indicators in news data using a sample of New York Times
articles.
9.12.1 Corpora Details
The initial set of people names consisted of 58 names along with their occupations.
These names were then queried to the New York Times website for finding recent news
about the people in this set. A total of 134 articles were found for a subset of the 27
people names from the initial set. These set of articles were used for modeling the
position of “occupation” attribute in the article, both globally and with respect to
the name mention.
9.12.2 Global Position Model of “Occupation”
Attribute
Figure 9.6 shows the histogram of the position of correct “occupation” in the
overall article, without any relation to where the respective name was mentioned in
the article. While such a global position model is useful for formal biographies as
explained in Section 9.6, Figure 9.6 shows that the distribution of positions has a
large variance, motivating more localized models as explained in the next sections.
183
Figure 9.6: Global position “occupation” attribute the New York Times articles.
The position is given as the fraction of the article length on the X-axis, and Y-axis
describes the number of times an “occupation” attribute was found in that fraction.
184
9.12.3 Modeling Position with respect to the First
Name Mention
Figure 9.7 shows the histogram of the relative distance of correct “occupation”
with respect to the first full mention of the target name. The first mention of the
name usually licenses the author to provide additional biographic information either
just before (as a premodifer) or near any of the following coreferent mentions. Figure
9.7 shows that the overwhelming indicator of the correct “occupation” of the name is
in the premodifier (-1) position. Furthermore, majority of the remaining probability
mass is also focused around a small window near the name mention, within a distance
of 10 words.
9.12.4 Modeling Position with respect to the
Closest Name Mention
Often there are multiple mentions of the target name in the article and it is useful
to model the position of the correct occupation nearest to the closest target mention,
in order to obtain a better localized model. Figure 9.8 shows the histogram of the
distance of correct “occupation” from the closest full name mention. One can see
that it looks very similar to that of the position from first mention (Figure 9.7), and
that may be due to the fact that the full names are usually mentioned only once in
185
Figure 9.7: Distribution of “occupation” attribute from first full mention of the name
in the New York Times articles.
186
Figure 9.8: Distribution of “occupation” attribute from the closest full mention of
the name in the New York Times articles.
187
the article and tend to the be the first mention. The later mentions of the name often
use only part of the name such as the first name or the last name and the next section
describes taking partial name matches into account for position modeling.
9.12.5 Modeling Position with respect to the
Closest Full or Partial Name Mention
Figure 9.9 shows the histogram of the correct “occupation” attributes with re-
spect to the closest full or partial name mentions. The partial name mentions were
approximated via the usage of first name or last name of the target name and a more
exact model would require a full coreference chain analysis. One can see in Figure 9.9
that the premodifier position still remains the most salient indicator of “occupation”
attribute, however the frequency at -1 position is higher due to increased coverage
via partial matches. Furthermore, this increased coverage also results in additional
“occupation” matches around a small window of size five to six words.
9.12.6 Analysis
The various histograms in Figures 9.6, 9.7, 9.8 and 9.9 motivate a “occupation”
extraction model of news that places a very high prior on the premodifier position of
the target name for extracting correct occupation values. Furthermore, the majority
of the remaining probability mass occurs in a small window centered around any
188
Figure 9.9: Distribution of “occupation” attribute from the closest full or partial
(first name or last name) mention of the name in the New York Times articles.
189
mention of the target name. This motivates modeling “occupation” attribute using a
narrow bag-of-words as candidates in conjunction with an appropriate domain model
for the “occupation” attribute as utilized in Section 9.6 for the biographic genre.
9.13 Using Biographical Facts for Name
Disambiguation
This work considers the task of disambiguating a first name or last name mention
in unstructured text to a disambiguated Wikipedia page. This task is along the lines
of Bunescu and Pasca (2006) and Cucerzan (2007), who make use of the entire text
on the Wikipedia page and the mention page to perform disambiguation. However,
the goal here is to test the effectiveness of biographical attributes in disambiguation
and this work reports some preliminary results on how often a name can be disam-
biguated by just using an “occupation” match13.
A preliminary name disambiguation experiment was performed on a set of 100 first
name or last name mentions. In order to test the “occupation” match, the men-
tions were chosen such that an occupation string occurs in the 5-word premodifying
or appositive context14 of the mention and were chosen randomly from the English
13Biographical features have also been used for cross-document coreference by Mann and Yarowsky(2003) in combination with the full bag-of-words model.
14Nenkova and McKeown (2003) showed in their corpus study that name-external evidence suchas “occupation” and “nationality” often occur in the premodifying or appositive context of themention.
190
..... British rider Phil Collins will be among the favorites tonight in the 19th U.S. championship speedway motorcycle races at the Orange County Fairgrounds in ........
Phil Collins (1)(Speedway rider)
Phil Collins (2)(Musician)
Phil Collins (3)(Baseball player)
Phil Collins (4)(Artist, Photographer)
Figure 9.10: Application of biographical attributes for name disambiguation: Disam-
biguating mention of “Phil Collins” to the correct Wikipedia entry using the premod-
ifying occupation “rider”. Similarly other biographical attributes such as nationality
premodifier “British” can also be used for disambiguation. This can be further im-
proved by using compatible occupations as shown in Table 9.13.
191
ACE-2005 (Walker et al., 2006) training set. An example of using occupation for
disambiguating the mention of name “Phil Collins” is shown in Figure 9.10.
The baseline model is to just do a string match of the mention to the names in
Wikipedia and choose randomly if more than one name is present. The “occupation”
model extracts the occupation present in the premodifying or appositive context of
the mention and and selects the candidate Wikipedia page with the matching “oc-
cupation”. The first two rows of Table 9.12 show the performance gain using exact
“occupation” match.
However, it is found that using exact “occupation” match was too conservative as it
would count the mention “musician Yanni” as mismatch to “Composer” or “Pianist”
which is the occupation mentioned on the Wikipedia page. In order to solve this
problem, the correlation among values of “occupation” using the # of names in a bi-
ographical database15 that share those occupations was measured. Table 9.13 shows
a sample of the “occupation” correlations.
Results for Name DisambiguationModel AccuracyName string match 0.15+Exact occupation match 0.46+Occupation match with correlation 0.56
Table 9.12: Name disambiguation performance for matching first or last name men-tions to a Wikipedia person page
15Biographical database used was Freebase (www.freebase.com) as it is a Wikipedia-centricdatabase.
192
The “occupation” correlations were used for fuzzy match in name disambiguation
and the candidate whose “occupation” had the highest correlation with the “occupa-
tion” of the mention was chosen. The third row in Table 9.12 shows the performance
gain using this feature. The preliminary results show promise for using additional
biographical features and a likely fruitful line of future work is to integrate all the
automatically extracted biographical features presented in this work with a full name
disambiguation/cross-document coreference system.
Correlations for “Occupation” attributeOccupation Pair # of PeopleNovelist, Writer 7100
Lawyer, Politician 3366Singer, Songwriter 1552
. .
. .Baseball player, Football player 588
Mathematician, Physicist 366Film Director, Screenwriter 248
. .
. .Accountant, Baseball player 2
Actor, Store manager 2
Table 9.13: Correlation between occupations based on number of people sharing thesame occupation
9.14 Conclusion
This chapter describes novel approaches to biographic fact extraction using struc-
tural, transitive and latent properties of biographic data. First, an improvement to
the standard Ravichandran and Hovy (2002) model is shown utilizing untethered
193
contextual pattern models, followed by a document position and sequence-based ap-
proach to attribute modeling. Next transitive models were presented exploiting the
tendency for individuals occurring together in an article to have related attribute val-
ues. This chapter also describes how latent-attribute-based models of wide document
context, both monolingually and translingually, can capture facts that are not stated
directly in a text.
Each of these models provide substantial performance gain, and further performance
gain is achieved via classifier combination. As additional source of information, inter-
attribution correlations are modeled to filter unlikely attribute combinations, and
models of functions over attributes, such as deathdate-birthdate distributions, further
constrain the candidate space. These approaches collectively achieve 80% average ac-
curacy on a test set of 7 biographic attribute types, yielding a 36% absolute accuracy
gain relative to a standard algorithm on the same data.
194
Chapter 10
Modeling Latent Biographical
Attributes in Conversational
Genres
Summary
This chapter presents and evaluates several original techniques for the latent clas-
sification of biographic attributes such as gender, age and native language, in diverse
genres (conversation transcripts, email) and languages (Arabic, English). First, a
novel partner-sensitive model for extracting biographic attributes in conversations,
given the differences in lexical usage and discourse style such as observed between
same-gender and mixed-gender conversations is presented. Then, a rich variety of
195
novel sociolinguistic and discourse-based features, including mean utterance length,
passive/active usage, percentage domination of the conversation, speaking rate and
filler word usage is explored. Cumulatively up to 20% error reduction is achieved rel-
ative to the standard Boulis and Ostendorf (2005) algorithm for classifying individual
conversations on Switchboard, and accuracy for gender detection on the Switchboard
corpus (aggregate) and Gulf Arabic corpus exceeds 95%.
Components of this chapter were originally published by the author of this disserta-
tion in the forum referenced below1.
10.1 Introduction
Speaker attributes such as gender, age, dialect, native language and educational
level may be (a) stated overtly in metadata, (b) derivable indirectly from metadata
such as a speaker’s phone number or userid, or (c) derivable from acoustic properties
of the speaker, including pitch and f0 contours (Bocklet et al., 2008).
In contrast, the goal of this work is to model and classify such speaker attributes
from only the latent information found in textual transcripts. In particular, this
work is focused on modeling and classifying speaker attributes such as gender and
age based on lexical and discourse factors including lexical choice, mean utterance
length, patterns of participation in the conversation and filler word usage. Further-
1Reference: N. Garera, D.Yarowsky. Modeling Latent Biographic Attributes in ConversationalGenres. Proceedings of Association for Computational Linguistics (ACL), 2009.
196
more, a speaker’s lexical choice and discourse style may differ substantially depending
on the gender/age/etc. of the speaker’s interlocutor, and hence improvements may
be achieved via joint conversational dyad modeling or stacked classifiers.
There has been substantial work in the sociolinguistics literature investigating dis-
course style differences due to speaker properties such as gender (Coates, 1997; Eckert,
McConnell-Ginet, 2003). While most of the prior work in sociolinguistics has been
approached from a non-computational perspective, Singh (2001) and Koppel et al.
(2002) employed the use of a linear model for gender with manually selected and
linguistically interesting words and part-of-speech as features, focused on a small de-
velopment corpus. Another computational study for gender using approximately 30
weblog entries was done by Herring and Paolillo (2006), making use of a logistic regres-
sion model to study the effect of different features. While small-scale sociolinguistic
studies on monologues have shed some light on important features, this work focuses
on modeling attributes from transcripts of spoken conversations as shown in Figure
10.1, building upon the work of Boulis and Ostendorf (2005) and show how gender
and other attributes can be accurately predicted in conversations. In addition to spo-
ken conversations, this work also explores another genre for informal conversations,
namely email2.
2An example email snippet is shown in Figure 10.7.
197
75.06 75.53 A: no
77.52 78.23 B: actually
78.36 79.00 B: um
79.16 79.54 B: [cough]
79.82 80.65 B: my ah
81.79 82.52 B: my wife
83.02 84.46 B: ah and i
84.63 88.48 B: enjoy having dinner together and i have to tell you i really enjoy eating.
89.82 93.14 A: i definitely agree with you on that one [laugh]
94.44 94.99 A: um
96.59 97.83 B: so what's you're favorite?
98.86 99.80 A: um
100.48 103.05 A: that's a hard one, but i would have to go
103.40 104.55 A: i like to make
104.81 106.68 A: like taco salad a lot
106.68 106.97 B: hm
107.31 109.66 A: stuff like that simple stuff but
111.04 113.02 B: [lipsmack] we had taco
Figure 10.1: A snippet of Fisher telephone transcript between a female (A) and male
(B) speaker. The first two fields indicate the start time and stop time and the third
field contains the utterance.
198
10.1.1 Applications
Analyzing such differences in latent author/speaker attributes is interesting from
the sociolinguistic and psycholinguistic point of view of language understanding, but
also from an engineering perspective for a wide range of applications described below:
• Call routing: A straightforward application of detecting gender, age, native
language of the speaker is to re-route the call for personalized assistance in
various phone-based services.
• User authentication and security: An important problem on major blogging or
social networking websites is that of detecting whether a user account has been
compromised. The posts or comments written by the user can be analyzed to
see if the author attributes match to that of the original profile in determining
the identity/authenticity of the user.
• Filling user profiles: Many websites require users to fill in their biographical
data. Such information could be automatically extracted using the content
posted by the users.
• Gender/age conditioned models: Extracting latent properties enables re-
searchers to build gender/age conditional models for building attribute-specific
models for various tasks such as language modeling, machine translation and
speech recognition.
199
10.1.2 Contributions
Having motivated the goal of predicting latent biographic attributes, the following
points briefly outline the original contributions of the work described in this chapter:
1. Modeling Partner Effect: A speaker may adapt his or her conversation style
depending on the partner and it is shown how conditioning on the predicted
partner class using a stacked model can provide further performance gains in
gender classification.
2. Sociolinguistic features: The chapter explores a rich set of lexical and non-lexical
features motivated by the sociolinguistic literature for gender classification, and
shows how they can effectively augment the standard ngram-based model of
Boulis and Ostendorf (2005).
3. Application to Arabic Language: This work also reports results for application
to the Arabic language, in addition to the English Fisher transcripts used by
Boulis and Ostendorf (2005). It is shown that the ngram model gives reasonably
high accuracy for Arabic as well. Furthermore, consistent performance gains due
to partner-sensitive models and sociolinguistic features, as observed in English
are also obtained.
4. Application to Email Genre: It is shown how the models explored in this chapter
developed for conversational transcript genre extend to email, showing the wide
applicability of the models due to the use of general text-based features.
200
5. Application to new attributes: This work shows how the lexical model of Boulis
and Ostendorf (2005) can be extended to Age and Native vs. Non-native pre-
diction, with further improvements gained from using the introduced partner-
sensitive models and novel sociolinguistic features.
10.2 Related Work
Conversational speech presents a challenging domain due to the interaction of
genders, recognition errors and sudden topic shifts.Text-based information extraction
approaches for speech genre have also been investigated, (Jing et al., 2007) presents a
supervised framework for extracting biographical facts from a transcribed conversa-
tional speech collection (MALACH) consisting of interviews of Holocaust survivors.
They present new features for co-reference resolution in conversational speech such as
speaker role identification, speaker turns, name patterns, etc. and use a combination
of lexical, contextual and syntactic features for attribute labeling. However, only the
explicitly mentioned attributes were extracted as they treat the attribute extraction
as a labeling problem and train a maximum-entropy classifier for the same.
In contrast, the goal of the work presented in this chapter is to predict such at-
tributes when they are not necessarily explicitly stated in the utterance, along the
lines of Singh (2001) and Boulis and Ostendorf (2005), which use lexical differences
in conversational speech for gender classification. Singh (2001) performed a pilot
201
study using conversational speech for identifying gender differences based on lexical
richness measures. A total of thirty subjects were recorded and transcribed in a con-
versational setting and the lexical richness measures were based on word frequencies
of word classes such as noun, pronoun, adjective and verb rate per 100 words, type-
token ratio, etc, achieving a 90% classification accuracy using discriminant analysis
on this small dataset.
Boulis and Ostendorf (2005) presented the first large-scale study of gender modeling
in conversational speech transcripts using Fisher corpus (Cieri et al., 2004). Their
model utilized a simple bag of ngrams feature vector in a SVM framework for gender
classification, showing how state-of-the art machine learning approaches utilizing very
high dimensional feature vectors can classify gender with more than 90% accuracy.
While Boulis and Ostendorf (2005) observe that the gender of the partner can have
a substantial effect on their classifier accuracy, given that same-gender conversations
are easier to classify than mixed-gender classifications, they don’t utilize this obser-
vation in their work. Section 10.5.3, shows how the predicted gender/age etc. of the
partner/interlocutor can be used to improve overall performance via both joint dyad
modeling and classifier stacking. Boulis and Ostendorf (2005) have also constrained
themselves to lexical n-gram features, while this work shows improvements via the
incorporation of non-lexical features such as the percentage domination of the con-
versation, degree of passive usage, usage of subordinate clauses, speaker rate, usage
profiles for filler words (e.g. “umm”), mean-utterance length, and other such proper-
202
ties. Finally, this work explores and empirically evaluates original model performance
on additional latent speaker attributes including age and native vs. non-native En-
glish speaking status. The remaining sections describe the approach in detail.
10.3 Corpus Details
Consistent with Boulis and Ostendorf (2005), this work utilized the Fisher tele-
phone conversation corpus (Cieri et al., 2004) and also evaluated performance on the
standard Switchboard conversational corpus (Godfrey et al., 1992), both collected and
annotated by the Linguistic Data Consortium. In both cases, the provided metadata
(including true speaker gender, age, native language, etc.) was utilized as only class
labels for both training and evaluation, but never as features in the classification.
The primary task employed was identical to Boulis and Ostendorf (2005), namely
the classification of gender, etc. of each speaker in an isolated conversation, and also
to evaluate performance when classifying speaker attributes given the combination
of multiple conversations in which the speaker has participated. The Fisher corpus
contains a total of 11971 speakers and each speaker participated in 1-3 conversations,
resulting in a total of 23398 conversation sides (i.e. the transcript of a single speaker
in a single conversation). The preprocessing steps and experimental setup of Boulis
and Ostendorf (2005) was followed as closely as possible given the details presented in
their paper, although the some details such as the exact training/test partition were
203
not currently obtainable from either the paper or personal communication. This re-
sulted in a training set of 9000 speakers with 17587 conversation sides and a test set
of 1000 speakers with 2008 conversation sides.
The Switchboard corpus was much smaller and consisted of 543 speakers, with 443
speakers used for training and 100 speakers used for testing, resulting in a total of
4062 conversation sides for training and 808 conversation sides for testing.
10.4 Modeling Gender via Ngram
features (Boulis and Ostendorf,
2005)
As the reference algorithm, the currently state-of-the-art system developed by
Boulis and Ostendorf (2005) was used using unigram and bigram features in a SVM
framework. This model was reimplemented as the reference standard for gender
classification, further details of which are given below:
10.4.1 Training Vectors
For each conversation side, a training example was created using unigram and
bigram features with TF.IDF weighting, as done in standard text classification ap-
proaches. However, stopwords were retained in the feature set as various sociolin-
204
Female MaleFisher Corpus
husband -0.0291 my wife3 0.0366my husband -0.0281 wife 0.0328oh -0.0210 uh 0.0284laughter -0.0186 ah 0.0248have -0.0169 er 0.0222mhm -0.0169 i i 0.0201so -0.0163 hey 0.0199because -0.0160 you doing 0.0169and -0.0155 all right 0.0169i know -0.0152 man 0.0160hi -0.0147 pretty 0.0156um -0.0141 i see 0.0141boyfriend -0.0134 yeah i 0.0125oh my -0.0124 my girlfriend 0.0114i have -0.0119 thats thats 0.0109but -0.0118 mike 0.0109children -0.0115 guy 0.0109goodness -0.0114 is that 0.0108yes -0.0106 basically 0.0106uh huh -0.0105 shit 0.0102
Switchboard Corpusoh -0.0122 wife 0.0078laughter -0.0088 my wife 0.0077my husband -0.0077 uh 0.0072husband -0.0072 i i 0.0053have -0.0069 actually 0.0051uhhuh -0.0068 sort of 0.0041and i -0.0050 yeah i 0.0041feel -0.0048 got 0.0039umhum -0.0048 a 0.0038i know -0.0047 sort 0.0037really -0.0046 yep 0.0036women -0.0043 the 0.0036um -0.0042 stuff 0.0035would -0.0039 yeah 0.0034children -0.0038 pretty 0.0033too -0.0036 that that 0.0032but -0.0035 guess 0.0031and -0.0034 as 0.0029wonderful -0.0032 is 0.0028yeah yeah -0.0031 i guess 0.0028
Table 10.1: Top 20 ngram features for Gender, ranked by the weights assigned by thelinear SVM model
205
guistic studies have shown that use of some of the stopwords, for instance, pronouns
and determiners, are correlated with age and gender. Also, only the ngrams with fre-
quency greater than 5 were retained in the feature set, resulting in a total of 227,450
features for the Fisher corpus and 57,914 features for the Switchboard corpus.
10.4.2 Model
After extracting the ngram features, an SVM model was trained via the SVMlight
toolkit (Joachims, 1999) using the linear kernel with the default toolkit settings.
Table 10.1 shows the most discriminative ngrams for gender based on the weights
assigned by the linear SVM model. The negative and positive sign on the weights of
ngrams for the two gender are due to the selection of “-1” as female class and “+1”
as male class in the SVM model. It is interesting that some of the gender-correlated
words proposed by sociolinguistic literature are also found by this empirical approach,
including the frequent use of “oh” by females and also obvious indicators of gender
such as “my wife” or “my husband”, etc. Also, named entity “Mike” shows up as a
discriminative unigram, this maybe due to the self-introduction at the beginning of
the conversations and “Mike” being a common male name. For compatibility with
Boulis and Ostendorf (2005), no special preprocessing for names is performed, and
they are treated as just any other unigrams or bigrams in this particular direct com-
parison4.
4A natural extension of this work, however, would be to do explicit extraction of self introductionsand then do table-lookup-based gender classification, although this was not implemented for consis-
206
Figure 10.2: The effect of varying the amount of each conversation side utilized for
training, based on the utilized % of each conversation, starting from the beginning
of the conversation. While one would expect the accuracy to improve linearly with
increased training data, the anomaly inolving flat portion in the middle could be due
to the fact that Fisher and Switchboard participants were complete strangers. The
intial ramp up in the curve is probably due to the addition of speaker data starting
from no data at all and the flat portion is probably due to the time taken for the
speakers to get familiar and speak comfortably with each other, after which, the
discourse features for speaker attributes become more prominent. Another reason
could be due to the fact that the middle portion indicates discussion on a specific
topic given to the speakers and after they have spoken enough about the topic, the
speakers may move on to more gender biased topics of their choice.
207
Furthermore, the ngram-based approach scales well with varying the amount of con-
versation utilized in training the model as shown in Figure 10.2.
The “Boulis and Ostendorf, 05” rows in Table 10.4 show the performance of this
reimplemented algorithm on both the Fisher Corpus (90.84%) and Switchboard Cor-
pus (90.22%), under the identical training and test conditions used elsewhere in the
chapter for direct comparison with subsequent results5.
10.5 Modeling Based on the Partner’s
Gender
The original contribution in this section is the successful modeling of speaker prop-
erties (e.g. gender/age) based on the prior and joint modeling of the partner speaker’s
gender/age in the same discourse. The motivation for this work is that people tend
to use stronger gender-specific, age-specific or dialect-specific word/phrase usage and
discourse properties when speaking with someone of a similar gender/age/dialect than
when speaking with someone of a different gender/age/dialect, when they may adapt
a more neutral speaking style. Also, discourse properties such as relative use of the
passive and percentage of the conversation dominated may vary depending on the
tency with the reference algorithm. The handling of names and potentially self-reporting features isstudied and handled specifically elsewhere in this chapter, however.
5The modest differences with their reported results may be due to unreported details such asthe exact training/test splits or SVM parameterizations, so for the purposes of assessing the relativegain of the subsequent enhancements we base all reported experiments on the internally-consistentconfigurations as (re-)implemented here.
208
Fisher CorpusSame gender conversations 94.01Mixed gender conversations 84.06Switchboard CorpusSame gender conversations 93.22Mixed gender conversations 86.84
Table 10.2: Difference in Gender classification accuracy between mixed gender andsame gender conversations using the reference algorithm
Classifying speaker’s and partner’sgender simultaneously
Male-Male 84.80Female-Female 81.96Male-Female 15.58Female-Male 27.46
Table 10.3: Performance for 4-way classification of the entire conversation into (mm,ff, mf, fm) classes using the reference algorithm on Switchboard corpus.
gender or age relationship with the speaking partner. Several varieties of classifier
stacking and joint modeling were employed to be effectively sensitive to these dif-
ferences. To illustrate the significance of the “partner effect”, Table 10.2 shows the
difference in the standard algorithm performance between same-gender conversations
(when gender-specific style flourishes) and mixed-gender conversations (where the
more neutral styles are harder to classify):
10.5.1 Oracle Experiment
To assess the potential gains from full exploitation of partner-sensitive modeling,
first the result are reported from an oracle experiment, where it is assumed that
the algorithm knows whether the conversation is homogeneous (same gender) or het-
erogeneous (different gender). In order to effectively utilize this information, both
209
the test conversation side and the partner side are classified , and if the classifier is
more confident about the partner side then the gender of the test conversation side is
chosen based on the heterogeneous/homogeneous information. The overall accuracy
improves to 96.46% on the Fisher corpus using this oracle (from 90.84%), leading to
the following experiment where the oracle is replaced with a non-oracle SVM model
trained on a subset of training data such that all test conversation sides (of the speaker
and the partner) are excluded from the training set.
10.5.2 Replacing Oracle by a Homogeneous vs
Heterogeneous Classifier
Given the substantial improvement using the Oracle information, initially another
binary classifier was trained for classifying the conversation as mixed or single-gender.
It turns out that this task is much harder than the single-side gender classification,
task and achieved only a low accuracy value of 68.35% on the Fisher corpus. In-
tuitively, the homogeneous vs. hetereogeneous partition results in a much harder
classification task because the two diverse classes of male-male and female-female
conversations are grouped into one class (“homogeneous”) resulting in linearly in-
separable classes6. This lead us (subsequently) to create two different classifiers for
conversations, namely, male-male vs rest and female-female vs rest used in a classifier
6Even non-linear kernels were not able to find a good classification boundary.
210
Figure 10.3: People use stronger gender-specific discourse properties when speaking
to someone of a similar gender. Stacking whole conversation and partner-conditioned
models as shown above allows to model such behavior. The common graphic utilized
for individual SVM classifiers first appeared in (Ustun, 2003).
combination framework as follows:
10.5.3 Modeling partner via conditional model
and whole-conversation model
The following classifiers were trained and each of their scores was used as a feature
in a meta SVM classifier:
211
1. Male-Male vs Rest: Classifying the entire conversation (using test speaker and
partner’s sides) as male-male or other7.
2. Female-Female vs Rest: Classifying the entire conversation (using test speaker
and partner’s sides) as female-female or other.
3. Conditional model of gender given most likely partner’s gender: Two separate
classifiers were trained for classifying the gender of a
given conversation side, one where the partner is male and other where the
partner is female. Given a test conversation side, first the most likely gender of
the partner’s conversation side is chosen using the ngram-based model8 and then
choose the gender of the test conversation side using the appropriate conditional
model.
4. Ngram-based model as explained in Section 10.4
The stacking approach described above is illustrated in Figure 10.3. The row labeled
“+ Partner Model” in Table 10.4 shows the performance gain obtained via this meta-
classifier incorporating conversation type and partner-conditioned models.
7For classifying the conversations as male-male vs rest or female-female vs rest, all the conversa-tions with either the speaker or the partner present in any of the test conversations were eliminatedfrom the training set, thus creating a disjoint training and test conversation partitions.
8All the partner conversation sides of test speakers were removed from the training data and thengram-based model was retrained on the remaining subset.
212
Figure 10.4: Empirical differences in sociolinguistic features for Gender on the Switch-
board corpus
10.6 Sociolinguistic Features
The sociolinguistic literature has shown gender differences for speakers due to
features such as speaking rate, pronoun usage and filler word usage. While ngram
features are able to reasonably predict speaker gender due to their high detail and cov-
erage and the overall importance of lexical choice in gender differences while speaking,
the sociolinguistics literature suggests that other non-lexical features can further help
improve performance, and more importantly, advance our understanding of gender
differences in discourse. Thus, on top of the standard Boulis and Ostendorf (2005)
model, the following features were also investigated, motivated by the sociolinguistic
literature on gender differences in discourse (Macaulay, 2005):
213
1. % of conversation spoken: The speaker’s fraction of conversation spoken was
measured via three features extracted from the transcripts: % of words, utter-
ances and time.
2. Speaker rate: Some studies have shown that males speak faster than females
(Yuan et al., 2006) as can also be observed in Figure 10.4 showing empirical
data obtained from Switchboard corpus. The speaker rate was measured in
words/sec., using starting and ending time-stamps for the discourse.
3. % of pronoun usage: Macaulay (2005) argues that females tend to use more
third-person male/female pronouns (he, she, him, her and his) as compared to
males.
4. % of back-channel responses such as “(laughter)”, “(lipsmacks)” and “(sighs)”.
5. % of passive usage: Passive usage was detected by extracting a list of past-
participle verbs from Penn Treebank and counting any occurrence of “form of
”to be” + past participle”.
6. % of short utterances (<= 3 words).
7. % of modal auxiliaries and subordinate clauses.
8. % of “mm” tokens such as “mhm”, “um”, “uh-huh”, “uh”, “hm”, “hmm”,etc.
9. Type-token ratio: Ratio of number of types divided by the number of tokens in
the conversation side, in order to measure vocabulary richness of the speaker.
214
10. Mean inter-utterance time: The average time taken between utterances of the
same speaker.
11. % of “yeah” occurences.
12. % of WH-question words such as “What”, “When”, “Where”, etc.
13. % Mean word and utterance length of the speaker.
The above classes resulted in a total of 16 sociolinguistic features which were added
based on feature-ablation studies as features in the meta SVM classifier along with
the 4 features as explained previously in Section 10.5.3.
The rows in Table 10.4 labeled “+ (any sociolinguistic feature)” show the perfor-
mance gain using the respective features described in this section. Each row indicates
an additive effect in the feature ablation, showing the result of adding the current
sociolinguistic feature with the set of features mentioned in the rows above.
10.7 Gender Classification Results
Table 10.4 combines the results of the experiments reported in the previous sec-
tions, assessed on both the Fisher and Switchboard corpora for gender classification.
The evaluation measure was the standard classifier accuracy, that is, the fraction of
test conversation sides whose gender was correctly predicted. Baseline performance
(always guessing the most frequent gender female) yields 57.47% and 51.6% on Fisher
215
Model Acc. ErrorReduction
Fisher Corpus (57.5% of sides are female)Gender Genie 55.63 -384%Ngram 90.84 Ref.(Boulis & Ostendorf, 05)+ Partner Model 91.28 4.80%+ % of “yeah” 91.33+ % of (laughter) 91.38+ % of short utterance 91.43+ % of auxiliaries 91.48+ % of subord-clauses, “mm” 91.58+ % of Participation (in utterance) 91.63+ % of Passive usage 91.68 9.17%
Switchboard Corpus (51.6% of sides are female)Gender Genie 55.94 -350%Ngram 90.22 Ref.(Boulis & Ostendorf, 05)+ Partner Model 91.58 13.91%+ Speaker rate, % of fillers 91.71+ Mean utterance length, % of Ques. 91.96+ % of Passive usage 92.08+ % of (laughter) 92.20 20.25%
Table 10.4: Results showing improvement in accuracy of gender classifier usingpartner-sensitive model and sociolinguistic features
216
and Switchboard respectively. As noted before, the standard reference algorithm is
Boulis and Ostendorf (2005), and all cited relative error reductions are based on this
established standard. Also, as a second reference, performance is also cited for the
popular “Gender Genie”, an online gender-detector9, based on the manually weighted
word-level sociolinguistic features discussed in Argamon et al. (2003).
The additional rows in the table are described in Sections 10.4-10.6, and cumulatively
yield substantial improvements over the Boulis and Ostendorf (2005) standard.
10.7.1 Aggregating results over per-speaker via
consensus voting
While Table 10.4 shows results for classifying the gender of the speaker on a per
conversation basis (to be consistent and enable fair comparison with the work reported
by Boulis and Ostendorf (2005)), all of the above models can be easily extended to
per-speaker evaluation by pooling in the predictions from multiple conversations of
the same speaker. Table 10.5 shows the result of each model on a per-speaker basis
using a majority vote of the predictions made on the individual conversations of
the respective speaker. The consensus model when applied to Switchboard corpus
show larger gains as it has 9.38 conversations per speaker on average as compared
to 1.95 conversations per speaker on average in Fisher. The results on Switchboard
9http://bookblog.net/gender/genie.php
217
Conversation 1
Conversation 2
Conversation 3
Voting
Probability
Female Male
Probability
Female Male
Probability
Female Male
Probability
Female Male
Figure 10.5: Aggregating results over all the conversations of a given speaker via
consensus voting as explained in Section 10.7.1. One can also utilize other ways
of combining evidence such as length-weighted voting, confidence-weighted voting,
stacking, combining all conversations into one single conversation, etc. However,
since the speakers were supposed to speak for a fixed time while collecting the data
for Fisher and Switchboard corpus, the conversations in these corpora are of similar
length. Thus the above simple combination technique is also appropriate due to the
nature of approximately equal length conversations.
218
Model Acc. ErrorReduction
Fisher CorpusNgram 90.50 Ref.(Boulis & Ostendorf, 05)+ Partner Model 91.60 11.58%+ Socioling. Features 91.70 12.63%
Switchboard CorpusNgram 92.78 Ref.(Boulis & Ostendorf, 05)+ Partner Model 93.81 14.27%+ Socioling. Features 96.91 57.20%
Table 10.5: Aggregate results on a “per-speaker” basis via majority consensus ondifferent conversations for the respective speaker. The results on Switchboard aresignificantly higher due to more conversations per speaker as compared to the Fishercorpus
corpus show a very large reduction in error rate of more than 57% with respect to the
standard algorithm, further indicating the usefulness of the partner-sensitive model
and richer sociolinguistic features when more conversational evidence is available.
10.8 Effect of Self-Reporting Features on
Gender Classification
Some ngrams are very strong indicators of gender in a conversation, to the extent
that the single occurrence of those ngrams can determine the gender of the speaker.
Some examples of such self-reporting features are “my wife”, “my boyfriend”, etc. It
is worthwhile to study the impact of such features in artificially inflating the perfor-
mance, from the scientific perspective of studying whether general discourse features
219
% of conversation sideswith self-reporting features
Fisher CorpusTraining data 25.94%Test data 26.69%Switchboard CorpusTraining data 3.2%Test data 3.09%
Table 10.6: Fraction of conversations containing self-reporting features such as “mywife”, “my boyfriend”, on different corpora. Although Fisher has significant con-versations with such features, they have little impact on the overall performance asshown in Table 10.7
are helpful in determining the gender.
Table 10.6 shows the statistics for presence of such features in the Fisher and Switch-
board conversation corpora. While only a small fraction of the conversations (approx-
imately 3%) in the Switchboard corpus contain such features, a significant fraction
(approximately 25%) of the conversations in Fisher corpus have such features. Nev-
ertheless, a second experiment by retraining the classifier after removing all such
features shows only a negligible effect in performance. Table 10.7 shows that there is
no change in the accuracy on Switchboard corpora as expected, and on Fisher corpora
the accuracy drops only by 0.25%, thus indicating the robust and sicessful utilization
of general discourse based features for gender classification.
220
Model AccuracyFisher CorpusNgram-based model 90.84Removing self reporting features 90.59Switchboard CorpusNgram-based model 90.22Removing self reporting features 90.22
Table 10.7: Self reporting features for gender such as “my wife”, “my boyfriend”, etc.have negligible impact on performance of gender classification.
10.9 Application to Arabic Language
While differences in discourse based on author attributes have been heavily studied
for English, it would be interesting to see how the ngram-based model along with the
partner-based model and sociolinguistic features would extend to a new language.
Arabic differs from English along many dimensions such as orthography, word order,
capitalization, sound system, etc., making it a good language to test the robustness
of these models. It also allows to assess the contribution of non-lexical sociolinguistic
features such as mean word length, when language specific sociolinguistic features
(e.g. % of passives) are not available. The following subsections explain the corpus
details and results obtained for Arabic.
10.9.1 Corpus Details
LDC Gulf Arabic telephone conversation corpus (Linguistic Data Consortium,
2006). The training set consisted of 499 conversations, and the test set consisted
of 200 conversations. Each speaker participated in only one conversation, resulting
221
in the same number of training/test speakers as conversations. Thus there was no
overlap in speakers/partners between training and test sets, which is also appropriate
for training and evaluation of the partner-based models.
10.9.2 Results
The ngram feature vectors and the partner-sensitive models were trained similar
way to the models trained for English. Among the sociolinguistic features, only
the non-lexical features, namely, % of conversation spoken, speaker rate, % of short
utterances, type-token ratio, mean inter-utterance time and mean word length and
utterance length were utilized for feature-ablation study.
The results for Arabic are shown in Table 10.8. Based on the prior distribution,
always guessing the most likely class for gender (“male”) yielded 52.5% accuracy. It
can be seen that the ngram-based model gives a reasonably high accuracy in Arabic
as well. More importantly, consistent performance gains due to partner modeling
are observed, achieving an accuracy of 95%, indicating robustness of partner-senstive
model. Furthermore, using the sociolinguistic features, it is seen that mean word
length and mean utterance length can provide additional performance gains.
222
Model Acc. ErrorReduction
Gulf Arabic (52.5% sides are male)Ngram (Boulis & Ostendorf, 05) 92.00 Ref.+ Partner Model 95.00+ Mean word length 95.50+ Mean utterance length 96.00 50.00%
Table 10.8: Gender classification results for a new language (Gulf Arabic) showingconsistent improvement gains via partner-sensitive model and sociolinguistic features.
10.9.3 Analysis
In order to understand which of the ngram features are more discriminative for
gender, the ngrams were ranked according to their weight in the SVM model as shown
in Table 10.6, that shows the Arabic ngram in unicode, roman transliteration and the
weight. Some examples of the top male ngram features are “Aaxiy (my brother)”,
“yaA (addressing or calling upon, vocative particle)”, “waAll ah and waAll ahi (swear
to god often used by males)”, “Al$ abaAb (the guys)”. It is also interesting to see the
use of questions “kam (how much, how many)”, “kayf (how)” among male speakers.
Some examples of top female ngram features are “<intiy (pronoun “you” when refer-
ring to a female)”, “Hiluw and Hilwap (sweet or nice)”, “Guwliy (say, when talking
to a female)”, “AlHamdi lil ah (Thank God)”. As noted in the sociolinguistic studies
for English, one can also see the use of gender-specific pronouns “GaAlat (he said)”
for female speakers.
223
Male
يتنإ <inta 4.54
كل lak 4.37
اي yaA 2.62
تنا Ainta 2.53
يخا Aaxiy 2.40
يخا اي yaA_Aaxiy 2.38
كيلع Ealayk 2.26
Al- 1.86
مك kam 1.75
تنإ
۔لا
<int 1.63
هللاو waAll~ah 1.63
كدنع Eindak 1.63
هللاو waAll~ahi 1.43
نيو wayn 1.22
بابشلا Al$~abaAb 1.18
لاعت taEaAl 1.17
كرابخاش
يلوق
ةولح
يتنإ
دمحلا
ولح
كل
اويأ
ةرم
هلل دمحلا
<intiy -4.59
AlHamdi -2.13
Hiluw -2.07
liJ -1.91
<iywaA -1.90
mar~ap -1.80
AlHamdi_lil~ah -1.46
هلل
يتنا
تلاق
هللا
كل
ام
lil~ah -1.44
Aintiy -1.41
GaAlat -1.41
All~ah -1.39
lik -1.32
maA -1.27
$aAaxbaAriJ -1.24
Guwliy -1.14
Hilwap -1.13
Female
Figure 10.6: Top 20 Arabic ngram features (along with their Roman transliterations)
for Gender, ranked by the weights assigned by the linear SVM model. Section 10.9.3
provides translation and insight into why these are appropriate gender indicators.
224
10.10 Application to Email Genre
A primary motivation for using only the speaker transcripts as compared to also
using acoustic properties of the speaker (Bocklet et al., 2008) was to enable the appli-
cation of the models to other new genres. In order to empirically support this moti-
vation, the performance of the models explored in this chapter was also tested on the
Enron email corpus (Klimt and Yang, 2004). The email genre has also been studied
before for gender classification, primarily from the point of view computer forensics
for securing user identity. Corney et al. (2002) describe an approach for email gender
classification using structural features such as style markers, email domain features
such as reply-status, number of attachments, use of HTML tags, greeting and farewell
acknowledgments and language features such as number of words ending with “able”,
“ive”, etc. While a full scale email modeling of the email domain for gender is defi-
nitely possible, in this section it is described how the simple ngram-features and the
sociolinguistic features used for conversational speech genre extend to email.
10.10.1 Corpus Details
The Enron corpus consists of email data of mostly senior management of Enron.
The corpus is organized into different folders and all the emails from “sent” folder
were used for gender classification. While the dataset does not contain explicit gender
markings, a manual examination and annotation for a subset of users resulted in
225
John,
Regarding the employment agreement, Mike declined without a counter. Keith said he would sign for $75K cash/$250 equity. I still believe Frank should receive the same signing incentives as Keith.
Figure 10.7: Example of an email sent by a male sender in Enron corpus. The
header and signature information containing the sender’s name are removed and only
the body of the email is used for gender classification.
unambiguous gender labels for 90 users, out of which 54 were male and 30 were female.
Only the emails with more than 20 words were utilized as part of training/test emails.
The emails were further cleaned by removing the header information containing the
sender’s name and other details from the corpus and only the email text was utilized.
Any content that was not written by the sender as a part of a reply-to or forward
message was also removed to a large extent using simple heuristics. Also, for fairness,
the name of the sender at the end of the email was also removed. The resulting
training and test sets after preprocessing consisted of 1579 and 204 emails respectively.
226
10.10.2 Features
The ngram feature vectors were computed as in Section 10.4 with TF.IDF weight-
ing10. In addition to ngram features, a subset of sociolinguistic features that could
be extracted for email, namely, % of pronoun usage, % of passive usage, % of modal
auxiliaries, % of subordinate clauses, type-token ratio, % of “yeah” occurences, % of
WH-question words and mean word length were also utilized11.
10.10.3 Results
Table 10.9 shows the performance for gender classification on email data. Based
on the prior distribution, always guessing the most likely class (“male”) results in
63.2% accuracy. It can be seen that the Boulis and Ostendorf (2005) model based on
all ngram features yields a reasonable performance of 76.78%. It is also interesting to
see that the accuracy drops to 74.61% when the names of other persons mentioned in
the email are removed12. Also, it can be seen that % of sub-ordinate clauses, mean
word length, type-token ratio and % of pronouns help in improving the performance
gain further, yielding 80.5% accuracy.
10The inverse document frequency was computed with the notion of email as a document.11The partner-based features were not utilized due to the problems with data sparsity, receiver’s
message deleted in the reply, reply to a group, unindentified receiver and the high noise in extractingthe receiver’s reply even when available. Furthermore, the collection of cleaned and annotatedemails consisted of only 90 authors resulting in a significant overlap between the receivers of thetest messages and senders of training messages. Nevertheless, even the gains from some of thesociolinguistic features as seen from Table 10.9 are quite promising.
12The name of the sender is always removed from the email and is a consistent preprocessing stepin all the experiments on email data.
227
Model Acc. ErrorReduction
Enron Email Corpus (63.2% sides are male)Ngrams with person names removed 74.61 Ref.All Ngrams 76.78 Ref.+ % of subor-claus., Mean 80.19word length, Type-token ratio+ % of pronouns. 80.50 16.02%
Table 10.9: Application of Ngram model and sociolinguistic features for gender clas-sification in a new genre (Email)
10.10.4 Analysis
The top part of Table 10.10 with heading “using all ngrams as features” shows
the most discriminative ngram features. Even though the name of the sender was
removed from both the email header and body13, the top ngram features show that
names of other people mentioned in email are quite discriminative of the gender of the
sender. For instance, the common male names such as “jeff, john, jim” occur as top
ngram features for male sender and the common female names such as “kim, susan,
sara” occur as top ngram features for female sender. It is also interesting to see that
“m” (and also “j” in the bottom part of the table), indicating that male senders tend
to use first-name initials as a common signature at the end of the message.
In order to gain further insight into the differences due to lexical usage, all the names
present in the US Census database were removed from the email and a second ngram-
based SVM model was trained using the remaining ngram features. The bottom part
of Table 10.10 shows the resulting top ngrams after training the SVM model using
13usually part of the signature at the end of the body.
228
this reduced feature set. It can be seen that that the gender neutral pronoun “it” is
common among males and gender specific pronoun “she” is more common in females,
as also observed in sociolinguistic studies of discourse. It is also interesting to see
the unigram “hi” as a strong feature for female senders and the short “bt (best)”
signature for male senders.
Note that the removal of person names was performed in an overly aggressive manner
by filtering out all names that occur in the US Census database, thus resulting in a
more conservative approach. Thus some words such as “Best” that are both person
names and common words will get removed from discourse even when not used as a
named entity for referring to a person. Nevertheless, the results obtained using this
approach are still reasonable (Table 10.9) and a more appropriate model of named
entity detection can further improve performance.
10.11 Modeling Other Attributes
While gender has been studied heavily in the literature, other speaker attributes
such as age and native/non-native status also correlate highly with lexical choice and
other non-lexical features. The ngram-based model of Boulis and Ostendorf (2005)
was applied and the improvements using the partner-sensitive model and richer soci-
olinguistic features for a binary classification of the age of the speaker, and classifying
into native speaker of English vs non-native.
229
Male FemaleUsing all ngrams as features
best 1.2735 kim -1.8831bt 1.2650 susan -1.6136it 1.1549 sara -1.4759jeff 1.0164 tana -1.4612john 0.9985 guaranty -1.2856jim 0.9624 master -1.2821m 0.9354 fax -1.2737to mark 0.8516 credit -1.2211book 0.8472 thanks kim -1.1441debra 0.8087 shelley -1.0857mark attached 0.7366 lindy -1.0199kate 0.7254 mark -1.0049market 0.7037 meeting -0.9949talked 0.7008 stephanie -0.9810andy 0.6885 to tana -0.9798barry 0.6840 form -0.9697lavo 0.6780 she -0.9351is a 0.6726 who -0.9186regards 0.6697 carol -0.8964set 0.6650 ss -0.8888
Retraining after removing person namesbt 1.8437 fax -2.0861it 1.8297 hi -1.7447m 1.4442 guaranty -1.6986talked 1.3735 thanks -1.6003regards 1.2503 ss -1.5832think 1.0448 she -1.4513positions 1.0018 tw -1.3503i think 1.0008 request -1.3384set 0.9963 who -1.3080lavo 0.9885 counterparty -1.2976this we 0.9693 agreements -1.2810j 0.9332 copies -1.2674gas 0.9324 fyi -1.2642make 0.9224 agreement -1.2627that it 0.9111 send -1.1894very 0.9029 one -1.1848i need 0.8982 since -1.1829send a 0.8932 handle -1.1246is a 0.8890 with -1.1073month 0.8838 meeting -1.1021
Table 10.10: Top 20 ngram features for gender classification in email, ranked by theweights assigned by the linear SVM model. See Section 10.10.4 for more details.
230
Figure 10.8: Empirical differences in sociolinguistic features for Age. Younger speak-
ers tend to use short utterances, pronouns and auxiliaries more often than older
speakers.
231
10.11.1 Corpus details for Age and
Native Language
For age, the same training and test speakers were used from Fisher corpus as
explained for gender in Section 10.3 and binarized into greater-than or less-than-
or-equal-to 40 for more parallel binary evaluation. For predicting native/non-native
status, the 1156 non-native speakers in the Fisher corpus was used and pooled them
with a randomly selected equal number of native speakers. The training and test par-
titions consisted of 2000 and 312 speakers respectively, resulting in 3267 conversation
sides for training and 508 conversation sides for testing.
10.11.2 Results for Age and Native/Non-Native
Based on the prior distribution, always guessing the most likely class for age ( age
less-than-or-equal-to 40) results in 62.59% accuracy and always guessing the most
likely class for native language (non-native) yields 50.59% accuracy.
Table 10.11 shows the performance of the models discussed in this chapter for age
and native/non-native speaker status. It can be seen that the ngram-based approach
of Boulis and Ostendorf (2005) for gender also gives reasonable performance on other
speaker attributes, and more importantly, both the partner-sensitive model and so-
ciolinguistic features help in reducing the error rate on age and native language sub-
stantially, indicating their usefulness not just on gender but also on other diverse
232
Model AccuracyAge (62.6% of sides have age <= 40)Ngram Model 82.27+ Partner Model 82.77+ % of passive, mean inter-utterance time 83.02, % of pronouns+ % of “yeah” 83.43+ type/token ratio, + % of lipsmacks 83.83+ % of auxiliaries, + % of short utterances 83.98+ % of “mm” 84.03(Reduction in Error) (9.93%)Native vs Non-native (50.6% of sides are non-native)Ngram 76.97+ Partner 80.31+ Mean word length 80.51(Reduction in Error) (15.37%)
Table 10.11: Results showing improvement in the accuracy of age and native languageclassification using partner-sensitive model and sociolinguistic features
latent attribute classifications.
10.11.3 Analysis
Table 10.12 shows the most discriminative ngram features for binary classification
of age, it is interesting to see the use of “well” right on top of the list for older speakers,
also found in the sociolinguistic studies for age (Macaulay, 2005). One can also see
that older speakers talk about their children (“my daughter”) and younger speakers
talk about their parents (“my mom”), the use of words such as “wow”, “kinda” and
“cool” is also common in younger speakers. To give maximal consistency/benefit
to the Boulis and Ostendorf (2005) n-gram-based model, the self-reporting n-grams
such as “im forty”, “im thirty” were not filtered, putting the sociolinguistic-literature-
233
Age >= 40 Age < 40well 0.0330 im thirty14 -0.0266im forty 0.0189 actually -0.0262thats right 0.0160 definitely -0.0226forty 0.0158 like -0.0223yeah well 0.0153 wow -0.0189uhhuh 0.0148 as well -0.0183yeah right 0.0144 exactly -0.0170and um 0.0130 oh wow -0.0143im fifty 0.0126 everyone -0.0137years 0.0126 i mean -0.0132anyway 0.0123 oh really -0.0128isnt 0.0118 mom -0.0112daughter 0.0117 im twenty -0.0110well i 0.0116 cool -0.0108in fact 0.0116 think that -0.0107whether 0.0111 so -0.0107my daughter 0.0111 mean -0.0106pardon 0.0110 pretty -0.0106gee 0.0109 thirty -0.0105know laughter 0.0105 hey -0.0103this 0.0102 right now -0.0100oh 0.0102 cause -0.0096young 0.0100 im actually -0.0096in 0.0100 my mom -0.0096when they 0.0100 kinda -0.0095
Table 10.12: Top 25 ngram features for Age ranked by weights assigned by the linearSVM model
234
based and discourse-style-based features at a relative disadvantage.
Among the sociolinguistic features, adding % of passive usage, mean inter-utterance
time, % of pronouns, % of yeah, % of type-token ratio, % of lipsmacks, % of auxiliaries,
% of short utterances and % of “mm” show improvement in performance. Figure
10.8 shows the empirical distributions for some of the sociolinguistic features for
age. For native vs non-native language, mean word length is a strong feature and
it was observed that native speakers tend to have larger mean word length (2.34) as
compared to non-native speakers (1.61).
10.12 Regression Models
While the binary model for age described in the previous section gives an insight
into what features are indicative of age, it is only a step towards predicting the real age
of the speaker. A better approach would is to use a regression framework, that allows
for greater reduction in entropy for predicting age. The following sub-sections explain
the different regression models and their performance on the. The same training and
test speakers were used from Switchboard corpus as explained for gender in Section
10.3.
235
Figure 10.9: Age histograms for training and test speakers of Switchboard corpus
indicating unbalanced age groups of the participating speakers.
236
10.12.1 Evaluation Measures
Table 10.13 reports both the mean absolute error and mean squared error defined
as follows:
Mean absolute error (MAE) =1
n
n∑i=1
|yi − yi| (10.1)
Mean squared error (MSE) =1
n
n∑i=1
(yi − yi)2 (10.2)
where n is the number of instances, yi and yi are the true age and predicted age of
instance i respectively.
10.12.2 Baseline Approaches
Simple predictors for age based on prior distribution are median and average
statistics. As shown in Table 10.13, predicting the correct age as median age (38)
yields a mean absolute error of 8.41 and a mean squared error of 111.98. Furthermore,
using the average age instead of median age, results in a mean absolute error of 8.61
and mean squared error of 108.01.
10.12.3 Ngram-based regression model
Based on the promising results using ngrams for binary classification of age, results
for a SVM-based regression model is also reported utilizing all the ngrams, resulting in
57,914 features. The regression model utilized was based on support vector machines
237
(Vapnik, 1995) along with the optimizations described in (Joachims, 1999; Joachims,
2002).
The row labeled “Ngram-based model” in Table 10.13 reports the performance of
this model, resulting in mean absolute error of 7.15 and mean squared error of 79.80
respectively.
10.12.4 Sociolinguistic features
The rows labeled “+ Socioling.” in Table 10.13 show the results for using soci-
olinguistic features along with the score from other models, combined using a meta-
classifier. One can see that using regression trees as a meta-classifier results in an
improved performance (MAE: 79.06) as compared to SVMs. This may be due to
increased robustness of regression trees in dealing with widely varying feature types
for hypothesizing decision boundaries.
10.12.5 Top Ngram features
Since the number of ngram features results in a high dimensional feature space,
a simple method of feature selection is to use the ngrams with high positive and
negative weights assigned by the ngram-based SVM model. The row labeled “Top
40 ngrams” show the regression performance just based on these 40 lexical features.
These 40 ngram features are also showed in Table 10.14 and based on how the SVM
238
Model MAE MSEMedian age (38 years) 8.41 111.98Average age (39.86 years) 8.61 108.01Ngram-based model 7.15 79.80Top 40 N-grams Only 7.81 91.96SVM StackingNgram + Socioling. 7.15 79.74Binary Splits 6.45 67.32Binary Splits + Socioling. 6.25 63.46Ngrams + Socioling. + Binary Splits 7.15 79.74Ngrams + Socioling. + Binary Splits + Top 40 Ngrams 7.14 79.64Regression Tree StackingNgram + Socioling. 7.06 78.08Binary Splits 7.33 104.24Binary Splits + Socioling. 7.38 104.04Ngrams + Socioling. + Binary Splits 7.06 78.08Ngrams + Socioling. + Binary Splits + Top 40 Ngrams 7.06 78.08
Table 10.13: Results for age regression using different feature and model combina-tions. Substantial performance gains were obtained by utilizing binary classifiersacross different age boundaries as features in a stacked SVM model.
models are trained, the features with higher weight indicate predictiveness of elder
speakers and vice versa. This can also be seen from the presence of ngrams such as
“grandchildren”, “son”, “i see” as features with high positive weights and ngrams
such as “oh really”, “my parents, “pretty much”, “yeah i”, etc. as features with high
negative weights.
While the performance reduces as compared to using all the 57,914 ngram features,
giving a mean absolute error of 7.81, it is still a reasonable performance comparatively.
The primary motivation behind reducing the number of ngram features was to include
them directly in the stacked model and allow for prominent ngrams to influence the
stacked model. The rows labeled “+ Top 40 Ngrams” reports performance with
239
Top +ve features Top -ve featuresumhum umhum 0.1454 oh really -0.1342isnt 0.1252 oh no -0.1070yeah right 0.1189 definitely -0.1053anyhow 0.1103 laughteryeah -0.0982and um 0.1091 umhum yeah -0.0973you mean 0.0998 agree -0.0917dallas 0.0957 also -0.0902thats right 0.0953 exactly -0.0883uhhuh uhhuh 0.0937 because1 -0.0852son 0.0897 yeah i -0.0850right uhhuh 0.0864 um -0.0810course 0.0853 pretty much -0.0802i think 0.0848 do do -0.0795yes i 0.0829 well um -0.0781isnt it 0.0814 pretty -0.0780grandchildren 0.0800 my parents -0.0776i see 0.0800 right -0.0773i have 0.0796 research -0.0767i say 0.0795 only -0.0751i just 0.0786 the school -0.0748
Table 10.14: Top 20 ngram features for Age ranked by weights assigned by the ngram-based SVM regression model
adding these ngrams as additional 40 features in the stacked model. However, no
performance gains were obtained via this combination.
10.12.6 Multiple Binary Classifiers Across
Different Age Boundaries
Training multiple binary SVM models as compared to full regression models can
be helpful due to the much reduced hypothesis space and improved accuracy of the
240
individual classifiers15. This model explores the use of multiple ngram-based binary
classifiers by windowing different age groups. The output of these classifiers is then
used as a feature in the regression model, resulting in a much lower dimension as com-
pared to the ngram-based regression model. Based on the assumption that the points
closer to the decision boundary are harder to classify, a windowing-based approach
was used resulting in the following three binary classifiers:
1. age < 30 vs age > 40, with standalone performance of 72.09%.
2. age < 40 vs age > 50, with standalone performance of 74.96%.
3. age < 50 vs age > 6016, with standalone performance of 95.72%.
Each of these binary classifiers were trained using the ngram features. The rows
labeled “Binary splits” shows the performance of using only these four features in
the stacking models. Using SVM as the stacking model for these four classifiers as
features results in a substantially better performance than the previous models with
a mean absolute error of 6.45 and mean squared error of 67.32. Further adding
sociolinguistic features results in the best performance shown in rows labeled “Binary
splits + Socioling.” in the SVM stacking model. In particular, with these added
features the mean absolute error is reduced to 6.25 and mean squared error is reduced
to 63.46.
15The original characterization of SVMs was also with respect to binary classification, makingthem more suitable for making binary predictions.
16The next windowing model of age < 60 vs age > 70 was not utilized since there were no instanceswith age > 70.
241
10.12.7 Stacked Models
Stacking or Stacked generalization (Wolpert, 1992) is an approach for constructing
ensemble or committee of classifiers. A classifier ensemble or committee is a set of clas-
sifiers whose individual decisions are combined to classify new instances (Dietterich,
1997). Stacking is a specific instance of ensemble classification where a higher-level or
a meta classifier is utilized for combining the output of multiple classifiers. Using the
analogy of a committee, the meta classifier is the president of the committee and the
individual classifiers as the ground committee members. The motivation for stacking
is that different committee members make different classification errors and hence a
president can learn when to trust each of the members depending upon the nature of
instance to be classified. The following sections describe the use of stacking for age
regression using a linear (SVM with linear kernel) and regression tree classifier.
10.12.7.1 Linear Combination
This approach uses a linear kernel SVM as a meta-regression for combining the
score from ngram-based lexical model, sociolinguistic features, top ngram features and
scores from multiple binary classifiers across different age boundaries. The results for
using various combinations of these features and classifier scores are shown under the
“SVM Stacking” category in Table 10.13. The row labeled “Ngram + Socioling.”
describes the result for using the output of the SVM Ngram based regression model
and sociolinguistic features in a meta SVM regression model. The models utilizing “+
242
Top 40 Ngrams” describe the performance of using top 40 ngram features as described
in Section 10.12.5 along with other features/scores in the meta SVM regression model.
Finally, the rows describing “+ Binary Splits” describe the performance of using
output from multiple binary classifiers across different age boundaries along with
other features/scores.
10.12.7.2 Regression Trees
This approach uses regression trees as a meta-regression for combining the score
from ngram-based lexical model, sociolinguistic features, top ngram features and
scores from multiple binary classifiers across different age boundaries. A regression
tree has traditionally been used for combining a small number of signals and hence
can be easily utilized as a meta classifier for various combinations of stacked models.
The REPTree package from Weka17 (Witten and Frank, 2005) was utilized as the
model for regression tree. REPTree builds a regression tree using the information
gain critieria and uses reduced error pruning for generalizing the tree. The various
combinations described under “Regression Tree Stacking” category in Table 10.13 are
stacked in a similar fashion as explained in SVM stacking section above.
17Weka (Witten and Frank, 2005) is a machine learning toolkit written in Java. It contains toolsfor data pre-processing, classification, regression, clustering, association rules and visualization. Itcan be downloaded from http://www.cs.waikato.ac.nz/ml/weka/
243
Features
Mean utterancelength
Active/Passive ratio usage
% of modal auxiliaries
Type-token ratio
....
Age < 30 vs Age > 40
Age < 40 vs Age > 50
Age < 50 vs Age > 60
Ngram-based binary classifiers for different age splits
SociolinguisticFeatures
SVM Regression Model
Regression Tree
OR
Figure 10.10: Stacking approach for age regression utilizing binary classifiers across
different age boundaries and sociolinguistic features as individual components.
244
Model MAE MSEBaseline1 (Median: 40.5) 9.74 130.29Baseline2 (Average: 41.19) 9.75 129.81Ngram-based model 7.63 88.37SVM StackingBinary Splits 7.17 80.27Binary Splits + Socioling. 6.94 75.69Ngrams + Socioling. + Binary Splits + Top 40 Ngrams 7.63 88.37Regression Tree StackingBinary Splits 7.87 116.42Binary Splits + Socioling. 7.95 116.21Ngrams + Socioling. + Binary Splits + Top 40 Ngrams 7.53 86.49
Table 10.15: Results for age regression using different feature and model combinationsfor age-wise balanced test set. While the performance of the baseline models degradedue to higher variance, regression models show consistent performance improvementsas in Table 10.13
10.12.8 Balancing Size of Different Age Groups in
Test Set
The age sample of speakers that participated in the telephone conversation exper-
iment could be biased due the age limit restrictions, location of sampling, etc. The
age histograms of training and test speakers are shown in Figure 10.9 and one can
see the characteristic age clusters leading to an unbalanced age sample. In order to
obtain a fair performance estimate on a more balanced age distribution set, the test
set was filtered to create a more balanced distribution across different age groups
using a threshold as shown in Figure 10.11. The results of various regression models
on this balanced test set are shown in Table 10.15. While the baselines result in a
lower performance as the age distribution is more uniform, the row labeled “Binary
245
Figure 10.11: Histograms for different age groups in the test set. The horizontal
line shows the threshold for balancing the size of test set across different age groups,
retaining a total of 600 examples.
Splits + Socioling.” still gives the best and a similar performance as compared to the
original biased sample.
10.13 Effect of Self-Reporting Features
on Age Prediction
A speaker may sometimes say his or her age during the conversation such as “I’m
thirty two”, etc. This may lead to artificial inflation of the results and may cause the
246
Model Mean Absolute Mean SquaredError Error
Ngram-based model 7.15 79.80Deleting self-reporting features 7.13 79.90
Table 10.16: Self-reporting features such as “in thirties, i’m fourty five, etc.”. havelittle impact. The performance after deleting such features is similar to the originalmodel containing all ngrams as features.
the regression model to over depend on such self-reporting features. In order to study
the impact of this effect on performance, all the self reporting ngrams were removed
from the feature set and the regression model was retrained using the remaining
features. The results showed in Table 10.16 show that such features have negligible
impact on performance, indicating the robustness of general discourse features in
predicting age.
10.14 Statistical Significance of Results
This section analyzes the statistical significance of results reported in Tables 10.4,
10.5, 10.8, 10.9, 10.11 and 10.13. For per conversation results reported in Table 10.4,
using a binomial test of sample sizes 2008 (Fisher) and 808 (Switchboard), and base-
line accuracies of 90.84% (Fisher) and 90.22 (Switchboard), any resulting accuracy
over 91.88% (Fisher) and 91.96% (Switchboard) corpus is statistically significant with
a p-value less than 0.05.
For speaker-wise aggregate results reported in Table 10.5, using a binomial test with
sample sizes of 1000 (Fisher) and 100 (Switchboard) speakers, any resulting accuracy
247
over 92% (Fisher) and over 97% (Switchboard) are statistically significant with a p-
value less than 0.05.
For Arabic gender classification results in Table 10.8, using a binomial test with sam-
ple size 200 and baseline accuracy 92%, any resulting accuracy over 95% is statistically
significant with p-value less than 0.05.
For gender classification results on Email in Table 10.9, using a binomial test with
sample size 204 and baseline accuracy 76.78%, any resulting accuracy over 81.37% is
statistically significant with a p-value less than 0.05.
For results on binary classification of age reported in Table 10.11, using a binomial
test with sample size 2008 and baseline accuracy of 82.27%, any resulting accuracy
over 83.66% is statistically significant with a p-value less than 0.05.
For results on native vs non-native speaker reported in Table 10.11, using a binomial
test with sample size 508 and baseline accuracy of 76.97%, any resulting accuracy
over 79.92% is statistically significant with a p-value less than 0.05.
10.15 Conclusion
This chapter has presented and evaluated several original techniques for the la-
tent classification of speaker gender, age and native language in diverse genres and
languages. A novel partner-sensitive model shows performance gains from the joint
248
modeling of speaker attributes along with partner speaker attributes, given the dif-
ferences in lexical usage and discourse style such as observed between same-gender
and mixed-gender conversations. The robustness of the partner-sensitive model is
substantially supported based on the consistent performance gains achieved in di-
verse languages and attributes. This chapter has also explored a rich variety of novel
sociolinguistic and discourse-based features, including mean utterance length, pas-
sive/active usage, percentage domination of the conversation, speaking rate and filler
word usage. In addition to these novel models, this work also shows how these mod-
els and the previous work extend to new languages and genres. Cumulatively up to
20% error reduction is achieved relative to the standard Boulis and Ostendorf (2005)
algorithm for classifying individual conversations on Switchboard, and accuracy for
gender detection on the Switchboard corpus (aggregate) and Gulf Arabic exceeds
95%.
249
Chapter 11
Contributions and Conclusion
This dissertation has presented several scientific contributions to the areas of trans-
lation lexicon induction, semantic knowledge extraction and textual fact extraction.
In particular, the following is a brief summary of some of the distinct scientific con-
tributions contained herein.
Chapter 3
1. Fluent, Non Compositional Translation of Compound Words: This is
the first work on fluent, non-compositional translation of compound words via
cross-language transitivity as opposed to compositional (or “glossy”) translation
methods used in the previous literature. Successful translation of compounds
can be achieved without the need for bilingual training text, by modeling the
mapping of literal component-word glosses (e.g. “iron-path”) into fluent English
(e.g. “railway”) across multiple languages.
250
2. Modeling Sequence of Compound Components and
Compound Morphology: Performance of compound translation is further
improved by adding component-sequence and learned-morphology models along
with context similarity from monolingual text and optional combination with
traditional bilingual-text-based translation discovery.
3. Application to Diverse World Languages: This is the first known work on
compound translation induction to be evaluated broadly on 10 diverse languages
(Albanian, Arabic, Bulgarian, Czech, Farsi, German, Hungarian, Russian, Slo-
vak and Swedish), showing robust language-independence as compared to pre-
vious literature that has focused on using fixed syntactic patterns modeling the
compounding phenomena of one or two languages in question. The models de-
veloped in this dissertation show consistent performance gains in translation
accuracy across all these languages.
Chapter 4
4. Dependency Contexts for Translation Lexicon Induction: While depen-
dency contexts have been successfully used for monolingual natural language
processing tasks, this is the first work to report their contribution to trans-
lation lexicon induction. In addition to providing empirical gains, this work
clearly shows why such richer contexts are helpful with respect to modeling
long-distance relationships and word-reordering.
251
5. Reducing Entropy of Candidate Space via Mapping Part-of-speech
Tagsets: This is the first work in the minimally supervised translation lexi-
con literature to show how preserving a word’s part of speech in translation can
improve performance and provided a mechanism for mapping part-of-speech
tagsets automatically. Such mapping was used to restrict the candidate space
(which can be large depending on the size of the vocabulary), allowing to im-
prove monolingual corpus-based methods for translating words for all part-of-
speech categories.
Chapter 6
6. Learning Semantic Taxonomy in Multiple Languages using Information
from Different Relationship-types: This work provides a novel minimal-
resource algorithm for the acquisition of multilingual lexical taxonomies (in-
cluding hyponymy/hypernymy and meronymy) using evidence from multiple
relationship-types.
This is also the first work to show successful application of corpus-based meth-
ods for fact extraction to Hindi and the robustness of this approach is shown by
the fact that the unannotated Hindi development corpus was only 1/15th the
size of the utilized English corpus.
7. Semantic Taxonomy as Transitive Bridge for
Translation Lexicon Induction: This is the first work to present a novel
252
model of unsupervised translation lexicon induction via multilingual transitive
models of hypernymy and hyponymy, using corpus-based induced taxonomies.
Chapter 7
8. Extracting Natural Parents in the Hypernymy Chain for Definite
Anaphora Resolution: This chapter presents a successful solution to the
problem of identifying natural hypernyms for definite anaphora resolution, by
illustrating a simple and noisy corpus-based approach globally modeling head-
word co-occurrence around likely anaphoric definite NPs as compared to the
approaches in previous literature utilizing standard standard Hearst-style pat-
terns for extracting hypernyms to identify likely antecedents.
9. Generation of Definite Anaphors using Natural Parents: This is the first
work in the coreference modeling literature to present a perspective on gen-
erating definite anaphors using natural parents extracted from corpus-based
methods. On this much harder anaphora generation task, where the stand-
alone WordNet-based model only achieved an accuracy of 4%, the corpus-based
models can achieve 35%-47% accuracy on blind exact-match evaluation.
Chapter 9
10. Global Document-level Structural Model: This is the first work illustrat-
ing a global structural model for biographic fact extraction utilizing absolute
253
and relative document-wide positions as opposed to modeling local contextual
patterns.
11. Transitive Model: Another property exploited in a novel manner in this work
is the tendency for individuals occurring together in an article to have related
attribute values. Based on this intuition, a transitive model was implemented
that predicts attributes based on consensus voting via the extracted attributes
of neighboring names
12. Correlation-based Model: This work also presents the novel use of correla-
tions between attributes for biographic fact extraction, learning compatible and
incompatible inter-attribute pairings. The motivation here is that the attributes
(such as nationality and religion) are not independent of each other when mod-
eled for the same individual, leading to performance gains via exploiting this
correlation.
Chapter 10
13. Modeling Partner-Effect: This work is the first to show performance gains
from the novel modeling of speaker attributes sensitive to partner speaker at-
tributes, given the differences in lexical usage and discourse style such as ob-
served between same-gender and mixed-gender conversations.
14. Use of Sociolinguistic Features: In contrast to lexical n-gram focused models
developed for gender prediction in the computational linguistics literature, this
254
work explores a rich variety of novel sociolinguistic and discourse-based features,
including features such as mean utterance length, passive/active usage ratio,
percentage domination of the conversation, speaking rate and filler word usage.
15. Application to Variety of Attributes: This work shows how the lexical
models of gender classification in the previous literature can be extended to
Age and Native vs. Non-native prediction, with further improvements gained
from partner-sensitive models and novel sociolinguistic features.
11.1 Applications and Future Work
The approaches, models and algorithms presented in this dissertation for cross-
language, semantic and factual relationship extraction have broad potential utilization
in a number of major applications:
• Fine-grained information retrieval: Building structured knowledge bases
containing a wide range of relationships allows for powerful query mechanisms
for search. For example, a relational database containing biographic facts
can return results for finding what attributes are common between two peo-
ple. General semantic relationships such as hypernymy can also be helpful for
query expansion. Also, learning translation lexicons is important for improving
query/document term translation in cross language information retrieval.
• Machine translation: Given enough parallel bilingual text, current statis-
255
tical machine translation systems can learn highly accurate word and phrase
translation lexicons. However, large parallel corpora exist for only few of the
world’s languages and the methods described in this thesis show several novel
approaches to building translation lexicons without the need for parallel cor-
pora.
• Disambiguation of concepts and named entities: A major problem in
natural language processing and search systems is that a word can refer to
multiple concepts in different languages, and similarly a named entity can have
multiple referent persons, organizations, etc. The extraction of entity attributes
and relations can provide powerful features for ambiguous entity disambiguation
and linking.
• Personalized services/recommendations: Approaches that can extract bi-
ographical attributes from unstructured user content can provide additional
meta information about the user. Meta information such as “gender”, “age”,
“education level”, etc., can enable more personalized user assistance such as
custom news feeds, call routing, book recommendations, etc. based on the
extracted attributes.
• Education: Generating structured repositories of the relationships between
words, their translations in different languages, and distilled facts about enti-
ties can provide a better way for students to learn about domains of current
256
interest. In addition, a system could detect that the student is a non-native
speaker and use the translingual and synonymy relationships to improve stu-
dent’s vocabulary.
257
Bibliography
[1] E. Agichtein and L. Gravano. Snowball: extracting relations from large plain-
text collections. In Proceedings of the 5th ACM International Conference on
Digital Libraries, pages 85–94, 2000.
[2] E. Alfonseca, P. Castells, M. Okumura, and M. Ruiz-Casado. A rote extractor
with edit distance-based generalisation and multi-corpora precision calculation.
Proceedings of International Conference on Computational Linguistics and As-
sociation for Computational Linguistics, pages 9–16, 2006.
[3] D.E. Appelt, J.R. Hobbs, J. Bear, D. Israel, and M. Tyson. FASTUS: A finite-
state processor for information extraction from real-world text. In International
Joint Conference on Artificial Intelligence, volume 13, pages 1172–1172, 1993.
[4] S. Argamon, M. Koppel, J. Fine, and A.R. Shimoni. Gender, genre, and writing
style in formal written texts. Text-Interdisciplinary Journal for the Study of
Discourse, 23(3):321–346, 2003.
[5] J. Artiles, J. Gonzalo, and S. Sekine. The semeval-2007 weps evaluation: Estab-
258
lishing a benchmark for the web people search task. In Proceedings of SemEval,
pages 64–69, 2007.
[6] S. Auer and J. Lehmann. What have Innsbruck and Leipzig in common? Ex-
tracting Semantics from Wiki Content. Proceedings of ESWC, pages 503–517,
2007.
[7] A. Bagga and B. Baldwin. Entity-Based Cross-Document Coreferencing Using
the Vector Space Model. In Proceedings of International Conference on Compu-
tational Linguistics and Association for Computational Linguistics, pages 79–
85, 1998.
[8] T. Baldwin and T. Tanaka. Translation by Machine of Complex Nominals: Get-
ting it Right. In Proceedings of the Association for Computational Linguistics
Workshop on Multiword Expressions, pages 24–31, 2004.
[9] M. Berland and E. Charniak. Finding parts in very large corpora. In Proceedings
of the 37th Annual Meeting of the Association for Computational Linguistics,
pages 57–64, 1999.
[10] T. Bocklet, A. Maier, and E. Noth. Age Determination of Children in Preschool
and Primary School Age with GMM and Based Supervectors and Support Vec-
tor Machines/Regression. In Proceedings of Text, Speech and Dialogue; 11th
International Conference, volume 1, pages 253–260, 2008.
259
[11] C. Boulis and M. Ostendorf. A quantitative analysis of lexical differences be-
tween genders in telephone conversations. In Proceedings of Association for
Computational Linguistics, pages 435–442, 2005.
[12] S. Brin. Extracting patterns and relations from the world wide web. In
In WebDB Workshop at 6th International Conference on Extending Database
Technology, EDBT98, 1998.
[13] P.F. Brown, S.A. Della Pietra, V.J. Della Pietra, F. Jelinek, J.D. Lafferty,
R.L. Mercer, and P.S. Roossin. A statistical approach to machine translation.
Computational Linguistics, 16(2):79–85, 1990.
[14] P.F. Brown, V.J. Della Pietra, S.A. Della Pietra, and R.L. Mercer. The mathe-
matics of statistical machine translation: Parameter estimation. Computational
linguistics, 19(2):263–311, 1993.
[15] R.D. Brown. Corpus-driven splitting of compound words. In Proceedings of
TMI, 2002.
[16] S. Buchholz and E. Marsi. Conference on natural language learning-X shared
task on multilingual dependency parsing. In Proceedings of Conference on Nat-
ural Language Learning, pages 189–210, 2006.
[17] R. Bunescu and M. Pasca. Using encyclopedic knowledge for named entity
260
disambiguation. In Proceedings of European Chapter of the Association for
Computational Linguistics, pages 3–7, 2006.
[18] J.D. Burger and J.C. Henderson. An exploration of observable features related
to blogger age. In Computational Approaches to Analyzing Weblogs: Papers
from the 2006 American Association for Artificial Intelligence Spring Sympo-
sium, pages 15–20, 2006.
[19] M. J. Cafarella, D. Downey, S. Soderland, and O. Etzioni. Knowitnow: Fast,
scalable information extraction from the web. In Proceedings of Empirical Meth-
ods in Natural Language Processing and Human Language Technologies, pages
563–570, 2005.
[20] C. Callison-Burch, D. Talbot, and M. Osborne. Statistical Machine Transla-
tion with Word-and Sentence-Aligned Parallel Corpora. In Proceedings of the
Association for Computational Linguistics, pages 175–182.
[21] Y. Cao and H. Li. Base Noun Phrase translation using web data and the EM
algorithm. In Proceedings of the International Conference on Computational
Linguistics and Volume 1, pages 1–7, 2002.
[22] S. Caraballo. Automatic construction of a hypernym-labeled noun hierarchy
from text. In Proceedings of the 37th Annual Meeting of the Association for
Computational Linguistics, pages 120–126, 1999.
261
[23] B. Carterette, R. Jones, W. Greiner, and C. Barr. N semantic classes are
harder than two. In Proceedings of Association for Computational Linguistics
and International Conference on Computational Linguistics, pages 49–56, 2006.
[24] S. Cederberg and D. Widdows. Using LSA and noun coordination informa-
tion to improve the precision and recall of automatic hyponymy extraction. In
Proceedings of Conference on Natural Language Learning, pages 111–118, 2003.
[25] C. Cieri, D. Miller, and K. Walker. The Fisher Corpus: a resource for the next
generations of speech-to-text. In Proceedings of LREC, 2004.
[26] H. H. Clark. Bridging. In Proceedings of the Conference on Theoretical Issues
in Natural Language Processing, pages 169–174, 1975.
[27] J. Coates. Language and Gender: A Reader. Blackwell Publishers, 1998.
[28] D. Connolly, J. D. Burger, and D. S. Day. A machine learning approach to
anaphoric reference. In Proceedings of the International Conference on New
Methods in Language Processing, pages 133–144, 1997.
[29] M. Corney, O. de Vel, A. Anderson, and G. Mohay. Gender-preferential text
mining of e-mail discourse. In Proceedings of Annual Computer Security Appli-
cations Conference, pages 21–27, 2002.
[30] J. Cowie, S. Nirenburg, and H. Molina-Salgado. Generating personal profiles.
In The International Conference On MT And Multilingual NLP, 2000.
262
[31] S. Cucerzan. Large-scale named entity disambiguation based on wikipedia data.
In Proceedings of Empirical Methods in Natural Language Processing and Con-
ference on Natural Language Learning, pages 708–716, 2007.
[32] A. Culotta, A. McCallum, and J. Betz. Integrating probabilistic extraction
models and data mining to discover relations and patterns in text. In Pro-
ceedings of Human Language Technologies and North American Chapter of the
Association for Computational Linguistics, pages 296–303, 2006.
[33] A. Culotta and J. Sorensen. Dependency tree kernels for relation extraction. In
Proceedings of Association for Computational Linguistics, 2004.
[34] A.P. Dempster, N.M. Laird, and D.B. Rubin. Maximum likelihood from in-
complete data via the EM algorithm. Journal of the Royal Statistical Society.
Series B (Methodological), pages 1–38, 1977.
[35] T.G. Dietterich. An experimental comparison of three methods for constructing
ensembles of decision trees: Bagging, boosting, and randomization. Machine
learning, 40(2):139–157, 2000.
[36] P. Eckert and S. McConnell-Ginet. Language and Gender. Cambridge Univer-
sity Press, 2003.
[37] O. Etzioni, M. Cafarella, D. Downey, A. Popescu, T. Shaked, S. Soderland,
263
D. S. Weld, and A. Yates. Unsupervised named-entity extraction from the web:
an experimental study. Artif. Intell., 165(1):91–134, 2005.
[38] C. Fellbaum. Wordnet: An electronic lexical database. 1998.
[39] E. Filatova and J. Prager. Tell me what you do and Ill tell you what you are:
Learning occupation-related activities for biographies. Proceedings of Human
Language Technologies and Empirical Methods in Natural Language Processing,
pages 113–120, 2005.
[40] J.L. Fischer. Social influences on the choice of a linguistic variant. Word,
14:47–56, 1958.
[41] R. Florian and D. Yarowsky. Modeling consensus: Classifier combination for
word sense disambiguation. In Proceedings of the conference on Empirical meth-
ods in natural language processing, pages 25–32, 2002.
[42] P. Fung. A statistical view on bilingual lexicon extraction: from parallel corpora
to non-parallel corpora. Lecture Notes in Computer Science, 1529:1–17, 1998.
[43] P. Fung and P. Cheung. Multi-level bootstrapping for extracting parallel sen-
tences from a quasi-comparable corpus. In Proceedings of International Con-
ference on Computational Linguistics, pages 1051–1057, 2004.
[44] P. Fung and L.Y. Yee. An IR Approach for Translating New Words from Non-
264
parallel, Comparable Texts. In Proceedings of Association for Computational
Linguistics, volume 36, pages 414–420, 1998.
[45] N. Garera, C. Callison-Burch, and D. Yarowsky. Improving translation lexicon
induction from monolingual corpora via dependency contexts and part-of-speech
equivalences. In Proceedings of the Conference on Computational Natural Lan-
guage Learning, pages 129–137, 2009.
[46] N. Garera and A. I. Rudnicky. Briefing assistant: Learning human summariza-
tion behavior over time. In AAAI Spring Symposium on Persistent Assistants,
2005.
[47] N. Garera and D. Yarowsky. Resolving and generating definite anaphora by
modeling hypernymy using unlabeled corpora. In Proceedings of the Conference
on Natural Language Learning, pages 37–44, 2006.
[48] N. Garera and D. Yarowsky. Minimally supervised multilingual taxonomy and
translation lexicon induction. In Proceedings of the International Joint Confer-
ence on Natural Language Processing, pages 465–472, 2008.
[49] N. Garera and D. Yarowsky. Translating compounds by learning component
gloss translation models via multiple languages. In Proceedings of the Interna-
tional Joint Conference on Natural Language Processing, pages 403–410, 2008.
[50] N. Garera and D. Yarowsky. Modeling latent biographic attributes in conver-
265
sational genres. In Proceedings of the Joint Conference of Association of Com-
putational Linguistics and International Joint Conference on Natural Language
Processing (ACL-IJCNLP), pages 710–718, 2009.
[51] N. Garera and D. Yarowsky. Structural, transitive and latent models for bi-
ographic fact extraction. In Proceedings of the Conference of the European
Chapter of the Association of Computational Linguistics, pages 300–308, 2009.
[52] R. Girju, A. Badulescu, and D. Moldovan. Learning semantic constraints for the
automatic discovery of part-whole relations. In Proceedings of Human Language
Technologies and North American Chapter of the Association for Computational
Linguistics, pages 1–8, 2003.
[53] R. Girju, A. Badulescu, and D. Moldovan. Automatic discovery of part-whole
relations. Computational Linguistics, 21(1):83–135, 2006.
[54] J.J. Godfrey, E.C. Holliman, and J. McDaniel. Switchboard: Telephone speech
corpus for research and development. In Proceedings of ICASSP, volume 1,
1992.
[55] T. Gollins and M. Sanderson. Improving cross language retrieval with triangu-
lated translation. In Proceedings of the 24th annual international ACM SIGIR
conference on Research and development in information retrieval, pages 90–95,
2001.
266
[56] D. Graff, J. Kong, K. Chen, and K. Maeda. English Gigaword Second Edition.
Linguistic Data Consortium, catalog number LDC2005T12, 2005.
[57] G. Grefenstette. The World Wide Web as a Resource for Example-Based Ma-
chine Translation Tasks. In ASLIB’99 Translating and the Computer 21., 1999.
[58] A. Haghighi, P. Liang, T. Berg-Kirkpatrick, and D. Klein. Learning bilingual
lexicons from monolingual corpora. In Proceedings of Association for Compu-
tational Linguistics and Human Language Technologies, pages 771–779, 2008.
[59] J. Hajic, J. Hric, and V. Kubon. Machine translation of very close languages.
In Proceedings of the sixth conference on Applied natural language processing,
pages 7–12, 2000.
[60] S. Harabagiu, R. Bunescu, and S. J. Maiorano. Text and knowledge mining
for coreference resolution. In Proceedings of the Second Meeting of the North
American Chapter of the Association for Computational Linguistics, pages 55–
62, 2001.
[61] Z. Harris. Distributional structure. Word, 10(23):146–162, 1954.
[62] T. Hasegawa, S. Sekine, and R. Grishman. Discovering relations among named
entities from large corpora. In Proceedings of Association for Computational
Linguistics, pages 415–422, 2004.
[63] M. Hearst. Automatic acquisition of hyponyms from large text corpora. In
267
Proceedings of International Conference on Computational Linguistics, pages
539–545, 1992.
[64] S.C. Herring and J.C. Paolillo. Gender and genre variation in weblogs. Journal
of Sociolinguistics, 10(4):439–459, 2006.
[65] L. Hirschman and N. Chinchor. MUC-7 coreference task definition. In MUC-7
proceedings, 1997.
[66] J Hobbs. Resolving pronoun references. pages 339–352, 1986.
[67] J.R. Hobbs. Overview of the TACITUS Project. Computational Linguistics,
12(3):220–222, 1986.
[68] H. Isahara, F. Bond, K. Uchimoto, M. Utiyama, and K. Kanzaki. Development
of the Japanese WordNet. In Proceedings of the 6th International Conference
on Language Resources and Evaluation (LREC 2008), 2008.
[69] V. Jijkoun, M. de Rijke, and J. Mur. Information extraction for question answer-
ing: improving recall through syntactic patterns. In Proceedings of International
Conference on Computational Linguistics, page 1284, 2004.
[70] H. Jing, N. Kambhatla, and S. Roukos. Extracting social networks and bio-
graphical facts from conversational speech transcripts. In Proceedings of Asso-
ciation for Computational Linguistics, pages 1040–1047, 2007.
268
[71] J. Kivinen and M.K. Warmuth. Exponentiated Gradient versus Gradient De-
scent for Linear Predictors. Information and Computation, 132(1):1–63, 1997.
[72] P. Koehn. In Europarl: A parallel corpus for statistical machine translation,
2005.
[73] P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi,
B. Cowan, W. Shen, C. Moran, R. Zens, et al. Moses: Open source toolkit for
statistical machine translation. In Proceedings of Association for Computational
Linguistics, companian volume, pages 177–180, 2007.
[74] P. Koehn and K. Knight. Learning a translation lexicon from monolingual
corpora. In Proceedings of Association for Computational Linguistics Workshop
on Unsupervised Lexical Acquisition, pages 9–16, 2002.
[75] P. Koehn and K. Knight. Empirical methods for compound splitting. In Proceed-
ings of the European Chapter of the Association for Computational Linguistics,
Volume 1, pages 187–193, 2003.
[76] M. Koppel, S. Argamon, and A.R. Shimoni. Automatically Categorizing Writ-
ten Texts by Author Gender. Literary and Linguistic Computing, 17(4):401–412,
2002.
[77] M. Kumar, N. Garera, and A. I. Rudnicky. Learning from the report-writing
269
behavior of individuals. In Proceedings of Internation Joint Conference on
Artificial Intelligence, pages 1641–1646, 2007.
[78] S. Kumar and W. Byrne. Minimum Bayes-Risk word alignments of bilingual
texts. In Proceedings of the conference on Empirical methods in natural language
processing, pages 140–147, 2002.
[79] W. Labov. The Social Stratification of English in New York City. Center for
Applied Linguistics, Washington DC, 1966.
[80] J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Prob-
abilistic models for segmenting and labeling sequence data. In Proceedings of
International Conference on Machine Learning, pages 282–289, 2001.
[81] T.R. Leek. Information Extraction Using Hidden Markov Models. PhD thesis,
University of California, San Diego, 1997.
[82] W. Lehnert, C. Cardie, D. Fisher, E. Riloff, and R. Williams. University of
Massachusetts: Description of the CIRCUS System as Used for MUC-3. In
Proceedings of the 3rd conference on Message understanding, pages 223–233,
1991.
[83] D. B. Lenat. Cyc: a large-scale investment in knowledge infrastructure. Com-
mun. ACM, 38(11):33–38, 1995.
[84] J.N. Levi. The Syntax and Semantics of Complex Nominals. 1978.
270
[85] D. Lin and P. Pantel. Discovery of inference rules for question-answering. Nat-
ural Language Engineering, 7(04):343–360, 2002.
[86] H. Liu and R. Mihalcea. Of Men, Women, and Computers: Data-Driven Gender
Modeling for Improved User Interfaces. In International Conference on Weblogs
and Social Media, 2007.
[87] Y. Liu, Q. Liu, and S. Lin. Log-linear models for word alignment. In Proceedings
of the 43rd Annual Meeting on Association for Computational Linguistics, pages
459–466, 2005.
[88] R.K.S. Macaulay. Talk that Counts: Age, Gender, and Social Class Differences
in Discourse. Oxford University Press, USA, 2005.
[89] G.S. Mann and D. Yarowsky. Multipath translation lexicon induction via bridge
languages. In Proceedings of North American Chapter of the Association for
Computational Linguistics, pages 151–158, 2001.
[90] G.S. Mann and D. Yarowsky. Unsupervised personal name disambiguation. In
Proceedings of Conference on Natural Language Learning, pages 33–40, 2003.
[91] G.S. Mann and D. Yarowsky. Multi-field information extraction and cross-
document fusion. In Proceedings of Association for Computational Linguistics,
pages 483–490, 2005.
[92] M.P. Marcus, M.A. Marcinkiewicz, and B. Santorini. Building a large annotated
271
corpus of English: the Penn treebank. Computational Linguistics, 19(2):313–
330, 1993.
[93] K. Markert and M. Nissim. Comparing knowledge sources for nominal anaphora
resolution. Computational Linguistics, 31(3):367–402, 2005.
[94] K. Markert, M. Nissim, and N. N. Modjeska. Using the web for nominal
anaphora resolution. In Proceedings of the European Chapter of the Association
for Computational Linguistics Workshop on the Computational Treatment of
Anaphora, pages 39–46, 2003.
[95] R. McDonald, F. Pereira, K. Ribarov, and J. Hajic. Non-projective dependency
parsing using spanning tree algorithms. In Proceedings of Empirical Methods
in Natural Language Processing and Human Language Technologies, pages 523–
530, 2005.
[96] J. Meyer and R. Dale. Mining a corpus to support associative anaphora res-
olution. In Proceedings of the Fourth International Conference on Discourse
Anaphora and Anaphor Resolution, 2002.
[97] G.A. Miller. WordNet: a lexical database for English. 1995.
[98] D.S. Munteanu, A. Fraser, D. Marcu, S. Dumais, D. Marcu, and S. Roukos.
Improved Machine Translation Performance via Parallel Sentence Extraction
from Comparable Corpora. In Proceedings of Human Language Technologies
272
and North American Chapter of the Association for Computational Linguistics,
pages 265–272, 2004.
[99] D.S. Munteanu and D. Marcu. Improving machine translation performance
by exploiting non-parallel corpora. Computational Linguistics, 31(4):477–504,
2005.
[100] D. Narayan, D. Chakrabarty, P. Pande, and P. Bhattacharyya. An experience in
building the Indo WordNet-a WordNet for Hindi. In International Conference
on Global WordNet, 2002.
[101] R. Navigli, P. Velardi, and A. Gangemi. Ontology learning and its application
to automated terminology translation. Intelligent Systems, IEEE, 18(1):22–31,
2003.
[102] A. Nenkova and K. McKeown. References to named entities: a corpus study.
Proceedings of Human Language Technologies and North American Chapter of
the Association for Computational Linguistics companion volume, pages 70–72,
2003.
[103] V. Ng and C. Cardie. Improving machine learning approaches to coreference
resolution. In Proceedings of the 40th Annual Meeting of the Association for
Computational Linguistics, pages 104–111, 2002.
[104] J. Nivre, J. Hall, S. Kubler, R. McDonald, J. Nilsson, S. Riedel, and D. Yuret.
273
The conference on natural language learning 2007 shared task on dependency
parsing. In Proceedings of the Conference on Natural Language Learning Shared
Task Session of Empirical Methods in Natural Language Processing and Con-
ference on Natural Language Learning, pages 915–932, 2007.
[105] S. Nowson and J. Oberlander. The identity of bloggers: Openness and gender
in personal weblogs. In Proceedings of the American Association for Artifi-
cial Intelligence Spring Symposia on Computational Approaches to Analyzing
Weblogs, 2006.
[106] F.J. Och and H. Ney. Discriminative training and maximum entropy models
for statistical machine translation. In Proc. of the 40th Annual Meeting of the
Association for Computational Linguistics (ACL), volume 8, 2002.
[107] F.J. Och, C. Tillmann, and H. Ney. Improved alignment models for statistical
machine translation. In Proceedings of the conference on on Empirical Methods
in Natural Language Processing and Very Large Corpora, pages 20–28, 1999.
[108] F.J. Och and H. Weber. Improving statistical natural language translation
with categories and rules. In Proceedings of the 17th international conference
on Computational linguistics-Volume 2, pages 985–989, 1998.
[109] M. Pasca, L. Dekang, J. Bigham, A. Lifchits, and A. Jain. Names and similari-
ties on the web: Fact extraction in the fast lane. In Proceedings of Association
274
for Computational Linguistics and International Conference on Computational
Linguistics, pages 809–816, 2006.
[110] M. Pasca, D. Lin, J. Bigham, A. Lifchits, and A. Jain. Organizing and searching
the World Wide Web of factsstep one: the one-million fact extraction challenge.
In Proceedings of American Association for Artificial Intelligence, pages 1400–
1405, 2006.
[111] P. Pantel and M. Pennacchiotti. Espresso: Leveraging generic patterns for
automatically harvesting semantic relations. In Proceedings of Association for
Computational Linguistics and International Conference on Computational Lin-
guistics, pages 113–120, 2006.
[112] P. Pantel and D. Ravichandran. Automatically labeling semantic classes. In
Proceedings of Human Language Technologies and North American Chapter of
the Association for Computational Linguistics, pages 321–328, 2004.
[113] P. Pantel, D. Ravichandran, and E. Hovy. Towards terascale knowledge acquisi-
tion. In Proceedings of International Conference on Computational Linguistics,
2004.
[114] M. Pasca, B. V. Durme, and N. Garera. The role of documents vs. queries
in extracting class attributes from text. In Proceedings of the Conference on
Information and Knowledge Management, pages 485–494, 2007.
275
[115] M. Poesio, T. Ishikawa, S. Schulte im Walde, and R. Viera. Acquiring lexical
knowledge for anaphora resolution. In Proccedings of the Third Conference on
Language Resources and Evaluation, pages 1220–1224, 2002.
[116] M. Poesio, R. Mehta, A. Maroudas, and J. Hitzeman. Learning to resolve bridg-
ing references. In Proceedings of the 42nd Annual Meeting of the Association
for Computational Linguistics, pages 143–150, 2004.
[117] M. Poesio, R. Vieira, and S. Teufel. Resolving bridging references in unrestricted
text. In Proceedings of the Association for Computational Linguistics Workshop
on Operational Factors in Robust Anaphora, pages 1–6, 1997.
[118] WV Quine. Natural kinds. Essays in honor of Carl G. Hempel, pages 5–23,
1969.
[119] U. Rackow, I. Dagan, and U. Schwall. Automatic translation of noun com-
pounds. In Proceedings of the International Conference on Computational Lin-
guistics and Volume 4, pages 1249–1253, 1992.
[120] D. Rao, N. Garera, and D. Yarowsky. Jhu1 : An unsupervised approach to
person name disambiguation using web snippets. In Proceedings of the Fourh
International Workshop on Semantic Evaluations (SemEval), pages 199–202,
2007.
[121] R. Rapp. Automatic identification of word translations from unrelated En-
276
glish and German corpora. In Proceedings of Association for Computational
Linguistics, pages 519–526, 1999.
[122] D. Ravichandran and E. Hovy. Learning surface text patterns for a question
answering system. In Proceedings of Association for Computational Linguistics,
pages 41–47, 2002.
[123] Y. Ravin and Z. Kazi. Is Hillary Rodham Clinton the President? Disambiguat-
ing Names across Documents. In Proceedings of Association for Computational
Linguistics, 1999.
[124] M. Remy. Wikipedia: The Free Encyclopedia. Online Information Review Year,
26(6), 2002.
[125] E. Riloff. Automatically Generating Extraction Patterns from Untagged Text.
In Proceedings of American Association for Artificial Intelligence, pages 1044–
1049, 1996.
[126] E. Riloff and R. Jones. Learning dictionaries for information extraction by
multi-level bootstrapping. In Proceedings of American Association for Artificial
Intelligence and Innovative Applications of Artificial Intelligence, pages 474–
479, 1999.
[127] E. Riloff and J. Shepherd. A corpus-based approach for building semantic
lexicons. CoRR, cmp-lg/9706013, 1997.
277
[128] E. S. Ristad and P. N. Yianilos. Learning string edit distance. In Machine
Learning: Proceedings of the Fourteenth International Conference, pages 287–
295, 1997.
[129] M. Ruiz-Casado, E. Alfonseca, and P. Castells. Automatic extraction of seman-
tic relationships for wordnet by means of pattern learning from wikipedia. In
Proceedings of NLDB 2005, 2005.
[130] M. Ruiz-Casado, E. Alfonseca, and P. Castells. From Wikipedia to semantic
relationships: a semiautomated annotation approach. In Proceedings of ESWC,
2006.
[131] A. Sayeed, T. Elsayed, N. Garera, D/ Alexander, T. Xu, D. Oard, D. Yarowsky,
and C. Piatko. Arabic cross-document coreference resolution. In Proceedings
of the Joint Conference of Association of Computational Linguistics and In-
ternational Joint Conference on Natural Language Processing (ACL-IJCNLP),
Conference Short Papers, pages 357–360, 2009.
[132] C. Schafer and D. Yarowsky. Inducing translation lexicons via diverse similarity
measures and bridge languages. In Proceedings of CONLL, pages 146–152, 2002.
[133] C. Schafer and D. Yarowsky. Inducing translation lexicons via diverse similarity
measures and bridge languages. In Proceedings of International Conference on
Computational Linguistics, pages 1–7, 2002.
278
[134] C. Schafer and D. Yarowsky. Exploiting aggregate properties of bilingual dic-
tionaries for distinguishing senses of English words and inducing English sense
clusters. In Proceedings of Association for Computational Linguistics, pages
118–121, 2004.
[135] B. Schiffman, I. Mani, and K.J. Concepcion. Producing biographical sum-
maries: combining linguistic knowledge with corpus statistics. In Proceedings
of Association for Computational Linguistics, pages 458–465, 2001.
[136] J. Schler, M. Koppel, S. Argamon, and J. Pennebaker. Effects of age and gender
on blogging. In Proceedings of the American Association for Artificial Intelli-
gence Spring Symposia on Computational Approaches to Analyzing Weblogs,
2006.
[137] J. Schler, M. Koppel, S. Argamon, and J. Pennebaker. Effects of age and
gender on blogging. In AAAI Spring Symposium on Computational Approaches
to Analyzing Weblogs, 2006.
[138] I. Shafran, M. Riley, and M. Mohri. Voice signatures. In Proceedings of ASRU,
pages 31–36, 2003.
[139] S. Singh. A pilot study on gender differences in conversational speech on lexical
richness measures. Literary and Linguistic Computing, 16(3):251–264, 2001.
[140] R. Snow, D. Jurafsky, and A. Y. Ng. Semantic taxonomy induction from het-
279
erogenous evidence. In Proceedings of Association for Computational Linguis-
tics and International Conference on Computational Linguistics, pages 801–808,
2006.
[141] S. Soderland. Learning information extraction rules for semi-structured and
free text. Machine learning, 34(1):233–272, 1999.
[142] W. M. Soon, H. T. Ng, and D. C. Y. Lim. A machine learning approach to
coreference resolution of noun phrases. Computational Linguistics, 27(4):521–
544, 2001.
[143] M. Strube, S. Rapp, and C. Muller. The influence of minimum edit distance
on reference resolution. In Proceedings of the 2002 Conference on Empirical
Methods in Natural Language Processing, pages 312–319, 2002.
[144] I. Szpektor, H. Tanev, I. Dagan, and B. Coppola. Scaling web-based acquisition
of entailment relations, 2004.
[145] T. Tanaka and T. Baldwin. Noun-Noun Compound Machine Translation: A
Feasibility Study on Shallow Processing. In Proceedings of the Association for
Computational Linguistics Workshop on Multiword Expressions, pages 17–24,
2003.
[146] M. Thelen and E. Riloff. A bootstrapping method for learning semantic lexi-
280
cons using extraction pattern contexts. In Proceedings of Empirical Methods in
Natural Language Processing, pages 214–221, 2002.
[147] K. Toutanova, H.T. Ilhan, and C.D. Manning. Extensions to HMM-based sta-
tistical word alignment models. In Proceedings of the conference on Empirical
methods in natural language processing-Volume 10, pages 87–94, 2002.
[148] B. Ustun. A Comparison of Support Vector Machines and Partial Least Squares
Regression on Spectral Data, 2003.
[149] R. Vieira and M. Poesio. An empirically-based system for processing definite
descriptions. Computational Linguistics, 26(4):539–593, 2000.
[150] S. Vogel, H. Ney, and C. Tillmann. HMM-based word alignment in statistical
translation. In Proceedings of the 16th conference on Computational linguistics-
Volume 2, pages 836–841, 1996.
[151] P. Vossen. Eurowordnet a multilingual database with lexical semantic networks.
In Computational Linguistics, volume 25, 1998.
[152] N. Wacholder, Y. Ravin, and M. Choi. Disambiguation of proper names in text.
In Proceedings of ANLP, pages 202–208, 1931.
[153] C. Walker, S. Strassel, J. Medero, and K. Maeda. Ace 2005 multilingual training
corpus. Linguistic Data Consortium, 2006.
281
[154] R. Weischedel, J. Xu, and A. Licuanan. A Hybrid Approach to Answering
Biographical Questions. New Directions In Question Answering, pages 59–70,
2004.
[155] M. Wick, A. Culotta, and A. McCallum. Learning field compatibilities to ex-
tract database records from unstructured text. In Proceedings of Empirical
Methods in Natural Language Processing, pages 603–611, 2006.
[156] D. Widdows. Unsupervised methods for developing taxonomies by combining
syntactic and statistical information. In Proceedings of Human Language Tech-
nologies and North American Chapter of the Association for Computational
Linguistics, pages 197–204, 2003.
[157] I.H. Witten and E. Frank. Data mining: practical machine learning tools and
techniques with Java implementations. ACM SIGMOD Record, 31(1):76–77,
2002.
[158] D.H. Wolpert. Stacked generalization. Neural networks, 5(2):241–259, 1992.
[159] X. Yang, G. Zhou, J. Su, and C. L. Tan. Coreference resolution using com-
petition learning approach. In Proceedings of the 41st Annual Meeting of the
Association for Computational Linguistics, pages 176–183, 2003.
[160] D. Yarowsky. Unsupervised word sense disambiguation rivaling supervised
282
methods. In Proceedings of Association for Computational Linguistics, pages
189–196, 1995.
[161] J. Zhang, J. Gao, and M. Zhou. Extraction of Chinese compound words: an ex-
perimental study on a very large corpus. In Proceedings of the second workshop
on Chinese language processing, pages 132–139, 2000.
[162] L. Zhou, M. Ticrea, and E. Hovy. Multidocument biography summarization.
Proceedings of Empirical Methods in Natural Language Processing, pages 434–
441, 2004.
283
Vita
Nikesh Garera grew up in Mumbai, India, where he also went to college, obtaining
his bachelor degree in Computer Engineering from University of Mumbai in May 2002.
He came to USA for pursuing his graduate studies and earned his Master of Science
in Language Technologies from the School of Computer Science at Carnegie Mellon
University in May 2005. He moved on to pursue his doctoral studies at the Computer
Science Department at Johns Hopkins University. At Johns Hopkins, he earned his
Master of Science in Computer Science in May 2007 and his Doctor of Philosophy in
Computer Science in September 2009.
284