tinkerbell: cross-lingual cold-start knowledge base ... · tinkerbell: cross-lingual cold-start...

TinkerBell: Cross-lingual Cold-Start Knowledge Base Construction

Mohamed Al-Badrashiny1, Jason Bolton5, Arun Tejavsi Chaganty5, Kevin Clark5, Craig Harman3,Lifu Huang4, Matthew Lamm5, Jinhao Lei5, Di Lu4, Xiaoman Pan4, Ashwin Paranjape5,

Ellie Pavlick6, Haoruo Peng7, Peng Qi5, Pushpendre Rastogi3, Abigail See5,Kai Sun2, Max Thomas3, Chen-Tse Tsai7, Hao Wu6, Boliang Zhang4,

Chris Callison-Burch6, Claire Cardie2, Heng Ji4, Christopher Manning5,Smaranda Muresan1, Owen C. Rambow1, Dan Roth6, Mark Sammons7, Benjamin Van Durme3

1 Columbia University 2 Cornell [email protected] [email protected]

3 Johns Hopkins University 4 Rensselaer Polytechnic [email protected] [email protected]

5 Stanford University 6 University of [email protected] [email protected]

7 University of Illinois at [email protected]

Abstract

In this paper we present TinkerBell,a state-of-the-art end-to-end cold-startknowledge base construction system thatextracts entity, relation, event and sen-timent knowledge from three languages(English, Chinese and Spanish).

1 Introduction

The TinkerBell team has developed the first end-to-end cold-start knowledge base (KB) construc-tion system for three languages (English, Chineseand Spanish), and achieved top performance atTAC-KBP2017 evaluation. The overall system ar-chitecture is presented in Figure 1.

Using our existing high-performing techniquesas building blocks, we have improved each com-ponent by developing a series of novel methods asfollows:

• A joint model of name tagging, linking andclustering based on multi-lingual multi-levelcommon space construction.

• Joint transliteration and sub-word align-ment for cross-lingual entity linking: Us-ing pairs of wikipedia titles from interlan-guage wikipedia links, jointly model sub-word alignment and transliteration to com-pare multi-token name mentions across lan-guages.

• Joint inference between entity discovery andlinking (EDL) and slot filling (SF): For

Figure 1: TinkerBell System Overview

the first time we tightly integrated EDL re-sults into SF and achieved significant F-scoregains on SF for all three languages.

• Event extraction: developed a novel depen-dency relation based attention mechanism forevent argument extraction.

• Sentiment Analysis (BeSt): we used a target-focused method which we augmented witha polarity chooser and trained for the only-entity-target task.

• Cross-lingual coreference resolution: we

developed the first cross-document cross-lingual joint entity and event coreference res-olution component.

In the following sections we will present de-tailed quantitative and qualitative analysis for eachcomponent and discuss future research directions.We will also briefly present an entity recommen-dation demonstration system which searches thetri-lingual KB constructed by TinkerBell and rec-ommend entities for user queries.

2 Entity Discovery and Linking

2.1 English and Chinese EDLNamed Mention Extraction: We consider nametagging as a sequence labeling problem, to tageach token in a sentence as the Beginning (B), In-side (I) or Outside (O) of a name mention with oneof five types: Person (PER), Organization (ORG),Geo-political Entity (GPE), Location (LOC) andFacility (FAC). Predicting the tag for each tokenneeds evidence from both of its previous con-text and future context in the entire sentence.Bi-LSTM networks (Graves et al., 2013; Lampleet al., 2016) meet this need by processing each se-quence in both directions with two separate hid-den layers, which are then fed into the same out-put layer. Moreover, there are strong classifica-tion dependencies among name tags in a sequence.For example, “I-LOC” cannot follow “B-ORG”.CRFs model, which is particularly good at jointlymodeling tagging decisions, can be built on topof the Bi-LSTM networks. External informationlike gazetteers, brown clustering, etc. are provedto be beneficial for name tagging. We use an ad-ditional Bi-LSTM to consume the external featureembeddings of each token and concatenate bothBi-LSTM encodings of feature embeddings andword embeddings before the output layer. Ourmodel is depicted in Figure 2.

We set the word input dimension to 100, wordLSTM hidden layer dimension to 100, characterinput dimension to 50, character LSTM hiddenlayer dimension to 25, input dropout rate to 0.5,and use stochastic gradient descent with learningrate 0.01 for optimization.

Nominal and Pronominal Mention Extrac-tion: we utilize a deep neural networks basedentity coreference resolution system (Clarkand Manning, 2016) in Stanford CoreNLPtoolkit (Manning et al., 2014) to extract nominaland pronominal mentions.

Name Translation: We translate Chinese men-tions into English based on name translationdictionaries mined from various approaches de-scribed in (Ji et al., 2009; Pan et al., 2017). Ifa Chinese entity mention cannot be translated,we use Pinyin to transliterate it. In addition, wealso create a corpus which contains Chinese wordsand English entities from Chinese Wikipedia, byreplacing Chinese anchor links with English en-tity IDs using cross-lingual links. Using this ap-proach, we learn distributed representations ofmulti-lingual words and English entities to matchChinese mentions and English candidate entitiesin the KB.

Entity Linking: Given a set of entity mentionsM = {m1,m2, ...,mn}, we first generate an ini-tial list of candidate entities Em = {e1, e2, ..., en}for each entity mention m, and then rank them toselect the candidate entity with the highest scoreas the appropriate entity for linking.

We adopt a dictionary-based candidate genera-tion approach (Medelyan and Legg, 2008). In oderto improve the coverage of the dictionary, we alsogenerate a secondary dictionary by normalizing allkeys in the primary dictionary using a phonetic al-gorithm NYSIIS (Taft, 1970). If an entity mentionm is not in the primary dictionary, we will use thesecondary dictionary to generate candidates.

Then we rank these entity candidates based onthree measures: salience, similarity and coher-ence (Pan et al., 2015).

We utilize Wikipedia anchor links to computesalience based on entity prior:

pprior(e) =A∗,eA∗,∗

(1)

where A∗,e is a set of anchor links that point toentity e, and A∗,∗ is a set of all anchor links inWikipedia. We define mention to entity probabil-ity as

pmention(e|m) =Am,eAm,∗

(2)

where Am,∗ is a set of anchor links with the sameanchor textm, andAm,e is a subset ofAm,∗ whichpoints to entity e.

Then we compute the similarity between men-tion and any candidate entity. We first utilize entitytypes of mentions which are extracted from nametagging. For each entity e in the KB, we assigna coarse-grained entity type t (PER, ORG, GPE,LOC, Miscellaneous (MISC)) using a Maximum

The image part with relationship ID rId13 was not found in the file.

Strategy3withCNN

15

Input Word Embedding

Linguistic Feature Embedding

Left LSTMs

Right LSTMs

Left LSTMs

Character Embedding

Word Embedding

Right LSTMs

Left LSTMs

Right LSTMs

LSTMs Hidden Layer

Hidden Layer

CRF networks

B/I/O

Linguistic Features-Language Patterns-Gazetteer- Brown clustering features

Character Embedding

CNN

Figure 2: Name Tagging Model with Explicit Linguistic Features.

Entropy based entity classifier (Pan et al., 2017).We incorporate entity types by combining it withmention to entity probability pmention(e|m) (Linget al., 2015):

ptype(e|m, t) =p(e|m)∑

e 7→tp(e|m)

(3)

where e 7→ t indicates that t is the entity type of e.We also adopt a neural network model that jointlylearns distributed representations of words and en-tities from Wikipedia (Yamada et al., 2017; Caoet al., 2017). Considering all Wikipedia anchorlinks as entity annotations, a training corpus can becreated by replacing anchor links with unique en-tity IDs. Such training corpus can be used to trainthe distributed representations of words and enti-ties simultaneously. For each entity mention m,we build the vector representation of its context vtusing the vector representation of each word (ex-clude entity mention itself and stop words) in thecontext. Then we compute cosine similarity be-tween the vector representation of each candidateentity ve and vt, which can be used to measuresimilarity between mention and entity psim(m, e).

Following (Huang et al., 2017), we construct aweighted undirected graph G = (E,D) from DB-pedia, where E is a set of all entities in DBpediaand dij ∈ D indicates that two entities ei and ejshare some DBpedia properties. The weight of dij ,

wij is computed as:

wij =|pi ∩ pj |

max(|pi|, |pj |)(4)

where pi, pj are the sets of DBpedia propertiesof ei and ej respectively. After constructing theknowledge graph, we apply the graph embeddingframework proposed by (Tang et al., 2015) to gen-erate knowledge representations for all entities inthe KB. We compute cosine similarity betweenthe vector representations of two entities to modelcoherence between these two entities coh(ei, ej).Given a entity mention m and its candidate entitye, we defined coherence score as:

pcoh(e) =1

|Cm|∑c∈Cm

coh(e, c) (5)

where Cm is the union of entities for coherentmentions of m.

Finally, we combine these measures and com-pute final score for each candidate entity e.

NIL Clustering: For entity mentions that can-not be linked to the KB, we apply heuristic rulesdescribed in Table 1 to cluster these NIL entitymentions. For each cluster, we assign the most fre-quent entity mention as the document-level canon-ical mention.

2.2 Spanish EDLThe Spanish Entity Detection and Linking sys-tem is based on the Illinois Cross Lingual Wikifier

Rule DescriptionExact match Create initial clusters based on mention

surface form.Normalization Normalize surface forms (e.g., remove

designators and stop words) and groupmentions with the same normalized sur-face form.

NYSIIS (Taft,1970)

Obtain soundex NYSIIS representationof each mention and group mentionswith the same representation longerthan 4 letters.

Edit distance Cluster two mentions if the edit distancebetween their normalized surface formsis equal to or smaller than D, whereD = length(mention1)/8 + 1.

Translation Merge two clusters if they include men-tions with the same translation.

Table 1: Heuristic Rules for NIL Clustering.

(Tsai and Roth, 2016; Tsai et al., 2016).Named Mention Extraction: The Illinois

Cross Lingual Wikifier (XLWikifier) extends thepublicly available Illinois Named Entity Recog-nition (NER)(Ratinov and Roth, 2009; Redmanet al., 2016) system to detect named entities inthe cross lingual setting. The cross-lingual NERis language independent, leveraging wikificationto yield language-independent features based onWikipedia categories and Freebase types. Al-though the main idea in Tsai et al. (2016) is totrain a model on one language and apply it on an-other language directly, the authors also show thatthe newly proposed wikifier features are useful inmonolingual models.

For the training data, we use TAC EDL 2015Spanish training and evaluation documents, TACEDL 2016 Spanish evaluation documents, and theSpanish ERE datasets. The model is only trainedon the Spanish training data, therefore it is amonolingual model.

Nominal and Pronominal Mention Extrac-tion: For Spanish nominal and pronominal detec-tion, we take Illinois NER as the base model, andonly include the following features: the word it-self, the neighboring words, and the brown clus-ter paths of these words. Since other NER fea-tures such as gazetteer features and word shapefeatures will not be useful in identifying nominalor pronominal mentions.

We use the nominal mentions in the TAC EDL2016 Spanish evaluation data as the training ex-amples. We find that including nominal mentionsin the ERE dataset does not improve the perfor-mance. For pronominal detector, since there is nogold annotation in the previous TAC shared tasks,

we only train on the ERE dataset.Entity Linking: The next step is to ground the

extracted Spanish named entity mentions to theEnglish Wikipedia. We apply the model proposedin Tsai and Roth (2016) which uses cross-lingualword and title embeddings to compute similaritiesbetween a foreign mention and English title candi-dates. We then obtain the corresponding FreeBaseID using the links between Wikipedia titles andFreeBase entries if a mention is grounded to someWikipedia entry.

NIL Clustering: For the named entity men-tions which could not be grounded to the knowl-edge base, we try to group them with other namedentities by the NIL clustering algorithm which wedeveloped in TAC EDL 2015 (Mark Sammons,2015). The initial clustering is based on the Wik-ification result, where each NIL mention forms asingleton cluster. These initial clusters are sortedby their size. We merge clusters greedily: asmaller cluster will be merged into a larger clusterif there is any pair of mentions from two differentclusters that are sufficiently similar. The similar-ity between two mentions is based on the Jaccardsimilarity of the surface strings.

Co-reference Between Named Entity Men-tions and Other Mentions: The above two stepsare only performed on the named entity mentions,since the Illinois Cross-Lingual Wikifier focuseson named entities. In the final step, we try to linknominal and pronominal mentions to the namedentity mentions, that is, resolving the co-referenceproblem between nominal/pronominal nouns andnamed entities.

We apply the following simple heuristic rulesto nominal mentions. For each nominal mention,we find the closest mention (either a nominal or aname) to the left which has the same type (PER,ORG, GPE, ...). If this closest mention is a nomi-nal, the surface form is also required to be identi-cal. We then add this nominal mention to the clus-ter of the closest matching mention. Note that welimit how far do we look ahead by a predefinedthreshold. If no suitable mention is found withinthis window, the current nominal mention is dis-carded, since this nominal mention could refer tosome generic noun rather than a specific entity.

We apply different rules for different types ofpronominal mentions. The first and second personpronouns usually only appear in discussion forumdocuments. We try to resolve them to the authors

of posts. We link the first person pronoun to theauthor mention of the current post. If no authoris found, we link it to the previous named entitymention. For a second person pronoun, we link itto the author of the quoted post. If the current postdoes not quote any previous post (the quoted postwill be copied in the current post), we link the sec-ond person pronoun to the author of the previouspost.

For the third person pronoun, we link it to theprevious PER named entity which has the samegender as the pronoun. To determine the gen-der of a named entity, we count the number offemale and male pronouns in the correspondingWikipedia page. If there are more female pro-nouns in its Wikipedia page, we classify the entityas female. If no appropriate named entity is foundbefore the target third person pronoun, we simplylink it to the previous named entity mention.

3 Slot Filling

Let us now look at the slot filling component of theknowledge base construction system. While thespecifics of the slot filling systems for each lan-guage (English, Chinese and Spanish) differ, theyshare the following pipeline: (1) we first constructmention pair candidates for every type-compatiblepair of entities in a sentence for each sentence inthe corpus (Subsection 3.1), (2) we predict a re-lation for each of these candidates using one ofseveral relation extractors (Subsections 3.2 – 3.4),and finally (3) we combine the relation predictionsfrom each of the relation classifiers (Subsection3.5). We will briefly describe the details of eachcomponent of the pipeline in this section.

3.1 Mention-pair candidate generation

In the first stage of our pipeline, the entiredocument corpus is processed using StanfordCoreNLP’s annotators (Manning et al., 2014), in-cluding a tokenizer, POS tagger, parsers, coref-erence system and a fine-grained named entityrecognition (NER) system. While we are able touse linked entities from the EDL systems (Section2), the coreference system is critical in extractingrelations from pronominal mentions and the fine-grained NER system lets us identify possible slotcandidates for the string-valued relations such asper:title or org:date founded.

The fine-grained NER system uses a combina-tion of SUTime (English and Chinese) and Heidel-

Time (Spanish) to identify date expressions and amanually constructed gazette of TokensRegex pat-terns for each language.

To generate candidates, we simply consider ev-ery pair of entity or slot value mentions in a sen-tence that have compatible types (for example, wemight consider a PERSON mention and a TITLEmention as a valid pair, but not ORGANIZATIONand TITLE).

3.2 Pattern based systemsWe have 5 rule-based extractors in total.

Tokensregex and Semgrex. The first set in-cludes a Semgrex pattern system and a Token-sRegex (Chang and Manning, 2014) pattern sys-tem. TokensRegex patterns search for specifictemplates (specified via a regular expression) inthe word, lemma, POS and NER sequence of asentence. On the other hand, Semgrex patterns op-erate on the dependency graph of a sentence andtriggers a relation prediction once a specific pre-defined dependency pattern is matched betweentwo entities.

We reused all patterns from Stanford’s 2016KBP system (Zhang et al., 2016), and added anumber of new patterns by hill-climbing on the2016 development set. Of the two, we foundthat the dependency based patterns (originally de-veloped in English) were particularly effective attransferring across languages. Of course, it is im-portant to have a high quality dependency parserto use Semgrex patterns: we used the neural de-pedency parser from Chen and Manning (2014)for English and Chinese and Dozat and Manning(2016) for Spanish. The output of the two patternextractors are expected to be fairly precise.

Relation-specific extractors. Next we havethree relation-specific rule-based extractors:altnames, websites and gpe-mentions.altnames is an extractor that infers alternatenames of organizations and people from coref-erence chains of a document, and websitescompares the edit distance between an organi-zation name and an URL to give high-precisionpredictions of org:website relation. Theyare described in more detail in Angeli et al.(2015). Finally, gpe-mentions uses theoccurrence of location names within organiza-tion entities (for example University of[California], [Berkeley]) to identifylocation of headquarters relations.

h1

ps1

po1

h2

ps2

po2

hn

psn

pon

q

z

…

ana2a1

x1 x2 xn

Mike and Lisa

0

-2

1

-1

h3

x3

ps3

po3

2

0

married…

4

2

a3

(subject) (object)

Figure 3: The position-aware neural sequencemodel for relation extraction. The model is shownwith an example sentence “Mike and Lisa gotmarried.”

3.3 Supervised Logistic Classifier

We reused the self-trained supervised extractorfrom (Angeli et al., 2015). In summary, at the coreof this system is a traditional logistic regression-based classifier with manually-crafted features.We first ran the union of our patterns extractorsand an Open IE system on the entire corpus. Sincethese systems are both of high precision, we col-lected their positive output predictions to form atraining dataset. We added this bootstrapped train-ing set along with a set of presumed negative ex-amples into a pre-collected supervised training set(Angeli et al., 2014), and used this entire dataset toretrain the classifier. We then took this output asthe new training dataset, and repeated this processfor another iteration. In this way, we trained ourstatistical models with output from our own clas-sifiers. We also applied other tricks to avoid classskew and overfitting.

3.4 Neural network

Our neural network-based extractor uses an LSTMwith position-aware attention (Zhang et al., 2017)as pictured in Figure 3. The model takes theoriginal sentence as input, and generates embed-ding vectors for each word through a lookup layer

which are th fed into an LSTM layer module. Wereplace the the subject and object entities with spe-cial <subject> and <object> tokens and in-clude features for each token that describe the to-ken offset from the subject and object respectively.When predicting a relation, the output layer at-tends to a combination of the LSTM outputs andthe position features.

Our neural network model is trained on a fullysupervised dataset that is constructed from previ-ous years KBP Slotfilling assessment files and islabelled by online crowd sourcing.

3.5 Combining predictions

Once we have relation predictions from each ofthese systems, we combine their output by sim-ply adding the output probabilities from each sys-tem that predicts a relation. Finally, we use a setof heuristic post-processing filters to remove spu-rious slot fills, e.g. ensuring that city of andstate of relations agree.

4 Event Extraction and CoreferenceResolution

4.1 English and Spanish Event Extractionand Co-reference Resolution

Since the task includes three sub-tasks (eventnugget detection with types, realis label assign-ment and event co-reference), we use a stage wiseclassification approach to extract all events.

We first train a 34-class classifier (33 eventsubtypes and one non-event class) to detect eventnuggets and classify them into different types. Wethen train a realis classifier to decide the realislabel for each event nugget. The event type classi-fier and the realis classifier share the same set offeatures. During inference, we only keep events ofthe 18 types that the task guideline requires. Forevent co-reference, we train a classifier to modelthe similarity between each event nugget pair. Wethen implement a greedy inference procedure tolook at each detected event nugget from left toright. We make co-reference decisions based onthe similarity score of the targeted event nuggetand its antecedents (also from left to right).

Event Candidate GenerationWe use the Illinois SRL (Punyakanok et al., 2008)to pre-process the input text. We treat all verband noun predicates as event candidates. Wehave analyzed the SRL predicate coverage on

event triggers in a previous work (Peng et al.,2016).1. Here, we only focus on recall since weexpect the event nugget classifier to filter outmost non-trigger predicates. The results show thatSRL predicates provide good coverage of eventtriggers.

Features for Event Nugget DetectionBoth the event type and realis classifiers employthe following set of lexical and semantic features.1) Lexical features: context (part-of-speech tagand lemma) of tokens in a window size of 5around the candidate token, plus their conjunc-tions.2) Seed features: we use 140 seed terms for eventtriggers. We consider whether a candidate tokenis a seed or not and conjunction of the matchedseed and context seeds.3) Parse Tree features: path from a candidatetoken to root, number of its right/left siblings andtheir categories, and paths connecting a candidatetoken with other seeds or named entities.4) NER features: named entities and their typeswithin a window of size 20 around a candidatetoken.5) SRL features: whether a candidate token is apredicate or an argument (in which case, its role),its conjunction with SRL relation names and theconjunction of the SRL relation name and theNER types in the context.6) ESA features: top 200 ESA concepts.7) Brown cluster features: brown cluster vector ofprefix length 4, 6, 10 and 20.8) WordNet features: hypernym, hyponym andentailment words.

Features for Event Co-referenceFor event co-reference, we train a classifier tomodel the similarity between each event nuggetpair. Features for this classifier are as follows: 1)Nugget Features: all features defined above forevent nugget detection applied on two evaluatedevents and their conjunctions.2) Argument Features: all features defined abovefor event nugget detection applied on SRL argu-ments (A0 and A1) of two evaluated events andtheir conjunctions.3) Entity Features: all features defined above forevent nugget detection and their conjunctions withnugget features.

1Results are shown in Table 2

4) Pair-wise Features: distance and ESA similari-ties of two events nuggets.

Learning and Inference DetailsWe include several learning and inference detailson our implemented event pipeline system here:

1. Choice of Learner: We choose SVM to trainall three classifiers. We use L2 loss and tuneC on a development set.

2. Output Filtering: During inference, we onlykeep events of the 18 types that the taskguideline requires after we get results fromevent nugget classifier.

3. Training Data: We utilize data from bothevent nugget tracks in TAC 2015 and TAC2016. In addition, We also subsample theACE2005 data to align with the label distri-bution of TAC 2016 data.

Spanish Event Pipeline SystemWe first translate the Spanish documents intoEnglish with Google Translation. We then runthe English event system as explained above.Finally, We map the identified event nugget backto the original Spanish document based on a wordtranslation table built ahead of time. Theoret-ically, if the translation system produces wordalignment information, we can directly map eventnuggets back to the original document withoutany ambiguity. However, in our implementation,such word alignment information is not present.We have to devise a way to choose the correctSpanish token for the detected English eventnugget. To achieve this, we build a word leveltranslation table ahead of time. In the case wherewe cannot find the exact match in the translationtable, we choose the token with the least editdistance.

4.2 Chinese Event Extraction andCoreference Resolution

Chinese Event Nugget Detection Event nuggetdetection remains a challenge due to the diffi-culty at encoding word semantics and word sensesin various contexts. Previous approaches heav-ily depend on language-specific knowledge. How-ever, compared to English, the resources and toolsfor Chinese are limited and yield low quality. Amore promising approach is to automatically learn

effective features from data, without relying onlanguage-specific resources.

We developed a language-independent neuralnetwork architecture: Bi-LSTM-CRFs, which cansignificantly capture meaningful sequential infor-mation and jointly model nugget type decisions forevent nugget detection. This architecture is similaras the one in (Yu et al., 2016; Feng et al., 2016).

Given a sentence X = (X1, X2, ..., Xn) andtheir corresponding tags Y = (Y1, Y2, ..., Yn), nis the number of units contained in the sequence,we initialize each word with a vector by lookingup word embeddings. Specifically, we use theSkip-Gram model to pre-train the word embed-dings (Mikolov et al., 2013). Then, the sequenceof words in each sentence is taken as input to theBi-LSTM to get meaningful and contextual fea-tures. We feed these features into CRFs and maxi-mize the log-probabilities of all tag predictions ofthe sequence.

Chinese Event Nugget Realis Prediction Forrealis prediction, previous methods usually rely onhand-crafted features and dictionaries (Hong et al.,2015). Considering the limited resources for otherlanguages, we apply a Convolutional Nueral Net-works (CNNs) (Krizhevsky et al., 2012), incorpo-rating both event nugget and its context informa-tion to predict the realis type.

The general architecture of the CNNs is similaras the one we used in our last year’s system (Yuet al., 2016). For each event nugget candidate, weapply a standard CNNs framework on both the leftand right context of the nugget candidate and con-catenate the max-pooling output of both CNNs aswell as the representation of nugget candidate asinput to a fully connected Multi-Layer Perceptron(MLP). Finally we use Softmax function to predictthe realis type.

Chinese Event Argument Extraction For Chi-nese event argument extraction, given a sentence,we first adopt the Chinese event nugget detectionsystem to identify candidate triggers and utilizethe Chinese EDL system to recognize all candidatearguments, including Person, Location, Organi-zation, Geo-Political Entity, Time expression andMoney Phrases. For each trigger and each can-didate argument, we select two types of sequenceinformation: the surface contexts between triggerword and candidate argument or the shortest de-pendency path, which is obtained with the Breath-

First-Search (BFS) algorithm over the whole de-pendency parsing output, as input to our neural ar-chitecture.

Each word in the sequence will be assigned witha vector, which is concatenated from vectors ofword embedding, position and POS tag. In orderto better capture the relatedness between words,we also adopt CNNs to generate a vector for eachword based on its character sequence. For each de-pendency relation, we randomly initialize a vector,which holds the same dimensionality with eachword.

We encode the following two types of sequenceof vectors with CNNs and Bi-LSTMs respectively.For the surface context based sequence, we utilizea general CNNs architecture with Max-Pooling toobtain a vector representation. For the dependencypath based sequence, we adopt the Bi-LSTMs withMax-Pooling to get an overall vector representa-tion. Finally we concatenate these two vectorsand predict the final argument role with a Softmaxfunction.

Training Details For all the above three compo-nents, we utilize all the available Chinese annota-tion data from DEFT Rich ERE and ACE2005 fortraining.

Chinese Event Coreference Resolution OurChinese cross-document event coreference systemis composed of two modules: (1) within-documentevent coreference, and (2) cross-document eventcoreference.

Within-document Event Coreference: Ourwithin-document Event Nugget Coreference sys-tem is based on our last year’s system (Yu et al.,2016). We view the event nugget coreferencespace as an undirected weighted graph in whichthe nodes represent all the event nuggets and theedge weights indicate the coreference confidencebetween two event nuggets. And we apply hier-archical clustering to classify event nuggets intoevent hoppers. To compute the coreference con-fidence between two events, we train a MaximumEntropy classifier using the features listed in Ta-ble 2.

Cross-document Event Coreference: Ourcross-document event coreference module is rulebased. We merge the hoppers from two documentsif they share more than two event arguments.

Features Remarks(EM1: the first event mention, EM2: the second event mention)type subtype match 1 if the types and subtypes of the event nuggets matchtrigger pair exact match 1 if the spellings of triggers in EM1 and EM2 exactly matchstem of the trigger match† 1 if the stems of triggers in EM1 and EM2 matchsimilarity of the triggers(wordnet)∗ quantized semantic similarity score (0-5) using WordNet resourcesimilarity of the triggers(word2vec) quantized semantic similarity score (0-5) using word2vec embeddingPOS match∗ 1 if two sentences have the same NNPCDtoken dist how many tokens between triggers of EM1 and EM2 (quantized)realis conflict 1 if the realis in EM1 and EM2 exactly matchEntity match Number of entities appear both in sentences of EM1 and EM2Entity prior Number of entities appear only in the sentence of EM1Entity act Number of entities appear only in the sentence of EM2

Table 2: Featurs for Classifier. (∗: For English only; †: For English and Spanish only).

5 Belief and Sentiment Extraction

5.1 English and Spanish BeSTOur system is based on Columbia’s belief and sen-timent system at TAC 2016 (Rambow et al., 2016).

We extended the system for ColdStart++ in thefollowing ways:

• Data: We used all the data released prior tothe 2016 BeSt eval for training, and the 2016BeSt eval data for development.

• Polarity: Our 2016 BeSt eval system alwayspredicted negative sentiment, as negative sen-timent prevails. For ColdStart++ we added acomponent that identifies sentiment polarity.We chose a system that has high precision onpositive sentiment, so that the majority of ourpredictions remain negative.

• Confidence: We added confidence measureswhich are calculated as a function of the con-fidence of the classifier and the priors of thetarget by type.

We discuss polarity in more detail. Table 3 be-low represents the polarity distribution over thedifferent entities types for entity mentions as tar-gets of sentiment in the English DF data. The ta-ble shows that only 4.82% of the entities are thetargets of positive sentiment, while 11.46% arethe targets of negative sentiment, and 83.72% arenot targets of sentiment at all. Due to the verylow presence of the positive polarity in the train-ing data, our system tends to favour negative overpositive sentiment. To slightly solve this problem,we put the most frequent text among those whosepositive polarity in a list. We later use this list tomodify the polarity of our output BEST file. So, ifthe polarity of a certain entity mention is negativeand its corresponding text is found in the selected

Type POS NEG NoneFAC 0.07% 0.21% 2.31%GPE 0.62% 1.22% 10.51%LOC 0.08% 0.18% 2.56%ORG 0.61% 1.71% 8.10%PER 3.44% 8.15% 60.24%Total 4.82% 11.46% 83.72%

Table 3: Sentiments polarities distribution over thedifferent types.

al sharpton israel navy sealsall of them it our troopsall the men jeb ows

in my familyamerica marines self made manbritain me some hard

motherf’ersbritish mum that childbrother my the childcountry my family the country

here my grandfather the fedshes my mum the man

i my parents the monarchyiraqis navy

Table 4: Positive words list

positive words list, we change the polarity to bepositive. Table 4 below shows the positive list weare using.

5.2 Chinese BeST

Our system is based on Cornell’s belief and senti-ment system at TAC 2016 (Niculae et al., 2016).Specifically, for sentiment, the source extractioncomponent is a rule-based model, which locatessource candidates based on post authors and wordsfor reporting speech. The target extraction compo-

#reported 0 1 2 3csentiment 0.00 0.10 0.30 0.50#reported 4 5 6 7csentiment 0.70 0.90 0.95 1.00

Table 5: Chinese BeSt confidence settings. #re-ported refers to the number of system versions re-porting the sentiment.

nent consists of (a) a neural network that extractsthe polarity of a sentence/mention text/trigger, and(b) a rule-based model that outputs the final polar-ity of a target candidate based on the output of themodel (a) and a bunch of high level features suchas target types. Changes made for ColdStart++ areas follows:

• Data: More data is used by our system. Inparticular, the BeSt 2016 eval data is addedfor model training, and the dictionaries usedby our model are extended with more Chi-nese slangs and idioms.

• Confidence: Our system is optimized withFβ measure. Concretely, 7 versions of thesystem with different β are trained (β2 =0.2, 1, 2, 2.5, 5, 10, 50), and the confidencecsentiment of a sentiment is set accordingto how many versions by which the senti-ment is reported (Table 5). In the submit-ted runs, we use two different ways to cal-culate the final confidence cfinal. One doesnot take into account entity confidence (i.e.cfinal = csentiment), while the other does bycfinal = csentiment · ctarget · csource.

6 Cross-lingual Coreference Resolution

Cross-lingual Entity Coreference: If two entitymentions share the same entity type and KBID, weconsider them coreferential.

Cross-lingual Event Coreference: we onlyperform cross-lingual event coreference for ‘life-die’ events. For this specific type of events, wemerge two event hoppers in different languages ifthey share the same victim argument (linking tothe same KB entry or having the same NIL clusterID).

7 Entity Recommendation

Given a set of discovered entities as a query, theEntity Recommendation (ER) task requires a sys-tem to find additional entities that are most similar

to the queries. We executed this task based on theTinkerBell cross-lingual KB, and here discuss thedifference between a content based strategy for ERversus one solely reliant on relations discoveredunder the TAC relation schema.

For example, TinkerBell discovered the entitiesOsama and Baghdadi, and found exactly one en-tity – Al Qaeda – in common within one hop. Us-ing just the extracted relations from TinkerBell,we might then be able to enumerate those other en-tities that had the same relationship with Al Qaeda,e.g. Suleiman Abu Ghaith, Abu Anas al Libi,Belmokhtar, etc. Owing to our prior investigationsinto the relative sparsity of discovered relationsunder the TAC schema (with its focus on a smallnumber of high utility relations), our approach re-lies exclusively on the entity detection and cross-document linking capabilities of the KBP system,then builds models of similiarty based on the con-tent surrounding each of the entity mentions. Thisallows us in this example to discover entities asso-ciated with Osama and Baghdadi that did not havean explicit relation with Al Qaeda in the KB, suchas Ahmed Khalfan Ghailani, owing to the snippet:”ghailani had been charged with conspiring in alqaidas 1998 bombings of two u s embassies in eastafrica.”2.

Our ER framework currently is supportedby two algorithms: “Bayesian Sets” owing toGhahramani and Heller (2005), and a novel algo-rithm based on the deep Variational Auto-Encoderframework applied to documents (Miao et al.,2016).

We also provide insight into the working of theentity recommendation system by displaying themost important token features that were used bythe recommendation system while scoring the en-tities.3 We also include salient mentions associ-ated to each entity which justify why an entity wasrelevant in the context of a query. Such justifica-tions can aid an analyst in quickly finding support-ing data and gaining insights into the text corpus.

Acknowledgments

This work was supported by the DARPA DEFTNo. FA8750-13-2-{0041,0017,0040} The viewsand conclusions contained in this document are

2NYT ENG 20131025.02653Although our neural variational system does not directly

use these features in a linear scoring method, the features arestill indicative of the latent space that was constructed andthereby useful in performing error analysis.

Figure 4: Screenshot of ER system.

those of the authors and should not be inter-preted as representing the official policies, eitherexpressed or implied, of the U.S. Government.The U.S. Government is authorized to reproduceand distribute reprints for Government purposesnotwithstanding any copyright notation here on.

ReferencesGabor Angeli, Julie Tibshirani, Jean Y. Wu, and

Christopher D. Manning. 2014. Combining distantand partial supervision for relation extraction. InProceedings of the 2014 Conference on EmpiricalMethods in Natural Language Processing (EMNLP2014).

Gabor Angeli, Victor Zhong, Danqi Chen, Arun Cha-ganty, Jason Bolton, Melvin Johnson Premkumar,Panupong Pasupat, Sonal Gupta, and Christopher DManning. 2015. Bootstrapped self training forknowledge base population. In Text Analysis Con-ference (TAC) Proceedings 2015.

Yixin Cao, Lifu Huang, Heng Ji, Xu Chen, and JuanziLi. 2017. Bridge text and knowledge by learningmulti-prototype entity mention embedding. In Proc.the 55th Annual Meeting of the Association for Com-putational Linguistics (ACL2017).

Angel X Chang and Christopher D Manning. 2014. To-kensregex: Defining cascaded regular expressionsover tokens. Tech. Rep. CSTR 2014-02 .

Danqi Chen and Christopher Manning. 2014. A fastand accurate dependency parser using neural net-works. In Proceedings of the 2014 conference onempirical methods in natural language processing(EMNLP). pages 740–750.

Kevin Clark and Christopher D. Manning. 2016.Deep reinforcement learning for mention-ranking coreference models. In EmpiricalMethods on Natural Language Processing.https://nlp.stanford.edu/pubs/clark2016deep.pdf.

Timothy Dozat and Christopher D. Manning.2016. Deep biaffine attention for neural de-pendency parsing. CoRR abs/1611.01734.http://arxiv.org/abs/1611.01734.

Xiaocheng Feng, Lifu Huang, Duyu Tang, BingQin, Heng Ji, and Ting Liu. 2016. A language-independent neural network for event detection. InThe 54th Annual Meeting of the Association forComputational Linguistics. page 66.

Zoubin Ghahramani and Katherine A. Heller. 2005.Bayesian sets. In NIPS.

Alan Graves, Navdeep Jaitly, and Abdel-rahman Mo-hamed. 2013. Hybrid speech recognition with deepbidirectional lstm. In Automatic Speech Recognitionand Understanding, 2013 IEEE Workshop on.

Yu Hong, Di Lu, Dian Yu, Xiaoman Pan, XiaobinWang, Yadong Chen, Lifu Huang, and Heng Ji.2015. Rpi blender tac-kbp2015 system description.In Proc. Text Analysis Conference (TAC2015).

Lifu Huang, Jonathan May, Xiaoman Pan, Heng Ji,Xiang Ren, Jiawei Han, Lin Zhao, and James A.Hendler. 2017. Liberal entity extraction: Rapid con-struction of fine-grained entity typing systems. BigData 5. https://doi.org/10.1089/big.2017.0012.

Heng Ji, Ralph Grishman, Dayne Freitag, MatthiasBlume, John Wang, Shahram Khadivi, RichardZens, and Hermann Ney. 2009. Name extractionand translation for distillation. Handbook of Natu-ral Language Processing and Machine Translation:DARPA Global Autonomous Language Exploitation.

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hin-ton. 2012. Imagenet classification with deep con-volutional neural networks. In Advances in neuralinformation processing systems. pages 1097–1105.

Guillaume Lample, Miguel Ballesteros, KazuyaKawakami, Sandeep Subramanian, and Chris Dyer.2016. Neural architectures for named entity recog-nition. In Proceeddings of the 2016 Conference ofthe North American Chapter of the Association forComputational Linguistics Human Language Tech-nologies.

Xiao Ling, Sameer Singh, and Daniel Weld. 2015. De-sign challenges for entity linking. Transactions ofthe Association for Computational Linguistics 3.

Christopher D. Manning, Mihai Surdeanu, John Bauer,Jenny Finkel, Steven J. Bethard, and David Mc-Closky. 2014. The stanford corenlp natural languageprocessing toolkit. In Proceedings of 52nd AnnualMeeting of the Association for Computational Lin-guistics.

Haoruo Peng Yangqiu Song Shyam Upadhyay Chen-Tse Tsai Pavankumar Reddy Subhro Roy Dan RothMark Sammons. 2015. Illinois ccg tac 2015 eventnugget, entity discovery and linking, and slot fillervalidation systems. In TAC.

https://nlp.stanford.edu/pubs/clark2016deep.pdf



http://arxiv.org/abs/1611.01734



https://doi.org/10.1089/big.2017.0012

https://doi.org/10.1089/big.2017.0012

https://doi.org/10.1089/big.2017.0012

O. Medelyan and C. Legg. 2008. Integrating cyc andwikipedia: Folksonomy meets rigorously definedcommon-sense. In Proc. AAAI 2008 Workshop onWikipedia and Artificial Intelligence.

Yishu Miao, Lei Yu, and Phil Blunsom. 2016. Neu-ral variational inference for text processing. In In-ternational Conference on Machine Learning. pages1727–1736.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Cor-rado, and Jeffrey Dean. 2013. Distributed represen-tations of words and phrases and their composition-ality. In NIPS.

Vlad Niculae, Kai Sun, Xilun Chen, Yao Cheng, XinyaDu, Esin Durmus, Arzoo Katiyar, and Claire Cardie.2016. Cornell belief and sentiment system at TAC2016 .

Xiaoman Pan, Taylor Cassidy, Ulf Hermjakob, Heng Ji,and Kevin Knight. 2015. Unsupervised entity link-ing with abstract meaning representation. In Pro-ceedings of the 2015 Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics Human Language Technologies.

Xiaoman Pan, Boliang Zhang, Jonathan May, JoelNothman, Kevin Knight, and Heng Ji. 2017. Cross-lingual name tagging and linking for 282 languages.In Proc. the 55th Annual Meeting of the Associationfor Computational Linguistics.

Haoruo Peng, Yangqiu Song, and Dan Roth.2016. Event detection and co-referencewith minimal supervision. In Proc. ofthe Conference on Empirical Methods inNatural Language Processing (EMNLP).http://cogcomp.cs.illinois.edu/papers/PengSoRo16.pdf.

V. Punyakanok, D. Roth, and W. Yih. 2008. The impor-tance of syntactic parsing and inference in semanticrole labeling. Computational Linguistics 34(2).http://cogcomp.cs.illinois.edu/papers/PunyakanokRoYi07.pdf.

Owen Rambow, Tao Yu, Axinia Radeva, SardarHamidian, Alexander Fabbri, Debanjan Ghosh,Christopher Hidey, Tianrui Peng, Mona Diab, Kath-leen McKeown, and Smaranda Muresan. 2016. TheColumbia-GWU system at the 2016 TAC KBP BeStevaluation. In Proceedings of the 2016 NIST TACKBP Workshop.

Lev Ratinov and Dan Roth. 2009. Design chal-lenges and misconceptions in named entityrecognition. In Proc. of the Conference on Com-putational Natural Language Learning (CoNLL).http://cogcomp.cs.illinois.edu/papers/RatinovRo09.pdf.

Tom Redman, Mark Sammons, and DanRoth. 2016. Illinois named entity recog-nizer: Addendum to ratinov and roth ’09reporting improved results. Technical re-port. http://cogcomp.cs.illinois.edu/papers/ner-addendum-2016.pdf.

Robert L Taft. 1970. Name Search Techniques. NewYork State Identification and Intelligence System,Albany, New York, US.

Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, JunYan, and Qiaozhu Mei. 2015. Line: Large-scale in-formation network embedding. In WWW. Interna-tional World Wide Web Conferences Steering Com-mittee, pages 1067–1077.

Chen-Tse Tsai, Stephen Mayhew, and Dan Roth.2016. Cross-lingual named entity recognition viawikification. In Proc. of the Conference on Com-putational Natural Language Learning (CoNLL).http://cogcomp.cs.illinois.edu/papers/TsaiMaRo16.pdf.

Chen-Tse Tsai and Dan Roth. 2016. Cross-lingual wikification using multilingual em-beddings. In Proc. of the Annual Conferenceof the North American Chapter of the Asso-ciation for Computational Linguistics (NAACL).http://cogcomp.cs.illinois.edu/papers/TsaiRo16b.pdf.

Ikuya Yamada, Hiroyuki Shindo, Hideaki Takeda, andYoshiyasu Takefuji. 2017. Learning distributed rep-resentations of texts and entities from knowledgebase. CoRR abs/1705.02494.

Dian Yu, Xiaoman Pan, Boliang Zhang, Lifu Huang,Di Lu, Spencer Whitehead, and Heng Ji. 2016. Rpiblender tac-kbp2016 system description .

Yuhao Zhang, Arun Chaganty, Ashwin Paranjape,Danqi Chen, Jason Bolton, Peng Qi, and Christo-pher D. Manning. 2016. Stanford at TAC KBP2016: Sealing pipeline leaks and understanding chi-nese. In Text Analysis Conference (TAC) Proceed-ings 2016.

Yuhao Zhang, Victor Zhong, Danqi Chen, Gabor An-geli, and Christopher D Manning. 2017. Position-aware attention and supervised data improve slot fill-ing. In Proceedings of the 2017 Conference on Em-pirical Methods in Natural Language Processing.pages 35–45.

http://cogcomp.cs.illinois.edu/papers/PengSoRo16.pdf



http://cogcomp.cs.illinois.edu/papers/PunyakanokRoYi07.pdf




http://cogcomp.cs.illinois.edu/papers/RatinovRo09.pdf




http://cogcomp.cs.illinois.edu/papers/ner-addendum-2016.pdf





http://cogcomp.cs.illinois.edu/papers/TsaiMaRo16.pdf



http://cogcomp.cs.illinois.edu/papers/TsaiRo16b.pdf




tinkerbell: cross-lingual cold-start knowledge base ... · tinkerbell: cross-lingual cold-start...

Documents