bionlp09 winners
TRANSCRIPT
Extracting Complex Biological Eventswith Rich GraphBased Feature Sets
Jari Björne, Juho Heimonen, Filip Ginter, AnttiAirola, Tapio Pahikkala, Tapio SalakoskiBioNLP 2009 Workshop
Farzaneh Sarafraz18 June 2009
BioNLP'09 Task 1
Events in abstracts Given: gene and gene products (proteins) Wanted: events
− type− trigger− participant(s)− cause (if applicable)
Example
"I kappa B/MAD3 masks the nuclear localization signal of NFkappa B p65 and requires the transactivation domain to inhibit NFkappa B p65 DNA binding. "
Event: negative regulation
Trigger: masks
Theme1: the first p65
Cause: MAD3
Event Types
Gene expression Transcription Protein Catabolism Localisation Phosphorylation
Binding Regulation Positive regulation Negative regulation
Training and Test Data
Training data: 800 abstracts Development data: 150 abstracts Test data: 260 abstracts
The System
Trigger recognition− Methods similar to NER− Classification
Argument detection− Graph edge selection− Classification
Semantic postprocessing− Rulebased
Trigger Detection
Token labelling (one for each type and one ) 92% of triggers are single token
− Adjacent tokens form a trigger if they appear in the training data
Triggers that share a token:− Combined class: gene expression/pos regulation
A graph node for each trigger− Not duplicated just yet
Classification SVM
Token features− Binary: capitalisation, presence of punctuation or
numeric characters− Stem− Character bigrams and trigrams− Token is known triggers in training data− All the above for linear and dependency
“neighbours”
Classification SVM
Frequency features− # of named entities
In sentence In a linear window around the token Bagofwords count of token texts in the sentence (?)
Dependency chains− Up to depth of 3 from the token are constructed− At each depth both token and frequency features− Plus dep type and sequence of dep types in chain
Two SVMs
“Somewhat” different feature sets Combined weighted results
“This design should be considered an artifact of the timeconstrained, experimentdriven development of the system rather than a principled design”
Precision/Recall tradeoff
Undetected trigger > undetected event All triggers have events in the training data >
bias towards reporting an event for all detected triggers
Adjust P/R explicitly − multiply the negative class by β− find β experimentally
Edge Detection
Multiclass SVM All potential directed edges
− Event node to named entity− Event node to event node (nested event)− Labelled as theme, cause, or negative
Each edge is predicted independently
Feature Set – Central Concept
Shortest undirected path of syntactic dependencies in the Stanford scheme parse of the sentence.
Feature Set
Token text, POS, entity/event class, dependency (subject)
Ngrams: merging the attributes of 24− Consecutive tokens− Consecutive dependencies− Each token and two neighbouring dependencies− Each dependency and two neighbouring tokens− One bigram showing direction
Other Features
Individual component features Semantic node features Frequency features
Semantic PostProcessing
Duplicate nodes− Same class and same trigger− Combined trigger
Remove improper arguments Remove directed cycles by removing the
weakest link
Duplicating Event Nodes
Task restrictions− Two causes,− must have theme,− etc.
Several heuristics
xth first dependency in shortest path from the event for binding
Results
Compared to Us
What Didn't Work/Wasn't Tried
CRF HMM Removing strong independence assumption Coreference resolution (4.8%)
End.