bionlp09 winners

Extracting Complex Biological Eventswith Rich GraphBased Feature Sets

Jari Björne, Juho Heimonen, Filip Ginter, AnttiAirola, Tapio Pahikkala, Tapio SalakoskiBioNLP 2009 Workshop

Farzaneh Sarafraz18 June 2009

BioNLP'09 Task 1

Events in abstracts Given: gene and gene products (proteins) Wanted: events

− type− trigger− participant(s)− cause (if applicable)

Example

"I kappa B/MAD3 masks the nuclear localization signal of NFkappa B p65 and requires the transactivation domain to inhibit NFkappa B p65 DNA binding. "

Event: negative regulation

Trigger: masks

Theme1: the first p65

Cause: MAD3

Event Types

Gene expression Transcription Protein Catabolism Localisation Phosphorylation

Binding Regulation Positive regulation Negative regulation

Training and Test Data

Training data: 800 abstracts Development data: 150 abstracts Test data: 260 abstracts

The System

Trigger recognition− Methods similar to NER− Classification

Argument detection− Graph edge selection− Classification

Semantic postprocessing− Rulebased

Trigger Detection

Token labelling (one for each type and one ) 92% of triggers are single token

− Adjacent tokens form a trigger if they appear in the training data

Triggers that share a token:− Combined class: gene expression/pos regulation

A graph node for each trigger− Not duplicated just yet

Classification SVM

Token features− Binary: capitalisation, presence of punctuation or

numeric characters− Stem− Character bigrams and trigrams− Token is known triggers in training data− All the above for linear and dependency

“neighbours”

Classification SVM

Frequency features− # of named entities

In sentence In a linear window around the token Bagofwords count of token texts in the sentence (?)

Dependency chains− Up to depth of 3 from the token are constructed− At each depth both token and frequency features− Plus dep type and sequence of dep types in chain

Two SVMs

“Somewhat” different feature sets Combined weighted results

“This design should be considered an artifact of the timeconstrained, experimentdriven development of the system rather than a principled design”

Precision/Recall tradeoff

Undetected trigger > undetected event All triggers have events in the training data >

bias towards reporting an event for all detected triggers

Adjust P/R explicitly − multiply the negative class by β− find β experimentally

Edge Detection

Multiclass SVM All potential directed edges

− Event node to named entity− Event node to event node (nested event)− Labelled as theme, cause, or negative

Each edge is predicted independently

Feature Set – Central Concept

Shortest undirected path of syntactic dependencies in the Stanford scheme parse of the sentence.

Feature Set

Token text, POS, entity/event class, dependency (subject)

Ngrams: merging the attributes of 24− Consecutive tokens− Consecutive dependencies− Each token and two neighbouring dependencies− Each dependency and two neighbouring tokens− One bigram showing direction

Other Features

Individual component features Semantic node features Frequency features

Semantic PostProcessing

Duplicate nodes− Same class and same trigger− Combined trigger

Remove improper arguments Remove directed cycles by removing the

weakest link

Duplicating Event Nodes

Task restrictions− Two causes,− must have theme,− etc.

Several heuristics

xth first dependency in shortest path from the event for binding

Results

Compared to Us

What Didn't Work/Wasn't Tried

CRF HMM Removing strong independence assumption Coreference resolution (4.8%)

bionlp09 winners

Technology

abstracts testdata

abstracts developmentdata