improving scalability of support vector machines for biomedical named entity recognition

Improving Scalability of Improving Scalability of Support Vector Machines Support Vector Machines for Biomedical Named for Biomedical Named Entity RecognitionEntity Recognition

Ph.D. Thesis ProposalPh.D. Thesis ProposalPresented ByPresented By

Mona Soliman HabibMona Soliman Habib

December 2007

OutlineOutline

ObjectivesObjectives Named Entity RecognitionNamed Entity Recognition NER ChallengesNER Challenges Support Vector MachinesSupport Vector Machines SVM ChallengesSVM Challenges Research ProposalResearch Proposal Baseline Experiments and ResultsBaseline Experiments and Results

ObjectivesObjectives

Explore the scalability problems associated Explore the scalability problems associated with solving the named entity recognition with solving the named entity recognition problem using high-dimensional input problem using high-dimensional input space and support vector machines.space and support vector machines.

Propose a solution that improves SVM Propose a solution that improves SVM scalability for multi-class problemsscalability for multi-class problems

Propose an NER solution that fosters Propose an NER solution that fosters language and domain independencelanguage and domain independence

Apply the proposed solution to the Apply the proposed solution to the biomedical domainbiomedical domain

(Optional) Present auxiliary issues related (Optional) Present auxiliary issues related to SVM usability and recommend to SVM usability and recommend architecturearchitecture

Named Entity Named Entity RecognitionRecognition

Information extraction taskInformation extraction task Identification/classification of words or Identification/classification of words or

groups of words denoting a concept or groups of words denoting a concept or entityentity

E.g.: E.g.: personperson, , locationlocation, , genegene, , companycompany Entities may be relevant only to a specific Entities may be relevant only to a specific

domain, for e.g.: pneumonia is a domain, for e.g.: pneumonia is a diseasedisease Language or domain-specific NER solution Language or domain-specific NER solution

may not be useful for other languages or may not be useful for other languages or domainsdomains

General NER ExampleGeneral NER Example

Day 2 of “Oprahpalooza” begins in Day 2 of “Oprahpalooza” begins in [ORG SC][ORG SC] . She says . She says state of nation , belief in candidate led to her first state of nation , belief in candidate led to her first endorsement . endorsement . [ORG Associated Press][ORG Associated Press] . .

[LOC COLUMBIA][LOC COLUMBIA] , , [LOC S.C.][LOC S.C.] - Media mogul - Media mogul [PER Oprah [PER Oprah Winfrey]Winfrey] on Sunday told thousands of people in a football on Sunday told thousands of people in a football stadium in this early voting state to shrug off stadium in this early voting state to shrug off [PER [PER Barack Obama]Barack Obama] 's detractors and help him " seize the 's detractors and help him " seize the opportunity " in his bid for the opportunity " in his bid for the [LOC White House][LOC White House] . " . " [LOC South Carolina][LOC South Carolina] — January 26 th is your moment , " — January 26 th is your moment , " [PER Winfrey][PER Winfrey] said , referring to the state said , referring to the state [MISC [MISC Democratic]Democratic] primary date during a campaign stop primary date during a campaign stop alongside the alongside the [LOC Illinois][LOC Illinois] senator . " It 's your time to senator . " It 's your time to seize the opportunity to support a man who , as the seize the opportunity to support a man who , as the [PER [PER Bible]Bible] says , loves mercy and does justly . " says , loves mercy and does justly . " [PER Obama][PER Obama] 's campaign said more than 29,000 attended the event at 's campaign said more than 29,000 attended the event at the the [ORG University of South Carolina][ORG University of South Carolina] 's football stadium 's football stadium . It had the feel of a rock concert , with bands playing for . It had the feel of a rock concert , with bands playing for early arrivals and campaign supporters yelling " fire it up early arrivals and campaign supporters yelling " fire it up " to the crowd ." to the crowd . Text Source: http://www.msnbc.msn.com/id/22160762/ 12/09/2007NE Output from http://l2r.cs.uiuc.edu/~cogcomp/eoh/nedemo.htmlLegend: PER person, LOC location, ORG organization, MISC miscellaneous

NER Solution ApproachesNER Solution Approaches

Statistical, probabilistic, conditional, inference, Statistical, probabilistic, conditional, inference, ....

Machine learning (Supervised or unsupervised)Machine learning (Supervised or unsupervised) Hidden Markov ModelHidden Markov Model Maximum entropy approach Maximum entropy approach Decision treesDecision trees Rule-based modelsRule-based models Memory-based approachMemory-based approach Support vector machinesSupport vector machines AdaBoost, and other approachesAdaBoost, and other approaches Combination of different approachesCombination of different approaches

Language-Specific ToolsLanguage-Specific Tools

Part-of-speech tagsPart-of-speech tags Noun phrase tags, syntactic tagsNoun phrase tags, syntactic tags Grammar rulesGrammar rules Affix information (character n-grams)Affix information (character n-grams) Orthographic patternsOrthographic patterns Lexical featuresLexical features Punctuation & parentheses handlingPunctuation & parentheses handling Word triggers, word roots, word variationsWord triggers, word roots, word variations

Domain-Specific ToolsDomain-Specific Tools

Specialized dictionariesSpecialized dictionaries Gazetteers (reference information)Gazetteers (reference information) Bag of wordsBag of words Definition of rules describing Definition of rules describing

entities and their possible contextsentities and their possible contexts Cascaded entitiesCascaded entities Other external resourcesOther external resources

Language and Domain Language and Domain Independence: Why?Independence: Why?

Incorporating language or domain-Incorporating language or domain-specific knowledge requires additional specific knowledge requires additional pre and/or post processing.pre and/or post processing.

Additional tasks, such as part-of-Additional tasks, such as part-of-speech tagging or rule definition, are speech tagging or rule definition, are labor and time intensive.labor and time intensive.

It’s not easy to incorporate new It’s not easy to incorporate new information if/when it becomes information if/when it becomes available.available.

Solutions are not easily portable Solutions are not easily portable across domains or languages.across domains or languages.

NER in Biomedical DomainNER in Biomedical Domain

Challenging domain for NERChallenging domain for NER Growing nomenclature, large number Growing nomenclature, large number

of new articles, reports, records, ..of new articles, reports, records, .. Ambiguity in identifying left boundary Ambiguity in identifying left boundary

of multi-word entitiesof multi-word entities Strong overlap among different entitiesStrong overlap among different entities Difficult to annotate training dataDifficult to annotate training data Rules definition or inference is difficultRules definition or inference is difficult Linguistic information may add no Linguistic information may add no

valuevalue

Biomedical NER ExampleBiomedical NER Example

Annotation: GENIA Corpus - Article Source: (Briant et al. 1998) The Journal of Immunology

ERK-1 Example of a single word protein name

extracellular signal-regulated kinase Example of multi-word protein

TI - Involvement of Extracellular Signal-Regulated Kinase Module in [HIV]virus - Mediated [CD4]protein Signals Controlling Activation of [Nuclear Factor-kappa B] protein and [AP-1]protein Transcription FactorsAB - Although the molecular mechanisms by which the [HIV-1]virus triggers either [T cell]cell_type activation, anergy, or apoptosis remain poorly understood, it is well established that the interaction of [HIV-1]virus envelope glycoproteins with [cell surface]cell_line [CD4]protein delivers signals to the target cell, resulting in activation of transcription factors such as [NF-kappa B]protein and [AP-1]protein. In this study, we report the first evidence indicating that kinases [MEK-1]protein ([MAP kinase/Erk kinase]protein) and [ERK-1]protein ([extracellular signal-regulated kinase]protein) act as intermediates in the cascade of events that regulate [NF-kappa B]protein and [AP-1]protein activation upon [HIV-1]virus binding to [cell surface]cell_line [CD4]protein.

Biomedical NER ExampleBiomedical NER Example

Example MEDLINE sentence marked upfor molecular biology named entities

Source: (Collier and Takeuchi 2004)

interleukin-1 example of a single word protein name

IL-2 Example of a single word protein name

IL-2 receptor alpha (IL-2R alpha) gene Example of multi-word DNA

We have shown that [interleukin-1]protein [IL-1]protein) and [IL-2]protein control [IL-2 receptor alpha (IL-2R alpha) gene]DNA transcriptionin [CD4-CD8-murine T lymphocyte precursors]cell_line.

NER ChallengesNER Challenges

Entities may appear in any formEntities may appear in any form Patterns may be difficult to discoverPatterns may be difficult to discover Discovering boundaries of multi-word Discovering boundaries of multi-word

entities is challengingentities is challenging Supervised learning requires labeled Supervised learning requires labeled

training data, not easy to obtaintraining data, not easy to obtain Positive examples are usually scarcePositive examples are usually scarce Unbalanced representation of different Unbalanced representation of different

classes in the training corpusclasses in the training corpus

Now let’s look intoNow let’s look into

Support Vector MachinesSupport Vector Machines

as a machine learningas a machine learning

solution forsolution for

Named Entity RecognitionNamed Entity Recognition

NER Solution?NER Solution?

Support Vector Support Vector MachinesMachines Powerful tool for pattern recognitionPowerful tool for pattern recognition Based on Vapnik’s statistical Based on Vapnik’s statistical

learning theory (Vapnik 1995)learning theory (Vapnik 1995) Kernel-based machine learningKernel-based machine learning Increasingly popular due to its high Increasingly popular due to its high

generalization ability and handling of generalization ability and handling of high-dimensional input spacehigh-dimensional input space

Linearly Separable CaseLinearly Separable Case

Class 1

Class 2

Problem: How to find a “good” decision boundary?

Maximum Margin Decision Maximum Margin Decision BoundaryBoundary

0 bxwT

||||

2

Margin

wm

m

1 bxwT

mClass 1

Class 2

Solution: Maximize the margin between parallel supporting planes

w

1 bxwT

Non-Linearly Separable CaseNon-Linearly Separable Case

Input space is mapped into a higher dimension feature space

where classes are linearly separable

(.)

Input space Feature space

SVM Optimization ProblemSVM Optimization Problem

Linearly separable case:Linearly separable case:MinimizeMinimize

such thatsuch that

Non-linearly separable case:Non-linearly separable case:MinimizeMinimize

such thatsuch that

where where C C is a user-defined parameter and are

the slack variables, or margin errors.

2

2

1)( wwf

,...,N,iwxwy iT

i 21 ,10

N

iiCwwwf

1

2

0 2

1,,

10 iiT

i wxwy

i

Solving the optimization problem is Solving the optimization problem is equivalent to solving its dual problemequivalent to solving its dual problem

find find that minimizes that minimizes

subject tosubject to

the resulting SVM is of the formthe resulting SVM is of the form

The Dual ProblemThe Dual Problem

i

ii j

jT

ijiji yy )()(21 xx

iC

y

i

N

i ii

,0

,01

bybfN

i

Tiiii

T 1

)()()()( xxxwx

The Kernel “Trick”The Kernel “Trick”

There exists a mapping such thatThere exists a mapping such that

The dual problem becomesThe dual problem becomes

Minimize Minimize

So using the kernel we do not need to compute So using the kernel we do not need to compute the vector dot product in the high-dimensional the vector dot product in the high-dimensional feature space. The dot products are computed feature space. The dot products are computed in the lower dimension input space instead.in the lower dimension input space instead.

),()()( jijT

i K xxxx

i

ii j

jijiji Kyy ),(21 xx

Examples of KernelsExamples of Kernels

Linear kernelLinear kernel Polynomial kernelPolynomial kernel Radial basis function kernelRadial basis function kernel Sigmoid function kernelSigmoid function kernel It’s also possible to use other It’s also possible to use other

kernel functions to solve specific kernel functions to solve specific problemsproblems

Single Class SVMSingle Class SVM

Binary classification problem, i.e., a Binary classification problem, i.e., a point either “belongs” to a class or point either “belongs” to a class or does not belong to itdoes not belong to it

Direct application of theoryDirect application of theory Two popular implementations: LibSVM Two popular implementations: LibSVM

and SVM-Lightand SVM-Light Useful for applications that look for a Useful for applications that look for a

yes/no answer (for e.g., intrusion yes/no answer (for e.g., intrusion detection)detection)

Multi-Class SVMMulti-Class SVM

A given point belongs to “some” classA given point belongs to “some” class Finds more separating hyperplanes to identify Finds more separating hyperplanes to identify

the different classesthe different classes Different multi-class approaches:Different multi-class approaches:

– One-against-oneOne-against-one– One-against-allOne-against-all– Half-against-halfHalf-against-half– Solved by building Solved by building severalseveral SVMs and attempting to SVMs and attempting to

classify a point by each of them classify a point by each of them total time = total time = binary time x binary time x nn

All-together approach builds All-together approach builds oneone SVM that SVM that maximizes maximizes allall hyperplanes at the same time hyperplanes at the same time a much bigger optimization problem a much bigger optimization problem

Multi-Class BoundariesMulti-Class Boundaries

Class 1

Class 2

Class 3

One-Against-All

Class 1

Class 2

Class 3

One-Against-One

Class 1

Class 2

Class 3

All-Together

Overlapping areas are unclassifiable regions

SVM Positive FeaturesSVM Positive Features

Mathematically soundMathematically sound Geometric intuitionGeometric intuition Theoretical guaranteesTheoretical guarantees Optimization algorithms existOptimization algorithms exist Can be applied to a variety of problemsCan be applied to a variety of problems SVM vs. neural networks or decision SVM vs. neural networks or decision

trees:trees:– No problems with local minimaNo problems with local minima– Fewer learning parameters to selectFewer learning parameters to select– Stable and reproducible resultsStable and reproducible results

SVM Scalability IssuesSVM Scalability Issues

Optimization requires O(nOptimization requires O(n33) time and ) time and O(nO(n22) memory for single class training, ) memory for single class training, where n is input size (depends on where n is input size (depends on algorithm used)algorithm used)

Multi-class performance depends on Multi-class performance depends on approach used, worse with more classesapproach used, worse with more classes

Slow training, especially with non-linear Slow training, especially with non-linear kernelskernels– Reduce input data size (pruning, chunking, Reduce input data size (pruning, chunking,

clustering)clustering)– Reduce number of support vectorsReduce number of support vectors– Reduce input features dimensionsReduce input features dimensions

How to achieve a practical,How to achieve a practical,

scalable, and expandablescalable, and expandable

NER/SVM solution?NER/SVM solution?

Towards a Practical Towards a Practical NER/SVM SolutionNER/SVM Solution

Research ProposalResearch Proposal

Address SVM scalability issuesAddress SVM scalability issues Special focus on multi-class all-Special focus on multi-class all-

together optimization problemtogether optimization problem Apply proposed solution to Apply proposed solution to

biomedical named entity recognitionbiomedical named entity recognition Recommend a framework that Recommend a framework that

promotes future research work promotes future research work through easy expandability and through easy expandability and maintainabilitymaintainability

Two PhasesTwo Phases

Phase One: Baseline ExperimentsPhase One: Baseline Experiments– Explore the scalability issues through Explore the scalability issues through

a set of NER/SVM experiments using a set of NER/SVM experiments using biomedical abstractsbiomedical abstracts

– Identify auxiliary usability problemsIdentify auxiliary usability problems Phase Two: Proposed ResearchPhase Two: Proposed Research

– Address multi-class scalability issuesAddress multi-class scalability issues– Recommend dynamic architecture to Recommend dynamic architecture to

improve SVM usabilityimprove SVM usability

Key Ideas for ExperimentsKey Ideas for Experiments

Eliminate the use of prior language and Eliminate the use of prior language and domain-specific knowledgedomain-specific knowledge

Capitalize on SVM’s ability to handle high-Capitalize on SVM’s ability to handle high-dimensional input spacedimensional input space

Generate a very high number of binary Generate a very high number of binary orthographic and contextual featuresorthographic and contextual features

Character and word n-grams do not have to Character and word n-grams do not have to make linguistic sense, for e.g., a meaningful make linguistic sense, for e.g., a meaningful prefix or suffix, logical sequence of words.prefix or suffix, logical sequence of words.

Minimize pre and post-processing as much as Minimize pre and post-processing as much as possiblepossible

Baseline ExperimentsBaseline Experiments

Using the JNLPBA-04 challenge task data Using the JNLPBA-04 challenge task data (GENIA biomedical abstracts)(GENIA biomedical abstracts)

Features generated using jFex (Giuliano 2005)Features generated using jFex (Giuliano 2005) Single class: find PROTEIN namesSingle class: find PROTEIN names

– Binary classification using SVM-Light (Joachims 2002)Binary classification using SVM-Light (Joachims 2002) Multi-class: find all classes (PROTEIN, DNA, Multi-class: find all classes (PROTEIN, DNA,

RNA, CELL-TYPE, CELL-LINE)RNA, CELL-TYPE, CELL-LINE)– All-together classification using Joachims’ SVM-All-together classification using Joachims’ SVM-

Multiclass implementationMulticlass implementation Precision/Recall/F-score performance results Precision/Recall/F-score performance results

are comparable to published resultsare comparable to published results

The JNLPBA-04 DatasetsThe JNLPBA-04 Datasets

Protein DNA RNA Cell Type Cell Line All Entities

Training Set 30,269 (15.1) 9,533 (4.8) 951 (0.5) 6,718 (3.4) 3,830 (1.9) 51,301 (25.7)

Test Set 5,067 (12.5) 1,056 (2.6) 118 (0.3) 1,921 (4.8) 500 (1.2) 8,662 (21.4)

1978-1989 609 ( 5.9) 112 (1.1) 1 (0.0) 392 (3.8) 176 (1.7) 1,290 (12.4)

1990-1999 1,420 (13.4) 385 (3.6) 49 (0.5) 459 (4.3) 168 (1.6) 2,481 (23.4)

2000-2001 2,180 (16.8) 411 (3.2) 52 (0.4) 714 (5.5) 144 (1.1) 3,501 (26.9)

S/1998-2001 3,186 (15.5) 588 (2.9) 70 (0.3) 1,138 (5.5) 170 (0.8) 5,152 (25.0)

Training Data = 2,000 abstracts (492,551 tokens)Test Data = 404 abstracts (101,039 tokens)

%Positive Examples in Training Data= 0.2% - 0.6%

Common ArchitectureCommon Architecture

Baseline Experiments Baseline Experiments DesignDesign

Feature SelectionFeature Selection

Orthographic features:Orthographic features:– Capitalization: token begins with a capital letter.Capitalization: token begins with a capital letter.– Numeric: token is a numeric value.Numeric: token is a numeric value.– Punctuation: token is a punctuation.Punctuation: token is a punctuation.– Uppercase: token is all in uppercase.Uppercase: token is all in uppercase.– Lowercase: token is all in lowercase.Lowercase: token is all in lowercase.– Single character: token length is equal to one.Single character: token length is equal to one.– Symbol: token is a special character.Symbol: token is a special character.– Includes hyphen: one of the characters is a hyphen.Includes hyphen: one of the characters is a hyphen.– Includes slash: one of the characters is a slash.Includes slash: one of the characters is a slash.– Letters and Digits: token is alphanumeric.Letters and Digits: token is alphanumeric.– Capitals and digits: token contains caps and digits.Capitals and digits: token contains caps and digits.– Includes caps: some characters are in uppercase.Includes caps: some characters are in uppercase.– General regular expression summarizing the word shape.General regular expression summarizing the word shape.

Contextual features:Contextual features:– Each word is considered a featureEach word is considered a feature– Collocation of tokens active over three positions around the token Collocation of tokens active over three positions around the token

itselfitself

Performance MeasuresPerformance Measures

recall)(precision

recall) (precision 2FMean Weighted

recall)precision(β

recall) (precision )β (1FMean Weighted

fp tp

tp

algorithmby found entities of #

entities classifiedcorrectly of #Precision

fn tp

tp

corpus in the entities of #

entities classifiedcorrectly of #Recall

1β

2

2

β

Experimental Results:Experimental Results:Single Class (Linear Kernel)Single Class (Linear Kernel)

1978-1989 Set 1990-1999 Set 2000-2001 Set S/1998-2001 Set Total

No BoostingComplete

57.47 / 53.35 / 55.34 71.69 / 62.76 / 66.93 68.35 / 59.60 / 63.68 68.27 / 58.63 / 63.08 68.06 / 59.29 / 63.37

Right 73.56 / 68.29 / 70.83 81.76 / 71.58 / 76.33 79.04 / 68.92 / 73.63 79.47 / 68.25 / 73.43 79.30 / 69.09 / 73.84

Left 59.61 / 55.34 / 57.39 77.82 / 68.13 / 72.65 76.70 / 66.88 / 71.45 76.08 / 65.34 / 70.30 75.24 / 65.55 / 70.06

Boost Factor = 2Complete

65.35 / 50.83 / 57.18 78.59 / 60.88 / 68.61 76.33 / 60.36 / 67.41 75.58 / 58.40 / 65.89 75.54 / 58.82 / 66.14

Right 80.62 / 62.71 / 70.55 87.75 / 67.98 / 76.61 86.28 / 68.23 / 76.20 86.25 / 66.65 / 75.19 86.09 / 67.04 / 75.38

Left 67.98 / 52.87 / 59.48 85.70 / 66.39 / 74.82 83.67 / 66.16 / 73.89 83.21 / 64.30 / 72.54 82.57 / 64.30 / 72.30


71.43 / 48.49 / 57.77 81.55 / 59.32 / 68.68 78.99 / 59.03 / 67.57 78.63 / 57.04 / 66.11 78.70 / 57.30 / 66.32

Right 84.73 / 57.53 / 68.53 89.93 / 65.42 / 75.74 88.58 / 66.20 / 75.77 88.64 / 64.30 / 74.53 88.55 / 64.46 / 74.61

Left 74.38 / 50.50 / 60.16 88.66 / 64.50 / 74.67 86.10 / 64.35 / 73.65 86.00 / 62.39 / 72.31 85.58 / 62.31 / 72.11


71.43 / 44.52 / 54.85 81.41 / 57.14 / 67.15 78.76 / 56.91 / 66.08 78.34 / 54.74 / 64.45 78.49 / 54.87 / 64.59

Right 85.06 / 53.02 / 65.32 90.14 / 63.27 / 74.35 88.21 / 63.74 / 74.00 88.42 / 61.78 / 72.73 88.41 / 61.81 / 72.76

Left 73.89 / 46.06 / 56.75 88.59 / 62.18 / 73.08 86.47 / 62.48 / 72.54 86.44 / 60.39 / 71.11 85.83 / 60.00 / 70.63

Experimental Results:Experimental Results:Multi-Class (Linear Kernel)Multi-Class (Linear Kernel)

NamedEntity


protein 58.62 / 70.83 / 64.15 72.68 / 63.43 / 67.74 70.83 / 62.03 / 66.14 71.28 / 60.19 / 65.27 70.37 / 62.00 / 65.92

DNA 61.61 / 65.71 / 63.59 52.21 / 63.01 / 57.10 52.55 / 70.59 / 60.25 47.11 / 69.60 / 56.19 51.00 / 67.64 / 58.16

RNA 0.00 / 0.00 / 0.00 55.10 / 57.45 / 56.25 50.00 / 74.29 / 59.77 50.00 / 62.50 / 55.56 51.16 / 63.77 / 57.02

cell_type 51.79 / 73.55 / 60.78 51.42 / 72.84 / 60.28 52.94 / 82.89 / 64.62 50.09 / 81.31 / 61.99 59.56 / 78.00 / 67.55

cell_line 32.39 / 67.86 / 43.85 50.00 / 50.60 / 50.30 56.94 / 55.03 / 55.97 53.53 / 43.75 / 48.15 47.72 / 51.64 / 49.61

Overall 53.18 / 70.79 / 60.73 63.68 / 63.63 / 63.66 64.15 / 65.39 / 64.76 62.97 / 63.16 / 63.06 62.43 / 64.50 / 63.45

Correct Right 71.55 / 95.25 / 81.72 79.44 / 79.38 / 79.41 78.95 / 80.47 / 79.70 78.40 / 78.64 / 78.52 78.05 / 80.65 / 79.33

Correct Left 56.12 / 74.72 / 64.10 70.33 / 70.28 / 70.31 70.95 / 72.31 / 71.63 69.68 / 69.90 / 69.79 68.76 / 71.06 / 69.89

Performance ComparisonPerformance Comparison


Zhou (Zhou and Su 2004)

75.3 / 69.5 / 72.3 77.1 / 69.2 / 72.9 75.6 / 71.3 / 73.8 75.8 / 69.5 / 72.5 76.0 / 69.4 / 72.6

Song (Song et al. 2004)

60.3 / 66.2 / 63.1 71.2 / 65.6 / 68.2 69.5 / 65.8 / 67.6 68.3 / 64.0 / 66.1 67.8 / 64.8 / 66.3

Rössler (Rössler 2004)

59.2 / 60.3 / 59.8 70.3 / 61.8 / 65.8 68.4 / 61.5 / 64.8 68.3 / 60.4 / 64.1 67.4 / 61.0 / 64.0

Habib 53.2 / 70.8 / 60.7 63.7 / 63.6 / 63.7 64.2 / 65.4 / 64.8 63.0 / 63.2 / 63.1 62.4 / 64.5 / 63.5

Park (Park et al. 2004)

62.8 / 55.9 / 59.2 70.3 / 61.4 / 65.6 65.1 / 60.4 / 62.7 65.9 / 59.7 / 62.7 66.5 / 59.8 / 63.0

Lee (Lee, Hwang et al. 2004)

42.5 / 42.0 / 42.2 52.5 / 49.1 / 50.8 53.8 / 50.9 / 52.3 52.3 / 48.1 / 50.1 50.8 / 47.6 / 49.1

Baseline (Kim et al. 2004)

47.1 / 33.9 / 39.4 56.8 / 45.5 / 50.5 51.7 / 46.3 / 48.8 52.6 / 46.0 / 49.1 52.6 / 43.6 / 47.7

Giuliano (Giuliano et al. 2005)

-- -- -- -- 64.4 / 69.8 / 67.0

How Long Was Training How Long Was Training Time?Time?

Test Type Kernel Type Training Time Recall Precision F-score

Protein Linear 814.85 sec.

13.58 min.

68.13 59.33 63.43

Protein Polynomial degree=2

390082.24 sec.

6501.37 min.

69.93 62.23 65.86

Protein Polynomial degree=3

78506.33 sec.

1308.44 min.

68.47 62.44 65.32

Protein Radial basis 556858.22 sec.9280.97 min.

< 0.1 <10 <0.1

Multiclass Linear 353367.03 sec.

5889.45 min.

70.09 (P)

63.01 (A)

61.93 (P)

64.44 (A)

65.76 (P)

63.71 (A)

All tests performed on the same machine (Dual Core Xeon 3.6 GHz)

Margin error = 0.1, max. memory = 2GB for all tests

NER/SVM Scalability NER/SVM Scalability ProblemsProblems

Evidence exists that using high-Evidence exists that using high-dimensional orthographic and contextual dimensional orthographic and contextual features leads to good NER classification.features leads to good NER classification.

Input vectors are sparse & high-Input vectors are sparse & high-dimensional.dimensional.

SVM requires O(nSVM requires O(n33) time and O(n) time and O(n22) memory ) memory for single class training, where n is input for single class training, where n is input size.size.

Multi-class training time is much higher, Multi-class training time is much higher, especially for all-together optimization.especially for all-together optimization.

SVM is impractical for large input datasets, SVM is impractical for large input datasets, especially with non-linear kernel functions.especially with non-linear kernel functions.

Other Practical ChallengesOther Practical Challenges

Integrated tools are not availableIntegrated tools are not available Lack of standardization, incompatible Lack of standardization, incompatible

interfaces, need to “reinvent the wheel” to fit interfaces, need to “reinvent the wheel” to fit pieces togetherpieces together

How to implement new algorithms for partial How to implement new algorithms for partial problems?problems?

How to incorporate optional components into How to incorporate optional components into the overall NER/SVM solution?the overall NER/SVM solution?

How to select model parameters?How to select model parameters? How to select a kernel function that is suitable How to select a kernel function that is suitable

to a given problem data?to a given problem data? Adding new training data requires restarting Adding new training data requires restarting

the learning processthe learning process

Proposed ApproachProposed Approach

Reduce online memory Reduce online memory requirementsrequirements– Use a database repositoryUse a database repository

Reduce training timeReduce training time– Database-supported algorithmsDatabase-supported algorithms– Special focus on improving multi-Special focus on improving multi-

class optimization algorithmclass optimization algorithm

Proposed ArchitectureProposed Architecture

Database-Supported Database-Supported AlgorithmsAlgorithms

Use DBMS to store input vectors, evolving Use DBMS to store input vectors, evolving model, and intermediate training resultsmodel, and intermediate training results

Decompose SVM solution into modules Decompose SVM solution into modules that perform specific tasks, sharing datathat perform specific tasks, sharing data

Input and intermediate data resulting from Input and intermediate data resulting from previous experiments can/may be reused previous experiments can/may be reused for others thereby reducing recomputationfor others thereby reducing recomputation

As a by-product, building a growing As a by-product, building a growing gazetteer list is facilitated. May be used to gazetteer list is facilitated. May be used to improve performance measuresimprove performance measures

Embedded Database Embedded Database ModulesModules

Eliminate communication overheadEliminate communication overhead Take advantage of DB caching and Take advantage of DB caching and

parallelization abilityparallelization ability Provide a base for a potential SOA for Provide a base for a potential SOA for

SVM using the DB for data exchangeSVM using the DB for data exchange Extend the DBMS with reusable Extend the DBMS with reusable

classification modulesclassification modules Present a unified interface to the user Present a unified interface to the user

Proposed Approach - DBMSProposed Approach - DBMS

Open source: PostgreSQL or MySQL?Open source: PostgreSQL or MySQL? PostgreSQL is selected due to its rich PostgreSQL is selected due to its rich

features, adherence to standards, features, adherence to standards, and the flexible options to extend and the flexible options to extend DBMS via internal or embedded DBMS via internal or embedded functions.functions.

MySQL had better performance. MySQL had better performance. Latest versions of PostgreSQL Latest versions of PostgreSQL improved performance and enhanced improved performance and enhanced scalability.scalability.

Evaluation PlanEvaluation Plan

Test using the JNLPBA-04 biomedical Test using the JNLPBA-04 biomedical datadata

Repeat experiments with different input Repeat experiments with different input sizes and track training time. Compare sizes and track training time. Compare to training time using traditional SVM.to training time using traditional SVM.

Evaluate classification performance.Evaluate classification performance. (Optional) Re-train using more data and (Optional) Re-train using more data and

verify that previously stored model can verify that previously stored model can be augmented with new data.be augmented with new data.

May need to regenerate features for JNLPBA-04 and repeat baseline tests. Some features were excluded in previous tests due to memory shortage.

Success CriteriaSuccess Criteria

Demonstrate that the database-supported Demonstrate that the database-supported approach requires less online memory and total approach requires less online memory and total training time than traditional SVM.training time than traditional SVM.

Show that training time consistently outperforms Show that training time consistently outperforms traditional SVM with increased input data size. traditional SVM with increased input data size.

Precision/Recall/F-score performance measures Precision/Recall/F-score performance measures remain comparable to those obtained using remain comparable to those obtained using traditional SVM.traditional SVM.

(Optional) Demonstrate the ability to train (Optional) Demonstrate the ability to train incrementally without restarting the learning incrementally without restarting the learning process. process.

(Auxiliary) Recommend dynamic architecture to (Auxiliary) Recommend dynamic architecture to improve SVM usability by allowing definition of improve SVM usability by allowing definition of different solutions.different solutions.

Potential ContributionsPotential Contributions

Improved SVM scalabilityImproved SVM scalability Improved multi-class trainingImproved multi-class training NER solution that is portable NER solution that is portable

across languages and domains across languages and domains (Auxiliary) Recommendation of (Auxiliary) Recommendation of

architecture that promotes future architecture that promotes future SVM research in focused areas SVM research in focused areas

Thank You!Thank You!

Thank You for Your Time!Thank You for Your Time!

Questions?Questions?

ReferencesReferences

Bennett, Kristin P., and Colin Campbell. 2000. Support Vector Machines: Hype or Hallelujah? SIGKDD Explor. Newsl. 2 (2):1-13

Collier, Nigel, and Koichi Takeuchi. 2004. Comparison of Character-Level and Part of Speech Features for Name Recognition in Biomedical Texts. Journal of Biomedical Informatics 37 (6):423-435.

Giuliano, Claudio, Alberto Lavelli, et al. Simple Information Extraction (SIE). ITC-irst, Istituto per la Ricerca Scientifica e Tecnologica, 2005. Available from http://tcc.itc.it/research/textec/tools-resources/sie/giuliano-sie.pdf

Joachims, Thorsten. 2002. Learning to Classify Text Using Support Vector Machine. Norwell, MA: Kluwer Academic.

———. 2006. Training Linear SVMs in Linear Time. In Proc. of the ACM Conference on Knowledge Discovery and Data Mining (KDD), Aug 20-23.

Kim, Jin-Dong, Tomoko Ohta, et al. 2004. Introduction to the Bio-Entity Recognition Task at JNLPBA. In Proc. of the JNLPBA'2004 Workshop, Aug 28-29, at Geneva, Switzerland.

Lee, K. J., Y. S. Hwang, et al. 2004. Biomedical Named Entity Recognition using Two-Phase Model Based on SVMs. Journal of Biomedical Informatics 37 (6):436-447.

References (ctd.)References (ctd.)

Milenova, Boriana L., Joseph S. Yarmus, et al. 2005. SVM in Oracle Database 10g: Removing the Barriers to Widespread Adoption of Support Vector Machines. In Proc. of the 31st international conference on Very large data bases, Aug 30 - Sep 2, at Trondheim, Norway.

Park, Kyung-Mi, Seon-Ho Kim, et al. 2004. Incorporating Lexical Knowledge into Biomedical NE Recognition. In Proc. of the JNLPBA'2004 Workshop, Aug 28-29, at Geneva, Switzerland.

Rössler, Marc. 2004. Adapting an NER-System for German to the Biomedical Domain. In Proc. of the JNLPBA'2004 Workshop, Aug 28-29, at Geneva, Switzerland.

Song, Yu, Eunju Kim, et al. 2004. POSBIOTM-NER in the Shared Task of BioNLP/NLPBA 2004. In Proc. of the JNLPBA'2004 Workshop, Aug 28-29, at Geneva, Switzerland.

Sullivan, Keith M., and Sean Luke. 2007. Evolving Kernels for Support Vector Machine Classification. In Proc. of the 9th annual conference on Genetic and evolutionary computation, at London, England.

Vapnik, Vladimir N. 1995. The Nature of Statistical Learning Theory: Springer

Winters-Hilt, Stephen, Anil Yelundur, et al. 2006. Support Vector Machine Implementations for Classification and Clustering. BMC Bioinformatics 7:S4

Zhou, GuoDong, and Jian Su. 2004. Exploring Deep Knowledge Resources in Biomedical Name Recognition. In Proc. of the JNLPBA'2004 Workshop, Aug 28-29, at Geneva, Switzerland.

Multi-Class FormulationMulti-Class Formulation(One-Against-All)(One-Against-All)

lj

iybxw

iybxw

Cww

ij

jij

ij

Ti

jij

ij

Ti

i

j

ij

iTi

bw iii

,...,1,0

, if ,1)()(

, if ,1)()(

such that,

)(2

1min

1,,

Given l training data (x1, y1), …, (xl, yl) where , i = 1, ... , l and is the class of xi , the ith SVM solves the following problem:

))()((argmax of class k 1,...,iiTi bxwx

x is in the class which has the largest value of the decision function:

ni Rx

},...,1{ kyi

Multi-Class FormulationMulti-Class Formulation(One-Against-One)(One-Against-One)

Given l training data (x1, y1), …, (xl, yl) where , i = 1, ... , l and is the class of xi. For training data from the ith and the jth classes, we solve the following binary classification problem:

ni Rx

},...,1{ kyi

0

, if ,1)()(

, if ,1)()(

such that,

)(2

1min

,,

ijl

lijl

ijl

Tij

lijl

ijl

Tij

l

ijl

ijTij

bw

jybxw

iybxw

Cwwijijij

Multi-Class FormulationMulti-Class Formulation(All-Together)(All-Together)

The idea is similar to the one-against-all approach. It constructs k two-class rules where the mth function separates training vectors of the

class m from the other vectors. There are k decision functions but all are obtained by solving one problem. The formulation is as follows:

bxwTm )(

imi

mimi

Tmyi

Ty

k

m

l

i ym

mim

Tm

bw

ykmli

bxwbxw

Cww

ii

i

\},...,1{ and ,...,1,0

,2)()( such that,

2

1min

1 1,,

))((argmax k 1,...,m mTm bxw

Then the decision function is

Multi-Class FormulationMulti-Class Formulation(All-Together – Dual (All-Together – Dual Problem)Problem)

Like binary SVM, it is easier to solve the dual problem, which is:

, ..., km, ..., li

yy

yycA

C

kmAc

KAAc

k

m ji

jiyi

mii

yi

mi

l

i

l

ii

mi

mi

ji m m mi

miji

mj

mi

yj

miji

wj

i

i

ii

1 and 1

if 0

if 1,

,0,0

,,...,1, such that,

2)2

1

2

1(min

1

1 1

, ,,

l

imi

mii

mi bxxKAck

1k 1,...,m )),()((argmax

where and)()(, jT

iji xxK ,...,1),()(1

kmxAcw imii

l

i

mim

The decision function is:

Multi-Class FormulationMulti-Class Formulation(All-Together – (All-Together – Decomposition)Decomposition) Main problem for All-Together optimization is the Main problem for All-Together optimization is the

number of variables and constraintsnumber of variables and constraints Attempts to mathematically transform the big Attempts to mathematically transform the big

problem into several independent sub-problems to problem into several independent sub-problems to minimize the number of variables/constraints in minimize the number of variables/constraints in each (See Hsu et al. 2002, Zanghirati and Zanni each (See Hsu et al. 2002, Zanghirati and Zanni 2003, Tsochantaridis 2004, Abe 2005, and Zanni 2003, Tsochantaridis 2004, Abe 2005, and Zanni 2006)2006)

Providing a general solution for multi-class SVM is Providing a general solution for multi-class SVM is difficult, may be good to focus on linear kernels difficult, may be good to focus on linear kernels onlyonly

Based on protein experimental results, performance Based on protein experimental results, performance improvement by polynomial kernel was achieved improvement by polynomial kernel was achieved using linear multi-class. Also, previous NER work using linear multi-class. Also, previous NER work shows that classes are usually linearly separable shows that classes are usually linearly separable

Multi-Class FormulationMulti-Class Formulation(All-Together – (All-Together – Decomposition)Decomposition)

Using a linear kernel, we can focus on decomposing Using a linear kernel, we can focus on decomposing the multi-class optimization into smaller sub-the multi-class optimization into smaller sub-problems that lead to equivalent resultsproblems that lead to equivalent results

Another area for potential improvement is to limit Another area for potential improvement is to limit the representation of input data for each class to the representation of input data for each class to those points that are closer to the separating those points that are closer to the separating hyperplane (positive examples far from the hyperplane (positive examples far from the hyperplane do not contribute to the solution)hyperplane do not contribute to the solution)

Reducing the number of relevant data points based Reducing the number of relevant data points based on distance from separating hyperplane is easier to on distance from separating hyperplane is easier to achieve when vectors and distances are stored in the achieve when vectors and distances are stored in the database (may select 90database (may select 90thth percentile for e.g.) percentile for e.g.)

In summary, we can focus on improving multi-class In summary, we can focus on improving multi-class SVM for linear machines, suitable for NERSVM for linear machines, suitable for NER

References (Multi-References (Multi-class)class)Abe, Shigeo. 2005. Support Vector Machines for Pattern Classification. Edited by Sameer

Singh, Advances in Pattern Recognition. London: Springer-Verlag.

Hsu, Chih-Wei, and Chih-Chen Lin. 2002. A Comparison of Methods for Multi-Class Support Vector Machines. IEEE Transactions on Neural Networks 13:415-425.

Joachims, Thorsten. 2006. Training Linear SVMs in Linear Time. In Proc. of the ACM Conference on Knowledge Discovery and Data Mining (KDD), Aug 20-23.

Tsochantaridis, Ioannis, Thomas Hofmann, et al. 2004. Support Vector Learning for Interdependent and Structured Output Spaces. In Proc. of the 21st International Conference on Machine Learning (ICML), Jul 4-8, at Alberta, Canada.

Zanghirati, Gaetano, and Luca Zanni. 2003. A Parallel Solver for Large Quadratic Programs in Training Support Vector Machines. Parallel Computing 29:535-551.

Zanni, Luca, Thomas Serafini, et al. 2006. Parallel Software for Training Large Scale Support Vector Machines on Multiprocessor Systems. Journal of Machine Learning Research 7:1467-1492.

SVM Usability Issues SVM Usability Issues (Auxiliary)(Auxiliary) Integrated tools are not availableIntegrated tools are not available Lack of standardization, incompatible Lack of standardization, incompatible

interfaces, need to “reinvent the wheel” to fit interfaces, need to “reinvent the wheel” to fit pieces togetherpieces together

How to incorporate optional components into How to incorporate optional components into the overall solution?the overall solution?

How to implement new algorithms for partial How to implement new algorithms for partial problems?problems?

Model selection (parameter tuning)Model selection (parameter tuning)– Grid search, cross validation, decision trees, heuristicsGrid search, cross validation, decision trees, heuristics

Kernel selectionKernel selection– Heuristics based on input dimensions, cross validationHeuristics based on input dimensions, cross validation

Input data formatting, scaling, normalizationInput data formatting, scaling, normalization

Recommended Dynamic Recommended Dynamic Service-Oriented Service-Oriented ArchitectureArchitecture

Addressing Usability Addressing Usability (Auxiliary)(Auxiliary)

Dynamic architectureDynamic architecture– Decompose the problem into modules designed Decompose the problem into modules designed

using a Service Oriented Architecture (SOA).using a Service Oriented Architecture (SOA). Easily maintainable and expandable through Easily maintainable and expandable through

clear inter-module interfacesclear inter-module interfaces Database schema design supports the Database schema design supports the

architecture and maintains SOA exchange architecture and maintains SOA exchange data in addition to machine learning datadata in addition to machine learning data

Design promotes future research Design promotes future research no need no need to rebuild experimentation infrastructure to rebuild experimentation infrastructure from scratch, focus on specific componentsfrom scratch, focus on specific components

Addressing Usability Addressing Usability (Auxiliary)(Auxiliary)

DBMS embedded modulesDBMS embedded modules– Extend the DBMS with the machine learning Extend the DBMS with the machine learning

and other optional modulesand other optional modules Model and kernel selectionModel and kernel selection

– Made possible through optional modulesMade possible through optional modules Alternate algorithm implementations Alternate algorithm implementations

can co-exist and are selectable by usercan co-exist and are selectable by user Incremental trainingIncremental training

– Allow training to resume when new training Allow training to resume when new training data is availabledata is available

Additional Scalability Additional Scalability Modules (Auxiliary)Modules (Auxiliary)

Architecture allows inclusion of Architecture allows inclusion of modules that accomplish optional modules that accomplish optional tasks, for e.g.,tasks, for e.g.,– Reduce input data sizeReduce input data size

Active learningActive learning– Train the machine with a small chunk of data. Train the machine with a small chunk of data.

Incrementally include relevant dataIncrementally include relevant data Active ClusteringActive Clustering

– Cluster input data in input space or feature spaceCluster input data in input space or feature space

– Reduce number of support vectorsReduce number of support vectors– Feature dimensionality reductionFeature dimensionality reduction

SVM Classification SVM Classification DefinitionsDefinitions

• Input Space: Input Space: XX R RNN, where N is number of , where N is number of featuresfeatures

• Binary Output Space: Binary Output Space: YY = {+1,-1} = {+1,-1}• Multi-Class Output Space: Multi-Class Output Space: YY I I• A point, pattern or instance:A point, pattern or instance:

x x XX, , xx = (x = (x11, x, x22, …, x, …, xNN))• Example: Example: ((xx, , yy)) withwith x x X, yX, y YY• Training Set: a set of Training Set: a set of k k examples generated examples generated

according to an unknown distribution according to an unknown distribution PP((xx,,yy) )

S = {(S = {(xx11, , yy11), …, (), …, (xxkk, , yykk)} )} ( (X X YY))kk

SVM as a Neural NetworkSVM as a Neural Network

First layer input spaceSecond layer (higher dimension) feature space

Research Timeline - Research Timeline - CompletedCompleted

Task NameEstimatedDuration

Start Date Finish Date

NER infrastructure design 130d Mon 5/2/05 Fri 9/23/05

Locate data sources 15d Mon 5/2/05 Fri 5/20/05

Evaluate machine learning options 15d Mon 5/23/05 Fri 6/10/05

Evaluate machine learning packages 70d Mon 6/13/05 Fri 9/16/05

Evaluate SVM implementations 30d Mon 9/19/05 Fri 10/28/05

Evaluate feature extraction software 30d Mon 7/18/05 Fri 8/26/05

Generate features for JNLPBA-04 using jFex 20d Mon 8/29/05 Fri 9/23/05

Discover jFex input format and resolve memory problems 15d Mon 8/29/05 Fri 9/16/05

Partition JNLPBA-04 data and generate features 5d Mon 9/19/05 Fri 9/23/05

Run single class experiments 25d Mon 10/31/05 Fri 12/2/05

Evaluate multi-class SVM options 20d Mon 11/14/05 Fri 12/9/05

Evaluate SVM clustering and parallel SVM options 130d Mon 11/28/05 Fri 5/26/06

Run multi-class experiments (home machines, USAFA, CS machines) 445d Mon 12/12/05 Fri 8/24/07

Reevaluate SVM clustering options 20d Mon 11/27/06 Fri 12/22/06

Document baseline experiments results / Write BIOT-07 paper 11d Mon 7/30/07 Mon 8/13/07

Investigate NER/SVM scalability issues / identify potential solutions 45d Mon 3/12/07 Fri 7/20/07

Research Timeline – To DoResearch Timeline – To DoTask Name

Estimated Duration

Start Date Finish Date

Address NER/SVM scalability issues 90d Mon 10/15/07 Fri 2/15/08

Design proposed solution architecture 5d Mon 10/15/07 Fri 10/19/07

Select database server 3d Wed 11/14/07 Fri 11/16/07

Identify equipment to use 10d Mon 11/5/07 Fri 11/16/07

Setup new equipment at home and repeat baseline 7d Fri 11/16/07 Fri 11/23/07

Design/Implement proposed solution 65d Mon 12/3/07 Fri 2/29/08

Design SVM decomposition 15d Mon 12/3/07 Fri 12/21/07

Design and build database schema 12d Mon 12/24/07 Tue 1/8/08

Design dynamic modules architecture 15d Mon 12/3/07 Fri 12/21/07

Develop input data loading module 5d Mon 1/7/08 Fri 1/11/08

Develop embedded database modules 30d Mon 12/24/07 Fri 2/1/08

(Optional) Design/Implement incremental learning algorithm 20d Mon 2/4/08 Fri 2/29/08

Testing and Evaluation 60d Mon 1/7/08 Fri 4/4/08

Run SVM tests with different input sizes 35d Mon 1/7/08 Fri 2/22/08

Test solution with different input sizes 25d Mon 3/3/08 Fri 4/4/08

Final dissertation writing 20d Mon 3/17/08 Fri 4/11/08

Document experiments' results 10d Mon 3/17/08 Fri 3/28/08

Complete dissertation writing 20d Mon 3/17/08 Fri 4/11/08

Research Timeline – To DoResearch Timeline – To Do

improving scalability of support vector machines for biomedical named entity recognition

Documents

specific domain

domainspecific ner solution

loc location

domainspecific knowledge

loc columbia

loc south carolina

proposed solution

domain independenceapply