1 incorporating n-gram statistics in the normalization of clinical notes by bridget thomson mcinnes

33
1 Incorporating N-gram Incorporating N-gram Statistics in the Statistics in the Normalization of Normalization of Clinical Notes Clinical Notes By Bridget Thomson By Bridget Thomson McInnes McInnes

Post on 21-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 Incorporating N-gram Statistics in the Normalization of Clinical Notes By Bridget Thomson McInnes

1

Incorporating N-gram Incorporating N-gram Statistics in the Statistics in the

Normalization of Clinical Normalization of Clinical NotesNotes

By Bridget Thomson McInnesBy Bridget Thomson McInnes

Page 2: 1 Incorporating N-gram Statistics in the Normalization of Clinical Notes By Bridget Thomson McInnes

2

OverviewOverview

NgramsNgrams

Ngram Statistics for Spelling Ngram Statistics for Spelling CorrectionCorrection

Spelling CorrectionSpelling Correction

Ngram Statistics for Multi Term Ngram Statistics for Multi Term IdentificationIdentification

Multi Term IdentificationMulti Term Identification

Page 3: 1 Incorporating N-gram Statistics in the Normalization of Clinical Notes By Bridget Thomson McInnes

3

NgramNgram

Her dobutamine stress echo showed mild Her dobutamine stress echo showed mild aortic stenosis with a subaortic gradient.aortic stenosis with a subaortic gradient.

Her dobutamineDobutamine stressStress echoEcho showedShowed mildMild aorticAortic stenosisStenosis withWith a A subaorticSubaortic gradient

her dobutamine stressdobutamine stress echostress echo showedecho showed mildshowed mild aorticmild aortic stenosisaortic stenosis withstenosis with aa subaortic gradient

Bigrams Trigrams

Page 4: 1 Incorporating N-gram Statistics in the Normalization of Clinical Notes By Bridget Thomson McInnes

4

Contingency TablesContingency Tables

Word 2 ! Word 2

Word 1

! Word 1

n11 n12

n21 n22

n1p

np1 np2

n2p

npp

• n11 = the joint frequency of word1 and word2• n12 = the frequency word 1 occurs and word 2 does not• n21 = the frequency word 2 occurs and word 1 does not• n22 = the frequency word 1 and word 2 do not occur

• npp = the total number of ngrams

• n1p, np1, np2, n2p are the marginal counts

Page 5: 1 Incorporating N-gram Statistics in the Normalization of Clinical Notes By Bridget Thomson McInnes

5

Contingency TablesContingency Tables

echo ! echo

stress

!stress

1 0

0 10

1

1 10

10

11

Her dobutamine 1Dobutamine stress 1Stress echo 1Echo showed 1Showed mild 1Mild aortic 1Aortic stenosis 1Stenosis with 1With a 1A subaortic 1Subaortic gradient 1

Page 6: 1 Incorporating N-gram Statistics in the Normalization of Clinical Notes By Bridget Thomson McInnes

6

Contingency TablesContingency TablesExpected ValuesExpected Values

Word 2 ! Word 2

Word 1

! Word 1

n11 n12

n21 n22

n1p

np1 np2

n2p

npp

• Expected Values

• m11 = (np1 * n1p) / npp

• m12 = (np2 * n1p) / npp

• m21 = (np1 * n2p) / npp

• m22 = (np2 * n2p) / npp

Page 7: 1 Incorporating N-gram Statistics in the Normalization of Clinical Notes By Bridget Thomson McInnes

7

Contingency TablesContingency Tables

echo ! echo

stress

!stress

1 0

0 10

1

1 10

10

11

• Expected Values

• m11 = ( 1 * 1 ) / 11 = 0.09

• m12 = ( 1 * 10) / 11 = 0.91

• m21 = ( 1 * 10) / 11 = 0.90

• m22 = (10 * 10) / 11 = 9.09

What is this telling you?

‘this is’ occurs twice in our example.

The expected occurrence of ‘this is’ if they are independent is .09 (m11).

Page 8: 1 Incorporating N-gram Statistics in the Normalization of Clinical Notes By Bridget Thomson McInnes

8

Ngram StatisticsNgram Statistics

Measures of AssociationMeasures of Association Log Likelihood RatioLog Likelihood Ratio Chi Squared TestChi Squared Test Odds RatioOdds Ratio Phi CoefficientPhi Coefficient T-ScoreT-Score Dice CoefficientDice Coefficient True Mutual InformationTrue Mutual Information

Page 9: 1 Incorporating N-gram Statistics in the Normalization of Clinical Notes By Bridget Thomson McInnes

9

Log Likelihood RatioLog Likelihood Ratio

Word 2 ! Word 2

Word 1

! Word 1

n11 n12

n21 n22

n1p

np1 np2

n2p

npp

Log Likelihood = 2 * ∑ ( nij * log( nij / mij) )

The log likelihood ratio measures the difference between the observed values and the expected values. It is the sum

of the ratio of the observed and expected values

Page 10: 1 Incorporating N-gram Statistics in the Normalization of Clinical Notes By Bridget Thomson McInnes

10

Chi Squared TestChi Squared Test

Word 2 ! Word 2

Word 1

! Word 1

n11 n12

n21 n22

n1p

np1 np2

n2p

npp

x2 = ∑ pow( (nij – mij), 2) / mij

The chi squared test also measures the difference between the observed values and the expected values. It is the sum

of the difference between the observed and expected values

Page 11: 1 Incorporating N-gram Statistics in the Normalization of Clinical Notes By Bridget Thomson McInnes

11

Odds RatioOdds Ratio

Word 2 ! Word 2

Word 1

! Word 1

n11 n12

n21 n22

n1p

np1 np2

n2p

npp

Odds Ratio = (n11 * n22) / (n21 * n12)

The odds ratio is the ratio is the total number of times anevent takes place to the total number of times that it does

not take place. It is the cross product ratio of the 2x2 contingencytable and measures the magnitude of association between two words

Page 12: 1 Incorporating N-gram Statistics in the Normalization of Clinical Notes By Bridget Thomson McInnes

12

Phi CoefficientPhi Coefficient

Word 2 ! Word 2

Word 1

! Word 1

n11 n12

n21 n22

n1p

np1 np2

n2p

npp

Phi = ( (n11 * n22) - (n21 * n12) ) / Sqrt(np1 * n1p * n2p * np2)

The bigrams are considered positively associated if most of data is along the diagonal (meaning if n11 and n22 are larger than

n12 and n21) and negatively associated if most of the data falls off the diagonal.

Page 13: 1 Incorporating N-gram Statistics in the Normalization of Clinical Notes By Bridget Thomson McInnes

13

T ScoreT Score

Word 2 ! Word 2

Word 1

! Word 1

n11 n12

n21 n22

n1p

np1 np2

n2p

npp

T Score = ( n11 – m11 ) / sqrt( n11 )

The tscore determines whether there is some non randomassociation between two words. It is the quotient of your

known and expected divided by the square root of your known

Page 14: 1 Incorporating N-gram Statistics in the Normalization of Clinical Notes By Bridget Thomson McInnes

14

Dice CoefficientDice Coefficient

Word 2 ! Word 2

Word 1

! Word 1

n11 n12

n21 n22

n1p

np1 np2

n2p

npp

Dice coefficient = 2 * n11 / (np1 + n1p)

The dice coefficient depends on the frequency of the eventsoccurring together and their individual frequencies.

Page 15: 1 Incorporating N-gram Statistics in the Normalization of Clinical Notes By Bridget Thomson McInnes

15

True Mutual InformationTrue Mutual Information

Word 2 ! Word 2

Word 1

! Word 1

n11 n12

n21 n22

n1p

np1 np2

n2p

npp

TMI = (nij / npp) * ∑ log( nij / mij)

True Mutual Information measures to what extent theobserved frequencies differ from the expected.

Page 16: 1 Incorporating N-gram Statistics in the Normalization of Clinical Notes By Bridget Thomson McInnes

16

Spelling CorrectionSpelling Correction

Using context sensitive information through the Using context sensitive information through the bigrams to determine the ranking of a given set bigrams to determine the ranking of a given set of possible spelling corrections for a misspelled of possible spelling corrections for a misspelled word.word.

Given:Given: First content word prior to the misspelled wordFirst content word prior to the misspelled word First content word after the misspelled wordFirst content word after the misspelled word List of possible spelling correctionsList of possible spelling corrections

Page 17: 1 Incorporating N-gram Statistics in the Normalization of Clinical Notes By Bridget Thomson McInnes

17

Spelling Correction ExampleSpelling Correction Example Example Sentence:Example Sentence:

Her dobutamine stress echo showed mild Her dobutamine stress echo showed mild aurticaurtic stenosis with a subaortic stenosis with a subaortic gradient.gradient.

List of Possible corrections:List of Possible corrections: articartic aorticaortic

Statistical Analysis :Statistical Analysis :

Basic Idea Basic Idea

herher dobutaminedobutamine stressstress echoecho showeshowedd

mildmild POSPOS stenosistenosiss

withwith subaorticsubaortic gradientgradient

Page 18: 1 Incorporating N-gram Statistics in the Normalization of Clinical Notes By Bridget Thomson McInnes

18

Spelling Correction StatisticsSpelling Correction Statistics

mild mild articartic 0.400.40

articartic stenosis stenosis 0.030.03

Weighted averageWeighted average 0.2150.215

mild mild aorticaortic 0.660.66

aorticaortic stenosis stenosis 0.300.30

Weighted averageWeighted average 0.460.46

Possible 1 : Possible 2:

• This allows us to take into consideration finding a bigram with word prior to the misspelling and after the misspelling

• The possible word with its score are then returned

Page 19: 1 Incorporating N-gram Statistics in the Normalization of Clinical Notes By Bridget Thomson McInnes

19

Types of ResultsTypes of Results Types of Results Types of Results

Gspell onlyGspell only Context sensitive onlyContext sensitive only Hybrid of both Gspell and ContextHybrid of both Gspell and Context

Taking the average of the Gspell and context Taking the average of the Gspell and context sensitive scoressensitive scores

Note : this turns into a backoff method when no Note : this turns into a backoff method when no statistical data is found for any of the possibilitiesstatistical data is found for any of the possibilities

Backoff methodBackoff method Use only the context sensitive score unless it does not Use only the context sensitive score unless it does not

exists then revert to the Gspell scoreexists then revert to the Gspell score

Page 20: 1 Incorporating N-gram Statistics in the Normalization of Clinical Notes By Bridget Thomson McInnes

20

Preliminary Test SetPreliminary Test Set

Test set : partially scrubbed clinical Test set : partially scrubbed clinical notesnotes

Size : 854 wordsSize : 854 words Number of misspellings : 82Number of misspellings : 82

Includes AbbreviationsIncludes Abbreviations

Page 21: 1 Incorporating N-gram Statistics in the Normalization of Clinical Notes By Bridget Thomson McInnes

21

Preliminary ResultsPreliminary Results

GSPELLGSPELL PrecisionPrecision Recall Recall FmeasureFmeasure

   0.53570.5357 0.73170.7317 0.61860.6186

Measure of Measure of AssociationAssociation PrecisionPrecision RecallRecall FmeasureFmeasure

PHIPHI 0.61610.6161 0.84150.8415 0.71130.7113

LLLL 0.60710.6071 0.82930.8293 0.70100.7010

TMITMI 0.60710.6071 0.82930.8293 0.70100.7010

ODDSODDS 0.60710.6071 0.82930.8293 0.70100.7010

X2X2 0.61610.6161 0.84150.8415 0.71130.7113

TSCORETSCORE 0.56250.5625 0.76830.7683 0.64950.6495

DICEDICE 0.63390.6339 0.86590.8659 0.73200.7320

GSPELL Results :

Context Sensitive Results:

Page 22: 1 Incorporating N-gram Statistics in the Normalization of Clinical Notes By Bridget Thomson McInnes

22

Preliminary ResultsPreliminary Results

Measure of Measure of associationassociation PrecisionPrecision RecallRecall FmeasureFmeasure

PHIPHI 0.66070.6607 0.90240.9024 0.76290.7629

LLLL 0.63390.6339 0.86590.8659 0.7320.732

TMITMI 0.66070.6607 0.90240.9024 0.76290.7629

ODDSODDS 0.62500.6250 0.85370.8537 0.72160.7216

X2X2 0.63390.6339 0.86590.8659 0.7320.732

TSCORETSCORE 0.60710.6071 0.82930.8293 0.7010.701

DICEDICE 0.66960.6696 0.91460.9146 0.77320.7732

Hybrid Method Results:

Page 23: 1 Incorporating N-gram Statistics in the Normalization of Clinical Notes By Bridget Thomson McInnes

23

Notes on Log LikelihoodNotes on Log Likelihood Log Likelihood is used quite often with context Log Likelihood is used quite often with context

sensitive spelling correctionsensitive spelling correction

Problem with large sample sizesProblem with large sample sizes The marginal values are very large due to the sample The marginal values are very large due to the sample

sizesize Increases the expected values so the actually values are Increases the expected values so the actually values are

commonly so much lower than the expected valuescommonly so much lower than the expected values Very independent and very dependent ngrams end up Very independent and very dependent ngrams end up

with the same valuewith the same value

Noticed similar characteristics with true mutual Noticed similar characteristics with true mutual informationinformation

Page 24: 1 Incorporating N-gram Statistics in the Normalization of Clinical Notes By Bridget Thomson McInnes

24

Example of ProblemExample of Problem

hip ! hip

follow

! follow

n11 88951

65729 69783140

88962

65740 69872091

69848869

69937831

n11n11 Log LikelihoodLog Likelihood

1111 145.3647145.3647

190190 143.4268143.4268

8686 0.098640.09864

Page 25: 1 Incorporating N-gram Statistics in the Normalization of Clinical Notes By Bridget Thomson McInnes

25

Conclusions with Preliminary Conclusions with Preliminary ResultsResults

Dice coefficient returns the best resultsDice coefficient returns the best results Phi coefficient returns the second bestPhi coefficient returns the second best

Log Likelihood and True Mutual Log Likelihood and True Mutual Information should not be usedInformation should not be used

Need to now test the program with a Need to now test the program with a more extensive test bed which is in the more extensive test bed which is in the process of being createdprocess of being created

Page 26: 1 Incorporating N-gram Statistics in the Normalization of Clinical Notes By Bridget Thomson McInnes

26

NGram Statistics for Multi Term NGram Statistics for Multi Term IdentificationIdentification

Can not use previous statistics package Can not use previous statistics package Memory constraints due to the amount of dataMemory constraints due to the amount of data Would like to look for longer ngramsWould like to look for longer ngrams

Alternative : Suffix Arrays (Church and Yamamoto)Alternative : Suffix Arrays (Church and Yamamoto) Reduces the amount of memoryReduces the amount of memory

Two Arrays Two Arrays Contains the corpusContains the corpus Contains identifiers to the ngrams in the corpusContains identifiers to the ngrams in the corpus

Two StacksTwo Stacks Contains the longest common prefixContains the longest common prefix Contains the document frequencyContains the document frequency

Allows for ngrams up to the size of the corpus to be foundAllows for ngrams up to the size of the corpus to be found

Page 27: 1 Incorporating N-gram Statistics in the Normalization of Clinical Notes By Bridget Thomson McInnes

27

Suffix ArraysSuffix Arrays

toto be be oror notnot toto bebe

To be or not to be

to be or not to be

be or not to be

or not to be

not to be

to be

be

• Each array element is considered a suffix

• A Ngram is from a suffix until the end of the array

Page 28: 1 Incorporating N-gram Statistics in the Normalization of Clinical Notes By Bridget Thomson McInnes

28

Suffix ArraysSuffix Arraysto be or not to bebe or not to beor not to benot to beto bebe

[0] = 5 => be[1] = 1 => be or not to be[2] = 3 => not to be[3] = 2 => or not to be[4] = 4 => to be[5] = 0 => to be or not to be

55 11 33 22 44 00

Actual Suffix Array :

Page 29: 1 Incorporating N-gram Statistics in the Normalization of Clinical Notes By Bridget Thomson McInnes

29

Term FrequencyTerm Frequency

Term frequency (tf) is the number of times Term frequency (tf) is the number of times a ngram occurs in the corpusa ngram occurs in the corpus

To determine the tf of an ngram:To determine the tf of an ngram: Sorted the suffix arraySorted the suffix array

tf = j – i + 1tf = j – i + 1 j = first occurrencej = first occurrence i = last occurrencei = last occurrence

[0] = 5 => be[1] = 1 => be or not to be[2] = 3 => not to be[3] = 2 => or not to be[4] = 4 => to be[5] = 0 => to be or not to be

Page 30: 1 Incorporating N-gram Statistics in the Normalization of Clinical Notes By Bridget Thomson McInnes

30

Measures of Association Measures of Association Residual Inverse Document Frequency (RIDF)Residual Inverse Document Frequency (RIDF)

RIDF = - log (df / D) + log(1 – exp(-tf/D) )RIDF = - log (df / D) + log(1 – exp(-tf/D) )

Compares the distribution of a term over documents Compares the distribution of a term over documents to what would be expected by a random termto what would be expected by a random term

Mutual Information (MI)Mutual Information (MI) MI(xYz) = log MI(xYz) = log tf( xYz ) * tf( Y )tf( xYz ) * tf( Y )

tf( xY) * tf( Yz )tf( xY) * tf( Yz )

Compares the frequency of the whole to the frequency of Compares the frequency of the whole to the frequency of the partsthe parts

Page 31: 1 Incorporating N-gram Statistics in the Normalization of Clinical Notes By Bridget Thomson McInnes

31

Present WorkPresent Work

Calculated the MI and RIDF for the clinical Calculated the MI and RIDF for the clinical notes for each of the possible sections: CC, notes for each of the possible sections: CC, CM, IP, HPI, PSH, SH and DXCM, IP, HPI, PSH, SH and DX Retrieved the respective text for each Retrieved the respective text for each

headingheading

Calculate the ridf and mi each possible Calculate the ridf and mi each possible ngrams with a term frequency greater than ngrams with a term frequency greater than 10 for the data under each sections10 for the data under each sections

Noticed that different multi terms appear for Noticed that different multi terms appear for each of the different sectionseach of the different sections

Page 32: 1 Incorporating N-gram Statistics in the Normalization of Clinical Notes By Bridget Thomson McInnes

32

ConclusionsConclusions

Ngram statistics can be applied directly Ngram statistics can be applied directly and indirectly to various problemsand indirectly to various problems DirectlyDirectly

Spelling correctionSpelling correction Compound word identificationCompound word identification Term extractionTerm extraction Name identificationName identification

IndirectlyIndirectly Part of Speech taggingPart of Speech tagging Information RetrievalInformation Retrieval Data MiningData Mining

Page 33: 1 Incorporating N-gram Statistics in the Normalization of Clinical Notes By Bridget Thomson McInnes

33

PackagesPackages

Two Statistical PackagesTwo Statistical Packages Contingency Table approachContingency Table approach

Measures for bigramsMeasures for bigrams Log Likelihood, True Mutual Information, Chi Log Likelihood, True Mutual Information, Chi

Squared Test, 0dds Ratio, Phi Coefficient, T Score, Squared Test, 0dds Ratio, Phi Coefficient, T Score, and Dice Coefficientand Dice Coefficient

Measures for trigramsMeasures for trigrams Log Likelihood and True Mutual InformationLog Likelihood and True Mutual Information

Suffix Array approachSuffix Array approach Measures for all lengths of ngramsMeasures for all lengths of ngrams

Residual Inverse Document Frequency and Mutual Residual Inverse Document Frequency and Mutual InformationInformation