natural language processing (nlp) · natural language processing (nlp) pradnya nimkar, acas, maaa....

Natural Language Processing (NLP)

Pradnya Nimkar, ACAS, MAAA

Disclaimer: This presentation is going to be…..wordy!

NLP is everywhere!

Business cases in Insurance:● Lot of unstructured data

○ Data that does not follow a predefined pattern ○ accident description, injury description, claim notes, doctor notes, nurses notes, policy terms

etc.

● Claim Triage model○ Analyze claim notes, accident descriptions, injury descriptions to identify large losses early on.

● Risk Management Practices○ Identify and label areas that need attention

● Fraud Models○ Analyze settlement notes, claim notes to identify fraud claims

● Underwriting/ Policy Management○ Avoid costly mistakes by pointing underwriters to inconsistencies in tailor made wordings

● Claims Management:○ Analyze claims/complaints and direct them to appropriate claim adjuster○ Speed up decision making process by matching claim notes with existing claims

Why should an actuary care about NLP?ASOP 38: Using Models Outside the Actuary’s Area of Expertise (Property and Casualty)

What is Natural language Processing (NLP)?● How to program computers to process and analyze large amounts of data centered around human

language.● Focus is to capture syntactic and semantic meaning of the natural language.

History of NLP:● Dates back to 1950● 1950-1980:

○ Handwritten rules, lot of if...then...statements○ Hard to maintain

● 1980- Now○ Corpus*/Statistical methods

● Now- Future○ Deep learning methods + Statistical Methods

Typical workflow with unstructured data:

Preprocessing - some NLP terminology

WORKER slipped while carrying groceries. Worker fractured his elbow

worker developed carpal tunnel from repetitive typing

worker got traumatized from NLP presentation

Corpus (Collection of texts (paragraphs, papers, books))

NLP,WORKER,carpal,carrying,developed,elbow,fractured,from,got,groceries,his,presentation,repetitive,slipped,traumatized,tunnel,typing,while,worker

Vocabulary (The unique list of words observed)

Preprocessing - Casing


worker slipped while carrying groceries. worker fractured his elbow



worker got traumatized from nlp presentation


Preprocessing - Lemmatization (Reduce word to canonical form)


worker slip while carry grocery. worker fracture his elbow


worker develop carpal tunnel from repetitive typing

worker get traumatize from nlp presentation


tokens

Preprocessing - Stemming (Removing affixes to get stem word)


worker slip while carri groceri. worker fractur hi elbow

worker developed carpal tunnel from repetitive typing worker develop carpal tunnel from

repetit type

worker got traumat from nlp present


Other preprocessing steps to be considered:

● Part of speech tagging● Remove stop words like a, an, the, in etc.● Remove special characters● Expanding contractions● Dealing with abbreviations and misspellings

Main take-away: balance between simplification and retention of language nuance, encoding as much information as possible in the most tightly organized way possible

Time to go to space...Vector Space ● Two words: Word vectors!● Core idea: Map the text to mathematical entities (vectors)● Vector space Models(VSM)are most common models in NLP

○ Translate raw texts to vectors○ There are many!

● Popular VSMs○ Sparse Representation( Don’t reduce the vector space):

■ Counts (Term-Frequency)■ Absence or presence of word (One-hot-encoding)■ TF-IDF (Term frequency Inverse document frequency)

○ Dense Representation: (Reduce the space)■ LSI /LDA (Dimensionality reduction)■ Word embeddings : e.g. Word2vec (Neural net), Glove■ Sentence, Document embeddings: e.g. Doc2vec, SkipThought

Vector Space Model I : Counts or Term-frequency

● Count number times each word occurs● Order of words does not matter● Hence, the term bag of words (BOW)

worker

deve lop

carpal

note_2[1,1,1]

carpal carry develop elbow worker

note_1 0 1 0 1 2

note_2 1 0 1 0 1

note_3 0 0 0 0 1

note_1 “worker carpal deve lop” [2,0 ,0 ]

Vector Space Model II: Binary or One hot encoding

● Zipf’s law for word distributions○ Word counts follow a long tailed distribution

● Presence or absence of a word○ 1 = if term occurs at least once○ 0 = if word does not occur

VSM III: Term Frequency-Inve rse Document Frequency (TF-IDF)

● Calculate s importance of a te rm for a particular document

tf(t) * idf(t)

Greate r when the te rm isFrequent in a particular document

Greate r when the te rm isRare in ALL the documents (corpus)

● Diffe rent we ighing schemes for idf part - most common is logarithmic

Total number of documents

Number of documents in which that te rm appears

VSM III: TF-IDF Python implementation example

tf(note_1, worker) = 2, N= 3, df(worker) = 3


note_1 0 0.3451 0 0.3451 0.407

note_2 0.4107 0 0.4107 0 0.2425

note_3 0 0 0 0 0.2660

= 0.407


note_1 0 1 0 1 2

note_2 1 0 1 0 1

note_3 0 0 0 0 1

Count model

VSM III: Other Considerations in TF-IDF

Min df: removes highly infrequent termsmin_df = 0.10 => ignore the terms that occur in less than 1% of the documentsmin_df = 3 => ignore the terms that occur in less than 3 documents

Max df: removes terms that occur too frequentlymax_df = 0.5 => ignore the terms that occur in more than 50% of the documentsmax_df = 5 => ignore the terms that occur in more than 5 documents

Ngrams: continuous sequence of wordstries to captures the context of the sentencebi-grams and tri-grams are common

Bi-grams example: worker developed carpal tunnel from repetitive typing => (worker developed, developed carpal, carpal tunnel, …..repetitive typing)

● Advantages:○ Simple but surprisingly effective○ Quick ○ Interpretable

● Disadvantages:○ Assumes all words are independent or equidistant, which is not the case in real world○ Very sparse representation (sparse = bad because few examples to learn from)

Cosine Similarity● Any text can be represented by V-dimensional vector space. ● Cosine similarity used for measuring the similarity between the two vectors:

○ Measures the cosine of the angle between the two vectors

○ cosine is bound by [-1,1]: 1 being similar, 0 being dissimilar and -1 being opposite

● Basic Fraud Model : Rank other claim notes with respect to cosine similarity wrt fraudulent claim

Claim note associated with Fraudulent claim

Investigate claim associated with red claim note

Curse of Dimensionality● Dimensionality increases, the volume of the space increases so fast that the available data become

sparse

● Matrix view○ Sparse Lot of zero values○ Do not provide any additional information○ Arithmetic operations take a lot of time○ Takes lot of space in the memory

● Distance calculations○ In high dimensional vector space, distances are far.○ When a measure such as a Euclidean distance is defined using many coordinates, there is little

difference in the distances between different pairs of samples.

● Answer: Reduce the dimensions (Dense representations)

Dense Representation:

● Use Matrix Factorization ○ Singular value decomposition (Latent Semantic Indexing)○ Non-Negative Matrix Factorization

● Use Probabilistic inference ○ (Bayesian inference/ Latent Dirichlet allocation)

● Use Neural network approach ○ Word2vec (Google model)○ Glove○ FastText (Facebook model)○ BlazingText ( Amazon)○ Train your own!

Topic Modeling:

Q) Find Topics that best represents the information in these documents?

● Assumptions:○ Each topic consists of collection of words○ Each document consists of mixture of topics

● Uses:○ Unsupervised learning algorithms○ But, can also be an input to other supervised algorithms○ Labels the clusters

Latent Semantic Analysis/Indexing:

● Performs matrix factorization on the document-term matrix○ Matrix factorization done using Singular value decomposition○ Document term matrix : earlier tf-idf matrix transposed!

● Singular Value Decomposition

Documents (n)

Terms(m)

Tf-idf matrix

k k

k k

mxn

term document matrix

document space

topic weights

LSI parameter

K: Number of dimensions to reduce to:

● Depends on the data size○ old standard (300)○ new standard (500-1000)

LSI Example with k = 2:

note_1 note_2 note_3

carpal 0 0.4107 0

carry 0.3451 0 0

develop 0 0.4107 0

elbow 0.3451 0 0

worker 0.407 0.2425 0.266

0.222 -0.17

0.153 0.311

0.222 -0.17

0.15 0.311

0.358 -0.23

1.12 0

0 0.96

0.497 0.607 0.618

0.865 -0.398 -0.304

original space reduced 2-dimensional spacenote_1 ([0, 0.3451, 0, 0.3451, …...]) ===> note_1([0.497, 0.865])

Word assignment to

topics

Topic importance Topic distribution across

documents

NNMF Factorization:

● Another matrix factorization method!● Decomposes document-term matrix in 2 matrices, instead of 3● Main advantage over SVD

○ Elements in both matrices are non-negative○ Input matrix has non-negative elements

● Weakness:○ Factorization is not unique

Topic Modeling I : Latent Dirichlet Allocation

● Developed in 2003

Assumptions:

● There are k latent topics according to which documents are generated. ● Distribution of words for each topic

○ Each topic is represented by set of terms○ Models the probability of topics each word belongs to.○ Same word can appear in multiple topics

● Mixture of topics within a document

Topic Modeling I : Latent Dirichlet AllocationTopic 1: '0.038*"injury" + 0.027*"neck" + 0.024*"whiplash" + 0.017*"sti" + 0.015*"strain" + 0.011*"spin" + 0.010*"cerv" + 0.010*"low" + 0.009*"whiplash injury"+ ……..'

Topic 2: '0.019*"anxy" + 0.013*"disord" + 0.012*"depress" + 0.009*"ptsd" + 0.007*"stress" + 0.007*"adjust" + 0.006*"adjust disord" + 0.006*"traum" + 0.006*"post" + 0.005*"shock+ ……."

Topic 3: 0.017*"bru" + 0.015*"rt" + 0.012"left" + 0.010*"lt" + 0.010*"injury" + 0.009*"abras" + 0.009*"lac" + 0.011**"kne" + 0.007*"cut" + 0.006*"fall+ …..."'

Topic 4: '0.014*"rt" + 0.013*"kne" + 0.009*"left" + 0.009*"fract" + 0.007*"right" + 0.006*"ankl" + 0.005*"lt" + 0.005*"dist" + 0.005*"tib" + 0.004*"foot + ….."'

Document Level:

ip suff whiplash injury and has return to work whiplash injury of the neck musculoliga strain of the back sti of the l should adjust disord w depress mood anxy aggrav pre ex deg chang up low spin whiplash injury of the neck muscul liga strain of the back soft tissu injury of left should

Topic 1 Topic 2 Topic 3 Topic 4

0.411 0.316 0.153 0.120

Drawbacks of topic models:

● Sensitive to pre-processing● Training time is relatively much longer + more memory

Statistical word counts Topic models / grouping words Word Embeddings

Dense Representation● In 2013, a team at Google led by Tomas Mikolov

created word2vec

● 1956 motivated by Harris Distributional hypothesis - intuition that similar words have similar contexts, know words by the “neighbors they keep”

● Other dense word embedding variants such as Glove, matrix factorization methods, fastText

● Similar words become close in space, can do vector operations that “make sense” (king-man=queen)

● Can capture synonyms, misspellings, etc., can apply transfer learning

● Drawbacks of one dominating meaning, context dependent, relatively data hungry

Dense Representation Cont.

fall

burn

burnt

fe ll

ve rb-tense

Modeling Architecture

worker slipped on water

onworke r wate r

slipped

CBOW Skip-gram

worker on water

slipped slippedslipped

FastText - Dealing with words not in Vocabulary

Dense Representation Cont.● Might wonder, ok we now have vectors for each word… how’s that work for

sentences? Paragraphs?● Many answers (its own topic in research and practice):● Average vectors, concatenate, sentence embeddings, document embeddings

Recap / Summary

● We saw different ways to turn words into numbers: counts, groups, embeddings

● Simplicity + Speed vs Complexity + Cost● Implicit: Data Dependent (Conservation of Garbage)

Additional resources:● Code for generating some of the demonstrated VSM:

https://github.com/pradnya-nimkar/CABA-presentation/blob/master/CABA%20NLP%20presentation-May31.ipynb

● LDA Blei paper

https://ai.stanford.edu/~ang/papers/nips01-lda.pdf

● Word2vec paper link

https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf

● Stanford NLP

https://nlp.stanford.edu/

https://github.com/pradnya-nimkar/CABA-presentation/blob/master/CABA%20NLP%20presentation-May31.ipynb

https://ai.stanford.edu/%7Eang/papers/nips01-lda.pdf

https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf

https://nlp.stanford.edu/

Thank you!