Download - Small Data Classification for NLP
![Page 1: Small Data Classification for NLP](https://reader034.vdocuments.mx/reader034/viewer/2022051520/58ef4cb71a28ab487a8b45db/html5/thumbnails/1.jpg)
Small Data Classification for Natural Language ProcessingMichael ThorneHead of Data Science, CaliberMind
![Page 2: Small Data Classification for NLP](https://reader034.vdocuments.mx/reader034/viewer/2022051520/58ef4cb71a28ab487a8b45db/html5/thumbnails/2.jpg)
2 | ©2016 CaliberMind
Goals
• Intro
• What Makes NLP Different
• Solutions
• Questions
![Page 3: Small Data Classification for NLP](https://reader034.vdocuments.mx/reader034/viewer/2022051520/58ef4cb71a28ab487a8b45db/html5/thumbnails/3.jpg)
3 | ©2016 CaliberMind
Michael Thorne
Head of Data Science, Caliber Mind
MS Data Science Program, GalvanizeU
B.S. Physics, Fordham University
NSA Analytic Lead
US Navy Digital Network Intelligence Analyst / Cryptolinguist
Obligatory Speaker Bio
![Page 4: Small Data Classification for NLP](https://reader034.vdocuments.mx/reader034/viewer/2022051520/58ef4cb71a28ab487a8b45db/html5/thumbnails/4.jpg)
4 | ©2016 CaliberMind
CaliberMind
• B2B marketing SaaS
• Persona modeling and personality insights
• Content matching across buyer journey for high-value, complex purchase decisions
• Our core competency is natural language processing
![Page 5: Small Data Classification for NLP](https://reader034.vdocuments.mx/reader034/viewer/2022051520/58ef4cb71a28ab487a8b45db/html5/thumbnails/5.jpg)
5 | ©2016 CaliberMind
What’s So Special About NLP?
• Not random (Zipf’s Law)
• Huge feature space
• Subjective Criteria
![Page 6: Small Data Classification for NLP](https://reader034.vdocuments.mx/reader034/viewer/2022051520/58ef4cb71a28ab487a8b45db/html5/thumbnails/6.jpg)
6 | ©2016 CaliberMind
Small Data NLP
![Page 7: Small Data Classification for NLP](https://reader034.vdocuments.mx/reader034/viewer/2022051520/58ef4cb71a28ab487a8b45db/html5/thumbnails/7.jpg)
7 | ©2016 CaliberMind
Persona Status Quo
• Assumptive Personas
• Qualitative Criteria
• Subjective Labels
• Static Output
![Page 8: Small Data Classification for NLP](https://reader034.vdocuments.mx/reader034/viewer/2022051520/58ef4cb71a28ab487a8b45db/html5/thumbnails/8.jpg)
8 | ©2016 CaliberMind
Starting Point
Demographics
Psychographics
Firmographics
![Page 9: Small Data Classification for NLP](https://reader034.vdocuments.mx/reader034/viewer/2022051520/58ef4cb71a28ab487a8b45db/html5/thumbnails/9.jpg)
9 | ©2016 CaliberMind
Let’s Validate the Status Quo
![Page 10: Small Data Classification for NLP](https://reader034.vdocuments.mx/reader034/viewer/2022051520/58ef4cb71a28ab487a8b45db/html5/thumbnails/10.jpg)
10 | ©2016 CaliberMind
CaliberMind’s Data Challenge
• We match the right message, to the right person, at the right time
• We operate at the upper limits of human scale problems (100’s - 10,000’s of documents)
• We weren’t getting as accurate results as we expected
![Page 11: Small Data Classification for NLP](https://reader034.vdocuments.mx/reader034/viewer/2022051520/58ef4cb71a28ab487a8b45db/html5/thumbnails/11.jpg)
11 | ©2016 CaliberMind
Our Friend: The Central Limit Theorem
• This is the theorem that lets us assume our data is well behaved, assuming we have enough of it
• Let’s look at a classic example, coin tosses
![Page 12: Small Data Classification for NLP](https://reader034.vdocuments.mx/reader034/viewer/2022051520/58ef4cb71a28ab487a8b45db/html5/thumbnails/12.jpg)
12 | ©2016 CaliberMind
Coin Flip Distribution
![Page 13: Small Data Classification for NLP](https://reader034.vdocuments.mx/reader034/viewer/2022051520/58ef4cb71a28ab487a8b45db/html5/thumbnails/13.jpg)
13 | ©2016 CaliberMind
1 Trial
![Page 14: Small Data Classification for NLP](https://reader034.vdocuments.mx/reader034/viewer/2022051520/58ef4cb71a28ab487a8b45db/html5/thumbnails/14.jpg)
14 | ©2016 CaliberMind
100 Trials
![Page 15: Small Data Classification for NLP](https://reader034.vdocuments.mx/reader034/viewer/2022051520/58ef4cb71a28ab487a8b45db/html5/thumbnails/15.jpg)
15 | ©2016 CaliberMind
Example: K-Means
• K-means is a workhorse algorithm when doing unsupervised learning
• What are the assumptions we make when we use k-means?
Spherical data
Same variance
Same prior probability
Turns out NLP data is none of these things
![Page 16: Small Data Classification for NLP](https://reader034.vdocuments.mx/reader034/viewer/2022051520/58ef4cb71a28ab487a8b45db/html5/thumbnails/16.jpg)
16 | ©2016 CaliberMind
Happy K-Means
![Page 17: Small Data Classification for NLP](https://reader034.vdocuments.mx/reader034/viewer/2022051520/58ef4cb71a28ab487a8b45db/html5/thumbnails/17.jpg)
17 | ©2016 CaliberMind
NLP K-Means
![Page 18: Small Data Classification for NLP](https://reader034.vdocuments.mx/reader034/viewer/2022051520/58ef4cb71a28ab487a8b45db/html5/thumbnails/18.jpg)
18 | ©2016 CaliberMind
But Wait, It Gets Better
• Our documents tend to be of vastly different sizes within the same corpus
• Unbalanced Classes
• Qualitative Criteria
• Unlabeled data
• Human-labeling is time intensive
![Page 19: Small Data Classification for NLP](https://reader034.vdocuments.mx/reader034/viewer/2022051520/58ef4cb71a28ab487a8b45db/html5/thumbnails/19.jpg)
Our Solution
![Page 20: Small Data Classification for NLP](https://reader034.vdocuments.mx/reader034/viewer/2022051520/58ef4cb71a28ab487a8b45db/html5/thumbnails/20.jpg)
20 | ©2016 CaliberMind
Dimensionality Reduction
• Dimensionality was the first thing we tackled
• Manual dictionaries to collapse similar terms• mark = [‘growth hacker’, ‘marketer’, ‘demand gen’]
• LSA to remove low-information terms
• Automating the process using word2vec, dbpedia, and skip gram similarities
• As we aggregate more data, we’re able to do this process more effectively
![Page 21: Small Data Classification for NLP](https://reader034.vdocuments.mx/reader034/viewer/2022051520/58ef4cb71a28ab487a8b45db/html5/thumbnails/21.jpg)
21 | ©2016 CaliberMind
Spiky Data
![Page 22: Small Data Classification for NLP](https://reader034.vdocuments.mx/reader034/viewer/2022051520/58ef4cb71a28ab487a8b45db/html5/thumbnails/22.jpg)
22 | ©2016 CaliberMind
Metrics Over Raw Scores
• Especially important when comparing data of different sizes
• How many standard deviations off the mean works better than a simple similarity score
• Pick the best similarity score (with NLP, it’s not cosine)
![Page 23: Small Data Classification for NLP](https://reader034.vdocuments.mx/reader034/viewer/2022051520/58ef4cb71a28ab487a8b45db/html5/thumbnails/23.jpg)
23 | ©2016 CaliberMind
Pretend We Have Labeled Data
• Rules-based scoring algorithm for a first pass
• Take a small subset of high-scoring people as exemplars
• Use a latent semantic analysis of these exemplars to make a template
• Compare remaining data rows against each exemplar cluster
• Assign highest score to that exemplar cluster, broadening the definition
• Continue until all data rows are assigned
• Any row with a similarity below a threshold we set is labeled as an ‘Unknown’, indicates additional, underlying personas
![Page 24: Small Data Classification for NLP](https://reader034.vdocuments.mx/reader034/viewer/2022051520/58ef4cb71a28ab487a8b45db/html5/thumbnails/24.jpg)
24 | ©2016 CaliberMind
Round 1 (Rules)
Name Title Similarity Score Persona
Luke J VP Marketing 1.0 Value
Randy P Founder
Lucas M Growth Ninja
Bec G Tech Guru
Fiona F Sysops 1.0 Security
Claude S Growth Hacker
Art L Data Analyst
![Page 25: Small Data Classification for NLP](https://reader034.vdocuments.mx/reader034/viewer/2022051520/58ef4cb71a28ab487a8b45db/html5/thumbnails/25.jpg)
25 | ©2016 CaliberMind
Round 2 (LSA)
Name Title Similarity Score Persona
Luke J VP Marketing 1.0 Value
Randy P Founder 0.45
Lucas M Growth Ninja 0.11
Bec G Tech Guru 0.71 Security
Fiona F Sysops 1.0 Security
Claude S Growth Hacker 0.87 Value
Art L Data Analyst 0.41
![Page 26: Small Data Classification for NLP](https://reader034.vdocuments.mx/reader034/viewer/2022051520/58ef4cb71a28ab487a8b45db/html5/thumbnails/26.jpg)
26 | ©2016 CaliberMind
Round 3 (LSA)
Name Title Similarity Score Persona
Luke J VP Marketing 1.0 Value
Randy P Founder 0.68 Security
Lucas M Growth Ninja 0.18
Bec G Tech Guru 0.86 Security
Fiona F Sysops 1.0 Security
Claude S Growth Hacker 0.89 Value
Art L Data Analyst 0.72 Security
![Page 27: Small Data Classification for NLP](https://reader034.vdocuments.mx/reader034/viewer/2022051520/58ef4cb71a28ab487a8b45db/html5/thumbnails/27.jpg)
27 | ©2016 CaliberMind
Round 3 (LSA)
Name Title Similarity Score Persona
Luke J VP Marketing 1.0 Value
Randy P Founder 0.71 Security
Lucas M Growth Ninja 0.16 Unknown
Bec G Tech Guru 0.88 Security
Fiona F Sysops 1.0 Security
Claude S Growth Hacker 0.91 Value
Art L Data Analyst 0.78 Security
![Page 28: Small Data Classification for NLP](https://reader034.vdocuments.mx/reader034/viewer/2022051520/58ef4cb71a28ab487a8b45db/html5/thumbnails/28.jpg)
28 | ©2016 CaliberMind
Example
![Page 29: Small Data Classification for NLP](https://reader034.vdocuments.mx/reader034/viewer/2022051520/58ef4cb71a28ab487a8b45db/html5/thumbnails/29.jpg)
29 | ©2016 CaliberMind
Example
![Page 30: Small Data Classification for NLP](https://reader034.vdocuments.mx/reader034/viewer/2022051520/58ef4cb71a28ab487a8b45db/html5/thumbnails/30.jpg)
30 | ©2016 CaliberMind
Example
![Page 31: Small Data Classification for NLP](https://reader034.vdocuments.mx/reader034/viewer/2022051520/58ef4cb71a28ab487a8b45db/html5/thumbnails/31.jpg)
31 | ©2016 CaliberMind
Takeaways
• Human-generated data is never really random
• Small data models are hyper-sensitive
• Validate assumptions