![Page 1: II-SDV 2014 Organising Data: The step before visualisation (Nils C. Newman - Search Technology, USA & Alan L. Porter - Georgia Tech, USA)](https://reader033.vdocuments.mx/reader033/viewer/2022060108/554fa023b4c90586258b491e/html5/thumbnails/1.jpg)
![Page 2: II-SDV 2014 Organising Data: The step before visualisation (Nils C. Newman - Search Technology, USA & Alan L. Porter - Georgia Tech, USA)](https://reader033.vdocuments.mx/reader033/viewer/2022060108/554fa023b4c90586258b491e/html5/thumbnails/2.jpg)
Organizing DataThe Step Before Visualization
Nils C. Newman
Director New Business Development at Search Technology
& UNU-MERIT
Dr. Alan L. Porter
Director R&D at Search Technology
& Emeritus Professor, Georgia Tech
![Page 3: II-SDV 2014 Organising Data: The step before visualisation (Nils C. Newman - Search Technology, USA & Alan L. Porter - Georgia Tech, USA)](https://reader033.vdocuments.mx/reader033/viewer/2022060108/554fa023b4c90586258b491e/html5/thumbnails/3.jpg)
The way it was…..
• You would read information and filter the data through your mental framework, enabling discovery and synthesis
![Page 4: II-SDV 2014 Organising Data: The step before visualisation (Nils C. Newman - Search Technology, USA & Alan L. Porter - Georgia Tech, USA)](https://reader033.vdocuments.mx/reader033/viewer/2022060108/554fa023b4c90586258b491e/html5/thumbnails/4.jpg)
The way it is now….
• Too much information for you to process readily by reading…
![Page 5: II-SDV 2014 Organising Data: The step before visualisation (Nils C. Newman - Search Technology, USA & Alan L. Porter - Georgia Tech, USA)](https://reader033.vdocuments.mx/reader033/viewer/2022060108/554fa023b4c90586258b491e/html5/thumbnails/5.jpg)
Enter Text Analysis…
• If a computer can organize and present the data to you, then you can absorb more information faster than traditional reading
![Page 6: II-SDV 2014 Organising Data: The step before visualisation (Nils C. Newman - Search Technology, USA & Alan L. Porter - Georgia Tech, USA)](https://reader033.vdocuments.mx/reader033/viewer/2022060108/554fa023b4c90586258b491e/html5/thumbnails/6.jpg)
The challenge…
• How can a computer look at a collection of information and turn those data into something organized - into a framework that you understand?
![Page 7: II-SDV 2014 Organising Data: The step before visualisation (Nils C. Newman - Search Technology, USA & Alan L. Porter - Georgia Tech, USA)](https://reader033.vdocuments.mx/reader033/viewer/2022060108/554fa023b4c90586258b491e/html5/thumbnails/7.jpg)
Two main issues to consider….
• Do you want to impose order on the data?
• Do you want to let the data self-organize?
![Page 8: II-SDV 2014 Organising Data: The step before visualisation (Nils C. Newman - Search Technology, USA & Alan L. Porter - Georgia Tech, USA)](https://reader033.vdocuments.mx/reader033/viewer/2022060108/554fa023b4c90586258b491e/html5/thumbnails/8.jpg)
The choice is important because it drives the
math
Impose
Order
Self
Organize
LSAPCA
TM
SVM
NLP AS/PI
Roots in StatisticsRoots in Machine Learning
![Page 9: II-SDV 2014 Organising Data: The step before visualisation (Nils C. Newman - Search Technology, USA & Alan L. Porter - Georgia Tech, USA)](https://reader033.vdocuments.mx/reader033/viewer/2022060108/554fa023b4c90586258b491e/html5/thumbnails/9.jpg)
Within Machine Learning –
resources impact the decision…
Supervised training
• Requires time and effort by subject matter expert(s)
Unsupervised training
• Requires suitable quantities of training material
• Computationally expensive
![Page 10: II-SDV 2014 Organising Data: The step before visualisation (Nils C. Newman - Search Technology, USA & Alan L. Porter - Georgia Tech, USA)](https://reader033.vdocuments.mx/reader033/viewer/2022060108/554fa023b4c90586258b491e/html5/thumbnails/10.jpg)
Within Statistics –
data drive the decision…
Data Signal
• Requires data with sufficiently strong signal and relatively low noise
Data Homogeneity
• Requires that the records be sufficiently consistent (record to record)
![Page 11: II-SDV 2014 Organising Data: The step before visualisation (Nils C. Newman - Search Technology, USA & Alan L. Porter - Georgia Tech, USA)](https://reader033.vdocuments.mx/reader033/viewer/2022060108/554fa023b4c90586258b491e/html5/thumbnails/11.jpg)
Data Quality can help you make the decision…
High Noise Data Quality High Signal
Supervised
Machine Learning
Unsupervised
Machine LearningStatistics
![Page 12: II-SDV 2014 Organising Data: The step before visualisation (Nils C. Newman - Search Technology, USA & Alan L. Porter - Georgia Tech, USA)](https://reader033.vdocuments.mx/reader033/viewer/2022060108/554fa023b4c90586258b491e/html5/thumbnails/12.jpg)
But as with most things, it is never that easy..
![Page 13: II-SDV 2014 Organising Data: The step before visualisation (Nils C. Newman - Search Technology, USA & Alan L. Porter - Georgia Tech, USA)](https://reader033.vdocuments.mx/reader033/viewer/2022060108/554fa023b4c90586258b491e/html5/thumbnails/13.jpg)
Reality is usually an engineered hybrid
approach
Impose
Order
Self
Organize
![Page 14: II-SDV 2014 Organising Data: The step before visualisation (Nils C. Newman - Search Technology, USA & Alan L. Porter - Georgia Tech, USA)](https://reader033.vdocuments.mx/reader033/viewer/2022060108/554fa023b4c90586258b491e/html5/thumbnails/14.jpg)
But the hybrid approach adds complexity
• The hybrid elements make things somewhat confusing but provide capabilities to address issues:
�Known noise can be removed
�Signal can be amplified
�Steps can be hard-coded to reduce computational variability
• As tool developers, we often hide these tweaks to make tools look simpler than they actually are
![Page 15: II-SDV 2014 Organising Data: The step before visualisation (Nils C. Newman - Search Technology, USA & Alan L. Porter - Georgia Tech, USA)](https://reader033.vdocuments.mx/reader033/viewer/2022060108/554fa023b4c90586258b491e/html5/thumbnails/15.jpg)
A hybrid example…
• A core analytical approach in VantagePoint is a modified version of Principal Components Analysis (PCA)
• We feed phrases created by a Natural Language Processing (NLP) algorithm into the PCA algorithm to self-organize data
• So we are already using a hybrid system
• However, a recently developed Topic Modeling (TM) algorithm looked like it would out-perform our PCA/NLP system
• So we devised a series of tests pitting our PCA/NLP against TM
![Page 16: II-SDV 2014 Organising Data: The step before visualisation (Nils C. Newman - Search Technology, USA & Alan L. Porter - Georgia Tech, USA)](https://reader033.vdocuments.mx/reader033/viewer/2022060108/554fa023b4c90586258b491e/html5/thumbnails/16.jpg)
Round 1
• In round one, we compared our PCA/NLP approach to TM (Latent DirichletAllocation -- LDA) by analyzing a set of ~4,000 Dye-Sensitized Solar Cell (DSSC) abstracts
• The LDA approach ran much faster, required less expertise to run, and gave reasonable results
• However, this “bag of words” approach means that labeling the resulting clusters requires significant topical expertise
• The PCA/NLP approach required more expertise to run but the results gave clearer answers (and reasonable cluster labels)
• Judges’ Decision - Tie
![Page 17: II-SDV 2014 Organising Data: The step before visualisation (Nils C. Newman - Search Technology, USA & Alan L. Porter - Georgia Tech, USA)](https://reader033.vdocuments.mx/reader033/viewer/2022060108/554fa023b4c90586258b491e/html5/thumbnails/17.jpg)
Round 2
• In round two, we compared our PCA/NLP approach to several different TM approaches by analyzing a mixed set containing searches on 7 different topics
• The results were judged on precision and recall
• One particular TM approach worked really well
• It out-performed our PCA approach and all other TM approaches
• Judges’ Decision – TM variant a winner!
![Page 18: II-SDV 2014 Organising Data: The step before visualisation (Nils C. Newman - Search Technology, USA & Alan L. Porter - Georgia Tech, USA)](https://reader033.vdocuments.mx/reader033/viewer/2022060108/554fa023b4c90586258b491e/html5/thumbnails/18.jpg)
Round 3
• In round three, we tested the round two winner by analyzing a set of search results on similar topics
• The results were encouraging but not as clear-cut as round two
• Judges’ Decision – TM variant still a winner!
![Page 19: II-SDV 2014 Organising Data: The step before visualisation (Nils C. Newman - Search Technology, USA & Alan L. Porter - Georgia Tech, USA)](https://reader033.vdocuments.mx/reader033/viewer/2022060108/554fa023b4c90586258b491e/html5/thumbnails/19.jpg)
Round 4
• Not to be outdone by the TM team, our PCA team looked at the problem and decided that adding more tuning would be better than changing to TM
• They layered multiple “simple” techniques together to create a new more powerful PCA hybrid
• The super hybrid system includes up to 10 different steps embodied in a single process:
• Stopword removal• Acronym identification• Common word removal• Term Pruning• Association rule based removal• Term consolidation• etc…
![Page 20: II-SDV 2014 Organising Data: The step before visualisation (Nils C. Newman - Search Technology, USA & Alan L. Porter - Georgia Tech, USA)](https://reader033.vdocuments.mx/reader033/viewer/2022060108/554fa023b4c90586258b491e/html5/thumbnails/20.jpg)
The result?
• The fight is still ongoing but the improved PCA is looking to keep pace with TM while maintaining its dominance in Cluster naming
• The VantagePoint “Cluster Suite + PCA” approach is certainly ahead in usability
• We have the next bout scheduled for later this year Who is ready to Byte?
![Page 21: II-SDV 2014 Organising Data: The step before visualisation (Nils C. Newman - Search Technology, USA & Alan L. Porter - Georgia Tech, USA)](https://reader033.vdocuments.mx/reader033/viewer/2022060108/554fa023b4c90586258b491e/html5/thumbnails/21.jpg)
Why tell you all this?
• I wanted to give you a little insight into how tool developers think
• The recent explosive growth in algorithms means that we have a lot of different approaches from which to choose
• The growth in computing power means we can operate at a scale unheard of a decade ago
• We are driven to make the tools more effective and easier to use
• However, doing so often makes tools more opaque to the user
![Page 22: II-SDV 2014 Organising Data: The step before visualisation (Nils C. Newman - Search Technology, USA & Alan L. Porter - Georgia Tech, USA)](https://reader033.vdocuments.mx/reader033/viewer/2022060108/554fa023b4c90586258b491e/html5/thumbnails/22.jpg)
What does all this mean to you?
• There is no “one size fits all” when it comes to text analytics
• Analytical techniques still need to be matched to your data and your problems
• The state of the art is rapidly evolving
• You need to have a good sense of what is going on “under the hood” of the tools that you use
![Page 23: II-SDV 2014 Organising Data: The step before visualisation (Nils C. Newman - Search Technology, USA & Alan L. Porter - Georgia Tech, USA)](https://reader033.vdocuments.mx/reader033/viewer/2022060108/554fa023b4c90586258b491e/html5/thumbnails/23.jpg)
Why bother?
• Understanding a little about how your tools work is critical BEFORE you confound the situation by adding visualization on top the analysis
• Otherwise, you have to take it on faith that what we are doing suits your analytical situation
![Page 24: II-SDV 2014 Organising Data: The step before visualisation (Nils C. Newman - Search Technology, USA & Alan L. Porter - Georgia Tech, USA)](https://reader033.vdocuments.mx/reader033/viewer/2022060108/554fa023b4c90586258b491e/html5/thumbnails/24.jpg)
Questions?
Thank you!