text processing with knime
TRANSCRIPT
Copyright © 2014 KNIME.com AG
Boston KNIME UsersText Processing Applications
Kilian Thiel
KNIME
Copyright © 2014 KNIME.com AG
Agenda
• KNIME Crash Course
• Text Mining with KNIME: Mining Tripadvisor Data
• Text Mining with KNIME: Mining Amazon Reviews (Anil Tarachandani)
• Networking Apero
2
Copyright © 2014 KNIME.com AG
Text Mining with KNIME: Mining Tripadvisor Data
Agenda
• The KNIME Textprocessing Extension
– Preliminaries
– Philosophy & Usage
• Classification of Tripadvisor Reviews
– Tripadvisor data
– Classification of reviews
3
Copyright © 2014 KNIME.com AG
Resources
http://tech.knime.org/knime-text-processing
• Documentation
• Examples
• Forum
• White Papers
4
Copyright © 2014 KNIME.com AG
Installation
5
1.) 2.)
Copyright © 2014 KNIME.com AG
Requirements
Requirements to import and run demo workflows
• KNIME 2.10
• Textprocessing (labs)
• Distance Matrix (KNIME)
• Palladian (Community)
6
Copyright © 2014 KNIME.com AG
Tips
• Settings (knime.ini)
– Set maximum memory for KNIME
– -Xmx3G
7
Copyright © 2014 KNIME.com AG
Demo
Prepare KNIME
• Go to KNIME directory
• Change knime.ini file (optional)
– -Xmx3G
• Start KNIME
• Install Textprocessing Extension
– (or better have it already installed)
8
Copyright © 2014 KNIME.com AG
Philosophy
9
… perhaps your nameis
Rumpelstiltskin[Person] ? …
… perhaps your nameis
Rumpelstiltskin[Person] ? … Visualization
Cluster-ing
Classifi-cation
1 1 1 0 1 0 0 1 10 1 1 0 0 1 0 0 00 0 1 1 1 0 1 1 0
Copyright © 2014 KNIME.com AG
Additional Data Types
• Document Cell
– Encapsulates a document• Title, sentences, terms, words
• Authors, category, source
• Generic meta data (key, value pairs)
• Term Cell
– Encapsulates a term• Words, tags
10
Copyright © 2014 KNIME.com AG
Data Table Structures
• Document table– List of documents
• Bag of words– Tuples of documents
and terms
• Document vectors– Numerical
representations of documents
11
Copyright © 2014 KNIME.com AG
Philosophy and Data Table Structures
12
Enrichment Preprocessing1 1 1 0 1 0 0 1
Documents Bow VectorsDocuments Documents
Copyright © 2014 KNIME.com AG
Tripadvisor Data
13
Title
Author
Rating
Fulltext
Copyright © 2014 KNIME.com AG
Tripadvisor Data
14
Reviews about italian and chinese restaurants in Boston
• Chinese: 272
• Italian: 268
Copyright © 2014 KNIME.com AG
Tripadvisor Data
15
Goal:
• Build classifier to distinguish between chinese anditalian restaurants, based on their reviews.
Review about italian orchinese restaurant?
Copyright © 2014 KNIME.com AG
Tripadvisor Data
16
Goal:
Copyright © 2014 KNIME.com AG
1.) Reading
Read/Parse textual data
17
Copyright © 2014 KNIME.com AG
Demo
Reading
• Read Tripadvisor data (.table file)
• Filter rows with missing restaurant value
• Convert strings to documents
• Filter all but the document column
18
Copyright © 2014 KNIME.com AG
2.) Enrichment
Enrich documents with semantic information
19
Copyright © 2014 KNIME.com AG
Demo
Enrichment / Tagging
• Apply POS Tagger node
• Use Bag of Words node to inspect tagging result
20
Copyright © 2014 KNIME.com AG
3.) Preprocessing
Preprocess documents and filter words
21
Copyright © 2014 KNIME.com AG
Demo
Preprocessing
• Filter
– Numbers
– Punctuation marks
– Stop Words
• Convert to lower case
• Stemming
• Keep only nouns, verbs, adjectives
22
Copyright © 2014 KNIME.com AG
4.) Transformation
Creation of numerical representation of documents
23
Copyright © 2014 KNIME.com AG
Demo
Transformation
• Transform to bag of word
• Compute TF value for terms
• Transform to document vectors
• Extract category (class) value
24
Copyright © 2014 KNIME.com AG
5.) Classification
Training of a model (decision tree) and scoring
25
Copyright © 2014 KNIME.com AG
Demo
Classification
• Append color based on class
• Partition data into training and test set
• Train decision tree model in training data
• Apply decision tree model on test data
• Score model, measure accuracy
26
Copyright © 2014 KNIME.com AG
Additional Workflows
• Multi Word Tagging
– Detection of frequent Ngrams
– Creation of dictionary from Ngrams
– Applying Dictionary Tagger
• Classification with Multi Words
• Clustering of documents
27
Copyright © 2014 KNIME.com AG
Thank You
40k
60k
20k
28
Questions
• http://tech.knime.org/forum
Follow us
• Twitter: @KNIME
• LinkedIn: https://www.linkedin.com/groups?gid=2212172
• KNIME Blog: http://www.knime.org/blog