knime user group meeting 2015 text mining workshop · •subset(s) of the large movie review...
TRANSCRIPT
![Page 1: KNIME User Group Meeting 2015 Text Mining Workshop · •Subset(s) of the Large Movie Review Dataset v1.0 •The LMR Dataset v1.0 contains 50.000 English movie reviews with their](https://reader035.vdocuments.mx/reader035/viewer/2022062607/605a086c5f5e9a48e0750c08/html5/thumbnails/1.jpg)
Copyright © 2015 KNIME.com AG
KNIME User Group Meeting 2015Text Mining Workshop
Kilian Thiel
@KNIME
![Page 2: KNIME User Group Meeting 2015 Text Mining Workshop · •Subset(s) of the Large Movie Review Dataset v1.0 •The LMR Dataset v1.0 contains 50.000 English movie reviews with their](https://reader035.vdocuments.mx/reader035/viewer/2022062607/605a086c5f5e9a48e0750c08/html5/thumbnails/2.jpg)
Copyright © 2015 KNIME.com AG
Agenda
2
• Introduction
• Sentiment Analysis
– Single word features
– Ngram features
– Dictionary based
![Page 3: KNIME User Group Meeting 2015 Text Mining Workshop · •Subset(s) of the Large Movie Review Dataset v1.0 •The LMR Dataset v1.0 contains 50.000 English movie reviews with their](https://reader035.vdocuments.mx/reader035/viewer/2022062607/605a086c5f5e9a48e0750c08/html5/thumbnails/3.jpg)
Copyright © 2015 KNIME.com AG
• The KNIME Website (www.knime.org)• Textprocessing (tech.knime.org/knime-text-processing)
• Forum (tech.knime.org/forum/knime-textprocessing)
• Blog for news, tips and tricks(www.knime.org/blog)
• KNIME TV channel on
• KNIME on @KNIME
Resources
3
![Page 4: KNIME User Group Meeting 2015 Text Mining Workshop · •Subset(s) of the Large Movie Review Dataset v1.0 •The LMR Dataset v1.0 contains 50.000 English movie reviews with their](https://reader035.vdocuments.mx/reader035/viewer/2022062607/605a086c5f5e9a48e0750c08/html5/thumbnails/4.jpg)
Copyright © 2015 KNIME.com AG 4
Introduction to KNIME Textprocessing
![Page 5: KNIME User Group Meeting 2015 Text Mining Workshop · •Subset(s) of the Large Movie Review Dataset v1.0 •The LMR Dataset v1.0 contains 50.000 English movie reviews with their](https://reader035.vdocuments.mx/reader035/viewer/2022062607/605a086c5f5e9a48e0750c08/html5/thumbnails/5.jpg)
Copyright © 2015 KNIME.com AG
Installation
5
1.) 2.)
![Page 6: KNIME User Group Meeting 2015 Text Mining Workshop · •Subset(s) of the Large Movie Review Dataset v1.0 •The LMR Dataset v1.0 contains 50.000 English movie reviews with their](https://reader035.vdocuments.mx/reader035/viewer/2022062607/605a086c5f5e9a48e0750c08/html5/thumbnails/6.jpg)
Copyright © 2015 KNIME.com AG
Tip
• Increase maximum memory for KNIME
• Edit knime.ini
– Add “-Xmx3G” as last line of knime.ini file
– Or change existing setting
• Useful additional extensions
– Palladian (community extension)• Web crawling, Text Mining
– XML-Processing (KNIME extension)• Parsing and processing of XML documents
6
![Page 7: KNIME User Group Meeting 2015 Text Mining Workshop · •Subset(s) of the Large Movie Review Dataset v1.0 •The LMR Dataset v1.0 contains 50.000 English movie reviews with their](https://reader035.vdocuments.mx/reader035/viewer/2022062607/605a086c5f5e9a48e0750c08/html5/thumbnails/7.jpg)
Copyright © 2015 KNIME.com AG
Philosophy
7
… perhaps your nameis
Rumpelstiltskin[Person] ? …
… perhaps your nameis
Rumpelstiltskin[Person] ? … Visualization
Cluster-ing
Classifi-cation
1 1 1 0 1 0 0 1 10 1 1 0 0 1 0 0 00 0 1 1 1 0 1 1 0
![Page 8: KNIME User Group Meeting 2015 Text Mining Workshop · •Subset(s) of the Large Movie Review Dataset v1.0 •The LMR Dataset v1.0 contains 50.000 English movie reviews with their](https://reader035.vdocuments.mx/reader035/viewer/2022062607/605a086c5f5e9a48e0750c08/html5/thumbnails/8.jpg)
Copyright © 2015 KNIME.com AG
Data Import
• Node Repository: KNIME Labs/Text Processing/IO
• Available Parser Nodes
– Flat File Document Parser
– PDF Parser
– Word Parser
– Document Grabber
– …
8
![Page 9: KNIME User Group Meeting 2015 Text Mining Workshop · •Subset(s) of the Large Movie Review Dataset v1.0 •The LMR Dataset v1.0 contains 50.000 English movie reviews with their](https://reader035.vdocuments.mx/reader035/viewer/2022062607/605a086c5f5e9a48e0750c08/html5/thumbnails/9.jpg)
Copyright © 2015 KNIME.com AG
Enrichment
• Node Repository:
KNIME Labs/Text Processing/Enrichment
• Available Tagger Nodes
– Standford tagger
– Dictionary (& Wildcard) tagger
– OpenNLP tagger
– Abner tagger
– …
9
![Page 10: KNIME User Group Meeting 2015 Text Mining Workshop · •Subset(s) of the Large Movie Review Dataset v1.0 •The LMR Dataset v1.0 contains 50.000 English movie reviews with their](https://reader035.vdocuments.mx/reader035/viewer/2022062607/605a086c5f5e9a48e0750c08/html5/thumbnails/10.jpg)
Copyright © 2015 KNIME.com AG
Preprocessing
• Node Repository:
KNIME Labs/Text Processing/Preprocessing
• Available Preprocessing Nodes
– Stop word Filter
– Snowball Stemmer
– General Tag Filter
– Case converter
– RegEx Filter
– …
10
![Page 11: KNIME User Group Meeting 2015 Text Mining Workshop · •Subset(s) of the Large Movie Review Dataset v1.0 •The LMR Dataset v1.0 contains 50.000 English movie reviews with their](https://reader035.vdocuments.mx/reader035/viewer/2022062607/605a086c5f5e9a48e0750c08/html5/thumbnails/11.jpg)
Copyright © 2015 KNIME.com AG
Transformation
• Node Repository:
KNIME Labs/Text Processing/Transformation
• Available Transformation Nodes
– BoW Creator
– Document Vector
– Strings to Document
– Sentence Extractor
– Document Data Extractor
– …
11
![Page 12: KNIME User Group Meeting 2015 Text Mining Workshop · •Subset(s) of the Large Movie Review Dataset v1.0 •The LMR Dataset v1.0 contains 50.000 English movie reviews with their](https://reader035.vdocuments.mx/reader035/viewer/2022062607/605a086c5f5e9a48e0750c08/html5/thumbnails/12.jpg)
Copyright © 2015 KNIME.com AG
Additional Data Types
• Document Cell
– Encapsulates a document• Title, sentences, terms, words
• Authors, category, source
• Generic meta data (key, value pairs)
• Term Cell
– Encapsulates a term• Words, tags
12
![Page 13: KNIME User Group Meeting 2015 Text Mining Workshop · •Subset(s) of the Large Movie Review Dataset v1.0 •The LMR Dataset v1.0 contains 50.000 English movie reviews with their](https://reader035.vdocuments.mx/reader035/viewer/2022062607/605a086c5f5e9a48e0750c08/html5/thumbnails/13.jpg)
Copyright © 2015 KNIME.com AG
Data Table Structures
• Document table– List of documents
• Bag of words– Tuples of documents
and terms
• Document vectors– Numerical
representations of documents
13
![Page 14: KNIME User Group Meeting 2015 Text Mining Workshop · •Subset(s) of the Large Movie Review Dataset v1.0 •The LMR Dataset v1.0 contains 50.000 English movie reviews with their](https://reader035.vdocuments.mx/reader035/viewer/2022062607/605a086c5f5e9a48e0750c08/html5/thumbnails/14.jpg)
Copyright © 2015 KNIME.com AG
Import Export Workflows
14
![Page 15: KNIME User Group Meeting 2015 Text Mining Workshop · •Subset(s) of the Large Movie Review Dataset v1.0 •The LMR Dataset v1.0 contains 50.000 English movie reviews with their](https://reader035.vdocuments.mx/reader035/viewer/2022062607/605a086c5f5e9a48e0750c08/html5/thumbnails/15.jpg)
Copyright © 2015 KNIME.com AG
Today‘s example
• Classification of free-text documents is a common task in the field of text mining.
• It is used to categorize documents, i.e. assigning pre-defined topics or for sentiment analysis.
• Today we construct a workflow that reads and preprocesses text documents, transforms them into a numerical representation and build a predictive model to assign pre-defined sentiment labels to documents.
15
![Page 16: KNIME User Group Meeting 2015 Text Mining Workshop · •Subset(s) of the Large Movie Review Dataset v1.0 •The LMR Dataset v1.0 contains 50.000 English movie reviews with their](https://reader035.vdocuments.mx/reader035/viewer/2022062607/605a086c5f5e9a48e0750c08/html5/thumbnails/16.jpg)
Copyright © 2015 KNIME.com AG
Today‘s example
• Subset(s) of the Large Movie Review Dataset v1.0
• The LMR Dataset v1.0 contains 50.000 English movie reviews with their associated sentiment label “positive” or “negative”
• http://ai.stanford.edu/~amaas/data/sentiment/
16
![Page 17: KNIME User Group Meeting 2015 Text Mining Workshop · •Subset(s) of the Large Movie Review Dataset v1.0 •The LMR Dataset v1.0 contains 50.000 English movie reviews with their](https://reader035.vdocuments.mx/reader035/viewer/2022062607/605a086c5f5e9a48e0750c08/html5/thumbnails/17.jpg)
Copyright © 2015 KNIME.com AG
Today‘s example
17
Goal:
• Build classifier to distinguish between positive andnegative movie reviews.
• „Absolute masterpiece of a film! Goodnight Mr.Tomhas swiftly become one of my favourite films of all time. Nobody should miss out on seeing this film, it's just too good! …“
Positive or negative movie review?
![Page 18: KNIME User Group Meeting 2015 Text Mining Workshop · •Subset(s) of the Large Movie Review Dataset v1.0 •The LMR Dataset v1.0 contains 50.000 English movie reviews with their](https://reader035.vdocuments.mx/reader035/viewer/2022062607/605a086c5f5e9a48e0750c08/html5/thumbnails/18.jpg)
Copyright © 2015 KNIME.com AG
Hands On
• Adjust Xmx setting in knime.ini
• Start KNIME
• Import Workflows
18
![Page 19: KNIME User Group Meeting 2015 Text Mining Workshop · •Subset(s) of the Large Movie Review Dataset v1.0 •The LMR Dataset v1.0 contains 50.000 English movie reviews with their](https://reader035.vdocuments.mx/reader035/viewer/2022062607/605a086c5f5e9a48e0750c08/html5/thumbnails/19.jpg)
Copyright © 2015 KNIME.com AG 19
Sentiment Analysis
![Page 20: KNIME User Group Meeting 2015 Text Mining Workshop · •Subset(s) of the Large Movie Review Dataset v1.0 •The LMR Dataset v1.0 contains 50.000 English movie reviews with their](https://reader035.vdocuments.mx/reader035/viewer/2022062607/605a086c5f5e9a48e0750c08/html5/thumbnails/20.jpg)
Copyright © 2015 KNIME.com AG
Classification (Recap)
20
Training
Set
Test
Set
Original
Data Set
Train
Model
Apply
Model
Score
Model
Data Partitioning Training and
Applying Models
Scoring
Strategies
![Page 21: KNIME User Group Meeting 2015 Text Mining Workshop · •Subset(s) of the Large Movie Review Dataset v1.0 •The LMR Dataset v1.0 contains 50.000 English movie reviews with their](https://reader035.vdocuments.mx/reader035/viewer/2022062607/605a086c5f5e9a48e0750c08/html5/thumbnails/21.jpg)
Copyright © 2015 KNIME.com AG
Classification (Recap)
• All data mining models use a Learner-Predictor motif.
• The Learner node trains the model with its input data.
• The Predictor node applies the model to a different subset of data.
21
Training set
Test set
Trained Model
![Page 22: KNIME User Group Meeting 2015 Text Mining Workshop · •Subset(s) of the Large Movie Review Dataset v1.0 •The LMR Dataset v1.0 contains 50.000 English movie reviews with their](https://reader035.vdocuments.mx/reader035/viewer/2022062607/605a086c5f5e9a48e0750c08/html5/thumbnails/22.jpg)
Copyright © 2015 KNIME.com AG
From Strings to Numbers
22
![Page 23: KNIME User Group Meeting 2015 Text Mining Workshop · •Subset(s) of the Large Movie Review Dataset v1.0 •The LMR Dataset v1.0 contains 50.000 English movie reviews with their](https://reader035.vdocuments.mx/reader035/viewer/2022062607/605a086c5f5e9a48e0750c08/html5/thumbnails/23.jpg)
Copyright © 2015 KNIME.com AG
Hands On
• Read movie review data
• Convert to documents
• Preprocess documents
• Filter features based on frequency
• Transform documents into numerical representation
• Train and score predictive model
• View decision tree view
23
![Page 24: KNIME User Group Meeting 2015 Text Mining Workshop · •Subset(s) of the Large Movie Review Dataset v1.0 •The LMR Dataset v1.0 contains 50.000 English movie reviews with their](https://reader035.vdocuments.mx/reader035/viewer/2022062607/605a086c5f5e9a48e0750c08/html5/thumbnails/24.jpg)
Copyright © 2015 KNIME.com AG
Decision Tree Model
24
![Page 25: KNIME User Group Meeting 2015 Text Mining Workshop · •Subset(s) of the Large Movie Review Dataset v1.0 •The LMR Dataset v1.0 contains 50.000 English movie reviews with their](https://reader035.vdocuments.mx/reader035/viewer/2022062607/605a086c5f5e9a48e0750c08/html5/thumbnails/25.jpg)
Copyright © 2015 KNIME.com AG
Negations
• What about negations?
– Not good
– Not bad
• Can not be handled with single word features
• Ngram features (1- and 2grams)
25
![Page 26: KNIME User Group Meeting 2015 Text Mining Workshop · •Subset(s) of the Large Movie Review Dataset v1.0 •The LMR Dataset v1.0 contains 50.000 English movie reviews with their](https://reader035.vdocuments.mx/reader035/viewer/2022062607/605a086c5f5e9a48e0750c08/html5/thumbnails/26.jpg)
Copyright © 2015 KNIME.com AG
NGram Creator
• Allows for creation of
– Character Ngrams
– Word Ngrams
• As
– Bag of words
– With frequencies
26
![Page 27: KNIME User Group Meeting 2015 Text Mining Workshop · •Subset(s) of the Large Movie Review Dataset v1.0 •The LMR Dataset v1.0 contains 50.000 English movie reviews with their](https://reader035.vdocuments.mx/reader035/viewer/2022062607/605a086c5f5e9a48e0750c08/html5/thumbnails/27.jpg)
Copyright © 2015 KNIME.com AG
NGram Features
27
![Page 28: KNIME User Group Meeting 2015 Text Mining Workshop · •Subset(s) of the Large Movie Review Dataset v1.0 •The LMR Dataset v1.0 contains 50.000 English movie reviews with their](https://reader035.vdocuments.mx/reader035/viewer/2022062607/605a086c5f5e9a48e0750c08/html5/thumbnails/28.jpg)
Copyright © 2015 KNIME.com AG
Hands On
• Read movie review data
• Convert to documents
• Preprocess documents
• Create 1- and 2-gram features
• Filter features based on frequency
• Transform documents into numerical representation
• Train and score predictive model
• View decision tree view
28
![Page 29: KNIME User Group Meeting 2015 Text Mining Workshop · •Subset(s) of the Large Movie Review Dataset v1.0 •The LMR Dataset v1.0 contains 50.000 English movie reviews with their](https://reader035.vdocuments.mx/reader035/viewer/2022062607/605a086c5f5e9a48e0750c08/html5/thumbnails/29.jpg)
Copyright © 2015 KNIME.com AG
Decision Tree Model
29
![Page 30: KNIME User Group Meeting 2015 Text Mining Workshop · •Subset(s) of the Large Movie Review Dataset v1.0 •The LMR Dataset v1.0 contains 50.000 English movie reviews with their](https://reader035.vdocuments.mx/reader035/viewer/2022062607/605a086c5f5e9a48e0750c08/html5/thumbnails/30.jpg)
Copyright © 2015 KNIME.com AG
Dictionaries
• Custom dictionaries for sentiment classification
– Dictionary Tagger
– Wildcard Tagger (regular expressions)
30
![Page 31: KNIME User Group Meeting 2015 Text Mining Workshop · •Subset(s) of the Large Movie Review Dataset v1.0 •The LMR Dataset v1.0 contains 50.000 English movie reviews with their](https://reader035.vdocuments.mx/reader035/viewer/2022062607/605a086c5f5e9a48e0750c08/html5/thumbnails/31.jpg)
Copyright © 2015 KNIME.com AG
MPQA Subjectivity Lexicon
• Dictionary with words and related subjectivity clues
– 2718 words with positive polarity
– 4901 words with negative polarity
• http://mpqa.cs.pitt.edu/lexicons/subj_lexicon
31
![Page 32: KNIME User Group Meeting 2015 Text Mining Workshop · •Subset(s) of the Large Movie Review Dataset v1.0 •The LMR Dataset v1.0 contains 50.000 English movie reviews with their](https://reader035.vdocuments.mx/reader035/viewer/2022062607/605a086c5f5e9a48e0750c08/html5/thumbnails/32.jpg)
Copyright © 2015 KNIME.com AG
Dictionary Based
32
![Page 33: KNIME User Group Meeting 2015 Text Mining Workshop · •Subset(s) of the Large Movie Review Dataset v1.0 •The LMR Dataset v1.0 contains 50.000 English movie reviews with their](https://reader035.vdocuments.mx/reader035/viewer/2022062607/605a086c5f5e9a48e0750c08/html5/thumbnails/33.jpg)
Copyright © 2015 KNIME.com AG
Hands On
• Read movie review data
• Convert to documents
• Tag documents
• Create pivot features
• Specify score and rule for label assignment
• Score prediction
33
![Page 34: KNIME User Group Meeting 2015 Text Mining Workshop · •Subset(s) of the Large Movie Review Dataset v1.0 •The LMR Dataset v1.0 contains 50.000 English movie reviews with their](https://reader035.vdocuments.mx/reader035/viewer/2022062607/605a086c5f5e9a48e0750c08/html5/thumbnails/34.jpg)
Copyright © 2015 KNIME.com AG
Appendix
• Palladian and XML nodes to extract move titles from IMDB web pages
• First dataset contains IMDB URLs of related movies
34
![Page 35: KNIME User Group Meeting 2015 Text Mining Workshop · •Subset(s) of the Large Movie Review Dataset v1.0 •The LMR Dataset v1.0 contains 50.000 English movie reviews with their](https://reader035.vdocuments.mx/reader035/viewer/2022062607/605a086c5f5e9a48e0750c08/html5/thumbnails/35.jpg)
Copyright © 2015 KNIME.com AG
Web Data Extraction
35
![Page 36: KNIME User Group Meeting 2015 Text Mining Workshop · •Subset(s) of the Large Movie Review Dataset v1.0 •The LMR Dataset v1.0 contains 50.000 English movie reviews with their](https://reader035.vdocuments.mx/reader035/viewer/2022062607/605a086c5f5e9a48e0750c08/html5/thumbnails/36.jpg)
Copyright © 2015 KNIME.com AG
Hands On
• Read movie review data
• Load webpage content
• Parse HTML
• Extract title field
36