text mining - university of iowa
TRANSCRIPT
1
Text MiningText Mining
Joseph Engler
What is Text Mining
Text Mining is the discovery by computer of new, previously unknown information, by
automatically extracting information from different written resources.
M ti H t UC B k l-Marti Hearst, UC Berkeley
2
Document Gathering
• Text Databases• Text Databases− United States Patent and Trademark Office− Pacific Union College Nelson Memorial
Library
• Document Repositories− e-Law Document Repository− FAO Corporate Document Repository
• World Wide Web
Basic Measures for Text Retrieval
•Precision: the number of retrieved documents that are in fact
relevant to the query.
•Recall: the percentage of documents that are relevant to the query and were, in fact, retrieved.
{ } { }{ }trieved
trievedlevantprecision
ReReRe ∩
=
{ } { }{ }levant
trievedlevantrecall
ReReRe ∩
=
3
Text Retrieval Methods
•Boolean Retrieval Model•Boolean Retrieval Model−Document is represented by a set of key words−User provides a Boolean expression of key words
•“innovation” AND “process”•“text mining” BUT NOT “boring”
•Document RankingDocument Ranking−Uses the query to rank all documents in order of relevance−Google’s PageRank algorithm is an extension of this model
Tokenization of Text
• Preprocessing step in text mining• Preprocessing step in text mining− Stop Word Removal
• “the”, “and”, “for”
− Word Stemming • Innovation, Innovate, Innovative• “innova”
4
Model the Document for IR
• Term Frequency Matrix• Term Frequency Matrix− Measures the count of termi in documentk
• Inverse Document Frequency− Represents the importance of a term t.− If a term t is frequent in many documents,
its importance is scaled down.
Text Mining With Statistica
5
Statistica Document Retrieval
Statistica Web Crawling
• Non-Focused• Non-Focused
• Can filter file types
• Can specify domain to constrain
• Can specify depth of crawlp y p
• Can specify max number of items in crawling tree
6
Statistica Text Retrieval
Load Documents to Retrieve Text From
Statistica Text Retrieval Cont.
• Allows for the number of words to be retrieve to be set (Advanced Tab)
• Allows for multiple filters (Filters Tab)− Minimum Word Size− Max Word Size
Minimum % of files in which word occurs− Minimum % of files in which word occurs
• Allows for custom lists of stop words and inclusion words (Index Tab)
7
Statistica Text Mining Results
Summary of Word Occurrences
8
Singular Value Decomposition
Ordered Word Importanceafter performing SVD in
Statistica
Statistica Document Clustering
9
Clustering Cont.
Select either K-Means or EM Clusteringand set the number of clusters desired.
Clustering Cont.
Select variables based upon SVD.The variables are continuous in type.
10
Clustering Results
Training Error is usedto specify the number
of clusters.
Choose Number of Clusters
Inflection point indicates the pointat which little gain occurs when
increasing cluster count.
11
Cluster Results Cont.
Save this Worksheetfor ANN Classification
Document Classification
• Utilize Artificial Neural Network• Utilize Artificial Neural Network− Output Variable is the Final Cluster
Membership− Input Variables are those selected to create
the clusters− Train on 80% of the data− Test on 20% of the data
12
Statistica ANN
Statistica ANN Cont.
Set training and testingsample sizes.
One can also include avalidation set if desired.
13
Statistica ANN Results
Click on Predictionsto see how we did.
Statistica Results Cont.
Points in red represent misclassifications
14
Text Mining With Statistica
• Demo• Demo