text mining - university of iowa

1

Text MiningText Mining

Joseph Engler

What is Text Mining

Text Mining is the discovery by computer of new, previously unknown information, by

automatically extracting information from different written resources.

M ti H t UC B k l-Marti Hearst, UC Berkeley

2

Document Gathering

• Text Databases• Text Databases− United States Patent and Trademark Office− Pacific Union College Nelson Memorial

Library

• Document Repositories− e-Law Document Repository− FAO Corporate Document Repository

• World Wide Web

Basic Measures for Text Retrieval

•Precision: the number of retrieved documents that are in fact

relevant to the query.

•Recall: the percentage of documents that are relevant to the query and were, in fact, retrieved.

{ } { }{ }trieved

trievedlevantprecision

ReReRe ∩

=

{ } { }{ }levant

trievedlevantrecall

ReReRe ∩

=

3

Text Retrieval Methods

•Boolean Retrieval Model•Boolean Retrieval Model−Document is represented by a set of key words−User provides a Boolean expression of key words

•“innovation” AND “process”•“text mining” BUT NOT “boring”

•Document RankingDocument Ranking−Uses the query to rank all documents in order of relevance−Google’s PageRank algorithm is an extension of this model

Tokenization of Text

• Preprocessing step in text mining• Preprocessing step in text mining− Stop Word Removal

• “the”, “and”, “for”

− Word Stemming • Innovation, Innovate, Innovative• “innova”

4

Model the Document for IR

• Term Frequency Matrix• Term Frequency Matrix− Measures the count of termi in documentk

• Inverse Document Frequency− Represents the importance of a term t.− If a term t is frequent in many documents,

its importance is scaled down.

Text Mining With Statistica

5

Statistica Document Retrieval

Statistica Web Crawling

• Non-Focused• Non-Focused

• Can filter file types

• Can specify domain to constrain

• Can specify depth of crawlp y p

• Can specify max number of items in crawling tree

6

Statistica Text Retrieval

Load Documents to Retrieve Text From

Statistica Text Retrieval Cont.

• Allows for the number of words to be retrieve to be set (Advanced Tab)

• Allows for multiple filters (Filters Tab)− Minimum Word Size− Max Word Size

Minimum % of files in which word occurs− Minimum % of files in which word occurs

• Allows for custom lists of stop words and inclusion words (Index Tab)

7

Statistica Text Mining Results

Summary of Word Occurrences

8

Singular Value Decomposition

Ordered Word Importanceafter performing SVD in

Statistica

Statistica Document Clustering

9

Clustering Cont.

Select either K-Means or EM Clusteringand set the number of clusters desired.

Clustering Cont.

Select variables based upon SVD.The variables are continuous in type.

10

Clustering Results

Training Error is usedto specify the number

of clusters.

Choose Number of Clusters

Inflection point indicates the pointat which little gain occurs when

increasing cluster count.

11

Cluster Results Cont.

Save this Worksheetfor ANN Classification

Document Classification

• Utilize Artificial Neural Network• Utilize Artificial Neural Network− Output Variable is the Final Cluster

Membership− Input Variables are those selected to create

the clusters− Train on 80% of the data− Test on 20% of the data

12

Statistica ANN

Statistica ANN Cont.

Set training and testingsample sizes.

One can also include avalidation set if desired.

13

Statistica ANN Results

Click on Predictionsto see how we did.

Statistica Results Cont.

Points in red represent misclassifications

14

Text Mining With Statistica

• Demo• Demo