![Page 1: April 22, 20041 Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:](https://reader034.vdocuments.mx/reader034/viewer/2022051618/56649d385503460f94a12403/html5/thumbnails/1.jpg)
April 22, 2004 1
Text Mining: Finding Nuggets in Mountains of Textual Data
Jochen Doerre, Peter Gerstl, Roland Seiffert
IBM Germany, August 1999
Presenter: Tyler Carr
![Page 2: April 22, 20041 Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:](https://reader034.vdocuments.mx/reader034/viewer/2022051618/56649d385503460f94a12403/html5/thumbnails/2.jpg)
April 22, 2004 Motivation 2
Outline Motivation Methodology Feature Extraction Clustering and Categorizing Applications Exam Questions
![Page 3: April 22, 20041 Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:](https://reader034.vdocuments.mx/reader034/viewer/2022051618/56649d385503460f94a12403/html5/thumbnails/3.jpg)
April 22, 2004 Motivation 3
Motivation
Customer Letters E-Mail
Correspondence Phone Call
Recordings Contracts
Technical Documentation
Patents News Articles Web Pages
90% of company’s data cannot be looked at with standard Datamining:
![Page 4: April 22, 20041 Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:](https://reader034.vdocuments.mx/reader034/viewer/2022051618/56649d385503460f94a12403/html5/thumbnails/4.jpg)
April 22, 2004 Motivation 4
Value of Text Mining Rapid Digestion of large document
collections Faster than human knowledge brokers Objective and Customizable Analysis Automation of tasks
![Page 5: April 22, 20041 Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:](https://reader034.vdocuments.mx/reader034/viewer/2022051618/56649d385503460f94a12403/html5/thumbnails/5.jpg)
April 22, 2004 Motivation 5
Typical Applications Summarizing Documents Monitoring relations among people,
places, and organizations Organizing documents by content Organizing indices for search and
retrieval (keyword finding) Retrieving documents by content
![Page 6: April 22, 20041 Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:](https://reader034.vdocuments.mx/reader034/viewer/2022051618/56649d385503460f94a12403/html5/thumbnails/6.jpg)
April 22, 2004 Methodology 6
Outline Motivation Methodology Feature Extraction Clustering and Categorizing Applications Exam Questions
![Page 7: April 22, 20041 Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:](https://reader034.vdocuments.mx/reader034/viewer/2022051618/56649d385503460f94a12403/html5/thumbnails/7.jpg)
April 22, 2004 Methodology 7
Challenges in Text Mining Information is in unstructured textual
form Natural Language (NL) interpretation is
years away for computers Text Mining deals with huge collections
of documents
![Page 8: April 22, 20041 Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:](https://reader034.vdocuments.mx/reader034/viewer/2022051618/56649d385503460f94a12403/html5/thumbnails/8.jpg)
April 22, 2004 Methodology 8
Two Text Mining Approaches Knowledge Discovery
Extraction of codified information (features) Information Distillation
Analysis of the feature distribution
![Page 9: April 22, 20041 Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:](https://reader034.vdocuments.mx/reader034/viewer/2022051618/56649d385503460f94a12403/html5/thumbnails/9.jpg)
April 22, 2004 Methodology 9
Comparison with Data Mining Data Mining
Identify data sets Select features
manually Prepare data Analyze distribution
Text Mining Identify documents Extract features Select features by
algorithm Prepare data Analyze distribution
![Page 10: April 22, 20041 Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:](https://reader034.vdocuments.mx/reader034/viewer/2022051618/56649d385503460f94a12403/html5/thumbnails/10.jpg)
April 22, 2004 Feature Extraction 10
Outline Motivation Methodology Feature Extraction Clustering and Categorizing Applications Exam Questions
![Page 11: April 22, 20041 Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:](https://reader034.vdocuments.mx/reader034/viewer/2022051618/56649d385503460f94a12403/html5/thumbnails/11.jpg)
April 22, 2004 Feature Extraction 11
Feature Extraction “To recognize and classify significant
vocabulary items in unrestricted natural language texts.”
Classes of Vocabulary Proper names Technical phrases Abbreviations and acronyms …
![Page 12: April 22, 20041 Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:](https://reader034.vdocuments.mx/reader034/viewer/2022051618/56649d385503460f94a12403/html5/thumbnails/12.jpg)
April 22, 2004 Feature Extraction 12
Canonical Forms Numbers convert to normal form
Four ==> 4 Date convert to normal form Inflected forms convert to common form
Sings, Sang, Sung ==> Sing Alternative names convert to explicit
form Mr. Carr, Tyler, Presenter==>Tyler Carr
![Page 13: April 22, 20041 Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:](https://reader034.vdocuments.mx/reader034/viewer/2022051618/56649d385503460f94a12403/html5/thumbnails/13.jpg)
April 22, 2004 Feature Extraction 13
Feature Extraction Tools Linguistically motivated heuristics Pattern matching Limited amounts of lexical information
Part-of-speech information (subject,verb) Avoid analyzing too deep (for speed)
Does not use huge amounts of lexical info. No in-depth syntactic and semantic
analysis
![Page 14: April 22, 20041 Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:](https://reader034.vdocuments.mx/reader034/viewer/2022051618/56649d385503460f94a12403/html5/thumbnails/14.jpg)
April 22, 2004 Feature Extraction 14
Feature Extraction Example Disambiguating Proper Names
(Nominator Program) Apply heuristics to strings, instead of
interpreting semantics. The unit of context for extraction is a
document. The heuristics represent English naming
conventions.
![Page 15: April 22, 20041 Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:](https://reader034.vdocuments.mx/reader034/viewer/2022051618/56649d385503460f94a12403/html5/thumbnails/15.jpg)
April 22, 2004 Feature Extraction 15
Feature Extraction Goals Very fast processing to deal with huge
amounts of data Domain independence for general
applicability
![Page 16: April 22, 20041 Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:](https://reader034.vdocuments.mx/reader034/viewer/2022051618/56649d385503460f94a12403/html5/thumbnails/16.jpg)
April 22, 2004 Clustering and Categorization 16
Outline Motivation Methodology Feature Extraction Clustering and Categorizing Applications Exam Questions
![Page 17: April 22, 20041 Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:](https://reader034.vdocuments.mx/reader034/viewer/2022051618/56649d385503460f94a12403/html5/thumbnails/17.jpg)
April 22, 2004 Clustering and Categorization 17
Clustering Also called Knowledge Discovery Fully automatic process Partitions a given collection into groups
of documents similar in contents Clusters identifiable by feature vectors
Provides a set of keywords for each cluster
![Page 18: April 22, 20041 Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:](https://reader034.vdocuments.mx/reader034/viewer/2022051618/56649d385503460f94a12403/html5/thumbnails/18.jpg)
April 22, 2004 Clustering and Categorization 18
Two Clustering Engines Hierarchical Clustering tool
Orders the clusters into a tree reflecting various levels of similarity.
Binary Relational Clustering tool Produces a flat clustering together with
relationships of different strength between the clusters
Relationships reflect inter-cluster similarities
![Page 19: April 22, 20041 Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:](https://reader034.vdocuments.mx/reader034/viewer/2022051618/56649d385503460f94a12403/html5/thumbnails/19.jpg)
April 22, 2004 Clustering and Categorization 19
Clustering Model
![Page 20: April 22, 20041 Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:](https://reader034.vdocuments.mx/reader034/viewer/2022051618/56649d385503460f94a12403/html5/thumbnails/20.jpg)
April 22, 2004 Clustering and Categorization 20
Categorization Also called Information Distillation Topic Categorization Tool Assigns documents to pre-existing
categories (“topics” or “themes”) Categories are chosen to match the
intended use of the collection
![Page 21: April 22, 20041 Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:](https://reader034.vdocuments.mx/reader034/viewer/2022051618/56649d385503460f94a12403/html5/thumbnails/21.jpg)
April 22, 2004 Clustering and Categorization 21
Categorization Categories defined by providing a set of
sample documents for each category Training phase produces a special
index, called the categorization schema Categorization tool returns set of
category names and confidence levels for each document
![Page 22: April 22, 20041 Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:](https://reader034.vdocuments.mx/reader034/viewer/2022051618/56649d385503460f94a12403/html5/thumbnails/22.jpg)
April 22, 2004 Clustering and Categorization 22
Categorization If confidence is below some threshold,
document is set aside for human categorizer
Tests have shown the Topic Categorization Tool agrees with human categorizers to the same degree as human categorizers agree with one another.
![Page 23: April 22, 20041 Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:](https://reader034.vdocuments.mx/reader034/viewer/2022051618/56649d385503460f94a12403/html5/thumbnails/23.jpg)
April 22, 2004 Clustering and Categorization 23
Categorization Model
![Page 24: April 22, 20041 Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:](https://reader034.vdocuments.mx/reader034/viewer/2022051618/56649d385503460f94a12403/html5/thumbnails/24.jpg)
April 22, 2004 Applications 24
Outline Motivation Methodology Feature Extraction Clustering and Categorizing Applications Exam Questions
![Page 25: April 22, 20041 Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:](https://reader034.vdocuments.mx/reader034/viewer/2022051618/56649d385503460f94a12403/html5/thumbnails/25.jpg)
April 22, 2004 Applications 25
IBM Intelligent Miner for Text Software Development Kit (not full
application) Contains necessary components for “real text
mining” Also contains more traditional components:
IBM Text Search Engine IBM Web Crawler Drop-in Intranet search solutions
![Page 26: April 22, 20041 Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:](https://reader034.vdocuments.mx/reader034/viewer/2022051618/56649d385503460f94a12403/html5/thumbnails/26.jpg)
April 22, 2004 Applications 26
Applications Customer Relationship Management
application provided by IBM Intelligent Miner for text called Customer Relationship Intelligence (CRI) “Help companies better understand what
their customers want and what they think about the company itself.”
![Page 27: April 22, 20041 Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:](https://reader034.vdocuments.mx/reader034/viewer/2022051618/56649d385503460f94a12403/html5/thumbnails/27.jpg)
April 22, 2004 Applications 27
Customer Intelligence Process Take body of communications with customer
as input. Cluster the documents to identify issues. Characterize the clusters to identify the
conditions for problems. Assign new messages appropriate to
clusters.
![Page 28: April 22, 20041 Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:](https://reader034.vdocuments.mx/reader034/viewer/2022051618/56649d385503460f94a12403/html5/thumbnails/28.jpg)
April 22, 2004 Applications 28
Customer Intelligence Usage Knowledge Discovery
Clustering used to create a structure that can be interpreted
Information Distillation Refinement and extension of clustering results
Interpreting the results Tuning of the clustering process Selecting meaningful clusters
![Page 29: April 22, 20041 Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:](https://reader034.vdocuments.mx/reader034/viewer/2022051618/56649d385503460f94a12403/html5/thumbnails/29.jpg)
April 22, 2004 Exam Questions 29
Outline Motivation Methodology Feature Extraction Clustering and Categorizing Applications Exam Questions
![Page 30: April 22, 20041 Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:](https://reader034.vdocuments.mx/reader034/viewer/2022051618/56649d385503460f94a12403/html5/thumbnails/30.jpg)
April 22, 2004 Exam Questions 30
Exam Question #1 Name an example of each of the two
main classes of applications of text-mining. Knowledge Discovery: Discovering a
common customer complaint among much feedback
Information Distillation: Filtering future comments into pre-defined categories.
![Page 31: April 22, 20041 Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:](https://reader034.vdocuments.mx/reader034/viewer/2022051618/56649d385503460f94a12403/html5/thumbnails/31.jpg)
April 22, 2004 Exam Questions 31
Exam Question #2 How does the procedure for text mining
differ from the procedure for data mining? Adds feature extraction function Not feasible to have humans select
features Highly dimensional, sparsely populated
feature vectors
![Page 32: April 22, 20041 Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:](https://reader034.vdocuments.mx/reader034/viewer/2022051618/56649d385503460f94a12403/html5/thumbnails/32.jpg)
April 22, 2004 Exam Questions 32
Exam Question #3 In the Nominator program of IBM’s
Intelligent Miner for Text, an objective of the design is to enable rapid extraction of names from large amounts of text. How does this decision affect the ability of the program to interpret the semantics of text? Does not perform in-depth syntactic or
semantic analysis of texts
![Page 33: April 22, 20041 Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:](https://reader034.vdocuments.mx/reader034/viewer/2022051618/56649d385503460f94a12403/html5/thumbnails/33.jpg)
April 22, 2004 33
Thank You
Any Questions?
![Page 34: April 22, 20041 Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:](https://reader034.vdocuments.mx/reader034/viewer/2022051618/56649d385503460f94a12403/html5/thumbnails/34.jpg)
April 22, 2004 34
Thank You
Any Questions?
![Page 35: April 22, 20041 Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:](https://reader034.vdocuments.mx/reader034/viewer/2022051618/56649d385503460f94a12403/html5/thumbnails/35.jpg)
April 22, 2004 35
Thank You
Any Questions?