page 1 copyrighted material john tullis ibm intelligent miner for text john tullis depaul instructor...

Copyrighted materialJohn Tullis

IBM Intelligent Miner for Text

John TullisDePaul Instructorjohn.d.tullis@us.arthurandersen.com

IBM Intelligent Miner for Text A Knowledge-discovery software development toolkit

to build advanced Text-Mining and Text-Search applications

A NetQuestion Solution to construct Internet/intranet text-search solutions

NetQuestion Solution

Text Analysis

Text Search Engine

Web Crawler Package

Intelligent Miner for Text For companies of any size and for different

industries Media Petroleum

BankingIntelligent

Miner for Text Education

Government Insurance

Potential Applications

Customer complaints

analysis

Newswire analysis

Intelligent Website

Intelligent Miner for Text

Opinion survey

classification

Competitive intelligence

Corporate Image analysis

Intelligent Miner for Text: Platforms supported Text

AnalysisTools

TextSearchEngineServer

Text Search Engine

Client

Text SearchEngineJava GUIJavaBeans

Web CrawlerPackage

NetQSolution

AIX 4.3

Y Y Y Y Y Y

Solaris 2.5.1

Y Y Y Y Y Y

Win NT 4.0SP3 Y Y Y Y Y Y

OS/390 V2R4, V2R5,

V2R6 Y Y Y Y Y Y

Reference Customers FinanceWise (Search engine for financial content on the

Internet) www.financewise.com

IBM web sites (incl. 2000 IBM intranet sites) www.ibm.com

Sueddeutsche Zeitung (classified ads on Sueddeutsche Zeitung Web site)

www.sueddeutsche.de SearchCafe (Business Partner)

www.search-cafe.com Success stories available at

www.software.ibm.com/iminer/fortext

Reference customers & Success stories

Component: Text Analysis Tools

Functionality Language Identification Clustering of document collection

hierarchical clustering relational clustering

Categorization/Classification of document collection Feature Extraction Summarization

Text Analysis Tools To automate tasks previously done manually

automatically identifies the language of a document automatically groups related documents based on their content,

without requiring predefined classes automatically assigns documents to one or more user-defined

categories automatically recognizes significant items in text, such as names,

technical terms, and abbreviations automatically extracts sentences from a document to create a

document summary

•Text analysis tools are available in a command line format structured to function like common UNIX or DOS command line formats.

•Text analysis tools can be used individually or in a combined mode depending on the required task.

•Configuration files allow document format flexibility and performance tuning for text searches. Command line switches provide additional flexibility by permitting the user to set runtime parameters.

•The documents need to be provided in plain text format. For other formats, conversion tools can be obtained from third parties such as KEYpak (http://www.keypak.com/

Text Analysis

Clustering (2)

Summarization

tifica

ssifica

Feature Extractio

Text Analysis Tools: Feature Extraction

To recognize significant vocabulary items To recognize all names referring to a single entity To provide the location of all person names, places and

organization in a text To find multi-word terms that have a meaning of their own To find abbreviations introduced in a text and links them

with their full forms To recognize named relationships

Text Analysis Tools: Feature Extraction• Produces statistics for each vocabulary item.• Associates terms to canonical forms (i.e. "related" associated to the term

"relate")• Feature extraction can be used as a preprocessor for the Clustering utility to

bias (or control) clustering activities.• Feature extraction can be run in two modes:

1) Lookup mode which refers to a schema generated by a training set and produces statistics for vocabulary items as they relate to the rest of the schema as well as within the document

2) Exploration mode which requires no training and yields textual data statistics for vocabulary items as they relate within the scope of the document(s) specified

Several classes of significant vocabulary can

be recognized

Names are categorized

Significant concepts are detected automatically

Automatic keywording: the most significant terminology in the

document

Feature Extraction - statistics & analysis• Application here shows how one can use the statistics and analysis

produced by the feature extraction.• Highlighting of selected items within a document by using the location

information in the feature extract (all vocabulary terms have location information to accomplish this).

• Selected categories can be filtered upon.• A significance measure for each vocabulary item is produced by

feature extraction which allows prioritization of keywords within the scope of individual documents or entire collections.

• This is a sample application which is not included in the software installation.

"Terms" include multi-word phrases whose

meaning is much more than that of the individual

Multi-word phrases are the vocabulary in which concepts are expressed

Feature Extraction - statistics & analysis• Recognizes multi-word phrases by pattern recognition meaning if a two

word pattern appears with an acceptable frequency then it is included as an extracted vocabulary item in the output.

• More heuristics are applied than mentioned but generally this is the textual processing which occurs.

• Concepts can be FORMULATED from the multi-word terms. The feature extraction utility assists in emphasizing prevalent multi-word terms.

Clustering (2)

Summarization

tifica

ssifica

Feature Extractio

Language Identification

given a document, discover automatically the language(s) in which the document is written

It can be used to restrict search results by languages organize the crawls by languages route documents to language translators

Language Identification• A 16 language dictionary is shipped with the Intelligent Miner for Text to

be used by the Language Identification utility.

• The Language Identification utility also comes with a utility which can be used to add to the shipped dictionary file to extend language identification. (You can even invent your own language and add it to the dictionary!)

• Documents can be analyzed for language content meaning the output of Language Identification can produce multiple degrees of language content in one pass (i.e. Document ABC has 75% English, 20% German, etc.). This is possible using a command line option.

• Allows further document organization by language and a degree of internationalization to applications.

Clustering (2)

Summarization

tifica

ssifica

Feature Extractio

Categorization/Classification

given a defined taxonomy, it can assign documents to preexisting categories

utilizes feature extraction capacities to do document comparisons efficiently

two stages training using sample documents category assignment

Categorization/Classification

• Users determine the taxonomy for organizing the documents into topics.

• Users create training sets to define categories and use the supplied training utility.

• Each document is analyzed and a rank value assigned as it relates to each category.

• A command line switch allows the user to display varying numbers of categories with the document's associated rank value.

• REMEMBER: The categories are predefined by the user.

Categorization: Solution Example

Clustering (2)

Summarization

tifica

ssifica

Feature Extractio

Clustering

Functions to automatically group related documents

based on their content, without requiring predefined classes

objects within a group are more similar to each other than to members of any other group

two approaches - Hierarchical clustering and binary relational clustering

Clustering - Details Preprocessing steps

Analyze data input stream and divide it into individual textual components to be used for clustering

Extract portions of individual textual components to be used for clustering (uses Feature Extraction as a preprocessor)

Customize stop word list Hierarchical clustering

Structure document collection using lexical affinity based on similarity function

Build clustering tree showing relationships between clusters of documents of varying granularity

Clustering - Details

Slicing Customize tree by applying adjustable thresholds to reduce

complexity and zoom-in on concepts of interest Use default threshold values for specific document collection

Note - slicing allows merging similar clusters into a single cluster. Clustering Output Formats

HTML file viewable by browser Textual description to be parsed (in the format of a tree)

Hierarchical Clustering - Visualization Example

Clustering - Details

• This is a sample application which shows the use of the clustering results in an HTML format. This application is not shipped with the software.

• The HTML output can be configured to place actual document paths in the display on the browser so users may easily view the documents which were clustered RIGHT FROM THE BROWSER.

• Clusters each have labels which are generated from three 2 word pairings which are the most common lexical affinities

• Similarity values in the application are represented by percentages. This is normalized as the similarity values actually range from 0 to 1000.

Categorization: Comparison to ClusteringIn clustering document collections are processed and grouped into dynamically generated clusters ....

In categorization, document collections are processed and grouped into predetermined groupings based on a taxonomy generated with training sets....Document

Collection DocumentCollection

ClusteringUtility

Cluster1 Cluster2 Cluster3 Cluster4

Trainer

Categorizer

Cat1 Cat2 Cat3 Cat4

Category1 Training Collection

Clustering (2)

Summarization

tifica

ssifica

Feature Extractio

Summarization

Extracts sentences from a document to create a document summary

Sentence selection is based on document structure and ranking of extracted features

Component: Text Search Engine

Text Search Engine

Fuzzy search

Hybrid queries

Free-text queries

Boolean queries

Synonyms search

Text Search Engine Search Engine

offers multiple search paradigms - boolean, free text, fuzzy, hybrid, etc.

supports linguistic analysis for documents in 21 languages including Arabic and Hebrew

features Boolean queries, precise term search and fuzzy search for 4 DBCS languages

Mining Functions to extract key features in text to cluster result list to refine queries

Integrated in IBM DB2 Digital Library and IBM DB2 UDB Text Extender

Text Search Engine

• A user can refine searches meaning that they can reuse previous search result sets to perform additional searches.

• Multilingual linguistic analysis performed:• - basic text analysis (recognizing terms, normalizing terms,

recognizing sentence boundaries)• - reducing terms to their base form• - stop word filtering• - decomposition (splitting compound terms)

Basic Text Seach Engine functions Included as part of the basic functional set in the Text

Search Engine Precise index ngram index linguistic index 21 SBCS languages 4 DBCS languages relevance ranking boolean queries free text queries fuzzy and phonetical searches thesaurus support

Text Search Engine: Details

Document support for single byte character set language Document support for double byte character set languages Linguistic search:

Dictionaries and synonyms lists for SBCS languages Terms are reduced to their base form, terms are decomposed,

terms are normalized to stand form Boolean query: Operators: AND, NOT, OR Natural language query/free text query: To formulate a query in

natural language Hybrid query:

To combine a natural language query with a Boolean search term

Text Search Engine: Details Fuzzy query:

To find misspell words: TOYOTA/TOYOTTA, DATABASE/DATABSAE

Phonetical query: Technique: remove vowel (s) from search term and replace

it/them with masking characters, eliminate duplicate consonants To search for similar-sounding words: COLOR/COLOUR,

SMITH/SMYTH, JANET/JEANNETTE ... Wildcard support for Boolean queries : Front, middle and end

masking for word and character masking

Text Search Engine: Even more details!

Section support Able to define a section of a document Restrict the search to given sections Example : define a section called Summary

Limit search scope within the Summary section Thesaurus support

for all index types and many languages ngram index thesaurus (workstation only)

Synonyms and broader/narrower terms DBCS language synonym support

Not supported for BiDi languages or Russian

Text Search Engine: Text Mining Functions

Provides text mining functions for English documents

Feature extractions Organize result list

Supports query refinement method for English documents

User assigns value to single documents

Text Search Engine: Query refinement example

Query Refinement Example• This is a snap shot of the Java GUI which is shipped with Intelligent

Miner for Text. The source code and instructions are shipped and must be compiled by the end user to be operational.

• Interacts with the TextMiner Java server.• Comprised of Java Beans which are shipped with Intelligent Miner for

Text. The Beans can also be built and integrated into other applications to interact with Intelligent Miner for text.

• The Java GUI provides a "ready-to-go" search GUI to interact with the Advanced Search engine. User can perform various levels of queries and even browse the documents themselves by double clicking in the window.

• Users must use a full Java enabled browser to run this pure Java applet.

Where to find the Text Search Engine functions

Basic functions S/390 Text Search Download for OS/390 V2.4 - V2.6 IM4T V2.3 workstations

Extended functions (result list clustering, relevance

feedback/query refinement, feature index) IM4T V2.3 for OS/390 IM4T V2.3 for Workstations

Component: Java & JavaBeans

Java Components

Java Search GUI - fully operational, NLS enabled JavaBeans for Rapid Application Development

Search Administration

Source is available and intended to be used as a 'starter kit'

Works with the Text Search Engine

GUI Enhancements - Enhanced error recovery, help

Use with NetScape and MS Internet Explorer Internet Explorer 3.02 and 4.0 for NT Internet Explorer 4.0 for Win95/98 NetScape Navigator 3.0/4.0 for Win95/98/NT NetScape Navigator 3.0/4.0 Solaris/SPARC NetScape Navigator 3.0 for Solaris/x86

Supported via plugin found at http://java.sun.com/products/plugin/1.1.1/index.html

Sun's HotJava Browser

Java Components - Details

Component: WebCrawler

Web Crawler

Is a Robot used to collect HTML pages for indexing Customizable as to which HTML links are to be crawled

(include and exclude patterns ...) Results are stored

Data objects on AIX/NT file systems Metadata in DB2

Parallel crawling, results combined HTML page change frequency used as revisiting factor External subsystems can be notified of web changes detected

by the crawler Create individual crawler using crawler toolkit

Web Crawler details

• Uses regular expression configuration files to filter or retain crawled URL.

• The data object are actual URL or documents. The size and type of URL to be stored are also configurable using provided configuration file structure.

• Storage is scaleable by mounting disk storage to file system storage locations

• Multiple crawlers can be run at once. The only known limitation is physical machine processing and storage capacities.

Web Crawler details

• Crawlers will dynamically adjust to increase monitoring for pages which change more frequently and vice-versa. This feature is also user configurable.

• Flexible API toolkit provide for the web crawler to assist in tasks such as forwarding of workflow messages

• API toolkit can also be used to allow the user to build their own crawler using provided components. Sample code is included to assist in the development.

Web Crawler Package

consists of 2 components A ready-to-run Web Crawler A Web Crawler toolkit to build customized Web

crawlers

The NetQuestion Solution

A Pre-built ready to use Internet/intranet text-search solution for searching a local Web server

A multiserver domain solution based on the Text Search Engine and Web Crawler

NetQuestion - Single WebServer Support Workstations

SBCS Search Forms and CGI script S/390

SBCS Search Forms and CGI script English Admin Forms and Script

NetQuestion - Multiple WebServer Support Drop in solution with some assumed defaults Fully configurable solution

Spellchecker support

NetQuestion Solution - details

Natural Language Support

NLS Support IBM Text Search Engine

18 SBCS Languages US English, UK English, Catalan, Danish, Dutch, German,

Swiss German, Spanish, Finnish, French, Canadian French, Icelandic, Italian, Norwegian Bokma., Norwegian Nynmal, Portuguese, Brazilian Portuguese, and Swedish plus Russian, Hebrew (BiDi), Arabic (BiDi)

4 DBCS Languages (Japanese, S Chinese, T Chinese, Korean) Text Analysis Tools

Language ID can identify 14 languages all other tools are English only

EURO support (new code page 8859-15) TATools to recognize Euro Abbr

NLS Support - Messages and GUI Fully enabled messages across all platforms Ship translations in all Group I languages (English , French,

German, Italian, Spanish, Brazilian Portugese, Simplified Chinese, Traditional Chinese, Japanese, Korean)

Java Search GUI sample is enabled, not to be translated JavaBeans not enabled NetQ Solution on S/390

NLS for Search forms and scripts (English, French, German, Italian, Spanish, Brazilian Portugese, Danish, Swedish, Norwegian, Finnish, Simplified Chinese, Traditional Chinese, Japanese, Korean)

No NLS of Admin Search forms and scripts

Documentation

On-line Documentation in HTML for workstation product S/390 Relies upon documentation on workstation CD-ROMs PDFs are shipped on workstation CD-ROMs Online Documentation Search available for all workstation

platforms

Documentation - DetailsTitle BookMaster HTML PDF Hardcopy Cmts

Getting Started Y Y Y Y

Translated into Group 1

Text Analysis Tools Y Y Y N

IBM Text Search Engine Y Y Y N

IBM Text Search EngineCustomization and Admin

Y Y Y N

WebCrawler Y Y Y N

Java GUI, Java Beans SearchJava Beans Admin N Y N N

NetQuestion Solution Y Y Y N

Welcome HTML page with search N Y N N

Fact Sheet N N N Y

IBM Web Crawler and Toolkit Y Y Y N

WWW External Pages N Y N N

Presentation Summary

IBM Intelligent Miner for Text A Knowledge-discovery software development toolkit

to build advanced Text-Mining and Text-Search applications

A NetQuestion Solution to construct Internet/intranet text-search solutions

Text Analysis

Text Search Engine

Web Crawler Package

Platforms AIX , Sun Solaris, Windows NT, OS/390

Announcement December 8, 1998

General Availability Workstation product: December 29, 1998 Mainframe product: January 29, 1999

Evaluation License 60-day trial version for AIX, Windows NT, Sun Solaris Order Number: GK2T-0167

Price for workstation product 30K$ per server

Platforms Available

Web presence Product Features, Downloads, News, Library,

Business partners, Case studies, Service, Support, Feedback

www.software.ibm.com/iminer/fortext

Intelligent Miner for Text

page 1 copyrighted material john tullis ibm intelligent miner for text john tullis depaul instructor...

Documents

statistical analyses of e-commerce ... - ux metrics...

from left to right: terry tullis, roland b¼rgmann, onno

chris tuff & julie tullis

hydraulics of pipelines j p tullis

domestic service alaska & hawaii...georgette reidburn...

copyrighted material john tullis 10/21/2015 page 1 04/02/00...

trappe, t. & tullis, g., intelligent business intermediate...

mini_upa, 2009 rating scales: what the research says joe...

copyrighted material john tullis 10/2/2015 page 1 04/02/00...

1 copyrighted material john tullis interwoven content...

patterns of observable work, brian tullis

j. paul tullis president, tullis engineering...

copyrighted material john tullis 01/29 /00 edi overview john...

inveresk plc (respondent) v tullis russell …...easter term...

from left to right: terry tullis, roland bürgmann, onno...

copyrighted material john tullis 8/23/2015 page 1 04/29/00...

reproduction of tractor components senior design project...

04/15/00 net.commerce overview copyrighted material john...

journal of experimental psychology:...

umea 2015 boogie with books jill devilbiss, irene tullis...