page 1 copyrighted material john tullis ibm intelligent miner for text john tullis depaul instructor...
Post on 22-Dec-2015
221 Views
Preview:
TRANSCRIPT
page 1
Copyrighted materialJohn Tullis
IBM Intelligent Miner for Text
John TullisDePaul Instructorjohn.d.tullis@us.arthurandersen.com
page 2
Copyrighted materialJohn Tullis
IBM Intelligent Miner for Text A Knowledge-discovery software development toolkit
to build advanced Text-Mining and Text-Search applications
A NetQuestion Solution to construct Internet/intranet text-search solutions
NetQuestion Solution
Text Analysis
Tools
Text Search Engine
Web Crawler Package
page 3
Copyrighted materialJohn Tullis
Intelligent Miner for Text For companies of any size and for different
industries Media Petroleum
BankingIntelligent
Miner for Text Education
Government Insurance
page 4
Copyrighted materialJohn Tullis
Potential Applications
Customer complaints
analysis
Newswire analysis
Intelligent Website
Intelligent Miner for Text
Opinion survey
classification
Competitive intelligence
Corporate Image analysis
page 5
Copyrighted materialJohn Tullis
Intelligent Miner for Text: Platforms supported Text
AnalysisTools
TextSearchEngineServer
Text Search Engine
Client
Text SearchEngineJava GUIJavaBeans
Web CrawlerPackage
NetQSolution
AIX 4.3
Y Y Y Y Y Y
Solaris 2.5.1
Y Y Y Y Y Y
Win NT 4.0SP3 Y Y Y Y Y Y
OS/390 V2R4, V2R5,
V2R6 Y Y Y Y Y Y
page 6
Copyrighted materialJohn Tullis
Reference Customers FinanceWise (Search engine for financial content on the
Internet) www.financewise.com
IBM web sites (incl. 2000 IBM intranet sites) www.ibm.com
Sueddeutsche Zeitung (classified ads on Sueddeutsche Zeitung Web site)
www.sueddeutsche.de SearchCafe (Business Partner)
www.search-cafe.com Success stories available at
www.software.ibm.com/iminer/fortext
Reference customers & Success stories
page 7
Copyrighted materialJohn Tullis
Component: Text Analysis Tools
page 8
Copyrighted materialJohn Tullis
Functionality Language Identification Clustering of document collection
hierarchical clustering relational clustering
Categorization/Classification of document collection Feature Extraction Summarization
page 9
Copyrighted materialJohn Tullis
Text Analysis Tools To automate tasks previously done manually
automatically identifies the language of a document automatically groups related documents based on their content,
without requiring predefined classes automatically assigns documents to one or more user-defined
categories automatically recognizes significant items in text, such as names,
technical terms, and abbreviations automatically extracts sentences from a document to create a
document summary
page 10
Copyrighted materialJohn Tullis
•Text analysis tools are available in a command line format structured to function like common UNIX or DOS command line formats.
•Text analysis tools can be used individually or in a combined mode depending on the required task.
•Configuration files allow document format flexibility and performance tuning for text searches. Command line switches provide additional flexibility by permitting the user to set runtime parameters.
•The documents need to be provided in plain text format. For other formats, conversion tools can be obtained from third parties such as KEYpak (http://www.keypak.com/
Text Analysis
page 11
Copyrighted materialJohn Tullis
Clustering (2)
Summarization
Lan
gu
ag
e
Iden
tifica
tion
Cla
ssifica
tion
Feature Extractio
n
page 12
Copyrighted materialJohn Tullis
Text Analysis Tools: Feature Extraction
To recognize significant vocabulary items To recognize all names referring to a single entity To provide the location of all person names, places and
organization in a text To find multi-word terms that have a meaning of their own To find abbreviations introduced in a text and links them
with their full forms To recognize named relationships
page 13
Copyrighted materialJohn Tullis
Text Analysis Tools: Feature Extraction• Produces statistics for each vocabulary item.• Associates terms to canonical forms (i.e. "related" associated to the term
"relate")• Feature extraction can be used as a preprocessor for the Clustering utility to
bias (or control) clustering activities.• Feature extraction can be run in two modes:
1) Lookup mode which refers to a schema generated by a training set and produces statistics for vocabulary items as they relate to the rest of the schema as well as within the document
2) Exploration mode which requires no training and yields textual data statistics for vocabulary items as they relate within the scope of the document(s) specified
page 14
Copyrighted materialJohn Tullis
Several classes of significant vocabulary can
be recognized
Names are categorized
Significant concepts are detected automatically
Automatic keywording: the most significant terminology in the
document
page 15
Copyrighted materialJohn Tullis
Feature Extraction - statistics & analysis• Application here shows how one can use the statistics and analysis
produced by the feature extraction.• Highlighting of selected items within a document by using the location
information in the feature extract (all vocabulary terms have location information to accomplish this).
• Selected categories can be filtered upon.• A significance measure for each vocabulary item is produced by
feature extraction which allows prioritization of keywords within the scope of individual documents or entire collections.
• This is a sample application which is not included in the software installation.
page 16
Copyrighted materialJohn Tullis
"Terms" include multi-word phrases whose
meaning is much more than that of the individual
words
Multi-word phrases are the vocabulary in which concepts are expressed
page 17
Copyrighted materialJohn Tullis
Feature Extraction - statistics & analysis• Recognizes multi-word phrases by pattern recognition meaning if a two
word pattern appears with an acceptable frequency then it is included as an extracted vocabulary item in the output.
• More heuristics are applied than mentioned but generally this is the textual processing which occurs.
• Concepts can be FORMULATED from the multi-word terms. The feature extraction utility assists in emphasizing prevalent multi-word terms.
page 18
Copyrighted materialJohn Tullis
Clustering (2)
Summarization
Lan
gu
ag
e
Iden
tifica
tion
Cla
ssifica
tion
Feature Extractio
n
page 19
Copyrighted materialJohn Tullis
Language Identification
given a document, discover automatically the language(s) in which the document is written
It can be used to restrict search results by languages organize the crawls by languages route documents to language translators
page 20
Copyrighted materialJohn Tullis
Language Identification• A 16 language dictionary is shipped with the Intelligent Miner for Text to
be used by the Language Identification utility.
• The Language Identification utility also comes with a utility which can be used to add to the shipped dictionary file to extend language identification. (You can even invent your own language and add it to the dictionary!)
• Documents can be analyzed for language content meaning the output of Language Identification can produce multiple degrees of language content in one pass (i.e. Document ABC has 75% English, 20% German, etc.). This is possible using a command line option.
• Allows further document organization by language and a degree of internationalization to applications.
page 21
Copyrighted materialJohn Tullis
Clustering (2)
Summarization
Lan
gu
ag
e
Iden
tifica
tion
Cla
ssifica
tion
Feature Extractio
n
page 22
Copyrighted materialJohn Tullis
Categorization/Classification
given a defined taxonomy, it can assign documents to preexisting categories
utilizes feature extraction capacities to do document comparisons efficiently
two stages training using sample documents category assignment
page 23
Copyrighted materialJohn Tullis
Categorization/Classification
• Users determine the taxonomy for organizing the documents into topics.
• Users create training sets to define categories and use the supplied training utility.
• Each document is analyzed and a rank value assigned as it relates to each category.
• A command line switch allows the user to display varying numbers of categories with the document's associated rank value.
• REMEMBER: The categories are predefined by the user.
page 24
Copyrighted materialJohn Tullis
Categorization: Solution Example
page 25
Copyrighted materialJohn Tullis
Clustering (2)
Summarization
Lan
gu
ag
e
Iden
tifica
tion
Cla
ssifica
tion
Feature Extractio
n
page 26
Copyrighted materialJohn Tullis
Clustering
Functions to automatically group related documents
based on their content, without requiring predefined classes
objects within a group are more similar to each other than to members of any other group
two approaches - Hierarchical clustering and binary relational clustering
page 27
Copyrighted materialJohn Tullis
Clustering - Details Preprocessing steps
Analyze data input stream and divide it into individual textual components to be used for clustering
Extract portions of individual textual components to be used for clustering (uses Feature Extraction as a preprocessor)
Customize stop word list Hierarchical clustering
Structure document collection using lexical affinity based on similarity function
Build clustering tree showing relationships between clusters of documents of varying granularity
page 28
Copyrighted materialJohn Tullis
Clustering - Details
Slicing Customize tree by applying adjustable thresholds to reduce
complexity and zoom-in on concepts of interest Use default threshold values for specific document collection
Note - slicing allows merging similar clusters into a single cluster. Clustering Output Formats
HTML file viewable by browser Textual description to be parsed (in the format of a tree)
page 29
Copyrighted materialJohn Tullis
Hierarchical Clustering - Visualization Example
page 30
Copyrighted materialJohn Tullis
Clustering - Details
• This is a sample application which shows the use of the clustering results in an HTML format. This application is not shipped with the software.
• The HTML output can be configured to place actual document paths in the display on the browser so users may easily view the documents which were clustered RIGHT FROM THE BROWSER.
• Clusters each have labels which are generated from three 2 word pairings which are the most common lexical affinities
• Similarity values in the application are represented by percentages. This is normalized as the similarity values actually range from 0 to 1000.
page 31
Copyrighted materialJohn Tullis
Categorization: Comparison to ClusteringIn clustering document collections are processed and grouped into dynamically generated clusters ....
In categorization, document collections are processed and grouped into predetermined groupings based on a taxonomy generated with training sets....Document
Collection DocumentCollection
ClusteringUtility
Cluster1 Cluster2 Cluster3 Cluster4
Trainer
Categorizer
Cat1 Cat2 Cat3 Cat4
Category1 Training Collection
Category2 Training Collection
Category3 Training Collection
page 32
Copyrighted materialJohn Tullis
Clustering (2)
Summarization
Lan
gu
ag
e
Iden
tifica
tion
Cla
ssifica
tion
Feature Extractio
n
page 33
Copyrighted materialJohn Tullis
Summarization
Extracts sentences from a document to create a document summary
Sentence selection is based on document structure and ranking of extracted features
page 34
Copyrighted materialJohn Tullis
Component: Text Search Engine
page 35
Copyrighted materialJohn Tullis
Text Search Engine
Fuzzy search
Hybrid queries
Free-text queries
Boolean queries
Synonyms search
page 36
Copyrighted materialJohn Tullis
Text Search Engine Search Engine
offers multiple search paradigms - boolean, free text, fuzzy, hybrid, etc.
supports linguistic analysis for documents in 21 languages including Arabic and Hebrew
features Boolean queries, precise term search and fuzzy search for 4 DBCS languages
Mining Functions to extract key features in text to cluster result list to refine queries
Integrated in IBM DB2 Digital Library and IBM DB2 UDB Text Extender
page 37
Copyrighted materialJohn Tullis
Text Search Engine
• A user can refine searches meaning that they can reuse previous search result sets to perform additional searches.
• Multilingual linguistic analysis performed:• - basic text analysis (recognizing terms, normalizing terms,
recognizing sentence boundaries)• - reducing terms to their base form• - stop word filtering• - decomposition (splitting compound terms)
page 38
Copyrighted materialJohn Tullis
Basic Text Seach Engine functions Included as part of the basic functional set in the Text
Search Engine Precise index ngram index linguistic index 21 SBCS languages 4 DBCS languages relevance ranking boolean queries free text queries fuzzy and phonetical searches thesaurus support
page 39
Copyrighted materialJohn Tullis
Text Search Engine: Details
Document support for single byte character set language Document support for double byte character set languages Linguistic search:
Dictionaries and synonyms lists for SBCS languages Terms are reduced to their base form, terms are decomposed,
terms are normalized to stand form Boolean query: Operators: AND, NOT, OR Natural language query/free text query: To formulate a query in
natural language Hybrid query:
To combine a natural language query with a Boolean search term
page 40
Copyrighted materialJohn Tullis
Text Search Engine: Details Fuzzy query:
To find misspell words: TOYOTA/TOYOTTA, DATABASE/DATABSAE
Phonetical query: Technique: remove vowel (s) from search term and replace
it/them with masking characters, eliminate duplicate consonants To search for similar-sounding words: COLOR/COLOUR,
SMITH/SMYTH, JANET/JEANNETTE ... Wildcard support for Boolean queries : Front, middle and end
masking for word and character masking
page 41
Copyrighted materialJohn Tullis
Text Search Engine: Even more details!
Section support Able to define a section of a document Restrict the search to given sections Example : define a section called Summary
Limit search scope within the Summary section Thesaurus support
for all index types and many languages ngram index thesaurus (workstation only)
Synonyms and broader/narrower terms DBCS language synonym support
Not supported for BiDi languages or Russian
page 42
Copyrighted materialJohn Tullis
Text Search Engine: Text Mining Functions
Provides text mining functions for English documents
Feature extractions Organize result list
Supports query refinement method for English documents
User assigns value to single documents
page 43
Copyrighted materialJohn Tullis
Text Search Engine: Query refinement example
page 44
Copyrighted materialJohn Tullis
Query Refinement Example• This is a snap shot of the Java GUI which is shipped with Intelligent
Miner for Text. The source code and instructions are shipped and must be compiled by the end user to be operational.
• Interacts with the TextMiner Java server.• Comprised of Java Beans which are shipped with Intelligent Miner for
Text. The Beans can also be built and integrated into other applications to interact with Intelligent Miner for text.
• The Java GUI provides a "ready-to-go" search GUI to interact with the Advanced Search engine. User can perform various levels of queries and even browse the documents themselves by double clicking in the window.
• Users must use a full Java enabled browser to run this pure Java applet.
page 45
Copyrighted materialJohn Tullis
Where to find the Text Search Engine functions
Basic functions S/390 Text Search Download for OS/390 V2.4 - V2.6 IM4T V2.3 workstations
Extended functions (result list clustering, relevance
feedback/query refinement, feature index) IM4T V2.3 for OS/390 IM4T V2.3 for Workstations
page 46
Copyrighted materialJohn Tullis
Component: Java & JavaBeans
page 47
Copyrighted materialJohn Tullis
Java Components
Java Search GUI - fully operational, NLS enabled JavaBeans for Rapid Application Development
Search Administration
Source is available and intended to be used as a 'starter kit'
Works with the Text Search Engine
page 48
Copyrighted materialJohn Tullis
GUI Enhancements - Enhanced error recovery, help
Use with NetScape and MS Internet Explorer Internet Explorer 3.02 and 4.0 for NT Internet Explorer 4.0 for Win95/98 NetScape Navigator 3.0/4.0 for Win95/98/NT NetScape Navigator 3.0/4.0 Solaris/SPARC NetScape Navigator 3.0 for Solaris/x86
Supported via plugin found at http://java.sun.com/products/plugin/1.1.1/index.html
Sun's HotJava Browser
Java Components - Details
page 49
Copyrighted materialJohn Tullis
Component: WebCrawler
page 50
Copyrighted materialJohn Tullis
Web Crawler
Is a Robot used to collect HTML pages for indexing Customizable as to which HTML links are to be crawled
(include and exclude patterns ...) Results are stored
Data objects on AIX/NT file systems Metadata in DB2
Parallel crawling, results combined HTML page change frequency used as revisiting factor External subsystems can be notified of web changes detected
by the crawler Create individual crawler using crawler toolkit
page 51
Copyrighted materialJohn Tullis
Web Crawler details
• Uses regular expression configuration files to filter or retain crawled URL.
• The data object are actual URL or documents. The size and type of URL to be stored are also configurable using provided configuration file structure.
• Storage is scaleable by mounting disk storage to file system storage locations
• Multiple crawlers can be run at once. The only known limitation is physical machine processing and storage capacities.
page 52
Copyrighted materialJohn Tullis
Web Crawler details
• Crawlers will dynamically adjust to increase monitoring for pages which change more frequently and vice-versa. This feature is also user configurable.
• Flexible API toolkit provide for the web crawler to assist in tasks such as forwarding of workflow messages
• API toolkit can also be used to allow the user to build their own crawler using provided components. Sample code is included to assist in the development.
page 53
Copyrighted materialJohn Tullis
Web Crawler Package
consists of 2 components A ready-to-run Web Crawler A Web Crawler toolkit to build customized Web
crawlers
page 54
Copyrighted materialJohn Tullis
The NetQuestion Solution
page 55
Copyrighted materialJohn Tullis
NetQuestion Solution
A Pre-built ready to use Internet/intranet text-search solution for searching a local Web server
A multiserver domain solution based on the Text Search Engine and Web Crawler
page 56
Copyrighted materialJohn Tullis
NetQuestion - Single WebServer Support Workstations
SBCS Search Forms and CGI script S/390
SBCS Search Forms and CGI script English Admin Forms and Script
NetQuestion - Multiple WebServer Support Drop in solution with some assumed defaults Fully configurable solution
Spellchecker support
NetQuestion Solution - details
page 57
Copyrighted materialJohn Tullis
Natural Language Support
page 58
Copyrighted materialJohn Tullis
NLS Support IBM Text Search Engine
18 SBCS Languages US English, UK English, Catalan, Danish, Dutch, German,
Swiss German, Spanish, Finnish, French, Canadian French, Icelandic, Italian, Norwegian Bokma., Norwegian Nynmal, Portuguese, Brazilian Portuguese, and Swedish plus Russian, Hebrew (BiDi), Arabic (BiDi)
4 DBCS Languages (Japanese, S Chinese, T Chinese, Korean) Text Analysis Tools
Language ID can identify 14 languages all other tools are English only
EURO support (new code page 8859-15) TATools to recognize Euro Abbr
page 59
Copyrighted materialJohn Tullis
NLS Support - Messages and GUI Fully enabled messages across all platforms Ship translations in all Group I languages (English , French,
German, Italian, Spanish, Brazilian Portugese, Simplified Chinese, Traditional Chinese, Japanese, Korean)
Java Search GUI sample is enabled, not to be translated JavaBeans not enabled NetQ Solution on S/390
NLS for Search forms and scripts (English, French, German, Italian, Spanish, Brazilian Portugese, Danish, Swedish, Norwegian, Finnish, Simplified Chinese, Traditional Chinese, Japanese, Korean)
No NLS of Admin Search forms and scripts
page 60
Copyrighted materialJohn Tullis
Documentation
page 61
Copyrighted materialJohn Tullis
Documentation
On-line Documentation in HTML for workstation product S/390 Relies upon documentation on workstation CD-ROMs PDFs are shipped on workstation CD-ROMs Online Documentation Search available for all workstation
platforms
page 62
Copyrighted materialJohn Tullis
Documentation - DetailsTitle BookMaster HTML PDF Hardcopy Cmts
Getting Started Y Y Y Y
Translated into Group 1
Text Analysis Tools Y Y Y N
IBM Text Search Engine Y Y Y N
IBM Text Search EngineCustomization and Admin
Y Y Y N
WebCrawler Y Y Y N
Java GUI, Java Beans SearchJava Beans Admin N Y N N
NetQuestion Solution Y Y Y N
Welcome HTML page with search N Y N N
Fact Sheet N N N Y
IBM Web Crawler and Toolkit Y Y Y N
WWW External Pages N Y N N
page 63
Copyrighted materialJohn Tullis
Presentation Summary
page 64
Copyrighted materialJohn Tullis
IBM Intelligent Miner for Text A Knowledge-discovery software development toolkit
to build advanced Text-Mining and Text-Search applications
A NetQuestion Solution to construct Internet/intranet text-search solutions
NetQuestion Solution
Text Analysis
Tools
Text Search Engine
Web Crawler Package
page 65
Copyrighted materialJohn Tullis
Platforms AIX , Sun Solaris, Windows NT, OS/390
Announcement December 8, 1998
General Availability Workstation product: December 29, 1998 Mainframe product: January 29, 1999
Evaluation License 60-day trial version for AIX, Windows NT, Sun Solaris Order Number: GK2T-0167
Price for workstation product 30K$ per server
Platforms Available
page 66
Copyrighted materialJohn Tullis
Web presence Product Features, Downloads, News, Library,
Business partners, Case studies, Service, Support, Feedback
www.software.ibm.com/iminer/fortext
Intelligent Miner for Text
top related