TextMOLE: Text Mining Operations TextMOLE: Text Mining Operations Library and EnvironmentLibrary and Environment
Daniel B. Waegel Daniel B. Waegel andand
April Kontostathis, Ph.D.April Kontostathis, Ph.D.
Ursinus CollegeUrsinus CollegeCollegeville PACollegeville PA
What?What?
Advanced application for indexing and Advanced application for indexing and searching a text database. searching a text database.
Allows users to quickly analyze a corpus Allows users to quickly analyze a corpus of documents and determine which of documents and determine which parameters will provide maximal retrieval parameters will provide maximal retrieval performance.performance.
Who?Who?
Instructors - demonstrate information retrieval Instructors - demonstrate information retrieval concepts in the classroomconcepts in the classroom
Students – hands-on exploration of concepts Students – hands-on exploration of concepts often covered in an introductory course in often covered in an introductory course in information retrieval or artificial intelligence information retrieval or artificial intelligence
Reseachers - ‘quick and dirty’ analysis of an Reseachers - ‘quick and dirty’ analysis of an unfamiliar collectionunfamiliar collection
Juniors and Seniors – capstone experiences in Juniors and Seniors – capstone experiences in computer sciencecomputer science
Why?Why?
Students unfamiliar with applications which require Students unfamiliar with applications which require manipulation of unstructured textmanipulation of unstructured textIR students develop basic IR systems, but do not have IR students develop basic IR systems, but do not have time to implement and test a variety of parameterstime to implement and test a variety of parametersExisting systems do not tightly integrate indexing and Existing systems do not tightly integrate indexing and retrieval functionsretrieval functions– R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval.
Addison Wesley/ACM Press, New York, 1999.– R. K. Belew. Finding Out About. Cambridge University Press, 2000.– G. Salton. The SMART Retrieval System–Experiments in Automatic
Document Processing. Prentice Hall, Englewood Cliffs, New Jersey, 1971.
Time! Students in AI do not even have time to implement a basic IR system.
How?How?
Overview of the ApplicationOverview of the Application– IndexingIndexing– Single Query RetrievalSingle Query Retrieval– Multiple Query RetrievalMultiple Query Retrieval
Sample AssignmentsSample Assignments– Artificial IntelligenceArtificial Intelligence– Information RetrievalInformation Retrieval– Capstone ProjectsCapstone Projects
IndexingIndexing
Single Query SpecificationSingle Query Specification
Single Query ResultsSingle Query Results
Multiple Query SpecificationMultiple Query Specification
Multiple Query ResultsMultiple Query Results
How?How?
Overview of the ApplicationOverview of the Application– IndexingIndexing– Single Query RetrievalSingle Query Retrieval– Multiple Query RetrievalMultiple Query Retrieval
Sample AssignmentsSample Assignments– Artificial IntelligenceArtificial Intelligence– Information RetrievalInformation Retrieval– Capstone projectsCapstone projects
Information Retrieval CourseInformation Retrieval Course
Assignment 2Assignment 2– Assumes Assignment 1 was having students develop Assumes Assignment 1 was having students develop
their own rudimentary IR systemstheir own rudimentary IR systems– Using a corpus provided by the instructor or Using a corpus provided by the instructor or
developed by the student (min. 100 documents)developed by the student (min. 100 documents)Convert to XML formatConvert to XML formatParse with TextMOLEParse with TextMOLEIdentify a set of standard queries for the collection (truth set Identify a set of standard queries for the collection (truth set not necessary)not necessary)Vary parameters (stemming vs. no stemming, various Vary parameters (stemming vs. no stemming, various weighting schemes, various stop lists)weighting schemes, various stop lists)Decide which set of parameters work best for your collection. Decide which set of parameters work best for your collection. Write a paper describing your experiments and the results, Write a paper describing your experiments and the results, be sure to defend your conclusions!be sure to defend your conclusions!
Information Retrieval CourseInformation Retrieval Course
Assigment 3 or 4Assigment 3 or 4– Using the corpus from the previous assignment Using the corpus from the previous assignment
(minimum of 100 documents)(minimum of 100 documents)– Develop a set of standard queriesDevelop a set of standard queries– Determine which documents are truly relevant to Determine which documents are truly relevant to
these queries (involves lots of reading and frustration)these queries (involves lots of reading and frustration)– Use the Multiple Query function of TextMOLE to Use the Multiple Query function of TextMOLE to
determine precision and recall determine precision and recall
AlternateAlternate– Use one or more of the Gold Standard Collections Use one or more of the Gold Standard Collections
that have set of standard queries with truth sets that have set of standard queries with truth sets (TextMOLE can convert them to XML format)(TextMOLE can convert them to XML format)
Artificial Intelligence CourseArtificial Intelligence Course
IR AssignmentIR Assignment– Instructor provides set of documents in XML format Instructor provides set of documents in XML format
and set of standard queries (with or without result set)and set of standard queries (with or without result set)– Instructor provides students with parameters to use Instructor provides students with parameters to use
(ex. Stemming, log entropy weighting for both (ex. Stemming, log entropy weighting for both indexing and retrieval)indexing and retrieval)
– Students try to find the ‘best’ stop word list for this Students try to find the ‘best’ stop word list for this collectioncollection
– Write brief paper describing experiments and resultsWrite brief paper describing experiments and results
Capstone Experiences in Capstone Experiences in Computer ScienceComputer Science
Migrate TextMOLE to another Migrate TextMOLE to another platform platform – Open GLOpen GL– JavaJava– Web basedWeb based– Relational DatabaseRelational Database– Library Functions Library Functions
Add additional parameters to Add additional parameters to basic Search and Retrievalbasic Search and Retrieval– N-grams instead of wordsN-grams instead of words– Noun phrases (using a tool Noun phrases (using a tool
like flex)like flex)– ClusteringClustering– Latent Semantic IndexingLatent Semantic Indexing
Add additional IR applications Add additional IR applications – Emerging trend detectionEmerging trend detection– ClassificationClassification– First Story DetectionFirst Story Detection– FilteringFiltering– SummarizationSummarization
Research in Computer Research in Computer ScienceScience– Develop your own weighting Develop your own weighting
schemescheme– Identify additional features for Identify additional features for
indexingindexing– Develop a new Gold Standard Develop a new Gold Standard
collectioncollection
Where?Where?
Version 1.0 now available online!Version 1.0 now available online!http://webpages.ursinus.edu/akontostathis/TextMOLEhttp://webpages.ursinus.edu/akontostathis/TextMOLE
Contact Contact [email protected]@ursinus.edu with with questions and commentsquestions and comments