text mining patrick cash outline
TRANSCRIPT
![Page 1: Text Mining Patrick Cash Outline](https://reader035.vdocuments.mx/reader035/viewer/2022062703/55506612b4c90574428b55a7/html5/thumbnails/1.jpg)
Text Mining
Patrick Cash
![Page 2: Text Mining Patrick Cash Outline](https://reader035.vdocuments.mx/reader035/viewer/2022062703/55506612b4c90574428b55a7/html5/thumbnails/2.jpg)
Outline• Introduction• Data Mining• Text Mining
– Text Mining Process• Text Mining Applications • Challenges in Text Mining• Conclusion
![Page 3: Text Mining Patrick Cash Outline](https://reader035.vdocuments.mx/reader035/viewer/2022062703/55506612b4c90574428b55a7/html5/thumbnails/3.jpg)
Introduction• Why Text Mining?
– Massive amount of new information being created• World’s data doubles every 18 months (Jacques Vallee
Ph.D)– 80-90% of all data is held in various unstructured
formats– Useful information can be derived from this
unstructured data
![Page 4: Text Mining Patrick Cash Outline](https://reader035.vdocuments.mx/reader035/viewer/2022062703/55506612b4c90574428b55a7/html5/thumbnails/4.jpg)
Introduction• Intelligence in text mining is based on NLP
techniques• Can be used as a preprocessing technique to
harvest data and get an initial understanding of the patterns that exist in the data
• Often seen as a special case of data mining but there is an important difference
![Page 5: Text Mining Patrick Cash Outline](https://reader035.vdocuments.mx/reader035/viewer/2022062703/55506612b4c90574428b55a7/html5/thumbnails/5.jpg)
Data Mining• Extraction of interesting (non-trivial, implicit,
previously unknown and potentially useful) information (or patterns) from data
• Data Mining: a misnomer?– Knowledge discovery, knowledge extraction,
data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.
![Page 6: Text Mining Patrick Cash Outline](https://reader035.vdocuments.mx/reader035/viewer/2022062703/55506612b4c90574428b55a7/html5/thumbnails/6.jpg)
Data Mining• Descriptive: understanding underlying processes
or behavior – Patterns and trends– Clustering
• Predictive: predict an unseen or unmeasured value – Future projections and missing values– Classification
![Page 7: Text Mining Patrick Cash Outline](https://reader035.vdocuments.mx/reader035/viewer/2022062703/55506612b4c90574428b55a7/html5/thumbnails/7.jpg)
SEMMA• Search
– Input data source, data sampling, partitioning• Explore
– Patterns, trends, outliers, visualization• Modify
– Clustering, feature reduction• Model
– Regression, tree, network• Assess
– Report, pass to next step in analysis
![Page 8: Text Mining Patrick Cash Outline](https://reader035.vdocuments.mx/reader035/viewer/2022062703/55506612b4c90574428b55a7/html5/thumbnails/8.jpg)
Search vs. Discover
Data Mining
Text Mining
DataRetrieval
InformationRetrieval
Search(goal-oriented)
Discover(opportunistic)
StructuredData
UnstructuredData (Text)
![Page 9: Text Mining Patrick Cash Outline](https://reader035.vdocuments.mx/reader035/viewer/2022062703/55506612b4c90574428b55a7/html5/thumbnails/9.jpg)
Text Mining• Many different by similar definitions• Text Mining = Statistical NLP + Data Mining• Text Mining is a process that employs
– Statistical NLP: a set of algorithms for converting unstructured text into structured data objects
– Data Mining: the quantitative methods that analyze these data objects to discover knowledge
![Page 10: Text Mining Patrick Cash Outline](https://reader035.vdocuments.mx/reader035/viewer/2022062703/55506612b4c90574428b55a7/html5/thumbnails/10.jpg)
Text Mining• Descriptive
– Pattern and trend analysis– Knowledge base creation– Summarization– Visualization
• Predictive– Classification – Question answering– Pattern and trend forecasting
![Page 11: Text Mining Patrick Cash Outline](https://reader035.vdocuments.mx/reader035/viewer/2022062703/55506612b4c90574428b55a7/html5/thumbnails/11.jpg)
Text Mining Techniques• Information Retrieval
– Indexing and retrieval of textual documents• Information Extraction
– Extraction of partial knowledge in the text• Web Mining
– Indexing and retrieval of textual documents and extraction of partial knowledge using the web (ontology building)
• Clustering– Generating collections of similar text documents
![Page 12: Text Mining Patrick Cash Outline](https://reader035.vdocuments.mx/reader035/viewer/2022062703/55506612b4c90574428b55a7/html5/thumbnails/12.jpg)
Text Mining Process
![Page 13: Text Mining Patrick Cash Outline](https://reader035.vdocuments.mx/reader035/viewer/2022062703/55506612b4c90574428b55a7/html5/thumbnails/13.jpg)
Text Mining Process• Text Preprocessing
– Syntactic/Semantic text analysis • Features Generation
– Bag of words • Features Selection
– Simple counting– Statistics
• Text/Data Mining– Classification (Supervised) / Clustering (Unsupervised)
• Analyzing results
![Page 14: Text Mining Patrick Cash Outline](https://reader035.vdocuments.mx/reader035/viewer/2022062703/55506612b4c90574428b55a7/html5/thumbnails/14.jpg)
Text Mining Process• Text preprocessing
– Part Of Speech (POS) tagging• Find the corresponding POS for each word.
– Word sense disambiguation• Context based or proximity based
– Parsing• Generates a parse tree (graph) for each sentence• Each sentence is a stand alone graph
![Page 15: Text Mining Patrick Cash Outline](https://reader035.vdocuments.mx/reader035/viewer/2022062703/55506612b4c90574428b55a7/html5/thumbnails/15.jpg)
Text Mining Process• Feature Generation
– Text document is represented by the words it contains (and their occurrences)
• Order of words is not that important for certain applications (Bag of words)
– Stemming: identifies a word by its root• Reduce dimensionality
– Stop words: The common words unlikely to help text mining
![Page 16: Text Mining Patrick Cash Outline](https://reader035.vdocuments.mx/reader035/viewer/2022062703/55506612b4c90574428b55a7/html5/thumbnails/16.jpg)
Text Mining Process• Feature Selection
– Reduce dimensionality• Learners have difficulty addressing tasks with high
dimensionality• Only interested in the information relevant to what is being
analyzed – Irrelevant features
• Not all features help
![Page 17: Text Mining Patrick Cash Outline](https://reader035.vdocuments.mx/reader035/viewer/2022062703/55506612b4c90574428b55a7/html5/thumbnails/17.jpg)
Text Mining Process• Text Mining: Classification definition
– Given: a collection of labeled records (training set)• Each record contains a set of features (attributes), and
the true class (label)– Find: a model for the class as a function of the
values of the features– Goal: previously unseen records should be assigned
a class as accurately as possible
![Page 18: Text Mining Patrick Cash Outline](https://reader035.vdocuments.mx/reader035/viewer/2022062703/55506612b4c90574428b55a7/html5/thumbnails/18.jpg)
Text Mining Process• Text Mining: Clustering definition
– Given: a set of documents and a similarity measure among documents
– Find: clusters such that:• Documents in one cluster are more similar to one another• Documents in separate clusters are less similar to one
another– Goal:
• Finding a correct set of documents clusters
![Page 19: Text Mining Patrick Cash Outline](https://reader035.vdocuments.mx/reader035/viewer/2022062703/55506612b4c90574428b55a7/html5/thumbnails/19.jpg)
Text Mining Process• Supervised learning (classification)
– The training data is labeled indicating the class– New data is classified based on the training set– Correct classification: The known label of test sample is
identical with the class result from the classification model• Unsupervised learning (clustering)
– The class labels of training data are unknown– Establish the existence of classes or clusters in the data– Good clustering method: high intra-cluster similarity
![Page 20: Text Mining Patrick Cash Outline](https://reader035.vdocuments.mx/reader035/viewer/2022062703/55506612b4c90574428b55a7/html5/thumbnails/20.jpg)
Text Mining Process• Analyzing the results
– Are the results satisfactory?– Does more mining need to be done?– Does a different technique need to be used?– Does another iteration of one or more steps in the
process need to be done?
![Page 21: Text Mining Patrick Cash Outline](https://reader035.vdocuments.mx/reader035/viewer/2022062703/55506612b4c90574428b55a7/html5/thumbnails/21.jpg)
Text Mining Applications• Bioinformatics
– Genomics research (DNA sequencing)• Medical
– Mining medical records to improve care• Business intelligence
– Risk analysis• Research
– Analyzing research publications• Basically anywhere there is large amount of
unstructured text data
![Page 22: Text Mining Patrick Cash Outline](https://reader035.vdocuments.mx/reader035/viewer/2022062703/55506612b4c90574428b55a7/html5/thumbnails/22.jpg)
Text Mining Application• Classification (Categorization)
– Spam detection, Document organization• Clustering
– Trend analysis, Topic identification• Web Mining
– Trend analysis, Opinion mining, Ontology creation• Classical NLP
– Text summarization, Question answering, Information extraction
![Page 23: Text Mining Patrick Cash Outline](https://reader035.vdocuments.mx/reader035/viewer/2022062703/55506612b4c90574428b55a7/html5/thumbnails/23.jpg)
Text Mining Application• Smaller scale applications
– Relationship Analysis• If A is related to B, and B is related to C, there is
potentially a relationship between A and C.– Trend analysis
• Occurrences of A peak in October.– Mixed applications
• Co-occurrence of A together with B peak in November. (Shopping Cart Analysis)
![Page 24: Text Mining Patrick Cash Outline](https://reader035.vdocuments.mx/reader035/viewer/2022062703/55506612b4c90574428b55a7/html5/thumbnails/24.jpg)
Challenges in Text Mining• Remember
– Text Mining = Statistical NLP + Data Mining• Text mining suffers from the same challenges as
Statistical NLP and Data Mining• Add in the additional difficulties associated with
the data not being structured
![Page 25: Text Mining Patrick Cash Outline](https://reader035.vdocuments.mx/reader035/viewer/2022062703/55506612b4c90574428b55a7/html5/thumbnails/25.jpg)
Challenges in Text Mining• Statistical NLP
– Ambiguity– Context– Tokenization \ Sentence Detection– Stemming– POS Tagging– Coreference Resolution
![Page 26: Text Mining Patrick Cash Outline](https://reader035.vdocuments.mx/reader035/viewer/2022062703/55506612b4c90574428b55a7/html5/thumbnails/26.jpg)
Challenges in Text Mining• Data Mining
– Data preprocessing• Ability to process the data• Massive amounts of data• Determining and extracting information of interest
– Availability of NLP tools to work with data mining– Discovery process
• No training data available
![Page 27: Text Mining Patrick Cash Outline](https://reader035.vdocuments.mx/reader035/viewer/2022062703/55506612b4c90574428b55a7/html5/thumbnails/27.jpg)
Conclusion• Text Mining = Statistical NLP + Data Mining
– Culmination of all the NLP techniques covered in this course
• Growing research area that will be important as information growth (and need to extract knowledge from that information) increases
![Page 28: Text Mining Patrick Cash Outline](https://reader035.vdocuments.mx/reader035/viewer/2022062703/55506612b4c90574428b55a7/html5/thumbnails/28.jpg)
References• Even-Zohar, Y. Introduction to Text Mining.
Supercomputing, 2002. http://alg.ncsa.uiuc.edu/do/documents/presentations
• Treloar, N AvaQuest Inc. www.knowledgetechnologies.net/proceedings/presentations/treloar/nathantreloar.ppt
• Witte, R. Faculty of Informatics Institute for Program Structures and Data Organization (IPD) http://www.edbt2006.de/edbt-share/IntroductionToTextMining.pdf
![Page 29: Text Mining Patrick Cash Outline](https://reader035.vdocuments.mx/reader035/viewer/2022062703/55506612b4c90574428b55a7/html5/thumbnails/29.jpg)
Questions ?