testing
TRANSCRIPT
ISQS 6347, Data & Text MiningISQS 6347, Data & Text Mining 11
Lecture Notes 1:Lecture Notes 1:Introduction to Data MiningIntroduction to Data Mining
Zhangxi LinZhangxi Lin
ISQS 6347ISQS 6347
Texas Tech UniversityTexas Tech University
ISQS 6347, Data & Text MiningISQS 6347, Data & Text Mining 22
What is Data Mining?What is Data Mining?
Many DefinitionsMany Definitions– Non-trivial extraction of implicit, previously unknown and Non-trivial extraction of implicit, previously unknown and
potentially useful information from datapotentially useful information from data– Exploration & analysis, by automatic or semi-automatic Exploration & analysis, by automatic or semi-automatic
means, of large quantities of data in order to discover means, of large quantities of data in order to discover meaningful patterns. (Berry and Linoff, 1997, 2000)meaningful patterns. (Berry and Linoff, 1997, 2000)
– Data Mining is the process of discovering meaningful new Data Mining is the process of discovering meaningful new correlations, patterns and trends by sifting through large correlations, patterns and trends by sifting through large amount of data stored in repositories, using pattern amount of data stored in repositories, using pattern recognition technologies as well as statistical and recognition technologies as well as statistical and mathematical techniques. (Gartner Group, 2004)mathematical techniques. (Gartner Group, 2004)
ISQS 6347, Data & Text MiningISQS 6347, Data & Text Mining 33
Data Mining ProcessData Mining Process
ISQS 6347, Data & Text MiningISQS 6347, Data & Text Mining 44
What is Text Mining?What is Text Mining?
Discover useful and previously unknown Discover useful and previously unknown “gems” of information in “gems” of information in large text large text collectionscollections
ISQS 6347, Data & Text MiningISQS 6347, Data & Text Mining 55
Motivation for Text MiningMotivation for Text Mining
Approximately Approximately 90%90% of the world’s data is held in unstructured of the world’s data is held in unstructured formats (source: Oracle Corporation)formats (source: Oracle Corporation)Information intensive business processes demand that we transcend Information intensive business processes demand that we transcend from simple document retrieval to “knowledge” discovery.from simple document retrieval to “knowledge” discovery.
90%
Structured Numerical or CodedInformation
10%
Unstructured or Semi-structuredInformation
ISQS 6347, Data & Text MiningISQS 6347, Data & Text Mining 66
Text Mining ProcessText Mining ProcessText PreprocessingText Preprocessing– Syntactic/Semantic Syntactic/Semantic
Text Analysis Text Analysis
Features Generation Features Generation – Bag of Words Bag of Words
Feature SelectionFeature Selection– Simple CountingSimple Counting– Statistics Statistics
Text/Data MiningText/Data Mining– Classification- Classification-
Supervised LearningSupervised Learning– Clustering- Clustering-
Unsupervised LearningUnsupervised Learning
Analyzing ResultsAnalyzing Results
ISQS 6347, Data & Text MiningISQS 6347, Data & Text Mining 77
Lots of data is being collected Lots of data is being collected and warehoused and warehoused – Web data, e-commerceWeb data, e-commerce– purchases at department/purchases at department/
grocery storesgrocery stores– Bank/Credit Card Bank/Credit Card
transactionstransactions
Computers have become cheaper and more powerfulComputers have become cheaper and more powerful
Competitive Pressure is Strong Competitive Pressure is Strong – Provide better, customized services for an Provide better, customized services for an edge edge (e.g. in (e.g. in
Customer Relationship Management)Customer Relationship Management)
Why Mine Data? Commercial ViewpointWhy Mine Data? Commercial Viewpoint
Why Mine Data? Scientific ViewpointWhy Mine Data? Scientific Viewpoint
Data collected and stored at Data collected and stored at enormous speeds (GB/hour)enormous speeds (GB/hour)– remote sensors on a satelliteremote sensors on a satellite
– telescopes scanning the skiestelescopes scanning the skies
– microarrays generating gene microarrays generating gene expression dataexpression data
– scientific simulations scientific simulations generating terabytes of datagenerating terabytes of data
Traditional techniques infeasible for raw Traditional techniques infeasible for raw datadataData mining may help scientists Data mining may help scientists – in classifying and segmenting datain classifying and segmenting data– in Hypothesis Formationin Hypothesis Formation
ISQS 6347, Data & Text MiningISQS 6347, Data & Text Mining 99
Draws ideas from machine learning/AI, pattern Draws ideas from machine learning/AI, pattern recognition, statistics, and database systemsrecognition, statistics, and database systems
Traditional TechniquesTraditional Techniquesmay be unsuitable due to may be unsuitable due to – Enormity of dataEnormity of data– High dimensionality High dimensionality
of dataof data– Heterogeneous, Heterogeneous,
distributed nature distributed nature of dataof data
Origins of Data MiningOrigins of Data Mining
Machine Learning/Pattern
Recognition
Statistics/AI
Data Mining
Database systems
ISQS 6347, Data & Text MiningISQS 6347, Data & Text Mining 1010
How much is 12 Exabytes?
1,200,000 Libraries of Congress
Emerging data sources Medical images: potential 1 EB/year
Video monitors: potential 100 EB/year
Sources: • http://www2.sims.berkeley.edu/research/projects/how-much-info-2003/execsum.htm, • The Expanding Digital Universe, IDC white paper, March 2007
55% in personal PCs16% in corporate data warehousesInternet only 21 TBEmail 500x more than Internet / year
ISQS 6347, Data & Text MiningISQS 6347, Data & Text Mining 1111
Data Mining TasksData Mining Tasks
Prediction Methods. Use some variables to predict Prediction Methods. Use some variables to predict unknown or future values of other variables.unknown or future values of other variables.– ClassificationClassification– RegressionRegression– Deviation DetectionDeviation Detection
Description Methods. Find human-interpretable Description Methods. Find human-interpretable patterns that describe the data.patterns that describe the data.– Clustering Clustering – Association Rule DiscoveryAssociation Rule Discovery– Sequential Pattern DiscoverySequential Pattern Discovery
ISQS 6347, Data & Text MiningISQS 6347, Data & Text Mining 1212
Classification: DefinitionClassification: Definition
Given a collection of records (Given a collection of records (training set training set ))– Each record contains a set of Each record contains a set of attributesattributes, one of the , one of the
attributes is the attributes is the classclass..
Find a Find a modelmodel for class attribute as a function of for class attribute as a function of the values of other attributes.the values of other attributes.Goal: Goal: previously unseenpreviously unseen records should be records should be assigned a class as accurately as possible.assigned a class as accurately as possible.– A A test settest set is used to determine the accuracy of the model. is used to determine the accuracy of the model.
Usually, the given data set is divided into training and test Usually, the given data set is divided into training and test sets, with training set used to build the model and test set sets, with training set used to build the model and test set used to validate it.used to validate it.
ISQS 6347, Data & Text MiningISQS 6347, Data & Text Mining 1313
Classification ExampleClassification Example
Tid Refund MaritalStatus
TaxableIncome Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes10
categoric
al
categoric
al
continuous
class
Refund MaritalStatus
TaxableIncome Cheat
No Single 75K ?
Yes Married 50K ?
No Married 150K ?
Yes Divorced 90K ?
No Single 40K ?
No Married 80K ?10
TestSet
Training Set
ModelLearn
Classifier
ISQS 6347, Data & Text MiningISQS 6347, Data & Text Mining 1414
Classification: Application 1Classification: Application 1
Direct MarketingDirect Marketing– Goal: Reduce cost of mailing by Goal: Reduce cost of mailing by targetingtargeting a set of a set of
consumers likely to buy a new cell-phone product.consumers likely to buy a new cell-phone product.– Approach:Approach:
Use the data for a similar product introduced before. Use the data for a similar product introduced before. We know which customers decided to buy and which decided We know which customers decided to buy and which decided otherwise. This otherwise. This {buy, don’t buy}{buy, don’t buy} decision forms the decision forms the class class attributeattribute..Collect various demographic, lifestyle, and company-interaction Collect various demographic, lifestyle, and company-interaction related information about all such customers.related information about all such customers.
– Type of business, where they stay, how much they earn, etc.Type of business, where they stay, how much they earn, etc.
Use this information as input attributes to learn a classifier Use this information as input attributes to learn a classifier model.model.
From [Berry & Linoff] Data Mining Techniques, 1997
ISQS 6347, Data & Text MiningISQS 6347, Data & Text Mining 1515
Classification: Application 2Classification: Application 2
Fraud DetectionFraud Detection– Goal: Predict fraudulent cases in credit card transactions.Goal: Predict fraudulent cases in credit card transactions.– Approach:Approach:
Use credit card transactions and the information on its Use credit card transactions and the information on its account-holder as attributes.account-holder as attributes.
– When does a customer buy, what does he buy, how often he When does a customer buy, what does he buy, how often he pays on time, etcpays on time, etc
Label past transactions as fraud or fair transactions. This Label past transactions as fraud or fair transactions. This forms the class attribute.forms the class attribute.Learn a model for the class of the transactions.Learn a model for the class of the transactions.Use this model to detect fraud by observing credit card Use this model to detect fraud by observing credit card transactions on an account.transactions on an account.
ISQS 6347, Data & Text MiningISQS 6347, Data & Text Mining 1616
Clustering DefinitionClustering Definition
Given a set of data points, each having a set Given a set of data points, each having a set of attributes, and a similarity measure among of attributes, and a similarity measure among them, find clusters such thatthem, find clusters such that– Data points in one cluster are more similar to one Data points in one cluster are more similar to one
another.another.– Data points in separate clusters are less similar to Data points in separate clusters are less similar to
one another.one another.
Similarity Measures:Similarity Measures:– Euclidean Distance if attributes are continuous.Euclidean Distance if attributes are continuous.– Other Problem-specific Measures.Other Problem-specific Measures.
ISQS 6347, Data & Text MiningISQS 6347, Data & Text Mining 1717
Illustrating ClusteringIllustrating ClusteringEuclidean Distance Based Clustering in 3-D space.
Intracluster distancesare minimized
Intracluster distancesare minimized
Intercluster distancesare maximized
Intercluster distancesare maximized
ISQS 6347, Data & Text MiningISQS 6347, Data & Text Mining 1818
Clustering ExampleClustering Example
Market Segmentation:Market Segmentation:– Goal: subdivide a market into distinct subsets of customers Goal: subdivide a market into distinct subsets of customers
where any subset may conceivably be selected as a market where any subset may conceivably be selected as a market target to be reached with a distinct marketing mix.target to be reached with a distinct marketing mix.
– Approach: Approach: Collect different attributes of customers based on their Collect different attributes of customers based on their geographical and lifestyle related information.geographical and lifestyle related information.Find clusters of similar customers.Find clusters of similar customers.Measure the clustering quality by observing buying patterns of Measure the clustering quality by observing buying patterns of customers in same cluster vs. those from different clusters. customers in same cluster vs. those from different clusters.
ISQS 6347, Data & Text MiningISQS 6347, Data & Text Mining 1919
Association Rule Discovery: Association Rule Discovery: DefinitionDefinition
Given a set of records each of which contain some Given a set of records each of which contain some number of items from a given collection;number of items from a given collection;– Produce dependency rules which will predict occurrence of Produce dependency rules which will predict occurrence of
an item based on occurrences of other items.an item based on occurrences of other items.TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Rules Discovered: {Milk} --> {Coke} {Diaper, Milk} --> {Beer}
Rules Discovered: {Milk} --> {Coke} {Diaper, Milk} --> {Beer}
ISQS 6347, Data & Text MiningISQS 6347, Data & Text Mining 2020
Association Rule Discovery Association Rule Discovery ExampleExample
Marketing and Sales Promotion:Marketing and Sales Promotion:– Let the rule discovered beLet the rule discovered be
{Bagels, … } --> {Potato Chips}{Bagels, … } --> {Potato Chips}
– Potato Chips as consequentPotato Chips as consequent => => Can be used to Can be used to determine what should be done to boost its sales.determine what should be done to boost its sales.
– Bagels in the antecedentBagels in the antecedent => C => Can be used to see which an be used to see which products would be affected if the store discontinues products would be affected if the store discontinues selling bagels.selling bagels.
– Bagels in antecedent Bagels in antecedent and and Potato chips in consequentPotato chips in consequent => =>
Can be used to see what products should be sold with Can be used to see what products should be sold with Bagels to promote sale of Potato chips!Bagels to promote sale of Potato chips!
ISQS 6347, Data & Text MiningISQS 6347, Data & Text Mining 2121
RegressionRegression
Predict a value of a given continuous valued variable Predict a value of a given continuous valued variable based on the values of other variables, assuming a based on the values of other variables, assuming a linear or nonlinear model of dependency.linear or nonlinear model of dependency.Greatly studied in statistics, neural network fields.Greatly studied in statistics, neural network fields.Examples:Examples:– Predicting sales amounts of new product based on Predicting sales amounts of new product based on
advertising expenditure.advertising expenditure.– Predicting wind velocities as a function of temperature, Predicting wind velocities as a function of temperature,
humidity, air pressure, etc.humidity, air pressure, etc.– Time series prediction of stock market indices.Time series prediction of stock market indices.
ISQS 6347, Data & Text MiningISQS 6347, Data & Text Mining 2222
Deviation/Anomaly DetectionDeviation/Anomaly Detection
Detect significant deviations Detect significant deviations from normal behaviorfrom normal behavior
Applications:Applications:– Credit Card Fraud DetectionCredit Card Fraud Detection– Network Intrusion DetectionNetwork Intrusion Detection
Typical network traffic at University level may reach over 100 million connections per day
ISQS 6347, Data & Text MiningISQS 6347, Data & Text Mining 2323
Text Mining TasksText Mining Tasks
Exploratory Data AnalysisExploratory Data Analysis– Using text to form hypotheses about diseases (Swanson Using text to form hypotheses about diseases (Swanson
and Smalheiser, 1997).and Smalheiser, 1997).
Information ExtractionInformation Extraction– (Semi)automatically create (domain specific) knowledge (Semi)automatically create (domain specific) knowledge
bases, and then use standard data-mining techniques.bases, and then use standard data-mining techniques.
Bootstrapping methods (Riloff and Jones, 1999).Bootstrapping methods (Riloff and Jones, 1999).
Text ClassificationText Classification– Useful intermediary step for information extractionUseful intermediary step for information extraction
Bootstrapping method using EM (Nigam et al., 2000).Bootstrapping method using EM (Nigam et al., 2000).
ISQS 6347, Data & Text MiningISQS 6347, Data & Text Mining 2424
The Needs:The Needs:– Analysis of call records as input into decision-Analysis of call records as input into decision-
making process of Bank’s managementmaking process of Bank’s management– Quick answers to important questionsQuick answers to important questions
Which offices receive the most angry calls?Which offices receive the most angry calls?
What products have the fewest satisfied customers?What products have the fewest satisfied customers?
(“Angry” and “Satisfied” are recognizable sentiments)(“Angry” and “Satisfied” are recognizable sentiments)
– User friendly interface and visualization toolsUser friendly interface and visualization tools
Example: Example: Decision Support using Bank Call Center DataDecision Support using Bank Call Center Data
ISQS 6347, Data & Text MiningISQS 6347, Data & Text Mining 2525
Example: Example: Decision Support using Bank Call Center DataDecision Support using Bank Call Center Data
The Information Source:The Information Source:– Call center recordsCall center records– Example:Example:
AC2G31, 01, 0101, PCC, 021, 0053352, NEW YORK, NY, H-SUPRVR8, STMT, “Mr. Stark has been with the company forabout 20 yrs. He hates his stmt format andwishes that we would show a daily balanceto help him know when he falls below therequired balance on the account.”
ISQS 6347, Data & Text MiningISQS 6347, Data & Text Mining 2626
Challenges of Data MiningChallenges of Data Mining
ScalabilityScalability
DimensionalityDimensionality
Complex and Heterogeneous DataComplex and Heterogeneous Data
Data QualityData Quality
Data Ownership and DistributionData Ownership and Distribution
Privacy PreservationPrivacy Preservation
Streaming DataStreaming Data
ISQS 6347, Data & Text MiningISQS 6347, Data & Text Mining 2727
Challenges of Text MiningChallenges of Text Mining
Very high number of possible “dimensions”Very high number of possible “dimensions”– All possible word and phrase types in the language!!All possible word and phrase types in the language!!
Unlike data mining:Unlike data mining:– records (= docs) are not structurally identicalrecords (= docs) are not structurally identical– records are not statistically independentrecords are not statistically independent
Complex and subtle relationships between concepts in textComplex and subtle relationships between concepts in text– ““AOL merges with Time-Warner”AOL merges with Time-Warner”– ““Time-Warner is bought by AOL”Time-Warner is bought by AOL”
Ambiguity and context sensitivityAmbiguity and context sensitivity– automobile = car = vehicle = Toyotaautomobile = car = vehicle = Toyota– Apple (the company) or apple (the fruit)Apple (the company) or apple (the fruit)
ISQS 6347, Data & Text MiningISQS 6347, Data & Text Mining 2828
SAS Training/Self-taught CoursesSAS Training/Self-taught Courses
Getting Start with SASGetting Start with SAS®® Enterprise Miner 4.3, 132p Enterprise Miner 4.3, 132p (EM_GS_7281.PDF)(EM_GS_7281.PDF)Getting Start with SASGetting Start with SAS®® 9.1 Text Miner, 60p 9.1 Text Miner, 60p (EM_TMGS_7693.PDF)(EM_TMGS_7693.PDF)Data Mining - A Case Study Approach, 135pData Mining - A Case Study Approach, 135pText Mining Using SASText Mining Using SAS®® Software, 274p Software, 274p (DMTM.PDF)(DMTM.PDF)
Applying Data Mining Techniques Using Enterprise Applying Data Mining Techniques Using Enterprise Miner, 308p Miner, 308p (ADMT_001.PDF)(ADMT_001.PDF)Effective Web Mining: Attracting and Keeping Valued Effective Web Mining: Attracting and Keeping Valued Cyber Consumers, 632p Cyber Consumers, 632p (CCWEB_TKIT.PDF)(CCWEB_TKIT.PDF)