data mining all summary
TRANSCRIPT
-
7/27/2019 Data Mining All Summary
1/47
An Introduction to Data Mining
Prof. S. Sudarshan
CSE Dept, IIT Bombay
Most slides courtesy:Prof. Sunita Sarawagi
School of IT, IIT Bombay
-
7/27/2019 Data Mining All Summary
2/47
Why Data Mining
Credit ratings/targeted marketing:
Given a database of 100,000 names, which persons are theleast likely to default on their credit cards?
Identify likely responders to sales promotions
Fraud detection
Which types of transactions are likely to be fraudulent, giventhe demographics and transactional history of a particularcustomer?
Customer relationship management:Which of my customers are likely to be the most loyal, and
which are most likely to leave for a competitor? :
Data Mining helps extract such
information
-
7/27/2019 Data Mining All Summary
3/47
-
7/27/2019 Data Mining All Summary
4/47
Applications
Banking: loan/credit card approvalpredict good customers based on old customers
Customer relationship management:
identify those who are likely to leave for a competitor.Targeted marketing:identify likely responders to promotions
Fraud detection: telecommunications, financial
transactionsfrom an online stream of event identify fraudulent eventsManufacturing and production:automatically adjust knobs when process parameter changes
-
7/27/2019 Data Mining All Summary
5/47
Applications (continued)
Medicine: disease outcome, effectiveness oftreatments
analyze patient disease history: find relationshipbetween diseases
Molecular/Pharmaceutical: identify new drugs
Scientific data analysis:
identify new galaxies by searching for sub clusters
Web site/store design and promotion:
find affinity of visitor to pages and modify layout
-
7/27/2019 Data Mining All Summary
6/47
-
7/27/2019 Data Mining All Summary
7/47
Relationship with other
fields
Overlaps with machine learning, statistics,artificial intelligence, databases, visualization
but more stress onscalability of number of features and instances
stress on algorithms and architectures whereasfoundations of methods and formulations provided
by statistics and machine learning.automation for handling large, heterogeneous data
-
7/27/2019 Data Mining All Summary
8/47
Some basic operations
Predictive:
Regression
ClassificationCollaborative Filtering
Descriptive:
Clustering / similarity matchingAssociation rules and variants
Deviation detection
-
7/27/2019 Data Mining All Summary
9/47
Classification
(Supervised learning)
-
7/27/2019 Data Mining All Summary
10/47
Classification
Given old data about customers andpayments, predict new applicants loan
eligibility.
AgeSalary
ProfessionLocationCustomer type
Previous customers Classifier Decision rules
Salary > 5 L
Prof. = Exec
New applicants data
Good/
bad
-
7/27/2019 Data Mining All Summary
11/47
Classification methods
Goal: Predict class Ci = f(x1, x2, .. Xn)
Regression: (linear or any other polynomial)a*x1 + b*x2 + c = Ci.
Nearest neighour
Decision tree classifier: divide decision spaceinto piecewise constant regions.
Probabilistic/generative models
Neural networks: partition by non-linearboundaries
-
7/27/2019 Data Mining All Summary
12/47
-
7/27/2019 Data Mining All Summary
13/47
Tree where internal nodes are simpledecision rules on one or more attributesand leaf nodes are predicted class labels.
Decision trees
Salary < 1 M
Prof = teacher
Good
Age < 30
BadBad Good
-
7/27/2019 Data Mining All Summary
14/47
Decision tree classifiers
Widely used learning method
Easy to interpret: can be re-represented as if-then-else rules
Approximates function by piece wise constantregions
Does not require any prior knowledge of datadistribution, works well on noisy data.
Has been applied to:classify medical patients based on the disease,
equipment malfunction by cause,
loan applicant by likelihood of payment.
-
7/27/2019 Data Mining All Summary
15/47
-
7/27/2019 Data Mining All Summary
16/47
Neural network
Set of nodes connected by directedweighted edges
Hidden nodes
Output nodes
x1
x2
x3
x1
x2
x3
w1
w2
w3
y
n
i
ii
ey
xwo
1
1
)(
)(1
Basic NN unitA more typical NN
-
7/27/2019 Data Mining All Summary
17/47
Neural networks
Useful for learning complex data likehandwriting, speech and image
recognition
Neural networkClassification tree
Decision boundaries:
Linear regression
-
7/27/2019 Data Mining All Summary
18/47
Pros and Cons of Neural
Network
Cons
- Slow training time
- Hard to interpret
- Hard to implement: trialand error for choosingnumber of nodes
Pros
+ Can learn more complicated
class boundaries
+ Fast application
+ Can handle large number offeatures
Conclusion: Use neural nets only ifdecision-trees/NN fail.
-
7/27/2019 Data Mining All Summary
19/47
Bayesian learning
Assume a probability model on generation of data.
Apply bayes theorem to find most likely class as:
Nave bayes:Assume attributes conditionallyindependent given class value
Easy to learn probabilities by counting,
Useful in some domains e.g. text
)(
)()|(max)|(max:classpredicted
dp
cpcdpdcpc
jj
cj
c jj
n
i
ji
j
ccap
dp
cpc
j 1
)|()(
)(max
-
7/27/2019 Data Mining All Summary
20/47
Clustering orUnsupervised Learning
-
7/27/2019 Data Mining All Summary
21/47
Clustering
Unsupervised learning when old data with classlabels not available e.g. when introducing a new
product.Group/cluster existing customers based on time
series of payment history such that similarcustomers in same cluster.
Key requirement: Need a good measure ofsimilarity between instances.
Identify micro-markets and develop policies for
each
-
7/27/2019 Data Mining All Summary
22/47
-
7/27/2019 Data Mining All Summary
23/47
Distance functions
Numeric data: euclidean, manhattan distances
Categorical data: 0/1 to indicate
presence/absence followed byHamming distance (# dissimilarity)
Jaccard coefficients: #similarity in 1s/(# of 1s)
data dependent measures: similarity of A and B
depends on co-occurance with C.Combined numeric and categorical data:
weighted normalized distance:
-
7/27/2019 Data Mining All Summary
24/47
Clustering methods
Hierarchical clustering
agglomerative Vs divisive
single link Vs complete linkPartitional clustering
distance-based: K-means
model-based: EMdensity-based:
-
7/27/2019 Data Mining All Summary
25/47
-
7/27/2019 Data Mining All Summary
26/47
Collaborative Filtering
Given database of user preferences, predictpreference of new user
Example: predict what new movies you will like
based onyour past preferences
others with similar past preferences
their preferences for the new movies
Example: predict what books/CDs a person maywant to buy(and suggest it, or give discounts to tempt
customer)
-
7/27/2019 Data Mining All Summary
27/47
Collaborative
recommendation
Rangeel QSQT 100 day Anand Sholay Deewar Vertigo
Smita
Vijay
MohanRajesh
Nina
Nitin ? ? ? ? ? ?
Possible approaches:
Average vote along columns [Same prediction for all]
Weight vote based on similarity of likings [GroupLens]
Rangeel QSQT 100 day Anand Sholay Deewar Vertigo
Smita
Vijay
MohanRajesh
Nina
Nitin ? ? ? ? ? ?
-
7/27/2019 Data Mining All Summary
28/47
Cluster-based approaches
External attributes of people and movies toclusterage, gender of people
actors and directors of movies.
[ May not be available]Cluster people based on movie preferencesmisses information about similarity of movies
Repeated clustering:cluster movies based on people, then people based on
movies, and repeat
ad hoc, might smear out groups
-
7/27/2019 Data Mining All Summary
29/47
Example of clustering
Rangeel QSQT 100 day Anand Sholay Deewar VertigoSmita
Vijay
Mohan
Rajesh
Nina
Nitin ? ? ? ? ? ?
Anand QSQT Rangeela 100 days Vertigo Deewar SholayVijay
Rajesh
Mohan
Nina
Smita
Nitin ? ? ? ? ? ?
-
7/27/2019 Data Mining All Summary
30/47
Model-based approach
People and movies belong to unknown classes
Pk= probability a random person is in class k
Pl= probability a random movie is in class lPkl= probability of a class-kperson liking a
class-lmovie
Gibbs sampling: iterate
Pick a person or movie at random and assign to aclass with probability proportional to Pkor PlEstimate new parameters
Need statistics background to understand details
-
7/27/2019 Data Mining All Summary
31/47
Association Rules
-
7/27/2019 Data Mining All Summary
32/47
Association rules
Given set T of groups of items
Example: set of item sets purchased
Goal: find all rules on itemsets of theform a-->b such that
support of a and b > user threshold s
conditional probability (confidence) of b
given a > user threshold c
Example: Milk --> bread
Purchase of product A --> service B
Milk, cereal
Tea, milk
Tea, rice, bread
cereal
T
-
7/27/2019 Data Mining All Summary
33/47
-
7/27/2019 Data Mining All Summary
34/47
Prevalent Interesting
Analysts alreadyknow about prevalentrules
Interesting rules arethose that deviatefrom prior
expectationMinings payoff is in
finding surprisingphenomena
1995
1998
Milk and
cereal sell
together!
Zzzz...Milk and
cereal sell
together!
-
7/27/2019 Data Mining All Summary
35/47
A li ti f f t
-
7/27/2019 Data Mining All Summary
36/47
Applications of fast
itemset counting
Find correlated events:
Applications in medicine: find redundant
testsCross selling in retail, banking
Improve predictive capability of classifiers
that assume attribute independence New similarity measures of categorical
attributes [Mannila et al, KDD 98]
-
7/27/2019 Data Mining All Summary
37/47
-
7/27/2019 Data Mining All Summary
38/47
Application Areas
Industry Application
Finance Credit Card Analysis
Insurance Claims, Fraud AnalysisTelecommunication Call record analysis
Transport Logistics management
Consumer goods promotion analysis
Data Service providers Value added data
Utilities Power usage analysis
-
7/27/2019 Data Mining All Summary
39/47
Why Now?
Data is being produced
Data is being warehoused
The computing power is availableThe computing power is affordable
The competitive pressures are strong
Commercial products are available
D t Mi i k ith
-
7/27/2019 Data Mining All Summary
40/47
Data Mining works with
Warehouse Data
Data Warehousing provides theEnterprise with a memory
Data Mining provides theEnterprise with intelligence
-
7/27/2019 Data Mining All Summary
41/47
Usage scenarios
Data warehouse mining:
assimilate data from operational sources
mine static data
Mining log data
Continuous mining: example in process control
Stages in mining:
data selection pre-processing: cleaning transformation mining resultevaluation visualization
-
7/27/2019 Data Mining All Summary
42/47
V ti l i t ti
-
7/27/2019 Data Mining All Summary
43/47
Vertical integration:
Mining on the web
Web log analysis for site design:
what are popular pages,
what links are hard to find.Electronic stores sales enhancements:
recommendations, advertisement:
Collaborative filtering: Net perception, WisewireInventory control: what was a shopper
looking for and could not find..
-
7/27/2019 Data Mining All Summary
44/47
OLAP Mining integration
OLAP (On Line Analytical Processing)
Fast interactive exploration of multidim.aggregates.
Heavy reliance on manual operations foranalysis:
Tedious and error-prone on large
multidimensional dataIdeal platform for vertical integration of mining
but needs to be interactive instead of batch.
-
7/27/2019 Data Mining All Summary
45/47
-
7/27/2019 Data Mining All Summary
46/47
Data Mining in Use
The US Government uses Data Mining to trackfraud
A Supermarket becomes an information broker
Basketball teams use it to track game strategy
Cross Selling
Target Marketing
Holding on to Good Customers
Weeding out Bad Customers
-
7/27/2019 Data Mining All Summary
47/47
Some success stories
Network intrusion detection using a combination ofsequential rule discovery and classification tree on 4 GBDARPA dataWon over (manual) knowledge engineering approach
http://www.cs.columbia.edu/~sal/JAM/PROJECT/ provides gooddetailed description of the entire process
Major US bank: customer attrition predictionFirst segment customers based on financial behavior: found 3
segments
Build attrition models for each of the 3 segments40-50% of attritions were predicted == factor of 18 increase
Targeted credit marketing: major US banksfind customer segments based on 13 months credit balances
build another response model based on surveys