data mining all summary

Upload: vijayakumar2k93983

Post on 02-Apr-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/27/2019 Data Mining All Summary

    1/47

    An Introduction to Data Mining

    Prof. S. Sudarshan

    CSE Dept, IIT Bombay

    Most slides courtesy:Prof. Sunita Sarawagi

    School of IT, IIT Bombay

  • 7/27/2019 Data Mining All Summary

    2/47

    Why Data Mining

    Credit ratings/targeted marketing:

    Given a database of 100,000 names, which persons are theleast likely to default on their credit cards?

    Identify likely responders to sales promotions

    Fraud detection

    Which types of transactions are likely to be fraudulent, giventhe demographics and transactional history of a particularcustomer?

    Customer relationship management:Which of my customers are likely to be the most loyal, and

    which are most likely to leave for a competitor? :

    Data Mining helps extract such

    information

  • 7/27/2019 Data Mining All Summary

    3/47

  • 7/27/2019 Data Mining All Summary

    4/47

    Applications

    Banking: loan/credit card approvalpredict good customers based on old customers

    Customer relationship management:

    identify those who are likely to leave for a competitor.Targeted marketing:identify likely responders to promotions

    Fraud detection: telecommunications, financial

    transactionsfrom an online stream of event identify fraudulent eventsManufacturing and production:automatically adjust knobs when process parameter changes

  • 7/27/2019 Data Mining All Summary

    5/47

    Applications (continued)

    Medicine: disease outcome, effectiveness oftreatments

    analyze patient disease history: find relationshipbetween diseases

    Molecular/Pharmaceutical: identify new drugs

    Scientific data analysis:

    identify new galaxies by searching for sub clusters

    Web site/store design and promotion:

    find affinity of visitor to pages and modify layout

  • 7/27/2019 Data Mining All Summary

    6/47

  • 7/27/2019 Data Mining All Summary

    7/47

    Relationship with other

    fields

    Overlaps with machine learning, statistics,artificial intelligence, databases, visualization

    but more stress onscalability of number of features and instances

    stress on algorithms and architectures whereasfoundations of methods and formulations provided

    by statistics and machine learning.automation for handling large, heterogeneous data

  • 7/27/2019 Data Mining All Summary

    8/47

    Some basic operations

    Predictive:

    Regression

    ClassificationCollaborative Filtering

    Descriptive:

    Clustering / similarity matchingAssociation rules and variants

    Deviation detection

  • 7/27/2019 Data Mining All Summary

    9/47

    Classification

    (Supervised learning)

  • 7/27/2019 Data Mining All Summary

    10/47

    Classification

    Given old data about customers andpayments, predict new applicants loan

    eligibility.

    AgeSalary

    ProfessionLocationCustomer type

    Previous customers Classifier Decision rules

    Salary > 5 L

    Prof. = Exec

    New applicants data

    Good/

    bad

  • 7/27/2019 Data Mining All Summary

    11/47

    Classification methods

    Goal: Predict class Ci = f(x1, x2, .. Xn)

    Regression: (linear or any other polynomial)a*x1 + b*x2 + c = Ci.

    Nearest neighour

    Decision tree classifier: divide decision spaceinto piecewise constant regions.

    Probabilistic/generative models

    Neural networks: partition by non-linearboundaries

  • 7/27/2019 Data Mining All Summary

    12/47

  • 7/27/2019 Data Mining All Summary

    13/47

    Tree where internal nodes are simpledecision rules on one or more attributesand leaf nodes are predicted class labels.

    Decision trees

    Salary < 1 M

    Prof = teacher

    Good

    Age < 30

    BadBad Good

  • 7/27/2019 Data Mining All Summary

    14/47

    Decision tree classifiers

    Widely used learning method

    Easy to interpret: can be re-represented as if-then-else rules

    Approximates function by piece wise constantregions

    Does not require any prior knowledge of datadistribution, works well on noisy data.

    Has been applied to:classify medical patients based on the disease,

    equipment malfunction by cause,

    loan applicant by likelihood of payment.

  • 7/27/2019 Data Mining All Summary

    15/47

  • 7/27/2019 Data Mining All Summary

    16/47

    Neural network

    Set of nodes connected by directedweighted edges

    Hidden nodes

    Output nodes

    x1

    x2

    x3

    x1

    x2

    x3

    w1

    w2

    w3

    y

    n

    i

    ii

    ey

    xwo

    1

    1

    )(

    )(1

    Basic NN unitA more typical NN

  • 7/27/2019 Data Mining All Summary

    17/47

    Neural networks

    Useful for learning complex data likehandwriting, speech and image

    recognition

    Neural networkClassification tree

    Decision boundaries:

    Linear regression

  • 7/27/2019 Data Mining All Summary

    18/47

    Pros and Cons of Neural

    Network

    Cons

    - Slow training time

    - Hard to interpret

    - Hard to implement: trialand error for choosingnumber of nodes

    Pros

    + Can learn more complicated

    class boundaries

    + Fast application

    + Can handle large number offeatures

    Conclusion: Use neural nets only ifdecision-trees/NN fail.

  • 7/27/2019 Data Mining All Summary

    19/47

    Bayesian learning

    Assume a probability model on generation of data.

    Apply bayes theorem to find most likely class as:

    Nave bayes:Assume attributes conditionallyindependent given class value

    Easy to learn probabilities by counting,

    Useful in some domains e.g. text

    )(

    )()|(max)|(max:classpredicted

    dp

    cpcdpdcpc

    jj

    cj

    c jj

    n

    i

    ji

    j

    ccap

    dp

    cpc

    j 1

    )|()(

    )(max

  • 7/27/2019 Data Mining All Summary

    20/47

    Clustering orUnsupervised Learning

  • 7/27/2019 Data Mining All Summary

    21/47

    Clustering

    Unsupervised learning when old data with classlabels not available e.g. when introducing a new

    product.Group/cluster existing customers based on time

    series of payment history such that similarcustomers in same cluster.

    Key requirement: Need a good measure ofsimilarity between instances.

    Identify micro-markets and develop policies for

    each

  • 7/27/2019 Data Mining All Summary

    22/47

  • 7/27/2019 Data Mining All Summary

    23/47

    Distance functions

    Numeric data: euclidean, manhattan distances

    Categorical data: 0/1 to indicate

    presence/absence followed byHamming distance (# dissimilarity)

    Jaccard coefficients: #similarity in 1s/(# of 1s)

    data dependent measures: similarity of A and B

    depends on co-occurance with C.Combined numeric and categorical data:

    weighted normalized distance:

  • 7/27/2019 Data Mining All Summary

    24/47

    Clustering methods

    Hierarchical clustering

    agglomerative Vs divisive

    single link Vs complete linkPartitional clustering

    distance-based: K-means

    model-based: EMdensity-based:

  • 7/27/2019 Data Mining All Summary

    25/47

  • 7/27/2019 Data Mining All Summary

    26/47

    Collaborative Filtering

    Given database of user preferences, predictpreference of new user

    Example: predict what new movies you will like

    based onyour past preferences

    others with similar past preferences

    their preferences for the new movies

    Example: predict what books/CDs a person maywant to buy(and suggest it, or give discounts to tempt

    customer)

  • 7/27/2019 Data Mining All Summary

    27/47

    Collaborative

    recommendation

    Rangeel QSQT 100 day Anand Sholay Deewar Vertigo

    Smita

    Vijay

    MohanRajesh

    Nina

    Nitin ? ? ? ? ? ?

    Possible approaches:

    Average vote along columns [Same prediction for all]

    Weight vote based on similarity of likings [GroupLens]

    Rangeel QSQT 100 day Anand Sholay Deewar Vertigo

    Smita

    Vijay

    MohanRajesh

    Nina

    Nitin ? ? ? ? ? ?

  • 7/27/2019 Data Mining All Summary

    28/47

    Cluster-based approaches

    External attributes of people and movies toclusterage, gender of people

    actors and directors of movies.

    [ May not be available]Cluster people based on movie preferencesmisses information about similarity of movies

    Repeated clustering:cluster movies based on people, then people based on

    movies, and repeat

    ad hoc, might smear out groups

  • 7/27/2019 Data Mining All Summary

    29/47

    Example of clustering

    Rangeel QSQT 100 day Anand Sholay Deewar VertigoSmita

    Vijay

    Mohan

    Rajesh

    Nina

    Nitin ? ? ? ? ? ?

    Anand QSQT Rangeela 100 days Vertigo Deewar SholayVijay

    Rajesh

    Mohan

    Nina

    Smita

    Nitin ? ? ? ? ? ?

  • 7/27/2019 Data Mining All Summary

    30/47

    Model-based approach

    People and movies belong to unknown classes

    Pk= probability a random person is in class k

    Pl= probability a random movie is in class lPkl= probability of a class-kperson liking a

    class-lmovie

    Gibbs sampling: iterate

    Pick a person or movie at random and assign to aclass with probability proportional to Pkor PlEstimate new parameters

    Need statistics background to understand details

  • 7/27/2019 Data Mining All Summary

    31/47

    Association Rules

  • 7/27/2019 Data Mining All Summary

    32/47

    Association rules

    Given set T of groups of items

    Example: set of item sets purchased

    Goal: find all rules on itemsets of theform a-->b such that

    support of a and b > user threshold s

    conditional probability (confidence) of b

    given a > user threshold c

    Example: Milk --> bread

    Purchase of product A --> service B

    Milk, cereal

    Tea, milk

    Tea, rice, bread

    cereal

    T

  • 7/27/2019 Data Mining All Summary

    33/47

  • 7/27/2019 Data Mining All Summary

    34/47

    Prevalent Interesting

    Analysts alreadyknow about prevalentrules

    Interesting rules arethose that deviatefrom prior

    expectationMinings payoff is in

    finding surprisingphenomena

    1995

    1998

    Milk and

    cereal sell

    together!

    Zzzz...Milk and

    cereal sell

    together!

  • 7/27/2019 Data Mining All Summary

    35/47

    A li ti f f t

  • 7/27/2019 Data Mining All Summary

    36/47

    Applications of fast

    itemset counting

    Find correlated events:

    Applications in medicine: find redundant

    testsCross selling in retail, banking

    Improve predictive capability of classifiers

    that assume attribute independence New similarity measures of categorical

    attributes [Mannila et al, KDD 98]

  • 7/27/2019 Data Mining All Summary

    37/47

  • 7/27/2019 Data Mining All Summary

    38/47

    Application Areas

    Industry Application

    Finance Credit Card Analysis

    Insurance Claims, Fraud AnalysisTelecommunication Call record analysis

    Transport Logistics management

    Consumer goods promotion analysis

    Data Service providers Value added data

    Utilities Power usage analysis

  • 7/27/2019 Data Mining All Summary

    39/47

    Why Now?

    Data is being produced

    Data is being warehoused

    The computing power is availableThe computing power is affordable

    The competitive pressures are strong

    Commercial products are available

    D t Mi i k ith

  • 7/27/2019 Data Mining All Summary

    40/47

    Data Mining works with

    Warehouse Data

    Data Warehousing provides theEnterprise with a memory

    Data Mining provides theEnterprise with intelligence

  • 7/27/2019 Data Mining All Summary

    41/47

    Usage scenarios

    Data warehouse mining:

    assimilate data from operational sources

    mine static data

    Mining log data

    Continuous mining: example in process control

    Stages in mining:

    data selection pre-processing: cleaning transformation mining resultevaluation visualization

  • 7/27/2019 Data Mining All Summary

    42/47

    V ti l i t ti

  • 7/27/2019 Data Mining All Summary

    43/47

    Vertical integration:

    Mining on the web

    Web log analysis for site design:

    what are popular pages,

    what links are hard to find.Electronic stores sales enhancements:

    recommendations, advertisement:

    Collaborative filtering: Net perception, WisewireInventory control: what was a shopper

    looking for and could not find..

  • 7/27/2019 Data Mining All Summary

    44/47

    OLAP Mining integration

    OLAP (On Line Analytical Processing)

    Fast interactive exploration of multidim.aggregates.

    Heavy reliance on manual operations foranalysis:

    Tedious and error-prone on large

    multidimensional dataIdeal platform for vertical integration of mining

    but needs to be interactive instead of batch.

  • 7/27/2019 Data Mining All Summary

    45/47

  • 7/27/2019 Data Mining All Summary

    46/47

    Data Mining in Use

    The US Government uses Data Mining to trackfraud

    A Supermarket becomes an information broker

    Basketball teams use it to track game strategy

    Cross Selling

    Target Marketing

    Holding on to Good Customers

    Weeding out Bad Customers

  • 7/27/2019 Data Mining All Summary

    47/47

    Some success stories

    Network intrusion detection using a combination ofsequential rule discovery and classification tree on 4 GBDARPA dataWon over (manual) knowledge engineering approach

    http://www.cs.columbia.edu/~sal/JAM/PROJECT/ provides gooddetailed description of the entire process

    Major US bank: customer attrition predictionFirst segment customers based on financial behavior: found 3

    segments

    Build attrition models for each of the 3 segments40-50% of attritions were predicted == factor of 18 increase

    Targeted credit marketing: major US banksfind customer segments based on 13 months credit balances

    build another response model based on surveys