knowledge discovery and data mining(kdd)

Upload: jhm1487

Post on 14-Feb-2018

230 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/23/2019 knowledge discovery and data mining(kdd)

    1/52

    Knowledge Discovery

    and Data Mining (KDD)

  • 7/23/2019 knowledge discovery and data mining(kdd)

    2/52

    Knowledge Discovery & DataMining

    process of extractingpreviously unknown, valid,and actionable(understandable) information from

    large databases

    Data mining is a step in the KDD process of

    applying data analysis and discovery algorithms

    Machine learning, pattern recognition, statistics,

    databases, data visualization. raditional techni!ues may be inade!uate

    " large data

  • 7/23/2019 knowledge discovery and data mining(kdd)

    3/52

    Why Mine Data?

    #uge amounts of data being collected and

    $arehoused" %almart records &' millions per day

    " health care transactions multigigabyte databases

    " Mobil *il geological data of over +'' terabytes

    ffordable computing

    -ompetitive pressure" gain an edge by providing improved, customized services

    " information as a product in its o$n right

  • 7/23/2019 knowledge discovery and data mining(kdd)

    4/52

    Knowledge Discovery Process

    " Data mining: the coreof knowledge discoveryprocess.

    Data Cleaning

    Data Integration

    Databases

    Preprocessed

    Data

    Task-relevant Data

    Data transformation

    s

    Selection

    Data Mining

    Knowledge Interpretation

  • 7/23/2019 knowledge discovery and data mining(kdd)

    5/52

    Knowledge Discovery Process oal

    " understanding the application domain, and goals of KDD effort

    Data selection, ac!uisition, integration

    Data cleaning" noise, missing data, outliers,etc.

    /xploratory data analysis" dimensionality reduction, transformations" selection of appropriate model for analysis, hypotheses to test

    Data mining" selecting appropriate method that match set goals (classification,

    regression, clustering, etc)" selecting algorithm

    esting and verification

    0nterpretation

    -onsolidation and use

  • 7/23/2019 knowledge discovery and data mining(kdd)

    6/52

    '

    +'

    &'

    1'

    2'

    3'

    4'

    5'

    6'

    7'

    +''

    BusinessObjective

    Determination

    DataPreparation DataMining Analysis ofResults and

    Knowledge

    Assimilation

    Effort for eac data!mining process step

  • 7/23/2019 knowledge discovery and data mining(kdd)

    7/52

    Isses and challenges

    large data

    " number of variables (features), number of cases (examples)" multi gigabyte, terabyte databases" efficient algorithms, parallel processing

    high dimensionality" large number of features exponential increase in search space" potential for spurious patterns" dimensionality reduction

    *verfitting" models noise in training data, rather than 8ust the generalpatterns

    -hanging data, missing and noisy data

    9se of domain :no$ledge" utilizing :no$ledge on complex data relationships, :no$n facts

    9nderstandability of patterns

  • 7/23/2019 knowledge discovery and data mining(kdd)

    8/52

    Data Mining

    ;rediction Methods" using some variables to predict un:no$n or future values of

    other variables

    Descriptive Methods" finding humaninterpretable patterns describing the data

  • 7/23/2019 knowledge discovery and data mining(kdd)

    9/52

    Data Mining !asks

    -lassification

    -lustering

    ssociation

  • 7/23/2019 knowledge discovery and data mining(kdd)

    10/52

    "lassification

    Data defined in terms of attributes, one of $hich is the class

    >ind a model for class attribute as a function of the

    values of other(predictor) attributes, such that previously

    unseen records can be assigned a class as accuratelyas possible.

    raining Data used to build the model

    est data used to validate the model (determine accuracy of themodel)

    iven data is usually divided into training and test sets.

  • 7/23/2019 knowledge discovery and data mining(kdd)

    11/52

    "lassification:#$ample

  • 7/23/2019 knowledge discovery and data mining(kdd)

    12/52

    "lassification: Direct Marketing

    oal

  • 7/23/2019 knowledge discovery and data mining(kdd)

    13/52

    "lassification: %rad detection

    oal ;redict fraudulent cases in credit card

    transactions.

    Data

    " 9se credit card transactions and information on its accountholder as input variables

    " label past transactions as fraud or fair.

    Aearn a model for the class of transactions

    9se the model to detect fraud by observing creditcard transactions on a given account.

  • 7/23/2019 knowledge discovery and data mining(kdd)

    14/52

    "lstering

    iven a set of data points, each having a set of

    attributes, and a similarity measure among them,

    find clusters such that" data points in one cluster are more similar to one another

    " data points in separate clusters are less simislar to one

    another.

    =imilarity measures" /uclidean distance if attributes are continuous

    " ;roblem specific measures

  • 7/23/2019 knowledge discovery and data mining(kdd)

    15/52

    "lstering: Market egmentation

    oal subdivide a mar:et into distinct subsets of

    customers $here any subset may conceivably be

    selected as a mar:et target to be reached $ith a

    distinct mar:eting mix.

    pproach" collect different attributes on customers based on

    geographical, and lifestyle related information

    " identify clusters of similar customers

    " measure the clustering !uality by observing buying patterns

    of customers in same cluster vs. those from different clusters.

  • 7/23/2019 knowledge discovery and data mining(kdd)

    16/52

    'ssociation (le Discovery

    iven a set of records, each of $hich contain

    some number of items from a given collection" produce dependency rules $hich $ill predict occurrence of

    an item based on occurences of other items

  • 7/23/2019 knowledge discovery and data mining(kdd)

    17/52

    'ssociation (les:'pplication

    Mar:eting and =ales ;romotion

    -onsider discovered rule

    {Bagels, } --> {Potato Chips}" ;otato -hips as conse!uent can be used to determine

    $hat may be done to boost sales

    " Bagels as an antecedent can be used to see $hich

    products may be affected if bagels are discontinued

    " -an be used to see $hich products should be sold $ith

    Bagels to promote sale of ;otato -hips

  • 7/23/2019 knowledge discovery and data mining(kdd)

    18/52

    'ssociation (les: 'pplication

    =upermar:et shelf management

    oal to identify items $hich are bought together

    (by sufficiently many customers)

    pproach process pointofsale data (collected

    $ith barcode scanners) to find dependencies

    among items.

    /xample" 0f a customer buys Diapers and Mil:, then he is very li:ely to

    but Beer

    " so stac: sixpac:s next to diapersC

  • 7/23/2019 knowledge discovery and data mining(kdd)

    19/52

    )isali*ation

    complement to other DM techni!ues li:e=egmentation,etc.

  • 7/23/2019 knowledge discovery and data mining(kdd)

    20/52

    &'

    Data Mining in "(M:

    "stomer +ife "ycle -ustomer Aife -ycle" he stages in the relationship bet$een a customer and a

    business

    Key stages in the customer lifecycle" Prospects: people $ho are not yet customers but are in

    the target mar:et" (esponders:prospects $ho sho$ an interest in a product

    or service" 'ctive "stomers: people $ho are currently using the

    product or service" %ormer "stomers:may be badE customers $ho did not

    pay their bills or $ho incurred high costs

    0tFs important to :no$ life cycle events (e.g.retirement)

  • 7/23/2019 knowledge discovery and data mining(kdd)

    21/52

    &+

    Data Mining in "(M:

    "stomer +ife "ycle %hat mar:eters $ant 0ncreasing customer

    revenue and customer profitability" 9psell

    " -rosssell

    " Keeping the customers for a longer period of time

    =olution pplying data mining

  • 7/23/2019 knowledge discovery and data mining(kdd)

    22/52

    &&

    Data Mining in "(M

    DM helps to" Determine the behavior surrounding a particular lifecycle

    event

    " >ind other people in similar life stages and determine $hichcustomers are follo$ing similar behavior patterns

  • 7/23/2019 knowledge discovery and data mining(kdd)

    23/52

    &1

    Data Mining in "(M ,cont.-

    Data "areouse Data Mining

    #ampaign Management

    Customer Profile

    Customer Life Ccle Info!

  • 7/23/2019 knowledge discovery and data mining(kdd)

    24/52

    &2

    Data Mining !echnies

    Data Mining echni!ues

    Descriptive ;redictive

    -lustering

    ssociation

    -lassification

  • 7/23/2019 knowledge discovery and data mining(kdd)

    25/52

    &3

    Predictive Data Mining

    ridas Hic:ie Mi:e

    1onest

    Barney%aldo%ally

    "rooked

  • 7/23/2019 knowledge discovery and data mining(kdd)

    26/52

    &4

    Prediction

    ridas Hic:ie Mi:e

    #onest I has round eyes and a smile

  • 7/23/2019 knowledge discovery and data mining(kdd)

    27/52

    &5

    Decision !rees

    Data

    eigt air eyes class

    sort blond blue A

    tall blond brown B

    tall red blue A

    sort dar$ blue B

    tall dar$ blue B

    tall blond blue Atall dar$ brown B

    sort blond brown B

  • 7/23/2019 knowledge discovery and data mining(kdd)

    28/52

    &6

    Decision !rees ,cont.-

    hair

    dar:

    red

    blond

    short, blue I B

    tall, blue I B

    tall, bro$nI B

    ?tall, blue I % short, blue I

    tall, bro$n I B

    tall, blue I

    short, bro$n I B

    -ompletely classifies dar:haired

    and redhaired people

    Does not completely classify

    blonde!aired people&

    More wor$ is re'uired

  • 7/23/2019 knowledge discovery and data mining(kdd)

    29/52

    &7

    Decision !rees ,cont.-

    hair

    dar:

    red

    blond

    short, blue I Btall, blue I B

    tall, bro$nI B

    ?tall, blue I % short, blue I

    tall, bro$n I B

    tall, blue I

    short, bro$n I B

    eyeblue bro$n

    short I

    tall I

    tall I B

    short I B

    Decision tree is complete because

    (& All ) cases appear at nodes*& At eac node+ all cases are in

    te same class ,A or B-

  • 7/23/2019 knowledge discovery and data mining(kdd)

    30/52

    1'

    Decision !rees:

    +earned Predictive (lesair

    eyesB

    B

    A

    A

    dar:

    red

    blond

    blue bro$n

  • 7/23/2019 knowledge discovery and data mining(kdd)

    31/52

    1+

    Decision !rees:

    'nother #$ample!otal list

    234 mem0er

    356 child 758 child734 mem0er

    9 children

    ;235

  • 7/23/2019 knowledge discovery and data mining(kdd)

    32/52

    1&

    (le Indction

    ry to find rules of the form

    0> Jlefthandside #/G Jrighthandside" his is the reverse of a rulebased agent, $here the rules

    are given and the agent must act. #ere the actions aregiven and $e have to discover the rulesL

    ;revalence I probability that A#= and

  • 7/23/2019 knowledge discovery and data mining(kdd)

    33/52

    11

    "ssociation #ules from

    Market $asket "nalsis

    .Dairy!Mil$!Refrigerated/.0oft Drin$s #arbonated/

    prevalence 1 2&334+ predictability 1 **&)34

    .Dry Dinners ! Pasta/.0oup!#anned/

    prevalence 1 5&324+ predictability 1 *)&(24 .Dry Dinners ! Pasta/.#ereal ! Ready to Eat/

    prevalence 1 (&674+ predictability 1 2(&5*4

    .#eese 0lices /.#ereal ! Ready to Eat/

    prevalence 1 (&(74+ predictability 1 6)&5(4

  • 7/23/2019 knowledge discovery and data mining(kdd)

    34/52

    12

    %se of #ule "ssociations

    #oupons+ discounts Don8t give discounts on * items tat are fre'uently bougt

    togeter& 9se te discount on ( to :pull; te oter

    Product placement Offer correlated products to te customer at te same time&

    #R purcasers *!6 monts after

    >#R purcase

    Discovery of patterns People wo bougt ?+ @ and ,but not any pair- bougt "

    over alf te time

  • 7/23/2019 knowledge discovery and data mining(kdd)

    35/52

    13

    &inding #ule "ssociations

    "lgorit'm EampleC grocery sopping

    or eac item+ count of occurrences ,say out of (55+555-

    apples ()3(+ caviar 6+ ice cream (5))+ F

    Drop te ones tat are below a minimum support level

    apples ()3(+ ice cream (5))+ pet food *2G(+ F

    Ma$e a table of eac item against eac oter itemC

    Discard cells below support tresold& How ma$e a cube for

    triples+ etc& Add ( dimension for eac product on IJ0&

    apples ice cream pet food

    apples 6=?6 >=2 79

    ice cream 55555 63== 877pet food 55555 55555 7926

  • 7/23/2019 knowledge discovery and data mining(kdd)

    36/52

    14

    "lstering

    he art of finding groups in data

    *b8ective gather items from a database into sets

    according to (un:no$n) common characteristics Much more difficult than classification since the

    classes are not :no$n in advance (no training)

    echni!ue unsupervised learning

  • 7/23/2019 knowledge discovery and data mining(kdd)

    37/52

    15

    !he K-Means"lstering Method

    '

    +

    &

    1

    2

    3

    4

    5

    6

    7

    +'

    ' + & 1 2 3 4 5 6 7 +'

    5

    (

    *

    6

    2

    G

    7

    )

    3

    (5

    5 ( * 6 2 G 7 ) 3 (5

    '

    +

    &

    1

    2

    3

    4

    5

    6

    7

    +'

    ' + & 1 2 3 4 5 6 7 +'

    '

    +

    &

    1

    2

    3

    4

    5

    6

    7

    +'

    ' + & 1 2 3 4 5 6 7 +'

    5

    (

    *

    6

    2

    G

    7

    )

    3

    (5

    5 ( * 6 2 G 7 ) 3 (5

    K=2

    Arbitrarily chooseK objects as initialcluster center

    Assign

    eachof theobjectstomostsimilarcenter

    Update

    theclustermeans

    Updatetheclustermeans

    reassignreassign

  • 7/23/2019 knowledge discovery and data mining(kdd)

    38/52

    38

    @pinion 'nalysis

    Word5of5moth on the We0 he %eb has dramatically changed the $ay that

    consumers express their opinions.

    *ne can post revie$s of products at merchant

    sites, %eb forums, discussion groups, blogs

    echni!ues are being developed to exploit these

    sources.

    Benefits of

  • 7/23/2019 knowledge discovery and data mining(kdd)

    39/52

    39

    %eatre Aased 'nalysis &mmari*ation

    /xtracting product features (called *pinion

    >eatures) that have been commented on by

    customers.

    0dentifying opinion sentences in each revie$ and

    deciding $hether each opinion sentence is positive

    or negative.

    =ummarizing and comparing results.

  • 7/23/2019 knowledge discovery and data mining(kdd)

    40/52

    entiment 'nalysis and opinion mining

    25

  • 7/23/2019 knowledge discovery and data mining(kdd)

    41/52

    2(

    Introdction $o main types of textual information.

    " >acts and *pinions Gote factual statements can imply opinions too.

    Most current text information processing methods(e.g., $eb search, text mining) $or: $ith factualinformation.

    =entiment analysis oropinion mining" computational study of opinions, sentiments and emotions

    expressed in text.

    %hy opinion mining no$CMainly because of the%eb huge volumes of opinionated text.

  • 7/23/2019 knowledge discovery and data mining(kdd)

    42/52

    2*

    Introdction B ser5generated

    media 0mportance of opinions" *pinions are important because $henever $e need to

    ma:e a decision, $e $ant to hear othersF opinions.

    " 0n the past, 0ndividuals opinions from friends and family businesses surveys, focus groups, consultants N

    %ordofmouth on the %eb

    " 9sergenerated media *ne can express opinions on

    anything in revie$s, forums, discussion groups, blogs ..." *pinions of global scale Go longer limited to

    0ndividuals oneFs circle of friends Businesses =mall scale surveys, tiny focus groups, etc.

  • 7/23/2019 knowledge discovery and data mining(kdd)

    43/52

    ' %ascinating Pro0lemC

    0ntellectually challenging O ma8or applications." popular research topic in recent years in GA; and %eb data

    mining.

    " &'4' companies in 9= alone

    0t touches every aspect of GA; and yet is restrictedand confined." Aittle research in GA;PAinguistics in the past.

    ;otentially a ma8or technology from GA;.

    " But not yetE and not easyL" Data sourcing and data integration are hard tooL

    26

  • 7/23/2019 knowledge discovery and data mining(kdd)

    44/52

    'n #$ample (eview

    bought an iPhone a !ew days ago" t was such anice phone" #he touch screen was really cool" #he

    voice $uality was clear too" %lthough the battery li!e

    was not long, that is ok !or &e" 'owever, &y &other

    was &ad with &e as did not tell her be!ore boughtthe phone" (he also thought the phone was too

    e)pensive, and wanted &e to return it to the shop" *

    %hat do $e seeC" @pinions targets of opinions and opinion holders

    22

  • 7/23/2019 knowledge discovery and data mining(kdd)

    45/52

    2G

    !arget @0Eect ,+i We0 Data Mining 0ook 733>-

    Definition(o0Eect) n ob+ectois a product, person,event, organization, or topic. ois represented as

    " a hierarchy of components, subcomponents, and so on." /ach node represents a component and is associated $ith

    a set of attributesof the component.

    n opinion can be expressed on any node or attribute ofthe node.

    o simplify our discussion, $e use the term featurestorepresent both components and attributes.

  • 7/23/2019 knowledge discovery and data mining(kdd)

    46/52

    What is an @pinion? ,+i a "h. in /+Phand0ook-

    n opinionis a !uintuple(o+, !+k, soi+kl, hi, tl),

    $here" o+is a target ob8ect.

    " !+kis a feature of the ob8ect o+.

    " soi+klis the sentiment value of the opinion of the opinion holder hi

    on feature !+kof ob8ect o+at time tl. soi+klis Qve, ve, or neu, or a

    more granular rating.

    " hiis an opinion holder.

    " tlis the time $hen the opinion is expressed.

    27

  • 7/23/2019 knowledge discovery and data mining(kdd)

    47/52

    2

    @0Eective B strctre the

    nstrctred *b8ective iven an opinionated document," Discover all !uintuples (o+, !+k, soi+kl, hi, tl),

    i.e., mine the five corresponding pieces of information in each!uintuple, and

    " *r, solve some simpler problems

    %ith the !uintuples,

    " 9nstructured ext=tructured Data

    raditional data and visualization tools can be used to slice, dice

    and visualize the results in all :inds of $ays /nable !ualitative and !uantitative analysis.

  • 7/23/2019 knowledge discovery and data mining(kdd)

    48/52

    2)

    entiment "lassification: doc5level,Pang and +ee et al 7337 and !rney 7337-

    -lassify a document (e.g., a revie$) based on the

    overall sentiment expressed by opinion holder

    " -lasses ;ositive, or negative (and neutral)

    0n the model, (o+, !+k, soi+kl, hi, tl), 0t assumes

    " /ach document focuses on a single ob8ect and contains

    opinions from a single opinion holder.

    " 0t considers opinion on the ob8ect, o+(or o+I !+k)

  • 7/23/2019 knowledge discovery and data mining(kdd)

    49/52

    23

    0Eectivity 'nalysis,Wie0e et al 7339-

    =entencelevel sentiment analysis has t$o tas:s" =ub8ectivity classification =ub8ective or ob8ective.

    *b8ective e.g., bought an iPhone a !ew days ago"

    =ub8ective e.g., t is such a nice phone"

    " =entiment classification >or sub8ective sentences orclauses, classify positive or negative.

    ;ositive t is such a nice phone"

    #o$ever.(Aiu, -hapter in GA; handboo:)

    " sub8ective sentences RQve or "ve opinions /.g., think he ca&e yesterday"

    " *b8ective sentence R no opinion 0mply "ve opinion y phone broke in the second day"

  • 7/23/2019 knowledge discovery and data mining(kdd)

    50/52

    G5

    %eatre5Aased entiment 'nalysis

    =entiment classification at both document andsentence (or clause) levels are not sufficient,

    " they do not tell $hat people li:e andPor disli:e

    " positive opinion on an ob8ect does not mean that the

    opinion holder li:es everything.

    " n negative opinion on an ob8ect does not mean N..

    *b8ective Discovering all !uintuples

    (o+, !+k, soi+kl, hi, tl)

    %ith all !uintuples, all :inds of analyses become

    possible.

  • 7/23/2019 knowledge discovery and data mining(kdd)

    51/52

    G(

    %eatre5Aased @pinion mmary,1 & +i KDD57339-

    bought an iPhonea !ew days

    ago" t was such a nicephone"

    #he touch screen was really

    cool" #he voice $uality was

    clear too" %lthough the battery

    li!e was not long, that is ok !or&e" 'owever, &y &other was

    &ad with &e as did not tell her

    be!ore bought the phone" (he

    also thought the phone was too

    e)pensive, and wanted &e toreturn it to the shop" *

    N.

    %eatre Aased mmary:

    %eatre6 !och screen;ositive &+& #he touch screen was really cool. #he touch screen was so easy to

    use and can do a&aing things"FGegative 4 he screenis easily scratched. 0 have a lot of difficulty in removing

    finger mar:s from the touch screen.

    F%eatre7 0attery lifeF

    .ote/ 0e o&it opinion holders

  • 7/23/2019 knowledge discovery and data mining(kdd)

    52/52

    G*

    )isal "omparison ,+i et al. WWW57332- =ummary of

    revie$s of-ell ;hone+

    )oice creen i*e WeightAattery

    Q

    S

    -omparison of

    revie$s of

    -ell ;hone +

    -ell ;hone &

    S

    Q