s1 introduction to course

Upload: sargentshriver

Post on 04-Apr-2018

221 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/30/2019 S1 Introduction to Course

    1/102

    S1:Introduc-ontotheCourse

    ShawndraHill

    Spring2013

    TR1:303pmand34:30

  • 7/30/2019 S1 Introduction to Course

    2/102

    Data

    The amount of data created by each

    person doubles every 1.5 2

    years

    after five years x 10

    after ten years x 100

    after twenty years x 10000

    A. Weigend

  • 7/30/2019 S1 Introduction to Course

    3/102

    1 billion connected flash

    players

    A. Weigend

  • 7/30/2019 S1 Introduction to Course

    4/102

    40 billion RFID tags worldwide

    A. Weigend

  • 7/30/2019 S1 Introduction to Course

    5/102

    Biology:~100kyears

    Time Scales

    Technology: ~1 year

    A.Weigend

  • 7/30/2019 S1 Introduction to Course

    6/102

    A. Weigend

  • 7/30/2019 S1 Introduction to Course

    7/102

    Social Data = Shared Data

    ................

    pieces of content shared

    per month

    10 billion

  • 7/30/2019 S1 Introduction to Course

    8/102

    Social Data = Shared Data

    1 billionvideos watched

    per .....day

  • 7/30/2019 S1 Introduction to Course

    9/102

    onlyoccasionally

    punctuatedbypurchases

    Process ofcreating and refining

    product space awareness

    Shopping?

  • 7/30/2019 S1 Introduction to Course

    10/102

    How do you know peoples

    secret desires?

  • 7/30/2019 S1 Introduction to Course

    11/102

    Instrument for feedback

  • 7/30/2019 S1 Introduction to Course

    12/102

    n Situation" Location" Device

    n Attention"

    Transactions" Clicksn Intention

    " Search

    Data Sources

    A.Weigend

  • 7/30/2019 S1 Introduction to Course

    13/102

    Data Mining, Spring 20013

    Shawndra Hill

    13

    WhatisDataMining?

  • 7/30/2019 S1 Introduction to Course

    14/102

    Theprocessofdiscoveringmeaningfulnewcorrela-ons,

    paNerns,andtrendsbysiOingthroughlargeamountsofdata

    storedinrepositoriesandbyusingpaNernrecogni-on

    technologiesaswellassta-s-calandmathema-caltechniques

    (TheGartnerGroup).

    Theexplora-onandanalysisoflargequan--esofdatainorder

    todiscovermeaningfulpaNernsandrules(BerryandLinoff).

    Thenontrivialextrac-onofimplicit,previouslyunknown,and

    poten-allyusefulinforma-onfromdata(Frawley,Paitestsky

    ShapiroandMathews).

    14

    WhatisDataMining?

  • 7/30/2019 S1 Introduction to Course

    15/102

    15

    Defini-on(Fayyadet.al):Thenontrivialdiscoveryof

    novel,valid,comprehensibleandpoten-allyusefulpaNernsfromdata.

    WhatisapaNern?Arela-onshipinthedata.E.g.,

    nnThursdaynightspeoplewhobuydiapersalsotendtobuybeer

    nPeoplewithgoodcreditra-ngsarelesslikelytohaveaccidents

    nMaleconsumers,37+,incomebracket50K75Kspendbetween$25$50percatalogorder

    WhatisDataMining?

  • 7/30/2019 S1 Introduction to Course

    16/102

    HistoricalDifferencesBetween

    Sta-s-csandDM

    Sta%s%cs DataMining

    Confirma-ve Explora-ve

    Smalldatasets/

    Filebased

    Largedatasets/

    Databases

    Smallnumberofvariables Largenumberofvariables

    Deduc-ve Induc-ve

    umericdata umericandnonnumeric(includingtxt,networks)

    Cleandata Datacleaning

    16

  • 7/30/2019 S1 Introduction to Course

    17/102

    DataMiningvs.Sta-s-cs

    Sta-s-csisknownfor: welldefinedhypothesesusedtolearnabouta specificallychosenpopula-onstudiedusing carefullycollecteddataprovidinginferenceswith wellknownproper-es.

    Dataminingisntthatcareful.Itis: datadrivendiscoveryof modelsandpaNernsfrom massiveand observa-onaldatasets

  • 7/30/2019 S1 Introduction to Course

    18/102

    DataMiningv.Sta-s-cs

    Tradi-onalsta-s-cs firsthypothesize,thencollectdata,thenanalyze oOenmodeloriented(strongparametricmodels)

    Datamining:

    fewifanyapriorihypotheses dataisusuallyalreadycollectedapriori

    analysisistypicallydatadrivennothypothesisdriven Oenalgorithmorientedratherthanmodeloriented

    Different? Yes,intermsofculture,mo-va-on:however.. sta-s-calideasareveryusefulindatamining,e.g.,invalida-ngwhether

    discoveredknowledgeisuseful

    Increasingoverlapattheboundaryofsta-s-csandDMe.g.,exploratorydataanalysis(basedonpioneeringworkofJohnTukeyinthe1960s)

  • 7/30/2019 S1 Introduction to Course

    19/102

    DataMiningEnablers

    Explosionofdata Fastandcheapcomputa-onandstorage MooresLaw:processingdoublesevery19months Diskstoragedoublesevery9months Databasetechnology

    Compe--vepressureinbusiness Datahasvalue!

    ew,successfulmodels SVM,boos-ng

    Commercialproducts SAS,SPSS,Insighul,IBM,racle

    penSourceproducts Weka R

    1E+3

    1E+4

    1E+5

    1E+6

    1E+7

    1988 1991 1994 1997 2000

    disk TB

    growth:

    112%/y

    Moore's Law:

    58.7%/y

    ExaByte

    Disk TB Shipped per Year1998 DiskTrend(JimPorter)

    h tt : w ww .d i s kt re nd . c om d f o r tr k . d f .

  • 7/30/2019 S1 Introduction to Course

    20/102

    DataDrivenDiscovery

    bserva-onaldatacheaprela-vetoexperimentaldata

    Examples:

    Transac-ondataarchivesforretailstores,airlines,etc

    WeblogsforAmazon,Google,etcThehuman/mouse/ratgenomeEtc.,etc

    makessensetoleverageavailabledatauseful(?)informa-onmaybehiddeninvastarchivesofdata

    Whataretheperilsofobserva-onaldata?

  • 7/30/2019 S1 Introduction to Course

    21/102

    DataMining:ConfluenceofMul-pleDisciplines

    Data Mining

    DatabaseTechnology Statistics

    OtherDisciplines

    InformationScience

    MachineLearning Visualization

    Different fields have different views of what data mining is(also different terminology!)

  • 7/30/2019 S1 Introduction to Course

    22/102

    Induc-onvs.Deduc-on

    TheproblemofDeduc-on:Howtodemonstrate

    thatanabstractideaappliestonature?

    TheProblemofInduc-on:Howtogobeyonda

    collec-onoffactstonewconcepts?

    22

  • 7/30/2019 S1 Introduction to Course

    23/102

    DecisionSupportSystems(DSSs)

    23

    Assistmanagersinmakingdecisionsorchoices

    TypesofDSSs:

    Model-Driven:Spreadsheetsandotherop-miza-onbasedmethodsfrompera-onsManagementandFinance.

    Communica8on-Driven:Groupware(e.g.vo-ng/ra-ng),ComputerSupportedCollabora-veWork(CSCW),Documentsharing,Teleconferencing

    Data-driven:Collect,store,andanalyzelargedatavolumes.a.k.a.BusinessIntelligence(BI)systems,Warehouses,LAP

    Knowledge-driven:e.g.Expertsystemsthatcaptureexper-sebyapplyingruleselicitedfromexperts.Tradi-onaluses:medicaldiagnosis(e.g.MYCI),computerconfigura-on

    (e.g.XC),personaliza-on.Knowledgeelicita-onandknowledgerepresenta-on

    problems.

    Thiscoursedealsmainlywith: data-drivenDSSs(Part1) and knowledge-drivenDSSs(Part2).

    Wewilltouchbrieflyonmodel-drivenDSSsinPart2(butseeOPIM101formoreonthat).

  • 7/30/2019 S1 Introduction to Course

    24/102

    TheCourse

    Data Mining, Spring 20013 Shawndra Hill

    24

  • 7/30/2019 S1 Introduction to Course

    25/102

    Coursebjec-ves

    Approachbusinessproblemsdata-analy;cally.Thinkcarefully&systema-callyaboutwhether&howdatacanimprovebusinessperformance.

    Beabletointeractcompetentlyonthetopicofdataminingforbusinessintelligence.Knowthebasicsofdataminingprocesses,algorithms,&systemswellenoughtointeractwithCTs,expertdataminers,andbusinessanalysts.Beabletoenvisiondataminingopportuni-es.

    Hands-onexperienceminingdata.Bepreparedtofollowuponideasoropportuni-esthatpresentthemselves,e.g.,byperformingpilotstudies

    25

  • 7/30/2019 S1 Introduction to Course

    26/102

    urGoals

    26

    Understand the basics of the major Data Mining/Machine

    Learning techniques:

    What they do: problems they can solve Who uses them Where they are used When and how to use them How they work (at a high level only) Limitations

    Apply techniques and evaluate the models built

  • 7/30/2019 S1 Introduction to Course

    27/102

    27

    Introduc-ontoModeling&DataMining

    nFundamentalconceptsandterminology

    DataMiningmethods

    nClassifica-ondecisiontrees,associa-onrules,clusteringandsegmenta-on,collabora-vefiltering,gene-calgorithmsetc.

    nInnerworkingsnStrengthsandweaknesses

    Evalua-on

    nHowtoevaluatetheresultsofadataminingsolu-ons

    Applica-ons

    nRealworldbusinessproblemsDMcanbeappliedto

    Courseutline

  • 7/30/2019 S1 Introduction to Course

    28/102

    28

    Teachingstyle: Lecture/Lab/GuestSpeakers(AT&T,IBM,Yahoo!)

    Studentpar-cipa-on/aNendanceisimportant

    Labsessions: Weka,Gephi,python

    Textbook:VariousPubliclyAvailableReadings

    CourseInforma-on

  • 7/30/2019 S1 Introduction to Course

    29/102

    29

    SQL(MicrosoOAccess)

    Weka

    Gephi

    Python(Version2.7)

    Startinstallingnow

    CourseTLS

  • 7/30/2019 S1 Introduction to Course

    30/102

    30

    Canvas Wordpressclasssite:hNp://opim672.wordpress.com Facebook/TwiNer

    fficehours:M67pm,F25pm,orbyappointment Email:[email protected]

    TA: KrishnaChoksi([email protected]) AdrianBenton

    CourseInforma-on

  • 7/30/2019 S1 Introduction to Course

    31/102

    31

    n ReadmaterialbeforeandaOerclassn 8homeworkassignment(35points)groupsof2n Dataminingproject(50points)groupsof46,10groupsperclass

    n FinalReportn Midsemesterupdaten Endofsemesterpresenta-onn ProjectReviews

    n Classpar-cipa-on(15points)n Datasetcompe--on(op-onalforextracredit)Warning:

    1.Thisisahandsonclass

    2.Asignificantpor-onofdeliverablesareattheendofthe

    semester.

    CourseInforma-on

  • 7/30/2019 S1 Introduction to Course

    32/102

    WhatisaDSS?

    32

    DecisionSupportSystemsaimatallowingbusinessuserstomakebeHerdecisionsfasterandtake

    ac%onmoreeasilyandmoreprofitablybasedon

    thisinforma%on.

    Thisisachievedthrough:

    Predic-onDescrip-onDataDissemina-onPrescrip-on

  • 7/30/2019 S1 Introduction to Course

    33/102

    Induc%on:

    Fromspecificexamples(instances)togeneralrulesInstances:

    Rules:

    IFswims=yesTENclass=dolphin Rules

    Antecedent/Assump%on(RuleBody) Consequent/Conclusion(Ruleead)

    Predic%on= DeterminingtheclassoraHribute-valueforanewitemwithsome

    knownaHributes.

    Predic-on

    33

    Swims Color TypeID

    yes gray dolphinAnimal1

    yes black dolphinAnimal2

    no gray elephantAnimal3

  • 7/30/2019 S1 Introduction to Course

    34/102

    TextMining

    34

  • 7/30/2019 S1 Introduction to Course

    35/102

    Predic-on:

    ExamplesfromIndustry?

    35

    Classifyingdolphinsandflowersisdull

    (toyproblemsoOencitedinthedatamining

    literature).

    Ques-ons:

    Howdoweusedatamining/machinelearningtogeneraterevenuesorreducecosts?

    Howdowemone-zeDM?!!!!!

  • 7/30/2019 S1 Introduction to Course

    36/102

    Examples

    Data Mining, Spring 20013 Shawndra Hill

    36

    Mining Medical Discussion Board Data

    Mining Motley Fool Caps

    Social Network Based Marketing

    Social Network Based Fraud Detection

    Social TV ExamplesProfit Maximizing Recommendation Engine

    P di -

  • 7/30/2019 S1 Introduction to Course

    37/102

    Predic-on:

    ExamplesfromIndustry?

    37

    WachoviaCan I predict if someone will

    default on their loan?

    Visa Can I identify fraudulent credit cardTransactions?

    Linens n Things

    Monster.com

    The World Bank

    P di -

  • 7/30/2019 S1 Introduction to Course

    38/102

    Predic-on:

    ExamplesfromIndustry?

    38

    WachoviaCan I predict if someone willDefault on their loan?

    VisaCan I identify fraudulent credit card

    Transactions?

    Linens n ThingsPredict response to recommendation online?

    Monster.comPredict if stock value of company will go up based on

    Employee attrition?

    The World BankPredict if country/organization will default?

    Predic-on

  • 7/30/2019 S1 Introduction to Course

    39/102

    39

    ACNielson

    Pepsico

    Predic-on:

    ExamplesfromIndustry?

    Predic-on:

  • 7/30/2019 S1 Introduction to Course

    40/102

    40

    ACNielsonAssociation rules for market baskets?

    PepsicoIdentify business opportunities?

    Predic-on:

    ExamplesfromIndustry?

  • 7/30/2019 S1 Introduction to Course

    41/102

    DataMiningasaCoreCompetency

    41

  • 7/30/2019 S1 Introduction to Course

    42/102

    ExamplesofDataMiningSuccesses

    Googleisacompanybuiltondatamining PageRankminedthewebtobuildbeNersearch Googleasspellchecker Googleasadplacer Googleasnewsaggregator Googleasfacerecognizer

  • 7/30/2019 S1 Introduction to Course

    43/102

    DataMiningasaCoreCompetency

    43

  • 7/30/2019 S1 Introduction to Course

    44/102

    DataMiningasaCoreCompetency

    44

  • 7/30/2019 S1 Introduction to Course

    45/102

    DataDataData

    Itsallaboutthedatawheredoesitcomefrom?

    wwwASABusinessprocesses/transac-onsTelecommunica-onsandnetworkingMedicalimageryGovernment,census,demographics(data.gov!)Sensornetworks,RFIDtagsSports

    f l il

  • 7/30/2019 S1 Introduction to Course

    46/102

    TypesofData:FlatFileorVector

    Data

    Rows=objects Columns=measurementsonobjects

    Representeachrowasapdimensionalvector,wherepisthedimensionality Inefffect,embedourobjectsinapdimensionalvectorspace Oenuseful,butnotalwaysappropriate

    Bothnandpcanbeverylargeindatamining Matrixcanbequitesparse

    n

    p

    2.3 -1.5 -1.3

    1.1 0.1 -0.1

  • 7/30/2019 S1 Introduction to Course

    47/102

    Data Mining, Spring 200647

    Text

    T f D S M i (T )

  • 7/30/2019 S1 Introduction to Course

    48/102

    TypesofData:SparseMatrix(Text)

    Data

    Word IDs

    TextDocuments

  • 7/30/2019 S1 Introduction to Course

    49/102

    128.195.36.195, -, 3/22/00, 10:35:11, W3SVC, SRVR1, 128.200.39.181, 781, 363, 875, 200, 0, GET, /top.html, -,

    128.195.36.195, -, 3/22/00, 10:35:16, W3SVC, SRVR1, 128.200.39.181, 5288, 524, 414, 200, 0, POST, /spt/main.html, -,

    128.195.36.195, -, 3/22/00, 10:35:17, W3SVC, SRVR1, 128.200.39.181, 30, 280, 111, 404, 3, GET, /spt/images/bk1.jpg, -,

    128.195.36.101, -, 3/22/00, 16:18:50, W3SVC, SRVR1, 128.200.39.181, 60, 425, 72, 304, 0, GET, /top.html, -,

    128.195.36.101, -, 3/22/00, 16:18:58, W3SVC, SRVR1, 128.200.39.181, 8322, 527, 414, 200, 0, POST, /spt/main.html, -,

    128.195.36.101, -, 3/22/00, 16:18:59, W3SVC, SRVR1, 128.200.39.181, 0, 280, 111, 404, 3, GET, /spt/images/bk1.jpg, -,

    128.200.39.17, -, 3/22/00, 20:54:37, W3SVC, SRVR1, 128.200.39.181, 140, 199, 875, 200, 0, GET, /top.html, -,

    128.200.39.17, -, 3/22/00, 20:54:55, W3SVC, SRVR1, 128.200.39.181, 17766, 365, 414, 200, 0, POST, /spt/main.html, -,

    128.200.39.17, -, 3/22/00, 20:54:55, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -,

    128.200.39.17, -, 3/22/00, 20:55:07, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -,

    128.200.39.17, -, 3/22/00, 20:55:36, W3SVC, SRVR1, 128.200.39.181, 1061, 382, 414, 200, 0, POST, /spt/main.html, -,

    128.200.39.17, -, 3/22/00, 20:55:36, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -,

    128.200.39.17, -, 3/22/00, 20:55:39, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -,128.200.39.17, -, 3/22/00, 20:56:03, W3SVC, SRVR1, 128.200.39.181, 1081, 382, 414, 200, 0, POST, /spt/main.html, -,

    128.200.39.17, -, 3/22/00, 20:56:04, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -,

    128.200.39.17, -, 3/22/00, 20:56:33, W3SVC, SRVR1, 128.200.39.181, 0, 262, 72, 304, 0, GET, /top.html, -,

    128.200.39.17, -, 3/22/00, 20:56:52, W3SVC, SRVR1, 128.200.39.181, 19598, 382, 414, 200, 0, POST, /spt/main.html, -,

    5115

    1111115151115177777777

    1113333333131113332232

    User 5User 4

    User 3User 2User 1

    Sequence (Web) Data

    Sometimes another representation is more useful

  • 7/30/2019 S1 Introduction to Course

    50/102

    128.200.39.17, -, 3/22/00, 20:55:07, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -,128.200.39.17, -, 3/22/00, 20:55:36, W3SVC, SRVR1, 128.200.39.181, 1061, 382, 414, 200, 0, POST, /spt/main.html, -,

    128.200.39.17, -, 3/22/00, 20:55:36, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -,

    128.195.36.195, -, 3/22/00, 10:35:11, W3SVC, SRVR1, 128.200.39.181, 781, 363, 875, 200, 0, GET, /top.html, -,

    128.195.36.195, -, 3/22/00, 10:35:16, W3SVC, SRVR1, 128.200.39.181, 5288, 524, 414, 200, 0, POST, /spt/main.html, -,

    128.195.36.195, -, 3/22/00, 10:35:17, W3SVC, SRVR1, 128.200.39.181, 30, 280, 111, 404, 3, GET, /spt/images/bk1.jpg, -,

    ,

    Types of Data: Relational Data

    128.195.36.195, Doe, John, 12 Main St, 973-462-3421, Madison, NJ, 07932

    114.12.12.25,Trank, Jill, 11 Elm St, 998-555-5675, Chester, NJ, 07911

    07911, Chester, NJ, 07954, 34000, , 40.65, -74.12

    07932, Madison, NJ, 56000, 40.642, -74.132

    Most large data sets are stored in relational data setsOracle, MSFT, IBMGood open source versions: MySQL, PostGres

  • 7/30/2019 S1 Introduction to Course

    51/102

    TypesofData:TimeSeriesData

    0 5 10 15 20 25 3040

    60

    80

    100

    120

    140

    160

    Often many time series, long

    time series, or multivariatetime series

  • 7/30/2019 S1 Introduction to Course

    52/102

    TypesofData:ImageData

  • 7/30/2019 S1 Introduction to Course

    53/102

    Spa-oTemporalData

    hNp://senseable.mit.edu/nyte/movies/nyteglobeencounters.movencounters.mov

    kD

  • 7/30/2019 S1 Introduction to Course

    54/102

    etworkData

    Algorithms for estimating relative importance in networksS. White and P. Smyth, ACM SIGKDD, 2003.

  • 7/30/2019 S1 Introduction to Course

    55/102

    HP Labs email network

  • 7/30/2019 S1 Introduction to Course

    56/102

    Data Mining - Columbia University

    HP Labs email network500 people, 20k relationships

    Also, temporal networks

    M j A li - A

  • 7/30/2019 S1 Introduction to Course

    57/102

    MajorApplica-onAreas

    Marke%ng Customerloyalty/aNri-on Marketbasketanalysis:nThursdaysshopperswhobuydiapersalso

    buybeer

    Directmarke-ng Personaliza-on Marketsegmenta-on

    FraudDetec%on(Telecommunica-on,Credit,Securi-es) Creditrisk ealthCare Insurance

    Peoplewithgoodcreditra-ngshavefeweraccidents Textmining:email,documents,andWebanalysis. StockSelec%on onbusinessapplica-ons:military,bioinforma%cs,etc.

    57

  • 7/30/2019 S1 Introduction to Course

    58/102

    ExamplesofDataMiningSuccesses

    MarketBasket(WalMart) RecommenderSystems(Amazon.com) FraudDetec-oninTelecommunica-ons(AT&T) TargetMarke-ng/CRM FinancialMarkets DAMicroarrayanalysis Biometrics(fingerprin-ng,handwri-ng) WebTraffic/Bloganalysis

    Why Data Mining ow?

  • 7/30/2019 S1 Introduction to Course

    59/102

    WhyDataMiningow?

    59

    Better and cheaperComputing

    Power

    Maturedata miningtechnology

    Improved DataCollection, Access& Storage

    DM

  • 7/30/2019 S1 Introduction to Course

    60/102

    Accuracyisking

    60

    Only 15% of mergers and acquisitions

    succeed

    Stephen Denning

    The Leaders Guide to StoryTelling,

    pg xiv

  • 7/30/2019 S1 Introduction to Course

    61/102

    ProfitisKing

    (orItpaystobewrongsome-mes

    )

    61

    Failure rate of new ventures invested in: 8 out of 10

    Profit on Google investment: $4 billion (on $25 million)

    Source: http://www.financialnews-us.com/?contentid=534017

  • 7/30/2019 S1 Introduction to Course

    62/102

    Some-mes

    itpaystobewrongalmostallthe-me

    62

    Customer Lifetime Value: $2,700

    Cost per flyer: 7 cents

    Required hit rate = 7 / 270,000 = 1 in 38,571

    Case: Verizon Wireless

  • 7/30/2019 S1 Introduction to Course

    63/102

    Case:VerizonWireless(PlainVanillaDM)

    AboutVerizonWireless LargestwirelessproviderinUS Customerbase:30.3million Covering90%ofUSpopula-on

    Challenges

    Highcustomerturnoverrate(churn)of2%permonth(600,000customersdisconnectpermonth)

    Associatedreplacementcostinhundredsofmillionsperyear Averagecostofnewcustomeracquisi-on:$320

    63

  • 7/30/2019 S1 Introduction to Course

    64/102

    Possiblesolu-ons

    fferincen-vestoeverycustomersbeforecontractsexpire

    expensive nolearning

    64

    Case: Verizon Wireless

    l d

  • 7/30/2019 S1 Introduction to Course

    65/102

    DataMiningSolu-on:Predic-on

    Buildapredic%vemodel:Beforecontractsexpireuseapredic8vemodelto

    predictwhichcustomersarelikelyto

    leave(i.e.,es-ma-ngtheprobability)

    Then: Offerbenefitssuchasanewphoneonlytocustomersmostlikelyto

    disconnect

    Developnewplanstofitcustomerneeds

    65

  • 7/30/2019 S1 Introduction to Course

    66/102

    PhasesintheDMProcess:CRISPDM

    66

    BusinessUnderstanding

    DataUnderstanding

    DataPrepara%on

    ModelingEvalua%on

    Deployment

    www.crisp-dm.org

  • 7/30/2019 S1 Introduction to Course

    67/102

    CRossIndustryStandardProcessDM

    BusinessUnderstanding:Understandingprojectobjec-vesanddataminingproblemiden-fica-on

    DataUnderstanding:Capturing,understand,exploreyourdataforqualityissues

    DataPrepara%on:Datacleaning,mergedata,deriveaNributesetc.

    Modeling:Selectthedataminingtechniques,buildthemodel

    Evalua%on:Evaluatetheresultsandapprovedmodels

    Deployment:Putmodelsintoprac-ce,monitoringandmaintenance

    67

    Case: Verizon Wireless

  • 7/30/2019 S1 Introduction to Course

    68/102

    UnderstandingTheBusinessProblemandData

    68

    n IT brought idea to Marketing team andpresented it as partnership

    n Marketing learned the modeling process as wellas capabilities and weaknesses of modeling

    n IT learned the business processes and directmarketing strategies

    n Marketing recommended additions to attributesto use in building model

    Case: Verizon Wireless

  • 7/30/2019 S1 Introduction to Course

    69/102

    Modeling

    69

    Data Selection/PreparationIncluded hundreds of basic attributesDerived and Ratio fields added to enrich the model

    Use predictive modeling technique to refine

    relationship between predictors and output of interest

    Test Model: how will it perform in real life

    Select the best models (accuracy, profitability, etc.)

    Case: Verizon Wireless

  • 7/30/2019 S1 Introduction to Course

    70/102

    Results:Marke-ngCampaignsusingPredic-ve

    Modeling

    Beganwithonecampaign 4060Kpiecespermonth Verypersonalizeduniqueoffer Approximately15%takerate

    Currentlyfourmaincampaigntypes

    400,000pieces/month

    Upto35takerateofhighchurnriskcustomers

    70

    Case: Verizon Wireless

  • 7/30/2019 S1 Introduction to Course

    71/102

    Deployment

    71

    Direct Mail and Telemarketing

    n Customized one-to-one mailings

    Customer Care ApplicationCustomer flagged by offerUsed By: Customer Service, RetailChannels

    To catch customers that:reps were unable to contact

    Call to disconnect

    Case: Verizon Wireless

  • 7/30/2019 S1 Introduction to Course

    72/102

    Benefits

    72

    nCost Reductionn Customers saved up to 80% more takesn Direct Mail budget for same churner mailing reduced by 60%

    Switched customers from analog to digitalContract Renewals increased

    Revenue IncreaseAverage monthly revenue increase per billMonthly usage increased

  • 7/30/2019 S1 Introduction to Course

    73/102

    Descrip-veVs.Predic-veDataMining

    Descriptive DM is used to learn about andunderstand the data.

    Example:

    Iden-fyanddescribegroupsofcustomerswith

    commonbuyingbehavior(Clustering)

    73

    Example for Descriptive (Visualization)

  • 7/30/2019 S1 Introduction to Course

    74/102

    74

    p p ( )DM

    Using Customer Data

    FindgroupsofcustomerswithsimilarbuyingpaNerns

    Descrip-ve vs. Predic-ve Data Mining

  • 7/30/2019 S1 Introduction to Course

    75/102

    Descrip-vevs.Predic-veDataMining

    Predictive DM: Aims to build models in order topredict unknown values of interest.Examples: Amodelthatgivenacustomerscharacteris-cspredictshowmuch

    thecustomerwillspendonthenextcatalogorder.

    Amodelthatclassifiescreditapplicantstodeterminewhetherornotanapplicantwilldefaultonaloan. Mostpredic-vemodelsarealsodescrip-ve.Amountspentoncatalogpurchase=0.001*(Annunal_Income)+0.3*(um_Cards)+(1/um_rders)

    75

    35 yearsProfessional, 95K annual income

    2 children2 credit cards3 orders last year

    Last purchase: 8 months agoAverage spending $30Last purchase: $40

    Next Order: $40-$50

    What Data Mining Can and

  • 7/30/2019 S1 Introduction to Course

    76/102

    76

    gCannot Do

    Not a magic wandn No automatic solutions - Data mining offers a

    set oftools and methodologies. Need to knowhow to utilize them.

    n Like any other powerful tool can be verydangerous if not used properly.

    n Team work: Cannot (always) replace skilledbusiness analysts - needs guidance and

    validation of output

    What Can Go Wrong

  • 7/30/2019 S1 Introduction to Course

    77/102

    77

    n Problemformula%onn eedtounderstandthebusinesswell,goodformula-onofproblem

    nInappropriateuseofmethodsn (And/r)Lackofsufficient/highqualitydatan Computa-onalissues

    n Evalua%onn eeddomainexpertsthroughouttheprocesstoprovide

    indispensableinputandvalidateresults

    WhatCanGoWrong

    Wh t C G W ?

  • 7/30/2019 S1 Introduction to Course

    78/102

    78

    What Can Go Wrong?

    n InabilitytoactuponpaNernbecauseofpoli-calorethicalreasons

    n Securi-esTradingmodelsnDatamininginclinicalevalua-onnPrivacy(Insurance&credit,DoubleclickInc.)nAdmissioninterviews

  • 7/30/2019 S1 Introduction to Course

    79/102

    DataMiningv.Privacy

    ThereisoOentensionbetweendataminingandpersonalprivacy:

    hNp://www.aclu.org/pizza/images/screen.swf

    Ri k R d i D Mi i

  • 7/30/2019 S1 Introduction to Course

    80/102

    Risk v. Reward in Data MiningMore data about more people in fewer places

  • 7/30/2019 S1 Introduction to Course

    81/102

    The risks of research

    My own personal story:

    or

    how a paper published in JCGS leads me tobe connected to FBI wiretapping.

    2006: (chris v) Published papers on Communities ofInterest using social networks and Guilt by association to

    catch fraud

    9 September 2007: NYT lead story F.B.I. Data Mining

    Reached Beyond Initial Targets discusses FBI techniquesCOI and GBA

    23 October 2007: Blogosphere erupts: How AT&T Provides

    the FBI with Terror Suspect Leads

  • 7/30/2019 S1 Introduction to Course

    82/102

    The risks of research

    Another story:

  • 7/30/2019 S1 Introduction to Course

    83/102

    Data Mining, Spring 2006 83

  • 7/30/2019 S1 Introduction to Course

    84/102

  • 7/30/2019 S1 Introduction to Course

    85/102

    Data Mining, Spring 2006 85

  • 7/30/2019 S1 Introduction to Course

    86/102

    86

    Wikileaks Visualizations

  • 7/30/2019 S1 Introduction to Course

    87/102

    Data Mining, Spring 2006 87

    The Good, The Bad, and the

  • 7/30/2019 S1 Introduction to Course

    88/102

    e Good, e ad, a d e

    Maybe

    The question remains: how do weeffectively leverage sensitive personal

    data for research purposes?

    Three case studies can give insight Netflix PrizeAOL search dataset Barabasi mobile study

    C St d 1 AOL S h D t

  • 7/30/2019 S1 Introduction to Course

    89/102

    Case Study 1: AOL Search Data

    August 4, 2006: AOL releases 20M search termsby anonymized users for research purposes. Why?

    Within hours, uproar on the blogs The utter stupidity of this is staggering -

    TechCrunch August 7: AOL removes data, issues apology

    this was a screw-up, and we are angry an innocent enough attempt to reach out to the

    research community

    August 9: NYT front page story Identifies Thelma Arnold, 62 year old widow

    C St d 1 AOL S h D t

  • 7/30/2019 S1 Introduction to Course

    90/102

    Case Study 1: AOL Search Data Whats the big deal?

    Ego searches make it easy to figure out who you are combined with porn orillegal queries can make for serious privacy violations.

    What went wrong Not well thought out : risk >> reward Poor internal controls on public data release Lack of understanding of subject matter Lack of understanding of anonymizing data

    Fallout CTO + at least two others fired Data still out in the public

    Is it ethical to study? Inspiration for bad drama purple lilac," "happy bunny pictures,

    "square dancing steps "cut into your

    trachea," "pee fetish, "Simpsons

    incest."

    C St d 2 N tfli P i

  • 7/30/2019 S1 Introduction to Course

    91/102

    Case Study 2: Netflix Prize

    October 2006: Netflix releasesanonymized movie ratings from its

    customer base

    100M ratings, 500K customers (

  • 7/30/2019 S1 Introduction to Course

    92/102

    Case Study 2: Netflix Prize

    Narayanan and Shmatikov (2008) The adversary with a small amount of background knowledge

    about an individualcan identify with high probability that

    individuals record in the data and learnsensitive attributes

    Claim that Netflix data sanitization not relevant Accuse Netflix of violating Video Privacy Protection Act of 1988 Details:

    With aux info on 8 movies, where 2 can be wrong, and datesare known within 14 days; 99% de-anonymization

    Aux info can be gotten via web sites, water coolers, etc People might be willing to give away some ratings, but notothers

    Case Study 2: Netflix Prize

  • 7/30/2019 S1 Introduction to Course

    93/102

    Case Study 2: Netflix Prize

    Much ado about nothing Although paper is technically correct, dates are key

    Without dates, you must know 8 movies, all outside of the top500 to get over 80% chance of de-anonymization

    Auxiliary data very hard to come by No known cases discovered

    Netflix did it right Consulted with top machine learning experts 0 < risk

  • 7/30/2019 S1 Introduction to Course

    94/102

    y

    Study Gonzalez, Hidalgo and Barabasi (2008)

    Article in Nature outlines study on human mobility patterns 100000 individuals selected randomly from dataset of 6 million Unidentified country (unclear if the researchers knew) Cell tower location at start of call 206 individuals were pinged every two hours for a week

    Findings humans follow simple, reproducible patterns Sample finding: Nearly three-quarters of those studied mainly stayed within

    a 20-mile-wide circle for half a year.

    Results could impact all phenomena driven by human mobility, fromepidemic prevention to emergency response and urban planning.

    Case Study 3: Barabasi Mobile

  • 7/30/2019 S1 Introduction to Course

    95/102

    y

    Study Uproar ensued oversecret tracking of cell phone users

    Blowback of negative feedback to Nature and scientists Study would be illegal in the US Approval from ONR review board and Northeastern review board.

    Barabasi did not check with an ethics panel

    Response Hidalgo: the data could be misused, but we were not trying to do

    evil things. We are trying to make the world a little better.

    Northeastern and Nature backed the research Continues to be referenced as an example of dangerous research Risk and reward both very high

    ResearchConceptsPrivacy

  • 7/30/2019 S1 Introduction to Course

    96/102

    Howdoweguaranteethatdataisprivate? quasiiden-fierscombina-onsofaNributeswithinthe

    datathatcanbeusedtoiden-fyindividuals.

    E.g.87%ofthepopula-onoftheUnitedStatescanbeuniquelyiden-fiedbygender,dateofbirth,and5digitzip

    code

    Datasetsarekanonymouswhenforanygivenquasiiden-fier,arecordisindis-nguishablefromk1others.

    But,onestepfurther,maybeallkhaveagivensensi-veaNribute! Thedistribu-onoftargetvalueswithinagroupisreferredto

    asldiversity.

    Waystofuzzdatatoincreaseanonymityanddiversity: Generalize/summarizethedata:binsize,aggregatecounts Suppressordeletedata Perturbdata

    DataMiningSoOware

  • 7/30/2019 S1 Introduction to Course

    97/102

    Data Mining - Columbia University

    SoOware CanuseanysoOwareyoulike: Preferred:Weka Also:R,SAS,SPSS,Systat,EnterpriseMiner.Matlab,SQLServer Maybe:Excel

    WhatisR? pensourcesta-s-calsoOwaregrownoutofS/Splus www.rproject.org PackagesatCRA

    RTutorialsavailableonline(seewebsiteandCRA) Greatgraphics(withabitofalearningcurve)

    Resources

  • 7/30/2019 S1 Introduction to Course

    98/102

    Resources Dataminingisanewfieldandassuch,doesnothave

    authorita-vetexts(yet).

    Thisclassdrawsfrommanysources,bestare DataMiningTechniques:ForMarke-ng,Sales,andCustomer

    Support,byMichaelJ.A.Berry,GordonLinoff,publishedbyJohnWiley&Sons,Inc.

    ElementsofSta%s%calLearning as%e,Tibshirani,andFriedman

    HandbookofDataMiningHand,MannilaandSmyth Interac-veandDynamicGraphicsforDataAnalysisCookand

    Swayne

    Alsogoodclassnotesavailablefromotherclasses: DavidMadigan,Columbia DiCook,IowaState PadhraicSmyth,UCIrvine JiaweiHan,SimonFraser

    seeclasswebsiteforpointerstothesenotes,orjustGooglethem!)

    Assignment1

  • 7/30/2019 S1 Introduction to Course

    99/102

    99

    nBy Monday (01/16/2013) midnight on canvasnConfirm access to canvas!nRequired readingsnProfiles will be posted on canvas to facilitate groupselection ASAP

    nGenerate 3 potential classification (prediction)problems/ideas as part of Assignment 1 (Startexploring publicly available data sets projectsfrom last year are available)

    Projects From Prior Years

  • 7/30/2019 S1 Introduction to Course

    100/102

    ProjectsFromPriorYears

    Data Mining, Spring 20013

    Shawndra Hill 100

  • 7/30/2019 S1 Introduction to Course

    101/102

    Sources:

    AndreasWeigend,ChrisVolinsky

    101

    S1:Introduc-ontotheCourse

  • 7/30/2019 S1 Introduction to Course

    102/102

    ShawndraHill

    Spring2013

    TR 1:30 3pm and 3 4:30