liacs data mining course

Download LIACS Data Mining course

If you can't read please download the document

Upload: shayla

Post on 10-Jan-2016

42 views

Category:

Documents


1 download

DESCRIPTION

LIACS Data Mining course. Arno Knobbe Arne Koopman. an introduction. Course Textbook. Data Mining Practical Machine Learning Tools and Techniques second edition, Morgan Kaufmann, ISBN 0-12-088407-0 by Ian Witten and Eibe Frank. Course Information. New Website: - PowerPoint PPT Presentation

TRANSCRIPT

  • LIACS Data Mining coursean introductionArno Knobbe Arne Koopman

  • Course TextbookData MiningPractical Machine Learning Tools and Techniquessecond edition, Morgan Kaufmann, ISBN 0-12-088407-0by Ian Witten and Eibe Frank

  • Course InformationNew Website:http://www.liacs.nl/~akoopman/DaMi/Old website discontinued: http://www.liacs.nl/~joost/DM/CollegeDataMining.htmNew lecturers new styleOld book. This may change next yearUpdated slides + some new materialPractical exercisesNew style of examfewer definitions, more understanding and applyingold exams should not be usedexam preparation (Dec 8) important

  • Course Outline08-SepKnobbetoday15-SepKnobbe22-SepKoopman29-SepKnobbe06-OctKnobbepractical exercise13-OctKoopman20-OctKoopman27-OctKnobbe03-NovKnobbepractical exercise10-NovKoopmanpractical exercise17-Novno lecture!24-NovKoopmanstart at 9:00!01-DecKoopman08-DecKoopman,Knobbeexam preparation!14-Jan, 10:00-13:00exam

  • Introduction Data Miningan overview and some examples

  • Data Mining definitionsData Mining:

    the concept of extracting previously unknown and potentially useful information from large sets of data.

    secondary statistics: analyzing data that wasnt originally collected for analysis.

  • Data Mining, the big ideaOrganizations collect large amounts of dataOften for administrative purposesLarge body of experienceLearning from experience

    GoalsPredictionOptimizationForecastingDiagnostics

  • 2 StreamsMining for insightUnderstanding a domainFinding regularities between variablesGoal of Data Mining is mostly undefinedInterpretable modelsExamples: Medicine, production, maintenance

    Black-box MiningDont care how you do it, just do it wellOptimizationExamples: Marketing, forecasting (financial, weather)

  • example: Direct MailOptimize the response to a mailing, by targeting only those that are likely to respond:more responsefewer lettersCustomer informationresponse3%Customer informationtest mailingfinalmailingresponse30%remainder

  • example: BioinformaticsFind genes involved in disease (Parkinsons, Celiac, Neuroblastoma)Measurements from patients (1) and controls (0)Gene expression: measurements of 20k genesdataset 20,001 x 100

    Challengesmany variablesfew examples (patients), testing is expensiveinteractions between genes

  • Data Mining paradigmsClassificationbinary class variablepredict class of future casesmost popular paradigmClusteringdivide dataset into groups of similar casesRegressionnumeric target variableAssociationfind dependencies between variablesbasket analysis,

  • ClassificationPredict the class (often 0/1) of an object on the basis of examples of other objects (with a class given).0.2

  • Building (inducing) a decision treeAgeGenderHousePriceMortgage?21MRent-No30FRent-Yes40MRent-No32FBuy300KNo30FRent-Yes55MBuy260KNo25FBuy180KYesAgeGenderHousePriceMortgage?21MRent-No30FRent-Yes40MRent-No32FBuy300KNo30FRent-Yes55MBuy260KNo25FBuy180KYes

  • Applying a classifier (decision tree)New customer: (House = Rent, Age = 32, )prediction = Yes

  • Graphical interpretationdataset with two variables + 1 class (+/-)graphical interpretation of decision tree

  • Graphical interpretationdataset with two variables + 1 class (+/-)other classifiersSupport Vector MachineNeural Network

  • Applications of DMMarketingoutgoingincomingBioinformatics & MedicineFraud detectionRisk managementInsuranceEnterprise resource planning

  • Rhinoplastic surgery

    Chart1

    10

    50

    80

    222

    222

    202

    80

    94

    00

    33

    33

    all patients

    pre E1c > 3

    VAE improvement

    nr. patients

    histogram over VAE improvement

    Sheet2

    BinFrequencyBinFrequency

    0100

    0.550.50

    1810

    1.5221.52

    22222

    2.5202.52

    3830

    3.593.54

    4040

    4.534.53

    5353

    Sheet2

    3

    1.5

    1.5

    1

    0.5

    2.5

    2

    1.5

    1.25

    2.5

    1

    2

    0.25

    3

    2

    0.25

    0

    2.25

    1.25

    3

    1.75

    2.5

    2

    2.5

    1.5

    1.5

    1

    2.5

    1.5

    1.5

    1.5

    3

    2

    1.5

    2

    2.5

    2

    3.5

    1.25

    2.5

    2.25

    2.5

    1.75

    1.5

    2

    2.5

    2

    2.5

    1.5

    3.25

    1

    1.5

    0.75

    3

    2.5

    1

    1

    3.5

    2.5

    2.5

    2.75

    3

    1.5

    2

    2

    3.5

    1.5

    0.5

    2

    2

    3

    2

    2.5

    3

    2

    2.25

    2.25

    2

    2

    1.5

    2

    1.25

    1.5

    0.5

    1

    3.25

    5

    4.5

    3.5

    2

    1.5

    2.25

    3.5

    2.25

    1.25

    3.25

    1.75

    4.5

    3.5

    4.5

    5

    4.75

    pre E1c

    VAE improvement

    scatter plot pre E1c vs. VAE improvement

    Blad1

    all patients

    pre E1c > 3

    VAE improvement

    nr. patients

    histogram over VAE improvement

    delta VAEpre E1c

    631

    31.51

    31.51

    211

    10.51

    52.51

    421

    31.51

    2.51.251

    52.51

    211

    421

    0.50.251

    631

    421

    0.50.251

    001

    4.52.251

    2.51.251

    631

    3.51.751

    52.51

    421

    52.51

    31.51

    31.51

    211

    52.51

    31.51

    31.51

    31.51

    631

    421

    31.51

    421

    52.51

    421

    73.52

    2.51.252

    52.52

    4.52.252

    52.52

    3.51.752

    31.52

    422

    52.52

    422

    52.52

    31.52

    6.53.252

    212

    31.52

    1.50.752

    632

    52.52

    212

    212

    73.52

    52.52

    52.52

    5.52.752

    633

    31.53

    423

    423

    73.53

    31.53

    10.53

    423

    423

    633

    423

    52.53

    633

    423

    4.52.253

    4.52.253

    423

    423

    31.53

    423

    2.51.253

    31.53

    10.530

    2130.5

    6.53.2531

    10541.5

    94.542

    73.542.5

    4243

    31.543.5

    4.52.2544

    73.544.5

    4.52.2545

    2.51.254

    6.53.254

    3.51.754

    94.55

    73.55

    94.55

    1055

    9.54.755

  • InfraWatch: monitoring of infrastructureContinuous monitoring of a large bridge Hollandse Brug145 sensorstime-dependent, at frequencies up to 100Hzmulti-modal (sensor, video, differen freq.)managing large data quantities, >1 Gb per day

  • InfraWatch: monitoring of infrastructure34 `geo-phones' (vibration sensors)44 embedded strain-gauges, 47 gauges outside20 thermometersvideo cameraweather station

  • InfraWatch sensors

  • Real-world application:Maintenance planning at KLMRoutine checks of aircraftsMaintenance requires up to 10k different partsOrdering parts incurs delay (costs) but so does stockingIn theory 10k individual predictionsInputmaintenance historyflight history, Sahara/North PoleOnly few parts predictable

  • Cashflow OnlineOnline personal finance overviewAll bank transactions are loaded into the applicationtransactions are classified into different categoriesData Mining predicts category

  • 67 CategoriesGas Water LichtOnderhoud huis en tuinTelefoon + Internet + TVContributie (sport-)verenigingenLevensverzekering / LijfrenteRente ontvangenBoodschappenHypotheekrenteNaar spaarrekeningGeldopname/chipknipVerzekeringen overigLoterijenCadeau'sInterne boekingVakantie & RecreatieUitgaan, hobby's en sportCreditcardZiektekostenverzekeringBrandstofWoonhuis / OpstalverzekeringHuishouden overigSchool- en StudiekostenInkomsten overigKleding & SchoenenLenenOpenbaar vervoer/Taxi

  • Fragmented results:Boodschappen (groceries)Contributie

  • Decision Tree over all categoriestruefalse

  • Data Mining at LIACSApplicationsbioinformatics (LUMC)law enforcement (KLPD, NFI)rhinoplastic surgery (NKI)Hollandse Brug (Strukton, RWS, Reef Infra)Complex datagraphical data (molecules)relational data (criminal careers)stream data (sensor-data, click-streams)