liacs data mining course
DESCRIPTION
LIACS Data Mining course. Arno Knobbe Arne Koopman. an introduction. Course Textbook. Data Mining Practical Machine Learning Tools and Techniques second edition, Morgan Kaufmann, ISBN 0-12-088407-0 by Ian Witten and Eibe Frank. Course Information. New Website: - PowerPoint PPT PresentationTRANSCRIPT
-
LIACS Data Mining coursean introductionArno Knobbe Arne Koopman
-
Course TextbookData MiningPractical Machine Learning Tools and Techniquessecond edition, Morgan Kaufmann, ISBN 0-12-088407-0by Ian Witten and Eibe Frank
-
Course InformationNew Website:http://www.liacs.nl/~akoopman/DaMi/Old website discontinued: http://www.liacs.nl/~joost/DM/CollegeDataMining.htmNew lecturers new styleOld book. This may change next yearUpdated slides + some new materialPractical exercisesNew style of examfewer definitions, more understanding and applyingold exams should not be usedexam preparation (Dec 8) important
-
Course Outline08-SepKnobbetoday15-SepKnobbe22-SepKoopman29-SepKnobbe06-OctKnobbepractical exercise13-OctKoopman20-OctKoopman27-OctKnobbe03-NovKnobbepractical exercise10-NovKoopmanpractical exercise17-Novno lecture!24-NovKoopmanstart at 9:00!01-DecKoopman08-DecKoopman,Knobbeexam preparation!14-Jan, 10:00-13:00exam
-
Introduction Data Miningan overview and some examples
-
Data Mining definitionsData Mining:
the concept of extracting previously unknown and potentially useful information from large sets of data.
secondary statistics: analyzing data that wasnt originally collected for analysis.
-
Data Mining, the big ideaOrganizations collect large amounts of dataOften for administrative purposesLarge body of experienceLearning from experience
GoalsPredictionOptimizationForecastingDiagnostics
-
2 StreamsMining for insightUnderstanding a domainFinding regularities between variablesGoal of Data Mining is mostly undefinedInterpretable modelsExamples: Medicine, production, maintenance
Black-box MiningDont care how you do it, just do it wellOptimizationExamples: Marketing, forecasting (financial, weather)
-
example: Direct MailOptimize the response to a mailing, by targeting only those that are likely to respond:more responsefewer lettersCustomer informationresponse3%Customer informationtest mailingfinalmailingresponse30%remainder
-
example: BioinformaticsFind genes involved in disease (Parkinsons, Celiac, Neuroblastoma)Measurements from patients (1) and controls (0)Gene expression: measurements of 20k genesdataset 20,001 x 100
Challengesmany variablesfew examples (patients), testing is expensiveinteractions between genes
-
Data Mining paradigmsClassificationbinary class variablepredict class of future casesmost popular paradigmClusteringdivide dataset into groups of similar casesRegressionnumeric target variableAssociationfind dependencies between variablesbasket analysis,
-
ClassificationPredict the class (often 0/1) of an object on the basis of examples of other objects (with a class given).0.2
-
Building (inducing) a decision treeAgeGenderHousePriceMortgage?21MRent-No30FRent-Yes40MRent-No32FBuy300KNo30FRent-Yes55MBuy260KNo25FBuy180KYesAgeGenderHousePriceMortgage?21MRent-No30FRent-Yes40MRent-No32FBuy300KNo30FRent-Yes55MBuy260KNo25FBuy180KYes
-
Applying a classifier (decision tree)New customer: (House = Rent, Age = 32, )prediction = Yes
-
Graphical interpretationdataset with two variables + 1 class (+/-)graphical interpretation of decision tree
-
Graphical interpretationdataset with two variables + 1 class (+/-)other classifiersSupport Vector MachineNeural Network
-
Applications of DMMarketingoutgoingincomingBioinformatics & MedicineFraud detectionRisk managementInsuranceEnterprise resource planning
-
Rhinoplastic surgery
Chart1
10
50
80
222
222
202
80
94
00
33
33
all patients
pre E1c > 3
VAE improvement
nr. patients
histogram over VAE improvement
Sheet2
BinFrequencyBinFrequency
0100
0.550.50
1810
1.5221.52
22222
2.5202.52
3830
3.593.54
4040
4.534.53
5353
Sheet2
3
1.5
1.5
1
0.5
2.5
2
1.5
1.25
2.5
1
2
0.25
3
2
0.25
0
2.25
1.25
3
1.75
2.5
2
2.5
1.5
1.5
1
2.5
1.5
1.5
1.5
3
2
1.5
2
2.5
2
3.5
1.25
2.5
2.25
2.5
1.75
1.5
2
2.5
2
2.5
1.5
3.25
1
1.5
0.75
3
2.5
1
1
3.5
2.5
2.5
2.75
3
1.5
2
2
3.5
1.5
0.5
2
2
3
2
2.5
3
2
2.25
2.25
2
2
1.5
2
1.25
1.5
0.5
1
3.25
5
4.5
3.5
2
1.5
2.25
3.5
2.25
1.25
3.25
1.75
4.5
3.5
4.5
5
4.75
pre E1c
VAE improvement
scatter plot pre E1c vs. VAE improvement
Blad1
all patients
pre E1c > 3
VAE improvement
nr. patients
histogram over VAE improvement
delta VAEpre E1c
631
31.51
31.51
211
10.51
52.51
421
31.51
2.51.251
52.51
211
421
0.50.251
631
421
0.50.251
001
4.52.251
2.51.251
631
3.51.751
52.51
421
52.51
31.51
31.51
211
52.51
31.51
31.51
31.51
631
421
31.51
421
52.51
421
73.52
2.51.252
52.52
4.52.252
52.52
3.51.752
31.52
422
52.52
422
52.52
31.52
6.53.252
212
31.52
1.50.752
632
52.52
212
212
73.52
52.52
52.52
5.52.752
633
31.53
423
423
73.53
31.53
10.53
423
423
633
423
52.53
633
423
4.52.253
4.52.253
423
423
31.53
423
2.51.253
31.53
10.530
2130.5
6.53.2531
10541.5
94.542
73.542.5
4243
31.543.5
4.52.2544
73.544.5
4.52.2545
2.51.254
6.53.254
3.51.754
94.55
73.55
94.55
1055
9.54.755
-
InfraWatch: monitoring of infrastructureContinuous monitoring of a large bridge Hollandse Brug145 sensorstime-dependent, at frequencies up to 100Hzmulti-modal (sensor, video, differen freq.)managing large data quantities, >1 Gb per day
-
InfraWatch: monitoring of infrastructure34 `geo-phones' (vibration sensors)44 embedded strain-gauges, 47 gauges outside20 thermometersvideo cameraweather station
-
InfraWatch sensors
-
Real-world application:Maintenance planning at KLMRoutine checks of aircraftsMaintenance requires up to 10k different partsOrdering parts incurs delay (costs) but so does stockingIn theory 10k individual predictionsInputmaintenance historyflight history, Sahara/North PoleOnly few parts predictable
-
Cashflow OnlineOnline personal finance overviewAll bank transactions are loaded into the applicationtransactions are classified into different categoriesData Mining predicts category
-
67 CategoriesGas Water LichtOnderhoud huis en tuinTelefoon + Internet + TVContributie (sport-)verenigingenLevensverzekering / LijfrenteRente ontvangenBoodschappenHypotheekrenteNaar spaarrekeningGeldopname/chipknipVerzekeringen overigLoterijenCadeau'sInterne boekingVakantie & RecreatieUitgaan, hobby's en sportCreditcardZiektekostenverzekeringBrandstofWoonhuis / OpstalverzekeringHuishouden overigSchool- en StudiekostenInkomsten overigKleding & SchoenenLenenOpenbaar vervoer/Taxi
-
Fragmented results:Boodschappen (groceries)Contributie
-
Decision Tree over all categoriestruefalse
-
Data Mining at LIACSApplicationsbioinformatics (LUMC)law enforcement (KLPD, NFI)rhinoplastic surgery (NKI)Hollandse Brug (Strukton, RWS, Reef Infra)Complex datagraphical data (molecules)relational data (criminal careers)stream data (sensor-data, click-streams)