machine learning assists the classification of reports by...

20
Machine Learning Assists the Classification of Reports by Citizens on Disease-Carrying Mosquitoes Antonio Rodriguez 1 Frederic Bartumeus 2,3,4 Ricard Gavald` a 1 Universitat Polit` ecnica de Catalunya, Barcelona (Spain) Centre for Advanced Studies of Blanes (CEAB-CSIC), 17300 Girona (Spain) CREAF, Cerdanyola del Vall` es, 08193 Barcelona (Spain) ICREA, Pg Llu´ ıs Companys 23, 08010 Barcelona (Spain) Workshop on Data Science for Social Good, SoGood September 2016 Antonio Rodriguez, Frederic Bartumeus, Ricard Gavald` ecnica de Catalunya, Barcelona (Spain), Centre for Advanced S Machine Learning Assists the Classification of Reports by Citizens on Disease-Carrying Mos Workshop on Data Science for Social Good, S / 20

Upload: others

Post on 14-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Machine Learning Assists the Classification of Reports by …gavalda/papers/sogood2016-slides.pdf · 2016-09-17 · Machine Learning Assists the Classi cation of Reports by Citizens

Machine Learning Assists the Classification of Reports byCitizens on Disease-Carrying Mosquitoes

Antonio Rodriguez1 Frederic Bartumeus2,3,4 Ricard Gavalda1

Universitat Politecnica de Catalunya, Barcelona (Spain)

Centre for Advanced Studies of Blanes (CEAB-CSIC), 17300 Girona (Spain)

CREAF, Cerdanyola del Valles, 08193 Barcelona (Spain)

ICREA, Pg Lluıs Companys 23, 08010 Barcelona (Spain)

Workshop on Data Science for Social Good, SoGoodSeptember 2016

Antonio Rodriguez, Frederic Bartumeus, Ricard Gavalda (Universitat Politecnica de Catalunya, Barcelona (Spain), Centre for Advanced Studies of Blanes (CEAB-CSIC), 17300 Girona (Spain), CREAF, Cerdanyola del Valles, 08193 Barcelona (Spain), ICREA, Pg Lluıs Companys 23, 08010 Barcelona (Spain) )Machine Learning Assists the Classification of Reports by Citizens on Disease-Carrying MosquitoesWorkshop on Data Science for Social Good, SoGood September 2016 1

/ 20

Page 2: Machine Learning Assists the Classification of Reports by …gavalda/papers/sogood2016-slides.pdf · 2016-09-17 · Machine Learning Assists the Classi cation of Reports by Citizens

Overview

1 Introduction

2 Methodology

3 Project developmentExploratory data analysisData cleaning and pre-processingClassifier training, evaluation and selectionReal-time classification system design

4 Discussion

5 Future work

Antonio Rodriguez, Frederic Bartumeus, Ricard Gavalda (Universitat Politecnica de Catalunya, Barcelona (Spain), Centre for Advanced Studies of Blanes (CEAB-CSIC), 17300 Girona (Spain), CREAF, Cerdanyola del Valles, 08193 Barcelona (Spain), ICREA, Pg Lluıs Companys 23, 08010 Barcelona (Spain) )Machine Learning Assists the Classification of Reports by Citizens on Disease-Carrying MosquitoesWorkshop on Data Science for Social Good, SoGood September 2016 2

/ 20

Page 3: Machine Learning Assists the Classification of Reports by …gavalda/papers/sogood2016-slides.pdf · 2016-09-17 · Machine Learning Assists the Classi cation of Reports by Citizens

Introduction - Mosquito Alert

Citizen Science Platform

Mobile application

Growing fast

Various mosquito speciesWorldwide localizations

Antonio Rodriguez, Frederic Bartumeus, Ricard Gavalda (Universitat Politecnica de Catalunya, Barcelona (Spain), Centre for Advanced Studies of Blanes (CEAB-CSIC), 17300 Girona (Spain), CREAF, Cerdanyola del Valles, 08193 Barcelona (Spain), ICREA, Pg Lluıs Companys 23, 08010 Barcelona (Spain) )Machine Learning Assists the Classification of Reports by Citizens on Disease-Carrying MosquitoesWorkshop on Data Science for Social Good, SoGood September 2016 3

/ 20

Page 4: Machine Learning Assists the Classification of Reports by …gavalda/papers/sogood2016-slides.pdf · 2016-09-17 · Machine Learning Assists the Classi cation of Reports by Citizens

Introduction - Mobile App

Send breeding site

Send specimen report

Small questionnaire

Geolocated!

Antonio Rodriguez, Frederic Bartumeus, Ricard Gavalda (Universitat Politecnica de Catalunya, Barcelona (Spain), Centre for Advanced Studies of Blanes (CEAB-CSIC), 17300 Girona (Spain), CREAF, Cerdanyola del Valles, 08193 Barcelona (Spain), ICREA, Pg Lluıs Companys 23, 08010 Barcelona (Spain) )Machine Learning Assists the Classification of Reports by Citizens on Disease-Carrying MosquitoesWorkshop on Data Science for Social Good, SoGood September 2016 4

/ 20

Page 5: Machine Learning Assists the Classification of Reports by …gavalda/papers/sogood2016-slides.pdf · 2016-09-17 · Machine Learning Assists the Classi cation of Reports by Citizens

Introduction - Mosquito Alert System

Antonio Rodriguez, Frederic Bartumeus, Ricard Gavalda (Universitat Politecnica de Catalunya, Barcelona (Spain), Centre for Advanced Studies of Blanes (CEAB-CSIC), 17300 Girona (Spain), CREAF, Cerdanyola del Valles, 08193 Barcelona (Spain), ICREA, Pg Lluıs Companys 23, 08010 Barcelona (Spain) )Machine Learning Assists the Classification of Reports by Citizens on Disease-Carrying MosquitoesWorkshop on Data Science for Social Good, SoGood September 2016 5

/ 20

Page 6: Machine Learning Assists the Classification of Reports by …gavalda/papers/sogood2016-slides.pdf · 2016-09-17 · Machine Learning Assists the Classi cation of Reports by Citizens

Introduction - Classification system

Antonio Rodriguez, Frederic Bartumeus, Ricard Gavalda (Universitat Politecnica de Catalunya, Barcelona (Spain), Centre for Advanced Studies of Blanes (CEAB-CSIC), 17300 Girona (Spain), CREAF, Cerdanyola del Valles, 08193 Barcelona (Spain), ICREA, Pg Lluıs Companys 23, 08010 Barcelona (Spain) )Machine Learning Assists the Classification of Reports by Citizens on Disease-Carrying MosquitoesWorkshop on Data Science for Social Good, SoGood September 2016 6

/ 20

Page 7: Machine Learning Assists the Classification of Reports by …gavalda/papers/sogood2016-slides.pdf · 2016-09-17 · Machine Learning Assists the Classi cation of Reports by Citizens

Methodology

1 Exploratory data analysis

2 Data cleaning and pre-processing3 Classifiers

trainingevaluationselection

4 Real-time classification system design

Antonio Rodriguez, Frederic Bartumeus, Ricard Gavalda (Universitat Politecnica de Catalunya, Barcelona (Spain), Centre for Advanced Studies of Blanes (CEAB-CSIC), 17300 Girona (Spain), CREAF, Cerdanyola del Valles, 08193 Barcelona (Spain), ICREA, Pg Lluıs Companys 23, 08010 Barcelona (Spain) )Machine Learning Assists the Classification of Reports by Citizens on Disease-Carrying MosquitoesWorkshop on Data Science for Social Good, SoGood September 2016 7

/ 20

Page 8: Machine Learning Assists the Classification of Reports by …gavalda/papers/sogood2016-slides.pdf · 2016-09-17 · Machine Learning Assists the Classi cation of Reports by Citizens

Exploratory data analysis - Raw files

users 16967 observations of 10 variables

userID

userRegistTimeOriginal

userRegistDatetime

userRegistDate

userRegistMonthNum

userRegistMonthString

userRegistWeekdayString

userRegistWeekdayNum

userSyst

userDaysSystRelease

Antonio Rodriguez, Frederic Bartumeus, Ricard Gavalda (Universitat Politecnica de Catalunya, Barcelona (Spain), Centre for Advanced Studies of Blanes (CEAB-CSIC), 17300 Girona (Spain), CREAF, Cerdanyola del Valles, 08193 Barcelona (Spain), ICREA, Pg Lluıs Companys 23, 08010 Barcelona (Spain) )Machine Learning Assists the Classification of Reports by Citizens on Disease-Carrying MosquitoesWorkshop on Data Science for Social Good, SoGood September 2016 8

/ 20

Page 9: Machine Learning Assists the Classification of Reports by …gavalda/papers/sogood2016-slides.pdf · 2016-09-17 · Machine Learning Assists the Classi cation of Reports by Citizens

Exploratory data analysis - Raw files

reports 10618 observations of 23+1 variables

reportVersionID

reportVersionNum

userID

reportID

reportType

reportNote

os

hide

reportCreationDatetime

reportCreationDate

reportVersionDatetime

reportVersionDate

reportCreationMonthNum

reportCreationMonthString

reportCreationWeekdayString

reportCreationWeekdayNum

reportLong

reportLat

missionNum

missionName

tiger q1 response

tiger q2 response

tiger q3 response

class

Antonio Rodriguez, Frederic Bartumeus, Ricard Gavalda (Universitat Politecnica de Catalunya, Barcelona (Spain), Centre for Advanced Studies of Blanes (CEAB-CSIC), 17300 Girona (Spain), CREAF, Cerdanyola del Valles, 08193 Barcelona (Spain), ICREA, Pg Lluıs Companys 23, 08010 Barcelona (Spain) )Machine Learning Assists the Classification of Reports by Citizens on Disease-Carrying MosquitoesWorkshop on Data Science for Social Good, SoGood September 2016 9

/ 20

Page 10: Machine Learning Assists the Classification of Reports by …gavalda/papers/sogood2016-slides.pdf · 2016-09-17 · Machine Learning Assists the Classi cation of Reports by Citizens

Questionnaire variables

QuestionsIs small, black and has white stripes?

Has a white stripe in both head andthorax?

Has white stripes in both abdomen andlegs?

Response values-1 No

0 Not sure

1 Yes

Antonio Rodriguez, Frederic Bartumeus, Ricard Gavalda (Universitat Politecnica de Catalunya, Barcelona (Spain), Centre for Advanced Studies of Blanes (CEAB-CSIC), 17300 Girona (Spain), CREAF, Cerdanyola del Valles, 08193 Barcelona (Spain), ICREA, Pg Lluıs Companys 23, 08010 Barcelona (Spain) )Machine Learning Assists the Classification of Reports by Citizens on Disease-Carrying MosquitoesWorkshop on Data Science for Social Good, SoGood September 2016 10

/ 20

Page 11: Machine Learning Assists the Classification of Reports by …gavalda/papers/sogood2016-slides.pdf · 2016-09-17 · Machine Learning Assists the Classi cation of Reports by Citizens

The class variable

-2 The report is definitely not a valid specimen.

-1 The report doesn’t seem to be a valid specimen. But it isnot sure.

0 There isn’t enough information to classify the report.

1 The report seems to be a valid specimen. But it is not sure.

2 The report is definitely a valid specimen.

Antonio Rodriguez, Frederic Bartumeus, Ricard Gavalda (Universitat Politecnica de Catalunya, Barcelona (Spain), Centre for Advanced Studies of Blanes (CEAB-CSIC), 17300 Girona (Spain), CREAF, Cerdanyola del Valles, 08193 Barcelona (Spain), ICREA, Pg Lluıs Companys 23, 08010 Barcelona (Spain) )Machine Learning Assists the Classification of Reports by Citizens on Disease-Carrying MosquitoesWorkshop on Data Science for Social Good, SoGood September 2016 11

/ 20

Page 12: Machine Learning Assists the Classification of Reports by …gavalda/papers/sogood2016-slides.pdf · 2016-09-17 · Machine Learning Assists the Classi cation of Reports by Citizens

Instance variables

Added

reportNote

reportTimeOfDay.

newUser

userNumReports

userAccuracy

userTimeForFirstReport

userTimeSinceLastReport

userMeanTimeBetweenReports

userNumActionAreas

userMobilityIndex

reports1kmLast* (4)

validReports1kmLast* (4)

Preserved

os

reportMonth

reportQ*Answ (3)

class

Antonio Rodriguez, Frederic Bartumeus, Ricard Gavalda (Universitat Politecnica de Catalunya, Barcelona (Spain), Centre for Advanced Studies of Blanes (CEAB-CSIC), 17300 Girona (Spain), CREAF, Cerdanyola del Valles, 08193 Barcelona (Spain), ICREA, Pg Lluıs Companys 23, 08010 Barcelona (Spain) )Machine Learning Assists the Classification of Reports by Citizens on Disease-Carrying MosquitoesWorkshop on Data Science for Social Good, SoGood September 2016 12

/ 20

Page 13: Machine Learning Assists the Classification of Reports by …gavalda/papers/sogood2016-slides.pdf · 2016-09-17 · Machine Learning Assists the Classi cation of Reports by Citizens

Generated instances

2094 instances from usable reports

Class 2 1 −1 −2Frequency 47% 46% 2% 5%

Class-imbalanced problem: positive instances over 7 times as frequentas negative ones.

Antonio Rodriguez, Frederic Bartumeus, Ricard Gavalda (Universitat Politecnica de Catalunya, Barcelona (Spain), Centre for Advanced Studies of Blanes (CEAB-CSIC), 17300 Girona (Spain), CREAF, Cerdanyola del Valles, 08193 Barcelona (Spain), ICREA, Pg Lluıs Companys 23, 08010 Barcelona (Spain) )Machine Learning Assists the Classification of Reports by Citizens on Disease-Carrying MosquitoesWorkshop on Data Science for Social Good, SoGood September 2016 13

/ 20

Page 14: Machine Learning Assists the Classification of Reports by …gavalda/papers/sogood2016-slides.pdf · 2016-09-17 · Machine Learning Assists the Classi cation of Reports by Citizens

Studied classifiers

Naive Bayes

k-nearest neighbors

Decision trees

Random Forests

Antonio Rodriguez, Frederic Bartumeus, Ricard Gavalda (Universitat Politecnica de Catalunya, Barcelona (Spain), Centre for Advanced Studies of Blanes (CEAB-CSIC), 17300 Girona (Spain), CREAF, Cerdanyola del Valles, 08193 Barcelona (Spain), ICREA, Pg Lluıs Companys 23, 08010 Barcelona (Spain) )Machine Learning Assists the Classification of Reports by Citizens on Disease-Carrying MosquitoesWorkshop on Data Science for Social Good, SoGood September 2016 14

/ 20

Page 15: Machine Learning Assists the Classification of Reports by …gavalda/papers/sogood2016-slides.pdf · 2016-09-17 · Machine Learning Assists the Classi cation of Reports by Citizens

Classifiers - Considerations

Most classifiers have trouble dealing with imbalanced classes

Merged “unsure” (-1,1) classes into “sure” ones (-2,2)

Replication of minority class performed

. . . but testing still on original proportion

Antonio Rodriguez, Frederic Bartumeus, Ricard Gavalda (Universitat Politecnica de Catalunya, Barcelona (Spain), Centre for Advanced Studies of Blanes (CEAB-CSIC), 17300 Girona (Spain), CREAF, Cerdanyola del Valles, 08193 Barcelona (Spain), ICREA, Pg Lluıs Companys 23, 08010 Barcelona (Spain) )Machine Learning Assists the Classification of Reports by Citizens on Disease-Carrying MosquitoesWorkshop on Data Science for Social Good, SoGood September 2016 15

/ 20

Page 16: Machine Learning Assists the Classification of Reports by …gavalda/papers/sogood2016-slides.pdf · 2016-09-17 · Machine Learning Assists the Classi cation of Reports by Citizens

Classifiers - Selected classifier

Positive NegativeAccuracy 0,380Precision 0,983 0,086

Recall 0,344 0,912F-measure (F1) 0,51 0,157

Table: Evaluation metrics, Naive Bayes

Naive Bayes

Training conditions:

Aggregated instancesReplicated (x10)negatives in training

High positive Precision

High negative Recall

can detect approximately1 third of the validreports with a precisionnear 98%

Antonio Rodriguez, Frederic Bartumeus, Ricard Gavalda (Universitat Politecnica de Catalunya, Barcelona (Spain), Centre for Advanced Studies of Blanes (CEAB-CSIC), 17300 Girona (Spain), CREAF, Cerdanyola del Valles, 08193 Barcelona (Spain), ICREA, Pg Lluıs Companys 23, 08010 Barcelona (Spain) )Machine Learning Assists the Classification of Reports by Citizens on Disease-Carrying MosquitoesWorkshop on Data Science for Social Good, SoGood September 2016 16

/ 20

Page 17: Machine Learning Assists the Classification of Reports by …gavalda/papers/sogood2016-slides.pdf · 2016-09-17 · Machine Learning Assists the Classi cation of Reports by Citizens

ROC curve and variable importance

Variable name ImportancereportQ2Answ 0.7424reportQ3Answ 0.7038reports1kmLastMonth 0.6623reportQ1Answ 0.6615userNumReports 0.6405userNumActionAreas 0.6348validReports1kmLastMonth 0.6216userTimeForFirstReport 0.6197reports1kmLastWeek 0.6158userAccuracy 0.6085

Table: Variable importance in the NBclassifier. Numbers are the values of themodel coefficients after standarization.

Antonio Rodriguez, Frederic Bartumeus, Ricard Gavalda (Universitat Politecnica de Catalunya, Barcelona (Spain), Centre for Advanced Studies of Blanes (CEAB-CSIC), 17300 Girona (Spain), CREAF, Cerdanyola del Valles, 08193 Barcelona (Spain), ICREA, Pg Lluıs Companys 23, 08010 Barcelona (Spain) )Machine Learning Assists the Classification of Reports by Citizens on Disease-Carrying MosquitoesWorkshop on Data Science for Social Good, SoGood September 2016 17

/ 20

Page 18: Machine Learning Assists the Classification of Reports by …gavalda/papers/sogood2016-slides.pdf · 2016-09-17 · Machine Learning Assists the Classi cation of Reports by Citizens

Real-time classification system design

Two subsystems:Instance generation system

Instance creation scriptEnvironment

Classification system

Training scriptClassifierClassification script

Antonio Rodriguez, Frederic Bartumeus, Ricard Gavalda (Universitat Politecnica de Catalunya, Barcelona (Spain), Centre for Advanced Studies of Blanes (CEAB-CSIC), 17300 Girona (Spain), CREAF, Cerdanyola del Valles, 08193 Barcelona (Spain), ICREA, Pg Lluıs Companys 23, 08010 Barcelona (Spain) )Machine Learning Assists the Classification of Reports by Citizens on Disease-Carrying MosquitoesWorkshop on Data Science for Social Good, SoGood September 2016 18

/ 20

Page 19: Machine Learning Assists the Classification of Reports by …gavalda/papers/sogood2016-slides.pdf · 2016-09-17 · Machine Learning Assists the Classi cation of Reports by Citizens

Future work

ScalabilityCode modifications

GIS enabled database

Approximately the samecomputational resources

ImprovementsClassifier tuning

Priority system

Another classifier: RandomForest

Antonio Rodriguez, Frederic Bartumeus, Ricard Gavalda (Universitat Politecnica de Catalunya, Barcelona (Spain), Centre for Advanced Studies of Blanes (CEAB-CSIC), 17300 Girona (Spain), CREAF, Cerdanyola del Valles, 08193 Barcelona (Spain), ICREA, Pg Lluıs Companys 23, 08010 Barcelona (Spain) )Machine Learning Assists the Classification of Reports by Citizens on Disease-Carrying MosquitoesWorkshop on Data Science for Social Good, SoGood September 2016 19

/ 20

Page 20: Machine Learning Assists the Classification of Reports by …gavalda/papers/sogood2016-slides.pdf · 2016-09-17 · Machine Learning Assists the Classi cation of Reports by Citizens

Machine Learning Assists the Classification of Reports byCitizens on Disease-Carrying Mosquitoes

Antonio Rodriguez1 Frederic Bartumeus2,3,4 Ricard Gavalda1

Universitat Politecnica de Catalunya, Barcelona (Spain)

Centre for Advanced Studies of Blanes (CEAB-CSIC), 17300 Girona (Spain)

CREAF, Cerdanyola del Valles, 08193 Barcelona (Spain)

ICREA, Pg Lluıs Companys 23, 08010 Barcelona (Spain)

Workshop on Data Science for Social Good, SoGoodSeptember 2016

Antonio Rodriguez, Frederic Bartumeus, Ricard Gavalda (Universitat Politecnica de Catalunya, Barcelona (Spain), Centre for Advanced Studies of Blanes (CEAB-CSIC), 17300 Girona (Spain), CREAF, Cerdanyola del Valles, 08193 Barcelona (Spain), ICREA, Pg Lluıs Companys 23, 08010 Barcelona (Spain) )Machine Learning Assists the Classification of Reports by Citizens on Disease-Carrying MosquitoesWorkshop on Data Science for Social Good, SoGood September 2016 20

/ 20