a data mining course using a kaggle competition · i started the competition in the final month...

24
. . . . . . A DATA MINING COURSE USING A KAGGLE COMPETITION Susan Holmes, Nelson Ray © April 24, 2011 ABabcdfghiejkl

Upload: others

Post on 16-Oct-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A DATA MINING COURSE USING A KAGGLE COMPETITION · I Started the competition in the final month and had some integration with the class (displayed the leaderboard at the beginning

. . . . . .

A DATA MINING COURSEUSING A KAGGLE

COMPETITIONSusan Holmes, Nelson Ray ©

April 24, 2011

ABabcdfghiejkl

Page 2: A DATA MINING COURSE USING A KAGGLE COMPETITION · I Started the competition in the final month and had some integration with the class (displayed the leaderboard at the beginning

. . . . . .

HOW TO GET STARTED IN DATAMINING?

If you have plenty of time:

I Learn statisticsI Learn RI Learn linear algebraI Learn functional

analysis (kernels,..)I Learn data base

management.

Page 3: A DATA MINING COURSE USING A KAGGLE COMPETITION · I Started the competition in the final month and had some integration with the class (displayed the leaderboard at the beginning

. . . . . .

HOW TO GET STARTED IN DATAMINING?

If you have plenty of time:

I Learn statisticsI Learn RI Learn linear algebraI Learn functional

analysis (kernels,..)I Learn data base

management.

Page 4: A DATA MINING COURSE USING A KAGGLE COMPETITION · I Started the competition in the final month and had some integration with the class (displayed the leaderboard at the beginning

. . . . . .

BEST NEURAL NETWORKAVAILABLE

You have it:

Page 5: A DATA MINING COURSE USING A KAGGLE COMPETITION · I Started the competition in the final month and had some integration with the class (displayed the leaderboard at the beginning

. . . . . .

BEST NEURAL NETWORKAVAILABLE

You have it:

Page 6: A DATA MINING COURSE USING A KAGGLE COMPETITION · I Started the competition in the final month and had some integration with the class (displayed the leaderboard at the beginning

. . . . . .

BEST NEURAL NETWORKAVAILABLE

You have it:

Page 7: A DATA MINING COURSE USING A KAGGLE COMPETITION · I Started the competition in the final month and had some integration with the class (displayed the leaderboard at the beginning

. . . . . .

DATAMINING IN ONE COURSE

We need a hands on approach.

Page 8: A DATA MINING COURSE USING A KAGGLE COMPETITION · I Started the competition in the final month and had some integration with the class (displayed the leaderboard at the beginning

. . . . . .

HOW TO GET REALLY STARTED ?

Page 9: A DATA MINING COURSE USING A KAGGLE COMPETITION · I Started the competition in the final month and had some integration with the class (displayed the leaderboard at the beginning

. . . . . .

HOW TO GET REALLY STARTED ?

Page 10: A DATA MINING COURSE USING A KAGGLE COMPETITION · I Started the competition in the final month and had some integration with the class (displayed the leaderboard at the beginning

. . . . . .

AVOID BLACK BOXES

Page 11: A DATA MINING COURSE USING A KAGGLE COMPETITION · I Started the competition in the final month and had some integration with the class (displayed the leaderboard at the beginning

. . . . . .

LINEAR ALGEBRA: THE IGON VALUE PROBLEM

Malcolm Gladwell made an unfortunate error in`What the dog saw' (in his spelling of eigenvalue).Igon Value Problem is one of dilettantism.We have to understand the methods and not treat them asblack boxes.

Page 12: A DATA MINING COURSE USING A KAGGLE COMPETITION · I Started the competition in the final month and had some integration with the class (displayed the leaderboard at the beginning

. . . . . .

START WITH DATA YOU KNOW WELL

I Because you invented it (simulation studies).I It is your domain of expertise.I You have already studied it with other tools.

Page 13: A DATA MINING COURSE USING A KAGGLE COMPETITION · I Started the competition in the final month and had some integration with the class (displayed the leaderboard at the beginning

. . . . . .

START IN THE SUPERVISED CONTEXT

Because there is a gold standard for evaluation.

Page 14: A DATA MINING COURSE USING A KAGGLE COMPETITION · I Started the competition in the final month and had some integration with the class (displayed the leaderboard at the beginning

. . . . . .

FINISH OFF BY DOING A REAL COMPETITION

Data Mining competitions with Kaggle.Started at the middle of the course, students who ranked intop 3 didn't have to take the final.

Page 15: A DATA MINING COURSE USING A KAGGLE COMPETITION · I Started the competition in the final month and had some integration with the class (displayed the leaderboard at the beginning

. . . . . .

PICK AN INTERESTING DATASET

Page 16: A DATA MINING COURSE USING A KAGGLE COMPETITION · I Started the competition in the final month and had some integration with the class (displayed the leaderboard at the beginning

. . . . . .

WINE HAS MANY GREAT STORIES!

I Robert Parker is arguably themost influential wine critic.

I ``The Robert Parker Effect'':wines ranked 90 points costsignificantly more than thoseranked 89 points.

I Each spring since 1994 RobertParker evaluates many wines inBordeaux.

I In 2003, at the request of hisfamily, he didn't make the trip.

I Economists have estimated thatthis depressed wine prices byabout 15%.

Page 17: A DATA MINING COURSE USING A KAGGLE COMPETITION · I Started the competition in the final month and had some integration with the class (displayed the leaderboard at the beginning

. . . . . .

PLENTY OF STRUCTURED INFORMATION

Page 18: A DATA MINING COURSE USING A KAGGLE COMPETITION · I Started the competition in the final month and had some integration with the class (displayed the leaderboard at the beginning

. . . . . .

IN ONE TIDY PACKAGE

Page 19: A DATA MINING COURSE USING A KAGGLE COMPETITION · I Started the competition in the final month and had some integration with the class (displayed the leaderboard at the beginning

. . . . . .

KAGGLE-IN-CLASS

Page 20: A DATA MINING COURSE USING A KAGGLE COMPETITION · I Started the competition in the final month and had some integration with the class (displayed the leaderboard at the beginning

. . . . . .

THE COMPETITION PAGE

Page 21: A DATA MINING COURSE USING A KAGGLE COMPETITION · I Started the competition in the final month and had some integration with the class (displayed the leaderboard at the beginning

. . . . . .

THE RESULTS

Page 22: A DATA MINING COURSE USING A KAGGLE COMPETITION · I Started the competition in the final month and had some integration with the class (displayed the leaderboard at the beginning

. . . . . .

THE WINNERS REVEAL THEIR STRATEGY

Page 23: A DATA MINING COURSE USING A KAGGLE COMPETITION · I Started the competition in the final month and had some integration with the class (displayed the leaderboard at the beginning

. . . . . .

POST-MORTEM

I At least 20 teams from Stats 202 participated.I Teams were limited to three people.I Roughly 160 students in the class.I Contest ran for about a month with a one submission per

day limit.I The top 3 teams averaged more than one submission

every other day.I Received many thank yous from students for the

competition.

Page 24: A DATA MINING COURSE USING A KAGGLE COMPETITION · I Started the competition in the final month and had some integration with the class (displayed the leaderboard at the beginning

. . . . . .

IDEAS FOR IMPROVEMENT

I Started the competition in the final month and had someintegration with the class (displayed the leaderboard atthe beginning of class and had some homeworkassignments involving the data)

I Could start even earlier and provide sample code usingincreasingly sophisticated techniques (i.e. starting withlinear models and moving to basic text mining forfeatures in boosted trees).

I Code would be provided at regular intervals to allow newparticipants to easily catch up and push the leaders evenfurther.

I Have ``progress prizes'' so the top teams don't entirelyneglect studying for the final.