my first attempt on kaggle - higgs machine learning challenge: 755st and proud!

Post on 14-Jun-2015

333 Views

Category:

Data & Analytics

3 Downloads

Preview:

Click to see full reader

DESCRIPTION

The Higgs Machine Learning Challenge is not only a place for PhDs! As an undergraduate with a student license of MATLAB and a couple of dollars for Amazon AWS I could enter on the last 8 days of the challenge and overtake more than half of the competitors! In this talk, I'll present the challenge, my approach, and walk through the code.

TRANSCRIPT

Higgs ChallengeMy first attempt at Kaggle: 755st and proud!

@dhianadeva

Err… Kaggle?!Platform for data science competitions

Machine Learning, Big Data, Statistics, Data mining ...

Community for data scientistsUsers, leaderboard, forums …

Sponsors!

$$$posored competitions!

We don’t need no PhD!

Yes, we can!My guilty pleasure:

Student license of MATLAB <3

Open source alternatives:Python + Scikit + Numpy + …R + randomForest + e1071 + caret + …Octave!?

Higgs Challenge

DatasetsTraining (labeled):

250k events30 featuresEvent id, weight and class (s/b)

Test (unlabeled):18% Public (500k events)72% Private

training.csvEventId , DER_mass_MMC , … , Weight , Class100000 , 138.47 , … , 0.00265331133733 , s100001 , 160.937 , … , 2.23358448717 , b100002 , -999.0 , … , 2.34738894364 , b100003 , 143.905 , … , 5.44637821192 , b…

test.csvEventId , DER_mass_MMC , … , PRI_jet_all_pt350000 , -999 , … , -0.0350001 , 106.398 , … , 47.575350002 , 117.794 , … , 0.0350003 , 135.861 , … , 0.0…

submission.csvEventId , RankOrder , Class350000 , 262328 , b350001 , 201479 , b350002 , 212810 , b350003 , 134945 , b…

End-to-end

A little math...

(Aproximate Median Significance)

755th/1785 secretsI’ve entered on the last 8 days of the 127-days challenge and could overtake more than half of the competitors using:

MATLAB 2014b (student license)Neural Networks Toolbox20$ EC2 at Amazon Web Services9 code files totaling 674 words

Neural netwhat?!

Neurons

Inputs Output

For now, a Black box!

OutputInputs

It trains

Output

Inputs

Target

Error

It runs

OutputInputs

Moonlighting!

1. nprtool2. fixunknowns3. trainlm4. processpca5. 0.8 threshold6. ams threshold pick7. hidden neurons pick8. 0.25*targets + amsweights

8 days!

Some stats...

Day 1

Day 2

Day 3

Day 4

Day 5

Day 6

Day 7

Day 8

Oops!(weighted errors using ams, regularization, mapstd, … nothing worked!)

Lessons learned+ Optimize self-learning doing things from scratch (or

from default baseline)

+ Kaggle is way funnier than studying with traditional datasets (iris, cancer, thyroid...)

+ Data science needs good engineering practices!

+ The competition fact sheet was a great way of accessing what I know I know, what I know I don’t know…

Let’s hack?!Re-considering PCAPCD?Dimensionality ReductionStop on best AMS (hack nn toolbox!)EnsembleAuto-encoderMATLAB unit testsMATLAB continuous integration

Thanks! ;)

top related