finalpresentation-gradproject

43
Big Data: Predicting Rent in London by Machine Learning Manabu Mukohyoshi

Upload: manabu-mukohyoshi

Post on 17-Aug-2015

55 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: FinalPresentation-GradProject

Big Data:Predicting Rent in London by Machine Learning

Manabu Mukohyoshi

Page 2: FinalPresentation-GradProject

Motivation

• Interested in Machine Learning• Wide range of Machine Learning applications

in use• Data-driven cities: City slicker - Data are slowly

changing the way cities operate (The Economist)

Page 3: FinalPresentation-GradProject

Initial ideas

• Predict fires to dispatch ambulances efficiently• Predict crimes to dispatch police cars efficiently• Predict energy consumption (gas, electricity, etc.)• Predict increase of waste using population• Predict emission of carbon dioxide• Predict the rise of rents and house prices using

economics and population data• Map Londoners’ health on to the map of London• Predict happiness by region• Predict congestion

Page 4: FinalPresentation-GradProject

Number of Fires by Ward

Page 5: FinalPresentation-GradProject

Number of Fires by Borough

Page 6: FinalPresentation-GradProject

Number of Fires by hour

Page 7: FinalPresentation-GradProject

# of Fires

Page 8: FinalPresentation-GradProject

First Arrival Time

Page 9: FinalPresentation-GradProject

First Arrival Time and Fire Stations

Page 10: FinalPresentation-GradProject

Initial ideas

• London Datastore has a variety of data– Mostly statistics– Not a lot of individual data

• What to learn?

Page 11: FinalPresentation-GradProject

The Idea

• Rent Prediction in London by Machine Learning– Can retrieve individual rent data from Zoopla– Rent keeps changing and it is hard to know if the

rent is right for the place• For landlords, it can be a standard to decide rent• For tenants, it can be a standard to judge rent• For Zoopla, it can attract more customers

Page 12: FinalPresentation-GradProject

Data Source

• Zoopla (about 45,000 examples)– Latitude, Longitude, # of bedrooms, # of bathrooms,

# of floors, # of receptions, property type, price• Walkscore – Calculate score of an address based on how

walkable it is. (Close to grocery stores, restaurants, cafes, etc…)

• MapIt– Converting Latitude/Longitude to Ward and Borough

code

Page 13: FinalPresentation-GradProject

Data Source

• London Datastore– Ward profile

• Mean Age, Population density, % Not Born in UK, General Fertility Rate, Male life expectancy, Female life expectancy, % children in year 6 who are obese, Rate of All Ambulance Incidents per 1,000 population, Employment rate (16-74), Median House Price, Number of properties sold, % Households Social Rented, % Households Private Rented, % dwellings in council tax bands A or B, % dwellings in council tax bands C, D or E, % dwellings in council tax bands F, G or H, Claimant Rate of Income Support, % with no qualifications, % with Level 4 qualifications and above, Crime rate, Deliberate Fires, Cars per household, Average Public Transport Accessibility score, Turnout at Mayoral election - 2012

– Borough profile• Total carbon emissions, Teenage conception rate, Life satisfaction score,

Worthwhileness score, Happiness score, Anxiety score

Page 14: FinalPresentation-GradProject

Steps to solve

1. Collect and combine data2. Preprocess data3. Try different algorithms of machine learning

on the collected data4. Tune the parameters of ML algorithms5. Evaluate the results and algorithms

Page 15: FinalPresentation-GradProject

Step 1: Collect and Combine Data

1. Download listings data using Zoopla API2. Get Walkscore using the API3. Convert Longitude/Latitude to ward and

borough code using self-hosted MapIt4. Merge ward and borough profile downloaded

from London Datastore to listings data

MapIt: UK

Page 16: FinalPresentation-GradProject

Step 2: Preprocess Data

• Scale (bias elimination)• Encode categorical features• Impute– Replace n/a or space with mean

• Shuffle• Split into training dataset and test dataset

(cross validation)

Page 17: FinalPresentation-GradProject

Step 3: Try Different Algorithms name Average MSE

1.11.2.1. Random Forests 0.241214063

1.11.4. Gradient Tree Boosting 0.273875445

1.11.1. Bagging meta-estimator 0.296172365

1.11.2.2. Extremely Randomized Trees 0.296710726

1.6.3. KNeighborsRegressor uniform 0.306133182

1.6.3. KNeighborsRegressor distance 0.319488307

1.10. DecisionTreeRegressor 0.336486662

1.10. ExtraTreeRegressor 0.40337387

1.4.2 SVR poly 0.429585937

1.4.2 NuSVR poly 0.434766842

1.11.3. AdaBoost 0.443524744

1.4.2 SVR rbf 0.476364995

1.1.9.1. Bayesian Ridge Regression 0.567228078

1.1.4. Elastic Net 0.56727658

1.1.2. Ridge Regression 0.567611415

name Average MSE

1.1.1. Ordinary Least Squares 0.567641956

1.1.11. Stochastic Gradient Descent 0.573168168

1.1.8. Orthogonal Matching Pursuit 0.576630178

1.1.14.3. Theil-Sen estimator 0.5875179

1.4.2 SVR linear 0.642531415

1.4.2 LinearSVR 0.667162534

1.1.14.2. RANSAC 0.705499997

1.1.13. Passive Aggressive Algorithms 0.726516853

1.1.3. Lasso 0.899948627

1.1.7. LARS Lasso 0.899948627

1.4.2 SVR sigmoid 0.937398784

1.8. Cross decomposition PLSRegression 1.662293485

1.6.3. NearestCentroid 1.701974047

1.8. Cross decomposition PLSCanonical 10.72550448

Page 18: FinalPresentation-GradProject

Step 4: Tune Parameters of Algs.

• Grid Search– Exhaustively search the possible combinations of

parameters– Takes too much time on my computer

• Random Search– Takes less time– Result is similar to grid search

Let’s see tuning parameters…

Page 19: FinalPresentation-GradProject

Support Vector Regression

Page 20: FinalPresentation-GradProject
Page 21: FinalPresentation-GradProject

KNN

Page 22: FinalPresentation-GradProject

Step 5: Evaluate

• Feature Importance• Final MSE for 4 selected algorithms• Compare rents with Zoopla Estimate

Page 23: FinalPresentation-GradProject

Feature Importance:Random Forest

Page 24: FinalPresentation-GradProject

Feature Importance:GBR

Page 25: FinalPresentation-GradProject

1 2 3 4 5 6 7 8 9 10 new data0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

MSE on Cross Validation and new listings data

KNNGBRRFSVRstandard deviation

Cross Validation / Score on new data

MSE

Page 26: FinalPresentation-GradProject

Final Result

Final Result (MSE) MSE from Step 3Fitting Time (42003 examples)

Predicting Time (3582 examples)

Random Forest 0.108435602 0.241214063 53.12 sec 2.03 sec

Gradient Tree Boosting 0.117256254 0.273875445 149.18 sec 0.45 sec

Support Vector Machine 0.143577993 0.429585937 3192.02 sec 4.54 sec

K-Nearest Neighbors 0.217186025 0.306133182 3.97 sec 3.82 sec

Page 27: FinalPresentation-GradProject

Actual rent and predicted rent (Random Forest)

4 156 308 460 612 764 916 1068 1220 1372 1524 1676 1828 1980 2132 2284 2436 2588 2740 2892 3044 3196 3348 35000

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

predictedactual

Rent (£)

Page 28: FinalPresentation-GradProject

Compare rents with Zoopla Estimate (1/2)Zoopla Estimate

Actual Rent

Predicted Rent by Random Forests

£381.5120443 pw = £1653 pcm(pw x 52 / 12 = pcm)

Page 29: FinalPresentation-GradProject

Compare rents with Zoopla Estimate (2/2)Zoopla Estimate

Actual Rent

Predicted Rent by Random Forests

£1488.237929 pw = £6449 pcm

Page 30: FinalPresentation-GradProject

Conclusion

• Random Forest works the best for this problem

• Data quality in dataset greatly influence the result of prediction more than parameters of machine learning algorithms does

• Can not compare all the predicted rents with Zoopla estimate, but got some results closer to the actual rents than Zoopla estimate

Page 31: FinalPresentation-GradProject

Future Work

• Adding more room specific information such as size of the room and age

• Make an app to predict rent by inputting an address, # of bedrooms, # of bathrooms, # of floors and property type

Page 32: FinalPresentation-GradProject

Challenges

• Collect Data– Time consuming– Hard to find good dataset

• Statistics– Possible to use machine learning without knowing

math/statistics– Need to know in order to understand what ML

algorithms do deeply or tune the parameters efficiently

Page 33: FinalPresentation-GradProject

What I learned

• Python• Scikit-learn / Tableau / Google Maps API /

Walkscore API / Coordinate systems (MapIt API)

• How to apply machine learning algorithms• Collecting good dataset is more important

than algorithms

Page 34: FinalPresentation-GradProject

References

• Walkscore– https://www.walkscore.com

• MapIt– http://mapit.poplus.org

• Google Maps API– https://developers.google.com/maps/documentation/javascript/

• Scikit-learn– http://scikit-learn.org/stable/

• London Datastore– http://data.london.gov.uk

• Tableau– http://www.tableau.com

Page 35: FinalPresentation-GradProject

References

• Zoopla– http://www.zoopla.co.uk– Examples from Zoopla

• http://www.zoopla.co.uk/property/101-greyhound-road/london/n17-6xr/15262720

• http://www.zoopla.co.uk/to-rent/details/36920785#5yJdKDM4BovT5eu6.97

• http://www.zoopla.co.uk/property/28-cato-street/london/w1h-5jj/28909969

• http://www.zoopla.co.uk/to-rent/details/37005409?search_identifier=0f64a06eeb798647935af065dcaf87c4#V6Xmr062sEqY198c.97

Page 37: FinalPresentation-GradProject

MSE

• The RMSE is the distance, on average, of a data point from the fitted line (representing predictions made by the model), measured along a vertical line.

Page 38: FinalPresentation-GradProject

Cross Validation

https://chrisjmccormick.wordpress.com/2013/07/31/k-fold-cross-validation-with-matlab-code/

Page 39: FinalPresentation-GradProject

Random Forests

http://provectus.com/blog/news/research_paper_for_load_forecast

Page 40: FinalPresentation-GradProject

Gradient Tree Boosting

http://provectus.com/blog/news/research_paper_for_load_forecast

Page 41: FinalPresentation-GradProject

Support Vector Machine

http://docs.opencv.org/doc/tutorials/ml/introduction_to_svm/introduction_to_svm.html

Page 42: FinalPresentation-GradProject

K-Nearest Neighbors

http://bdewilde.github.io/blog/blogger/2012/10/26/classification-of-hand-written-digits-3/

Page 43: FinalPresentation-GradProject

What is Machine Learning?

• Supervised learning– Classification– Regression

Fitting/Training

Predicting

# of bedrooms, lat/long

rent

# of bedrooms, lat/long

Predicted Rent