predict house prices in taichung to create an online ... · predict house prices in taichung to...
TRANSCRIPT
PREDICT HOUSE PRICES IN TAICHUNG TO CREATE AN ONLINE SERVICE FOR THE REAL
ESTATE MARKET
A n d i M R i z k i G u e r m a n A l e x e i
T h i mm a r a j u T e am : 7
I n s t i t u t e S e r v i c e S c i e n c e N a t i o n a l T s i n g H u a U n i v e r s i t y 2 0 1 5
TABLE OF CONTENT
EXECUTIVE SUMMARY…………………………………………………………….3 TECHNICAL SUMMARY……………………………………………………………4
PROBLEM DESCRIPTION…………………………………………4 DATA DESCRIPTION………………………………………………4 DATA PREPARATION FOR ANALYSIS…………………………..4 DATA MINING SOLUTION………………………………………..5 CONCLUSIONS……………………………………………………..5
APPENDICES………………………………………………………………………..6 APPENDIX A: VARIABLES USE FOR ANALYSIS……………….7 APPENDIX B: EXTERNAL VARIABLES USE FOR ANALYSIS…9 APPENDIX C: MODEL OUTPUT……………………………….11 APPENDIX C: BOXPLOT VALIDATION ERROR………………12
EXECUTIVE SUMMARY
Today real estate market has become very popular, but the housing recovery has pushed
up home prices nearly everywhere. In Taizhong, Taiwan has not been the exception, real
estate market has expanded in the past couple years to the point where it is attracting
interest, not only from other parts of Taiwan but also other parts of the world. An
accurate prediction on the house price is important to prospective homeowners,
developers, investors, appraisers, tax assessors and other real estate market participants,
such as, mortgage lenders and insurers. People who are looking to buy a new place tend to
be more conservative with their budget.
The goal of our project is to use previous transactions data to predict house prices and
provide to consumers the information they need, help professionals build their
businesses, and create additional value in adjacent markets. The days of calling a local
Realtor or hiring an expensive appraiser just to find out what a home is worth are falling
behind. In the first step we obtained data from the Ministry of Interior of Taiwan.
Followed by visualizing the raw data and finding the relationship between predictors.
Selection of variables using domain knowledge and including external data such as
national economic growth, latitude and longitude of houses, distance to nearest future
MRT station and other derivatives.
We considered several methods to approach the lowest error. First a multiple linear
regression and KNN algorithm. Second Random Trees and Neural Nets. The first pair
showed better performance so we continued improving those methods. The low error
measure in the MLR showed promising opportunity for an accurate prediction system. We
consider the model can be further improved with more data. Since online services are
becoming the new business platform, we encourage the application of this system on an
automatized service with costumer interaction.
TECHNICAL SUMMARY
PROBLEM DESCRIPTION
Business goal is to predict house prices in Taichung to create an online service for the real
estate market and data mining goal is to create a prediction system or predictive algorithm
for customer to help customer (buyer and seller) to get the fair price close to current market
price.
DATA DESCRIPTION
The data file containing information selling home price in Taichung, during 2009 to 2014
provided by government official website http://plvr.land.moi.gov.tw/Index. Data divided into
2000 rows and 27 columns. The purchase price of a house will depend on its characteristics,
including: its physical properties, such as its size and number of bedrooms, as well as the type
of neighborhood in which the house is located..
• Transaction_land_building
• Building_pattern
• Total_building_area
• Number_of_rooms
• Number_of_bathrooms
• Price_persquare_meter
!�
�$
�� ��
�� ��
��
�!
��
��
%�
��
�"
&
��
��
#�
��
��
��
��
� ��
��
�
�"
� �
�
1 �� �� ! 48�! 0" "" 10309" �0" ��" 6"2 �� �� ! 17�! 29.87" �" 10308" �1" ��" 7"3 ��� �� ! 55�! 29.25" �" 10011" ��2" ��" 10"4 ��� �� ! 47�! 16.59" �" 10307" �1" ��" 7"5 �� �� ! 30�! 0" "" 10309" �0" ��" 6"
• Age_of_building
• Floor_bin
• Longtude_latitude
• Distance_to_MRT
• Area/average
DATA PREPARATION ANALYSIS
The data were translate into English and the missing values of the predictor variables from the
data sources were merged to get a more robust and complete data set. Our initial data
exploration and output of certain models prompted us to get rid of records with missing data
to bring the overall data set down to 995 records and 11 columns. In addition, properties that
were not typical and might distort the series were also removed from the data set.
Data partition is 60% of training and 40 % of validation. We use external predictor and notice
that among the discarded data for this analysis, we can create other models for the rest types
of transactions (e.g. land, parking space) and types of house (g. undefined, commercial use).
We assume price-‐per-‐sq-‐m predictor based on a historical per-‐area computation and it is not
the function of the selling price. In order to avoid reducing the data, we tried using log of price (outcome) to include big values that could have been considered outliers, but finally we use
the normal price because it gave lower error. We create program-‐using python to convert
address to longitude and latitude and R programming to calculate the nearest distance from
house to MRT station future plan to build (see appendix).
DATA MINING SOLUTIONS
The following two models had the lowest overall error rate. Based on our performance
criteria, they had better accuracy compare to naïve rule on validation. All variables selected
were available and relevant at the time of prediction.
MULTIPLE LINEAR REGRESSIONS
This model gave us lowest average error of USD -‐24.902 compare to naïve error USD
253,885.5 on the validation data set. This model was run using best subset to select the
significant variables. See appendix for detail output.
TECHNICAL SUMMARY
K NEAREST NEIGHBORS
KNN cannot helps us select the important variables so we use exploration knowledge and
priors model to select the input variables this model give us error of USD -‐70.7963 compare
to naïve error USD 253,885.5
RECOMMENDATIONS
There are several recommendations for our model:
•• Monitoring of system. It may be necessary to restate, revise or remove data from the
index. .
•• Run the model monthly with update data.
•• Create alternative source of data by providing the customer the option to upload
their home information
•• Split the data according to the transaction type
•• Try external data to increase accuracy and automatize the system with the online
page.
TECHNICAL SUMMARY
Chinese English Description Type 號 record # index numerical 鄉鎮市區 District
categorical
Subject of the transaction type: land, house, parking categorical 土地區段位置或建物區門牌 location address categorical 土地移轉總面積平方公尺 total land area square meters numerical 都市土地使用分區 Urban Land Use Zoning residential, office, farm, factory, other categorical 交易年月 Transaction date
date
交易筆棟數 Number of subdivisions quantity of space, room and parking lot numerical 移轉層次 floor location floor # categorical 總樓層數 Total number of floors
numerical
建物型態 Buildings patterns
biz building, 11+ floor residential building with elevator, 10+ floor simple residential building with elevator, 5 floor apartment building no elevator, other, suite 1 room 1 hall 1 bathroom, shop, house, office building, warehouse, factory, farmhouse
categorical
主要用途 main purpose
categorical 主要建材 main building materials brick, reinforced brick, concrete, steel categorical 建築完成年月 Construction completion date
date
建物移轉總面積平方公尺 total building area square meters numerical 建物現況格局-‐房 number of rooms
numerical
建物現況格局-‐廳 number of Halls
numerical 建物現況格局-‐衛 number of bathrooms
numerical
有無管理組織 management Have management organization binary 總價元 total price NT$ numerical 單價每平方公尺 Price per square meter NT$/m2 numerical 車位類別 Parking type ramp, lift machine, first floor categorical 車位移轉總面積平方公尺 total area of parking space square meters numerical 車位總價元 total price Parking NT$ numerical
APPENDIX A VARIABLES
APPENDIX B EXTERNAL VARIABLES Converting address to longitude and latitude #!/Users/admin6/anaconda/bin/python import geocoder import unicodecsv import logging address=[] lat=[] lon=[] with open('taizhong_houses.csv', 'rb') as f: reader = unicodecsv.DictReader(f, encoding='utf-‐8') for line in reader: address = line['address'] g = geocoder.google([address], method='geocode')
if g.ok: pcode.extend(g.latlng) logging.info('SUCCESS: ' + str(address)) else: logging.warning('Geocoding ERROR: ' + str(address)) fields= 'lat', 'lon' rows=(lat,lon) with open('/Users/admin6/python/mindis.csv', 'wb') as outfile: w = unicodecsv.writer(outfile, encoding='utf-‐8') w.writerow(fields) for i in rows: w.writerow(i)
Calculate the nearest distance to MRT stations x1=read.csv("zhonghouses.csv",header=F) x2=read.csv("zhongmrt.csv",header=F) mindis <-‐ function(x1,x2) { deg2rad <-‐ function(deg) return(deg*pi/180) R <-‐ 6371 m <-‐ c() for(i in 1: length(x1[,1])) { lat1 <-‐ deg2rad(x1[i,1]) long1 <-‐ deg2rad(x1[i,2]) d <-‐ c() for(j in 1: length(x2[,1])) { lat2 <-‐ deg2rad(x2[j,1]) long2<-‐ deg2rad(x2[j,2])
d[j] <-‐ acos(sin(lat1)*sin(lat2) + cos(lat1)*cos(lat2) * cos(long2-‐long1)) * R } m[i] <-‐ min(d) } return(m) } mininos <-‐ mindis(x1,x2) write.table(mininos, "mininos.txt", sep="\t")
APPENDIX C MODEL OUTPUT K NEAREST NEIGHBORS
Validation)error)log)for)different)k
Value)of)kTrainingRMS)Error
ValidationRMS)Error
1 2210.3255 2256777.42 3687.2424 2193350.8 <6)Best)k3 4174.7158 2221855.44 4391.3622 2230585.15 4513.3547 2285595.86 4591.5384 2340164.77 4645.9033 2389848.18 4685.8901 2305930.19 4716.5343 2275397.9
10 4740.7669 2211266
Training)Data)Scoring)6)Summary)Report)(for)k)=)2)
Total)sum)ofsquared)errors
RMS)ErrorAverageError
8116666667 3687.2424 ,1.56E,12
Validation)Data)Scoring)6)Summary)Report)(for)k)=)2)
Total)sum)ofsquared)errors
RMS)ErrorAverageError
1.91469E+15 2193350.8 ,2123.903 ,70.7968
Training'Data'Scoring'-'Summary'Report
Total'sum'ofsquared'errors RMS'Error
AverageError
9.25995E+14 1245424 )3.03642E)05
Validation'Data'Scoring'-'Summary'Report
Total'sum'ofsquared'errors RMS'Error
AverageError
5.93783E+14 1221441 )747.0611615 )24.902 USD
InputVariables
Coefficient Std.5Error t7Statistic P7Value CI5Lower CI5Upper
Intercept !2561147 2469034.688 !1.0373 0.3 !7411019.472 2288724.967distance_mrt !119871 98568.4806 !1.2161 0.2245 !313486.9626 73744.969area/avg 7425829 177654.5376 41.7993 0 7076866.121 7774792.165age 14074.38 9286.7429 1.5155 0.1302 !4167.3669 32316.1353Price5per5square5meter152.4028 5.5929 27.2495 0 141.4169 163.3887pinyin_bei3qu1 !2771531 1562715.578 !1.7735 0.0767 !5841140.393 298077.4218pinyin_bei3tun2qu1!2608015 1374541.222 !1.8974 0.0583 !5307997.141 91966.916pinyin_da4jia3ou1 994909.2 1823252.67 0.5457 0.5855 !2586467.121 4576285.608pinyin_da4li3ou1 !2276106 1187920.026 !1.916 0.0559 !4609512.409 57299.5731pinyin_da4ya3ou1!618740.1 1640064.308 !0.3773 0.7061 !3840283.462 2602803.277pinyin_dong1qu1 !2221918 1574174.562 !1.4115 0.1587 !5314035.949 870199.1393pinyin_dong1shi4ou1!57794.04 1826395.962 !0.0316 0.9748 !3645344.708 3529756.622pinyin_feng1yuan2ou1!1367368 1560643.553 !0.8762 0.3813 !4432906.418 1698171.326pinyin_long2jin3gou1!681842.4 1687925.741 !0.404 0.6864 !3997398.999 2633714.125pinyin_nan2qu1 !3768672 1170992.18 !3.2184 0.0014 !6068826.964 !1468517pinyin_nan2tun2qu1!1896256 1271070.041 !1.4919 0.1363 !4392992.285 600479.3022pinyin_sha1lu4ou1 41740.12 1692849.272 0.0247 0.9803 !3283487.628 3366967.869pinyin_tai4ping2qu1!2444140 1537064.563 !1.5901 0.1124 !5463363.476 575082.8534pinyin_tan2zi3ou1 !1793127 1459282.338 !1.2288 0.2197 !4659564.31 1073310.087pinyin_wai4bu4ou11480040 2022187.453 0.7319 0.4645 !2492099.984 5452179.456pinyin_wu1ri4ou1 !1276329 1435200.904 !0.8893 0.3742 !4095463.771 1542805.328pinyin_wu2qi1ou1 218781.7 1684032.02 0.1299 0.8967 !3089126.488 3526689.931pinyin_wu4fen1gou1!887091 1774947.348 !0.4998 0.6174 !4373582.24 2599400.259pinyin_xi1qu1 !2238096 1204538.376 !1.8581 0.0637 !4604144.769 127953.3547pinyin_xi1tun2qu1 !2404988 1507565.885 !1.5953 0.1112 !5366267.528 556291.5534pinyin_zhong1qu1 !2938748 1756059.335 !1.6735 0.0948 !6388138.327 510641.5304tran_pin_labu 528136.9 171108.0053 3.0866 0.0021 192033.1402 864240.7568floorbin_1 !2138095 402007.951 !5.3185 0 !2927750.758 !1348439.46floorbin_2 !2088456 414515.3956 !5.0383 0 !2902679.58 !1274232.06floorbin_3 !1828651 443781.4031 !4.1206 0 !2700361.426 !956940.721floorbin_5 !1350030 536380.9898 !2.5169 0.0121 !2403632.177 !296428.682floorbin_6 !1127701 603830.2482 !1.8676 0.0624 !2313791.893 58389.884patt_pinyin_Apartment302097.3 419285.1782 0.7205 0.4715 !521495.6769 1125690.203patt_pinyin_Landmark!39326.76 367964.1007 !0.1069 0.9149 !762110.8039 683457.2789patt_pinyin_ResBuild!348085.9 343597.36 !1.0131 0.3115 !1023006.882 326835.0612number5of5rooms_2!573007.4 319882.7061 !1.7913 0.0738 !1201346.223 55331.342number5of5rooms_3 !426779 390751.5931 !1.0922 0.2752 !1194324.016 340766.0381number5of5rooms_4!799902.7 434909.5487 !1.8392 0.0664 !1654186.244 54380.8725number5of5rooms_5!1529880 672810.9339 !2.2739 0.0234 !2851468.629 !208292.27number5of5rooms_6!2913212 1.4989E+22 0 1 !2.94425E+22 2.94425E+22number5of5bathrooms_1!986749.7 263062.9511 !3.751 0.0002 !1503478.673 !470020.776number5of5bathrooms_2!1285738 266013.4412 !4.8334 0 !1808262.093 !763213.026number5of5bathrooms_3!1441042 440034.7213 !3.2748 0.0011 !2305393.024 !576691.373number5of5bathrooms_4!2202877 866778.8557 !2.5415 0.0113 !3905472.087 !500281.618number5of5bathrooms_5!2063665 805872.8372 !2.5608 0.0107 !3646623.643 !480705.962number5of5bathrooms_6!2913212 1.4989E+22 0 1 !2.94425E+22 2.94425E+22
Residual)DF 551R² 0.9525Adjusted)R² 0.9486Std.)Error)Estimate 1296368.454RSS 9.25995E+14
MULTIPLE LINEAR REGRESSION USING BEST SUBSET
APPENDIX D BOX PLOT VALIDATION ERROR