willhaben dataset collector and price...

1
TU Graz Institute of Interactive System and Data Science 8010 Graz, Inffeldgasse 16c, Austria, Tel.: +43 316 873-0000 Autoren: BSc Maris Siljak Knowledge Discovery and Data Mining 2 (VU 706.715) Data Preprocessing and Cleaning My dataset consist of crawled products categorised as ”Motorbikes -> Naked Bikes”. For obvious time reasons, there are only ~3500 crawled products. After successfully cleaning data and obtaining all numeric values, we are ready to examine pairwise relations between the target feature, respectively “price”, and all other features including itself. From the pairwise plot with itself, it is clearly visible that the price distribution is not normal. Since we are planning to use linear regression model for predictions, having normal distribution would be advantageous for the algorithm. Performing natural logarithm on the target feature results in: A big advantage of using such simple function is that its results are easily reversible using exponential function. Price Prediction Before the predictions happen, model is fit with previously processed data and yet then linear regression predictions are made delivering following results: Flat Data Clean Data Willhaben Dataset Collector and Price Predictor Introduction This project consists of two major components: Dataset Collector Price Predictor Dataset Collector depicts a generic crawler and its storage management. Crawler is able to crawl any kind of Willhaben product based on search criteria query and collect relevant data. Data is afterwards stored in a NoSQL DBMS. Price Predictor depicts dataset preparing and preprocessing, model training and finally predicting the price feature. Previously acquired data is being preprocessed and cleaned in order to get it into algorithm usable form and to remove outliers and erroneous values. When the data is ready, linear regression model is trained and used for predicting the target feature. Data Collecting and Storing Software component used for collecting relevant data is a web crawler built in Python upon ”Scrapy” 2 , the web-crawling and scraping framework. In order to start the crawling process, one has to feed it with a valid Willhaben search link. This transfers the search processing responsibilities on Willhaben’s own search engine, ensures correctness of the whole process flow and absence of redundant dependencies. During the crawling process all products on result pages are individually crawled and scraped until there are no more products left. Crawler is rather an advanced one, because it is able to revisit products and make revisions of updated product details. This data is collected with the purpose of advanced analytics like: How many times does a certain user change product details? If the change was a good or bad decision (increased views, accelerated sale)? Product data (excluding price) used in the model looks like this: Previously shown data can have arbitrary number of attributes as well as equipment. Therefore, it was one of the main factors that influenced DBMS choice decision. “MongoDB” 3 is used as a storage for this data and imported into project using pymongo4 library. One of its main advantages is that the schema of each collection is not constant, since it is a NoSQL DBMS. That makes it perfect for this case, because Willhaben does not enforce any product details to be filled, what can result in missing and erroneous values. Product Data 1 Optionaler Platz für Partnerlogos oder Ähnliches Figure 1: Pairwise plots (x axis = price) © isti2 - Fotolia.com Literatur / Zitat 1 https://www.willhaben.at/iad/gebrauchtwagen/d/auto/audi-q3-sportback-35-tdi-s-line-exterieur-345431452/ 2 https://scrapy.org 3 https://www.mongodb.com 4 https://api.mongodb.com/python/current/ Figure 2: Initial vs normal price distribution Metrics Root Mean Squared Error 0.33882925086198107 Mean Squared Error 0.11480526123969131 R2 Score 0.748883640846617 Figure 3: Actual and predicted price distribution Figure 4: Scatter plot of test and predicted prices with best fit line

Upload: others

Post on 09-Aug-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Willhaben Dataset Collector and Price Predictorkti.tugraz.at/staff/rkern/courses/kddm2/posters/team-2.pdfDataset Collector depicts a generic crawler and its storage management. Crawler

TU Graz – Institute of Interactive System and Data Science

8010 Graz, Inffeldgasse 16c, Austria, Tel.: +43 316 873-0000

Autoren: BSc Maris Siljak

Knowledge Discovery and Data Mining 2 (VU 706.715)

Data Preprocessing and Cleaning

My dataset consist of crawled products categorised as ”Motorbikes -> Naked

Bikes”. For obvious time reasons, there are only ~3500 crawled products.

After successfully cleaning data and obtaining all numeric values, we are

ready to examine pairwise relations between the target feature, respectively

“price”, and all other features including itself.

From the pairwise plot with itself, it is clearly visible that the price distribution is

not normal. Since we are planning to use linear regression model for

predictions, having normal distribution would be advantageous for the

algorithm. Performing natural logarithm on the target feature results in:

A big advantage of using such simple function is that its results are easily

reversible using exponential function.

Price Prediction

Before the predictions happen, model is fit with previously processed data and

yet then linear regression predictions are made delivering following results:

Flat Data Clean Data

Willhaben Dataset Collector and Price Predictor

Introduction

This project consists of two major components:

Dataset Collector

Price Predictor

Dataset Collector depicts a generic crawler and its storage management.

Crawler is able to crawl any kind of Willhaben product based on search criteria

query and collect relevant data. Data is afterwards stored in a NoSQL DBMS.

Price Predictor depicts dataset preparing and preprocessing, model training

and finally predicting the price feature. Previously acquired data is being

preprocessed and cleaned in order to get it into algorithm usable form and to

remove outliers and erroneous values. When the data is ready, linear

regression model is trained and used for predicting the target feature.

Data Collecting and Storing

Software component used for collecting relevant data is a web crawler built in

Python upon ”Scrapy” 2, the web-crawling and scraping framework. In order to

start the crawling process, one has to feed it with a valid Willhaben search

link. This transfers the search processing responsibilities on Willhaben’s own

search engine, ensures correctness of the whole process flow and absence of

redundant dependencies. During the crawling process all products on result

pages are individually crawled and scraped until there are no more products

left. Crawler is rather an advanced one, because it is able to revisit products

and make revisions of updated product details. This data is collected with the

purpose of advanced analytics like:

How many times does a certain user change product details?

If the change was a good or bad decision (increased views, accelerated

sale)?

Product data (excluding price) used in the model looks like this:

Previously shown data can have arbitrary number of attributes as well as

equipment. Therefore, it was one of the main factors that influenced DBMS

choice decision. “MongoDB” 3 is used as a storage for this data and imported

into project using ”pymongo” 4 library. One of its main advantages is that the

schema of each collection is not constant, since it is a NoSQL DBMS. That

makes it perfect for this case, because Willhaben does not enforce any

product details to be filled, what can result in missing and erroneous values.

Product Data1

Optionaler Platz für Partnerlogos oder Ähnliches

Figure 1: Pairwise plots (x axis = price)

© isti2 -

Foto

lia.c

om

Literatur / Zitat

1 https://www.willhaben.at/iad/gebrauchtwagen/d/auto/audi-q3-sportback-35-tdi-s-line-exterieur-345431452/ 2 https://scrapy.org 3 https://www.mongodb.com 4 https://api.mongodb.com/python/current/

Figure 2: Initial vs normal price distribution

Metrics

Root Mean Squared Error 0.33882925086198107

Mean Squared Error 0.11480526123969131

R2 Score 0.748883640846617

Figure 3: Actual and predicted

price distribution

Figure 4: Scatter plot of test and

predicted prices with best fit line