final presentaion team30

Product Cataloging and intelligence

Team Members: Shailendra Kumar Joshi Karan Mangla Karan Agarwal

Overview:Product comparison sites scrape data from various e-commerce websites and provide user with a one-stop pricing details page. The goal is to similarly build a product database,extract vendor pricing information and perform analytics over the data.

Modules:

● Master Catalogue development● Product dump extraction● Vendor dump extraction● Price dump extraction● Product master creation● Intelligence to avoid screen scraping blockages● GUI query interface along with the required backend

Our Work:We selected amazon.com as our target website and initialized our work. We created a master dump from which we took the set of urls for each of the products. While scraping, initial seed url set would be maintained in a queue to fetch webpage for a product and data would be dumped into a csv file for initial processing.

During this task we used raw python and beautifulsoup to create and enrich our product dump. We scraped products based on their categories(including their complete details).

http://www.amazon.com/

Challenges FaceDOne of the major challenge that comes during scrapping is escaping the site protocols that can block our ip address, so we made a generic script to escape those bots and protocols.

We initially used SqlLite as our database but due to variable number of fields in each of the categories of products,it is not possible to maintain a generic schema to for database so we moved to MongoDB which is good for big data due to its schemaless nature.

We also introduced towards the New web framework named as ”Flask” which is quite good for integrating schemaless database and retrieval of data from databases. Initially made a Web2py application for phase 1, but phased difficulties to integrate MongoDb. Hence moved to Flask.

Continue:After extraction of required details from dump, we imported these details to a database in flask. We made the Web interface for the same using Flask. We used MongoDB for storage purpose due to its NoSQL nature.

Following details are extracted for each of the products:

Name, price, ratings, details(including specifications),vendors

On the basis of vendor’s data we performed some analytics over data to get the best price for a particular product.

After properly saving to database, we worked over a simple python Flask to display result on a web browser.

Followed Workflow

Here are some screenshots From our work On THE PrOJECT:

Future Work:● Comparison Feature in between products.● Location Based analytics of the vendors and developing heuristics for calculating the preference of vendors based on a multitude of factors. Distributing the task of scraping across multiple machines.● Distributing the task of scraping across multiple

machines.

Links to other deliverables

<github_link_to_code> - https://github.com/manglakaran/IRE_Project6

<github_link_to_project_web_page> - http://manglakaran.github.io/IRE_Project6/

<youtube_link_to_video> - https://www.youtube.com/watch?v=1EH8qq0_9YM&feature=youtu.be

<dropbox_link_to_ppt_report_video> - https://www.dropbox.com/sh/8qx2gxmlz39ty6l/AAAwXJa5gtFwXT-14XkPA_nSa?dl=0

https://github.com/manglakaran/IRE_Project6

http://manglakaran.github.io/IRE_Project6/

https://www.youtube.com/watch?v=1EH8qq0_9YM&feature=youtu.be

https://www.dropbox.com/sh/8qx2gxmlz39ty6l/AAAwXJa5gtFwXT-14XkPA_nSa?dl=0



final presentaion team30

Education