final presentaion team30

15
Product Cataloging and intelligence Team Members: Shailendra Kumar Joshi Karan Mangla Karan Agarwal

Upload: karan-mangla

Post on 09-Apr-2017

69 views

Category:

Education


0 download

TRANSCRIPT

Page 1: Final presentaion team30

Product Cataloging and intelligence

Team Members: Shailendra Kumar Joshi Karan Mangla Karan Agarwal

Page 2: Final presentaion team30

Overview:Product comparison sites scrape data from various e-commerce websites and provide user with a one-stop pricing details page. The goal is to similarly build a product database,extract vendor pricing information and perform analytics over the data.

Modules:

● Master Catalogue development● Product dump extraction● Vendor dump extraction● Price dump extraction● Product master creation● Intelligence to avoid screen scraping blockages● GUI query interface along with the required backend

Page 3: Final presentaion team30

Our Work:We selected amazon.com as our target website and initialized our work. We created a master dump from which we took the set of urls for each of the products. While scraping, initial seed url set would be maintained in a queue to fetch webpage for a product and data would be dumped into a csv file for initial processing.

During this task we used raw python and beautifulsoup to create and enrich our product dump. We scraped products based on their categories(including their complete details).

Page 4: Final presentaion team30

Challenges FaceDOne of the major challenge that comes during scrapping is escaping the site protocols that can block our ip address, so we made a generic script to escape those bots and protocols.

We initially used SqlLite as our database but due to variable number of fields in each of the categories of products,it is not possible to maintain a generic schema to for database so we moved to MongoDB which is good for big data due to its schemaless nature.

We also introduced towards the New web framework named as ”Flask” which is quite good for integrating schemaless database and retrieval of data from databases. Initially made a Web2py application for phase 1, but phased difficulties to integrate MongoDb. Hence moved to Flask.

Page 5: Final presentaion team30

Continue:After extraction of required details from dump, we imported these details to a database in flask. We made the Web interface for the same using Flask. We used MongoDB for storage purpose due to its NoSQL nature.

Following details are extracted for each of the products:

Name, price, ratings, details(including specifications),vendors

On the basis of vendor’s data we performed some analytics over data to get the best price for a particular product.

After properly saving to database, we worked over a simple python Flask to display result on a web browser.

Page 6: Final presentaion team30

Followed Workflow

Page 7: Final presentaion team30

Here are some screenshots From our work On THE PrOJECT:

Page 8: Final presentaion team30
Page 9: Final presentaion team30
Page 10: Final presentaion team30
Page 11: Final presentaion team30
Page 12: Final presentaion team30
Page 13: Final presentaion team30

Future Work:● Comparison Feature in between products.● Location Based analytics of the vendors and developing heuristics for calculating the preference of vendors based on a multitude of factors. Distributing the task of scraping across multiple machines.● Distributing the task of scraping across multiple

machines.

Page 14: Final presentaion team30

Links to other deliverables

<github_link_to_code> - https://github.com/manglakaran/IRE_Project6

<github_link_to_project_web_page> - http://manglakaran.github.io/IRE_Project6/

<youtube_link_to_video> - https://www.youtube.com/watch?v=1EH8qq0_9YM&feature=youtu.be

<dropbox_link_to_ppt_report_video> - https://www.dropbox.com/sh/8qx2gxmlz39ty6l/AAAwXJa5gtFwXT-14XkPA_nSa?dl=0

Page 15: Final presentaion team30