competitive analysis api report

B.K. BIRLA INSTITUTE OF ENGINEERING &

TECHNOLOGY,PILANI (RAJ.)

SESSION 2015

CERTIFICATE

This is to certified that the project entitled “API to perform competitive analysis of E-

commerce sites ” has been submitted to the Rajasthan Technical University, Kota fulfillment

of the requirement for the award of the degree of Bachelor of Technology in “Information

Technology” by following students of final year B.Tech. (Information Technology).

Gaurav Kumawat (11EBKIT012) Mr. Shridhar Dandin

Vishesh Mishra(11EBKIT063) (HOD, IT Deptt.)

Guide

Mrs. Sonam Mittal Sahu

2| P a g e

CONTENTS

1 Abstract 6

1.1 Workflow 6

1.2 Base Language 6

1.3 Input 6

1.4 Output 6

2 Introduction 7

2.1 Background 7

2.2 Objectives 7

2.3 Purpose, Scope & Applicability 7

3 Tools & Environments used 8

4 Python 9

4.1 Python Features 9

4.2 Statements & Control Flow 10

4.3 Expressions 11

4.4 Methods 14

4.5 Mathematics 14

4.6 Libraries 15

4.7 Development Environments 16

5 Numpy 16

6 Python Data Analysis Library 17

7 JSON 20

8 Difflib 21

8.1 Sequence Matcher 21

9 Comma Separated Values (CSV) 22

10 Design Document 23

10.1 Modularization Details 23

11 Source Code 23

12 Testing 27

12.1 Unit Testing 27

12.2 Integration Testing 28

12.3 System Testing 28

13 Reports 28

3| P a g e

13.1 Unit Testing 28

13.2 Integration Testing 29

13.3 System Testing 30

14 Input & Output Screens 31

15 Conclusion 34

16 Limitations of the Project 35

17 Future Applications of the Project 35

18 References 36

4| P a g e

TEAM INFORMATION

MEMBERS :

Gaurav Kumawat (11EBKIT012)

Branch : Information Technology

Vishesh Mishra (11EBKIT063)

Branch : Information Technology

5| P a g e

1. ABSTRACT

This API is actually a system software to help Business Intelligence team to yield

market strategies. The main task of this API is the Product - Matching Algorithm

which intelligently matches the products across various e-comm platforms and gives

an analytical insight about the market capture of a product or a company against its

competitors. For instance a well known company 'flipkart' wants to compare selling

price of its product such as 'IPhone 6' to its competitors. Now at different e-comm

platforms the name of the same product may be different. for instance on Amazon it

would be i-phone6 black and on snapdeal it would be like white I-phone/6 32gb. It

becomes a mess and eats lot of time to search and sort. Our API is designed to address

such problems and reduces the overall effort by many folds.

1.1. WORKFLOW

Development Phase: To Analyze the problem statement, Research and a theoritical

view-point synthesis. Algorithms Development.

Implementation Phase: To mold the theoritical concepts in core software codes.

Evaluation Phase: To test the algorithms and verifying the results. Recall-Precision

for the output.

1.2. BASE LANGUAGE : Python.

1.3. INPUT: Web-Crawled Data, Dataset(Json Format).

1.4. OUTPUT: Business Intelligence Report, CSV Report.

6| P a g e

2. INTRODUCTION

2.1. BACKGROUND

As there are numerous e-commerce sites present in the web and also there are

products which are common among these sites but having different prices so it may be

quiet time consuming for an e-commerce company to compare its products

information with that of other companies and perform competitive analysis. It would

be quiet easy for the company if it could compare the prices and other information of

products present in different sites at a single location.

2.2. OBJECTIVES

To develop, implement and evaluate an API to perform competitive analysis on

various E-commerce business platforms.

2.3. PURPOSE, SCOPE AND APPLICABILITY

The purpose of this project is to reduce user’s workload of comparing price of a

product in different e-commerce sites. There is a lot of scope in this area since the

future will see a sudden surge in the number of e-commerce sites many containing

common products. Our API will address this problem by giving a platform which can

help e-commerce companies compare its product data with various other e-commerce

sites and to stay in competition by adjusting prices accordingly. Currently the API

could be a bit complex since its still in development stages and the algorithm used is

still not 100% efficient but it could be improved further in future. Its applicability

resides in scenarios where there are large number of e-commerce sites and it would be

difficult to analyze the product information of common products manually.

7| P a g e

3. TOOLS AND ENVIRONMENT USED

Enviroment :

Operating System used-Windows 7/8

Development Environment- Python 2.7.9 IDLE

Tools : Python Modules which include :

Numpy

Pandas

Csv

Json

Difflib

Operator

Sequencematcher

Other tools used : Visual C++ for Python 2.7 (VCForPython27), PythonGUI.

Dependencies

Python 2.7

Csv

operator

Numpy

Pandas

Json

difflib

8| P a g e

4. PYTHON (PROGRAMMING LANGUAGE)

Python is a high-level, interpreted, interactive and object-oriented scripting language.

Python is designed to be highly readable. It uses English keywords frequently where

as other languages use punctuation, and it has fewer syntactical constructions than

other languages.

Python is Interpreted: Python is processed at runtime by the interpreter. You

do not need to compile your program before executing it. This is similar to

PERL and PHP.

Python is Interactive: You can actually sit at a Python prompt and interact

with the interpreter directly to write your programs.

Python is Object-Oriented: Python supports Object-Oriented style or

technique of programming that encapsulates code within objects.

Python is a Beginner's Language: Python is a great language for the

beginner-level programmers and supports the

development of a wide range of applications from

simple text processing to WWW browsers to games.

4.1. PYTHON FEATURES

Python's features include:

Easy-to-learn: Python has few keywords, simple structure, and a clearly

defined syntax. This allows the student to pick up the language quickly.

Easy-to-read: Python code is more clearly defined and visible to the eyes.

Easy-to-maintain: Python's source code is fairly easy-to-maintain.

A broad standard library: Python's bulk of the library is very portable and

cross-platform compatible on UNIX, Windows, and Macintosh.

Interactive Mode:Python has support for an interactive mode which allows

interactive testing and debugging of snippets of code.

Portable: Python can run on a wide variety of hardware platforms and has the

same interface on all platforms.

9| P a g e

Extendable: You can add low-level modules to the Python interpreter. These

modules enable programmers to add to or customize their tools to be more

efficient.

Databases: Python provides interfaces to all major commercial databases.

GUI Programming: Python supports GUI applications that can be created

and ported to many system calls, libraries and windows systems, such as

Windows MFC, Macintosh, and the X Window system of Unix.

Scalable: Python provides a better structure and support for large programs

than shell scripting.

Apart from the above-mentioned features, Python has a big list of good features, few

are listed below:

IT supports functional and structured programming methods as well as OOP.

It can be used as a scripting language or can be compiled to byte-code for

building large applications.

It provides very high-level dynamic data types and supports dynamic type

checking.

IT supports automatic garbage collection.

It can be easily integrated with C, C++, COM, ActiveX, CORBA, and Java.

4.2. STATEMENTS AND CONTROL FLOW

Python's statements include (among others):

The if statement , which conditionally executes a block of code, along

with else and elif (a contraction of else-if).

The for statement , which iterates over an iterable object, capturing each element

to a local variable for use by the attached block.

The while statement , which executes a block of code as long as its condition is

true.

10| P a g e

http://en.wikipedia.org/wiki/While_loop#Python

http://en.wikipedia.org/wiki/Foreach#Python

http://en.wikipedia.org/wiki/If-then-else

The try statement, which allows exceptions raised in its attached code block to be

caught and handled by except clauses; it also ensures that clean-up code in a

finally block will always be run regardless of how the block exits.

The class statement, which executes a block of code and attaches its local

namespace to a class, for use in object-oriented programming.

The def statement, which defines a function or method.

The with statement (from Python 2.5), which encloses a code block within a

context manager (for example, acquiring a lock before the block of code is run

and releasing the lock afterwards, or opening a file and then closing it),

allowing RAII-like behavior.

The pass statement, which serves as a NOP. It is syntactically needed to create an

empty code block.

The assert statement , used during debugging to check for conditions that ought to

apply.

The yield statement, which returns a value from a generator function. From

Python 2.5, yield is also an operator. This form is used to implement coroutines.

The import statement, which is used to import modules whose functions or

variables can be used in the current program.

print() was changed to a function in Python 3.

4.3. EXPRESSIONS

Python expressions are similar to languages such as C and Java:

Addition, subtraction, and multiplication are the same, but the behavior of

division differs (see Mathematics for details). Python also added the ** operator

for exponentiation.

In Python, == compares by value, in contrast to Java, where it compares by

reference. (Value comparisons in Java use the equals() method.)

11| P a g e

http://en.wikipedia.org/wiki/Exception_handling_syntax#Python

http://en.wikipedia.org/wiki/Python_(programming_language)#Mathematics

http://en.wikipedia.org/wiki/Java_(programming_language)

http://en.wikipedia.org/wiki/C_(programming_language)

http://en.wikipedia.org/wiki/Coroutine

http://en.wikipedia.org/wiki/Generator_(computer_science)#Python

http://en.wikipedia.org/wiki/Assertion_(programming)

http://en.wikipedia.org/wiki/NOP

http://en.wikipedia.org/wiki/Resource_Acquisition_Is_Initialization

http://en.wikipedia.org/wiki/Computer_file

http://en.wikipedia.org/wiki/Lock_(computer_science)

http://en.wikipedia.org/wiki/Method_(computing)

http://en.wikipedia.org/wiki/Function_(computing)

http://en.wikipedia.org/wiki/Object-oriented_programming

http://en.wikipedia.org/wiki/Class_(computer_science)

Python's is operator may be used to compare object identities (comparison by

reference). Comparisons may be chained, for example a <= b <= c .

Python uses the words and , or , not for its boolean operators rather than the

symbolic && , || , ! used in Java and C.

Python has a type of expression termed a list comprehension. Python 2.4 extended

list comprehensions into a more general expression termed

a generator expression.[40]

Anonymous functions are implemented using lambda expressions; however, these

are limited in that the body can only be a single expression.

Conditional expressions in Python are written as x if c else y [57] (different in order

of operands from the ?: operator common to many other languages).

Python makes a distinction between lists and tuples. Lists are written as [1, 2, 3] ,

are mutable, and cannot be used as the keys of dictionaries (dictionary keys must

beimmutable in Python). Tuples are written as (1, 2, 3) , are immutable and thus

can be used as the keys of dictionaries, provided all elements of the tuple are

immutable. The parentheses around the tuple are optional in some contexts.

Tuples can appear on the left side of an equal sign; hence a statement like x, y =

y, x can be used to swap two variables.

Python has a "string format" operator % . This functions analogous

to printf format strings in C, e.g. "foo=%s bar=%d" % ("blah", 2) evaluates

to "foo=blah bar=2" . In Python 3 and 2.6+, this was supplemented by

the format() method of the str class, e.g. "foo={0} bar={1}".format("blah", 2) .

Python has various kinds of string literals:

Strings delimited by single or double quotation marks. Unlike in Unix

shells, Perl and Perl-influenced languages, single quotation marks and double

quotation marks function identically. Both kinds of string use the backslash

( \ ) as an escape character and there is no implicit string interpolation such

as "$foo" .

12| P a g e

http://en.wikipedia.org/wiki/String_interpolation

http://en.wikipedia.org/wiki/Escape_character

http://en.wikipedia.org/wiki/Perl

http://en.wikipedia.org/wiki/Unix_shell

http://en.wikipedia.org/wiki/Unix_shell

http://en.wikipedia.org/wiki/String_literal

http://en.wikipedia.org/wiki/C_(programming_language)

http://en.wikipedia.org/wiki/Printf

http://en.wikipedia.org/wiki/Immutable

http://en.wikipedia.org/wiki/Tuple

http://en.wikipedia.org/wiki/List_(computer_science)

http://en.wikipedia.org/wiki/%3F:

http://en.wikipedia.org/wiki/Python_(programming_language)#cite_note-AutoNT-60-57

http://en.wikipedia.org/wiki/Lambda_(programming)

http://en.wikipedia.org/wiki/Anonymous_function


http://en.wikipedia.org/wiki/Generator_(computer_science)

http://en.wikipedia.org/wiki/List_comprehension#Python

Triple-quoted strings, which begin and end with a series of three single or

double quotation marks. They may span multiple lines and function like here

documents in shells, Perl and Ruby.

Raw string varieties, denoted by prefixing the string literal with an r . No

escape sequences are interpreted; hence raw strings are useful where literal

backslashes are common, such as regular expressions and Windows-style

paths. Compare " @ -quoting" in C#.

Python has index and slice expressions on lists, denoted

as a[key] , a[start:stop] or a[start:stop:step] . Indexes are zero-based, and

negative indexes are relative to the end. Slices take elements from the start index

up to, but not including, the stop index. The third slice parameter,

called step or stride, allows elements to be skipped and reversed. Slice indexes

may be omitted, for example a[:] returns a copy of the entire list. Each element of

a slice is a shallow copy.

In Python, a distinction between expressions and statements is rigidly enforced, in

contrast to languages such as Common Lisp, Scheme, or Ruby. This leads to some

duplication of functionality. For example:

List comprehensions vs. for -loops

Conditional expressions vs. if blocks

The eval() vs. exec() built-in functions (in Python 2, exec is a statement); the

former is for expressions, the latter is for statements.

4.4. METHODS

Methods on objects are functions attached to the object's class; the

syntax instance.method(argument) is, for normal methods and functions, syntactic

sugar for Class.method(instance, argument) . Python methods have an

explicit self parameter to access instance data, in contrast to the

implicit self (or this ) in some other object-oriented programming languages (e.g. C+

+, Java, Objective-C, or Ruby).

13| P a g e

http://en.wikipedia.org/wiki/Ruby_(programming_language)

http://en.wikipedia.org/wiki/Objective-C

http://en.wikipedia.org/wiki/Java_(programming_language)

http://en.wikipedia.org/wiki/C%2B%2B

http://en.wikipedia.org/wiki/C%2B%2B

http://en.wikipedia.org/wiki/Instance_data

http://en.wikipedia.org/wiki/This_(computer_programming)

http://en.wikipedia.org/wiki/Syntactic_sugar

http://en.wikipedia.org/wiki/Syntactic_sugar

http://en.wikipedia.org/wiki/Function_(programming)

http://en.wikipedia.org/wiki/Method_(programming)

http://en.wikipedia.org/wiki/Conditional_(programming)

http://en.wikipedia.org/wiki/List_comprehensions


http://en.wikipedia.org/wiki/Scheme_(programming_language)

http://en.wikipedia.org/wiki/Common_Lisp

http://en.wikipedia.org/wiki/Shallow_copy

http://en.wikipedia.org/wiki/Zero-based

http://en.wikipedia.org/wiki/Array_slicing

http://en.wikipedia.org/wiki/Array_index

http://en.wikipedia.org/wiki/C_Sharp_(programming_language)

http://en.wikipedia.org/wiki/Microsoft_Windows

http://en.wikipedia.org/wiki/Regular_expression

http://en.wikipedia.org/wiki/Raw_string


http://en.wikipedia.org/wiki/Here_document

http://en.wikipedia.org/wiki/Here_document

4.5. MATHEMATICS

Python has the usual C arithmetic operators (+, -, *, /, %). It also has ** for

exponentiation, e.g. 5**3 == 125 and 9**.5 == 3.0 and a new matrix multiply

operator @ coming in 3.5.[61]

The behavior of division has changed significantly over time.[62]

Python 2.1 and earlier use the C division behavior. The / operator is integer

division if both operands are integers, and floating point division otherwise.

Integer division rounds towards 0, e.g. 7 / 3 == 2 and -7 / 3 == -2 .

Python 2.2 changes integer division to round towards negative infinity, e.g. 7 / 3

== 2 and -7 / 3 == -3 . The floor division // operator is introduced. So 7 // 3 ==

2 , -7 // 3 == -3 , 7.5 // 3 == 2.0 and -7.5 // 3 == -3.0 . Adding from __future__

import division causes a module to use Python 3.0 rules for division (see next).

Python 3.0 changes / to always be floating point division. In Python terms, the

pre-3.0 / is "classic division", the 3.0 / is "real division", and // is "floor division".

Rounding towards negative infinity, though different from most languages, adds

consistency. For instance, it means that the equation (a+b) // b == a // b + 1 is always

true. It also means that the equation b * (a // b) + a % b == a is valid for both positive

and negative values of a . However, maintaining the validity of this equation means

that while the result of a % b is, as expected, in the half-open interval [0,b),

where b is a positive integer, it has to lie in the interval (b,0] when b is negative.[63]

Python provides a round function for rounding floats to integers. Versions before 3

use round-away-from-zero: round(0.5) is 1.0, round(-0.5) is −1.0.[64] Python 3

usesround-to-even: round(1.5) is 2, round(2.5) is 2.[65] The Decimal type/class in

module decimal (since version 2.4) provides exact numerical representation and

several rounding modes.

14| P a g e


http://en.wikipedia.org/wiki/Round_to_even


http://en.wikipedia.org/wiki/Rounding


http://en.wikipedia.org/wiki/Half-open_interval

http://en.wikipedia.org/wiki/Python_(programming_language)#cite_note-pep0238-62

http://en.wikipedia.org/wiki/Python_(programming_language)#cite_note-61

Python allows boolean expressions with multiple equality relations in a manner that is

consistent with general usage in mathematics. For example, the expression a < b <

c tests whether a is less than b and b is less than c . C-derived languages interpret

this expression differently: in C, the expression would first evaluate a < b , resulting

in 0 or 1, and that result would then be compared with c .[66][page needed]

Due to Python's extensive mathematics library, it is frequently used as a scientific

scripting language to aid in problems such as data processing and manipulation.

4.6. LIBRARIES

Python has a large standard library, commonly cited as one of Python's greatest

strengths,[67] providing tools suited to many tasks. This is deliberate and has been

described as a "batteries included"[26] Python philosophy. For Internet-facing

applications, a large number of standard formats and protocols (such

as MIME and HTTP) are supported. Modules for creating graphical user interfaces,

connecting to relational databases, pseudorandom number generators, arithmetic with

arbitrary precision decimals,[68]manipulating regular expressions, and doing unit

testing are also included.

Some parts of the standard library are covered by specifications (for example,

the WSGI implementation wsgiref follows PEP 333[69]), but the majority of the

modules are not. They are specified by their code, internal documentation, and test

suite (if supplied). However, because most of the standard library is cross-platform

Python code, there are only a few modules that must be altered or completely

rewritten by alternative implementations.

The standard library is not essential to run Python or embed Python within an

application. Blender 2.49, for instance, omits most of the standard library.

As of January 2015, the Python Package Index, the official repository of third-party

software for Python, contains more than 54,000 packages offering a wide range of

functionality, including:

15| P a g e

http://en.wikipedia.org/wiki/Python_Package_Index

http://en.wikipedia.org/wiki/Blender_(software)


http://en.wikipedia.org/wiki/Web_Server_Gateway_Interface

http://en.wikipedia.org/wiki/Unit_testing

http://en.wikipedia.org/wiki/Unit_testing

http://en.wikipedia.org/wiki/Regular_expression


http://en.wikipedia.org/wiki/Pseudorandom_number_generator

http://en.wikipedia.org/wiki/Relational_database

http://en.wikipedia.org/wiki/Graphical_user_interface

http://en.wikipedia.org/wiki/Hypertext_Transfer_Protocol

http://en.wikipedia.org/wiki/MIME

http://en.wikipedia.org/wiki/Python_(programming_language)#cite_note-About-26


http://en.wikipedia.org/wiki/Standard_library

http://en.wikipedia.org/wiki/Wikipedia:Citing_sources


graphical user interfaces, web frameworks, multimedia, databases, networking

and communications

test frameworks, automation and web scraping, documentation tools, system

administration

scientific computing, text processing, image processing

4.7. DEVELOPMENT ENVIRONMENTS

Most Python implementations (including CPython) can function as a command line

interpreter, for which the user enters statements sequentially and receives the results

immediately (REPL). In short, Python acts as a shell.

Other shells add capabilities beyond those in the basic interpreter,

including IDLE and IPython. While generally following the visual style of the Python

shell, they implement features like auto-completion, retention of session state, and

syntax highlighting.In addition to standard desktop Python IDEs (integrated

development environments), there are also browser-based IDEs, Sage (intended for

developing science and math-related Python programs), and a browser-based IDE and

hosting environment, PythonAnywhere.

5. NUMPY

NumPy is the fundamental package for scientific computing with Python. It contains

among other things:

a powerful N-dimensional array object

sophisticated (broadcasting) functions

tools for integrating C/C++ and Fortran code

useful linear algebra, Fourier transform, and random number capabilities

Besides its obvious scientific uses, NumPy can also be used as an efficient multi-

dimensional container of generic data. Arbitrary data-types can be defined. This

allows NumPy to seamlessly and speedily integrate with a wide variety of databases.

16| P a g e

http://en.wikipedia.org/wiki/PythonAnywhere

http://en.wikipedia.org/wiki/Sage_(mathematics_software)

http://en.wikipedia.org/wiki/Web_browser

http://en.wikipedia.org/wiki/Python_IDE

http://en.wikipedia.org/wiki/IPython

http://en.wikipedia.org/wiki/IDLE_(Python)

http://en.wikipedia.org/wiki/Command-line_interface

http://en.wikipedia.org/wiki/REPL

http://en.wikipedia.org/wiki/Command_line_interpreter

http://en.wikipedia.org/wiki/Command_line_interpreter

NumPy is an extension to the Python programming language, adding support for

large, multi-dimensional arrays and matrices, along with a large library of high-

level mathematical functions to operate on these arrays. The ancestor of NumPy,

Numeric, was originally created by Jim Hugunin with contributions from several

other developers. In 2005, Travis Oliphant created NumPy by incorporating features

of the competing Numarray into Numeric, with extensive modifications. NumPy

is open source and has many contributors

6. PYTHON DATA ANALYSIS LIBRARY

pandas is an open source, BSD-licensed library providing high-performance, easy-to-

use data structures and data analysis tools for the Python programming language.

Library Highlights

A fast and efficient DataFrame object for data manipulation with integrated

indexing;

Tools for reading and writing data between in-memory data structures and

different formats: CSV and text files, Microsoft Excel, SQL databases, and the

fast HDF5 format;

Intelligent data alignment and integrated handling of missing data: gain

automatic label-based alignment in computations and easily manipulate messy

data into an orderly form;

Flexible reshaping and pivoting of data sets;

Intelligent label-based slicing, fancy indexing, and subsetting of large data

sets;

Columns can be inserted and deleted from data structures for size mutability;

Aggregating or transforming data with a powerful group by engine allowing

split-apply-combine operations on data sets;

High performance merging and joining of data sets;

Hierarchical axis indexing provides an intuitive way of working with high-

dimensional data in a lower-dimensional data structure;

Time series-functionality: date range generation and frequency conversion,

moving window statistics, moving window linear regressions, date shifting

17| P a g e

http://www.python.org/

http://en.wikipedia.org/wiki/Open_source

http://en.wikipedia.org/wiki/Jim_Hugunin

http://en.wikipedia.org/wiki/Function_(mathematics)

http://en.wikipedia.org/wiki/Mathematics

http://en.wikipedia.org/wiki/High-level_programming_language

http://en.wikipedia.org/wiki/High-level_programming_language

http://en.wikipedia.org/wiki/Matrix_(math)

http://en.wikipedia.org/wiki/Array_data_structure

http://en.wikipedia.org/wiki/Python_(programming_language)

and lagging. Even create domain-specific time offsets and join time series

without losing data;

Highly optimized for performance, with critical code paths written

in Cythonor C.

Python with pandas is in use in a wide variety of academic and

commercialdomains, including Finance, Neuroscience, Economics, Statistics,

Advertising, Web Analytics, and more.

pandas is a Python package providing fast, flexible, and expressive data structures

designed to make working with “relational” or “labeled” data both easy and intuitive.

It aims to be the fundamental high-level building block for doing practical, real

world data analysis in Python. Additionally, it has the broader goal of becoming the

most powerful and flexible open source data analysis / manipulation tool

available in any language. It is already well on its way toward this goal.

pandas is well suited for many different kinds of data:

Tabular data with heterogeneously-typed columns, as in an SQL table

or Excel spreadsheet

Ordered and unordered (not necessarily fixed-frequency) time series

data.

Arbitrary matrix data (homogeneously typed or heterogeneous) with

row and column labels

Any other form of observational / statistical data sets. The data actually

need not be labeled at all to be placed into a pandas data structure

Here are just a few of the things that pandas does well:

Easy handling of missing data (represented as NaN) in floating point

as well as non-floating point data

Size mutability: columns can be inserted and deleted from DataFrame

and higher dimensional objects

Automatic and explicit data alignment: objects can be explicitly

aligned to a set of labels, or the user can simply ignore the labels and

18| P a g e

http://www.cython.org/

let Series, DataFrame, etc. automatically align the data for you in

computations

Powerful, flexible group by functionality to perform split-apply-

combine operations on data sets, for both aggregating and transforming

data

Make it easy to convert ragged, differently-indexed data in other

Python and NumPy data structures into DataFrame objects

Intelligent label-based slicing, fancy indexing, and subsetting of large

data sets

Intuitive merging and joining data sets

Flexible reshaping and pivoting of data sets

Hierarchical labeling of axes (possible to have multiple labels per

tick)

Robust IO tools for loading data from flat files (CSV and delimited),

Excel files, databases, and saving / loading data from the

ultrafast HDF5 format

Time series-specific functionality: date range generation and

frequency conversion, moving window statistics, moving window

linear regressions, date shifting and lagging, etc.

Many of these principles are here to address the shortcomings frequently experienced

using other languages / scientific research environments. For data scientists, working

with data is typically divided into multiple stages: munging and cleaning data,

analyzing / modeling it, then organizing the results of the analysis into a form suitable

for plotting or tabular display. pandas is the ideal tool for all of these tasks.

Some other notes

pandas is fast. Many of the low-level algorithmic bits have been

extensively tweaked in Cython code. However, as with anything else

generalization usually sacrifices performance. So if you focus on one

feature for your application you may be able to create a faster

specialized tool.

pandas is a dependency of statsmodels, making it an important part of

the statistical computing ecosystem in Python.

19| P a g e

pandas has been used extensively in production in financial

applications.

7. JSON (JAVASCRIPT OBJECT NOTATION)

JSON (JavaScript Object Notation) is a lightweight data-interchange format. It is easy

for humans to read and write. It is easy for machines to parse and generate. It is based

on a subset of the JavaScript Programming Language, Standard ECMA-262 3rd

Edition - December 1999. JSON is a text format that is completely language

independent but uses conventions that are familiar to programmers of the C-family of

languages, including C, C++, C#, Java, JavaScript, Perl, Python, and many others.

These properties make JSON an ideal data-interchange language.

JSON is built on two structures:

A collection of name/value pairs. In various languages, this is realized as

an object, record, struct, dictionary, hash table, keyed list, or associative array.

An ordered list of values. In most languages, this is realized as an array,

vector, list, or sequence.

These are universal data structures. Virtually all modern programming languages

support them in one form or another. It makes sense that a data format that is

interchangeable with programming languages also be based on these structures.

JSON IN PYTHON 2.X

Provided by json module, here is a description of default behavior.

Transformation from JSON to Python and back is symmetrical (only formatting is

lost).

Transformation from Python to JSON is not:

strings are converted to unicode objects

operator – Functional interface to built-in operators

20| P a g e

http://docs.python.org/2/library/json

http://www.ecma-international.org/publications/files/ecma-st/ECMA-262.pdf

http://www.ecma-international.org/publications/files/ecma-st/ECMA-262.pdf

http://javascript.crockford.com/

Purpose: Functional interface to built-in operators.

Functional programming using iterators occasionally requires creating small functions

for simple expressions. Sometimes these can be expressed as lambda functions, but

some operations do not need to be implemented with custom functions at all.

The operator module defines functions that correspond to built-in operations for

arithmetic and comparison, as well as sequence and dictionary operations.

8. DIFFLIB

This module provides classes and functions for comparing sequences. It can be used

for example, for comparing files, and can produce difference information in various

formats, including HTML and context and unified diffs. For comparing directories

and files,

class difflib.

8.1. SEQUENCE MATCHER

This is a flexible class for comparing pairs of sequences of any type, so long as the

sequence elements are hashable. The basic algorithm predates, and is a little fancier

than, an algorithm published in the late 1980’s by Ratcliff and Obershelp under the

hyperbolic name “gestalt pattern matching.” The idea is to find the longest contiguous

matching subsequence that contains no “junk” elements (the Ratcliff and Obershelp

algorithm doesn’t address junk). The same idea is then applied recursively to the

pieces of the sequences to the left and to the right of the matching subsequence. This

does not yield minimal edit sequences, but does tend to yield matches that “look

right” to people

21| P a g e

http://pymotw.com/2/operator/#module-operator

9. COMMA-SEPARATED VALUES (CSV)

A comma-separated values (CSV) (also sometimes called character-separated

values) file stores tabular data (numbers and text) in plain-text form. Plain text means

that the file is a sequence of characters, with no data that has to be interpreted as

binary numbers. A CSV file consists of any number of records, separated by line

breaks of some kind; each record consists of fields, separated by some other character

or string, most commonly a literal comma or tab. Usually, all records have an

identical sequence of fields.

"CSV" refers to any file that:[2][4]

1. is plain text using a character set such as ASCII, various Unicode character

sets (e.g. UTF-8), EBCDIC, or Shift JIS,

2. consists of records (typically one record per line),

3. with the records divided into fields separated by delimiters (typically a single

reserved character such as comma, semicolon, or tab; sometimes the delimiter

may include optional spaces),

4. where every record has the same sequence of fields.

Within these general constraints, many variations are in use. Therefore "CSV" files

are not entirely portable. Nevertheless, the variations are fairly small, and many

implementations allow users to preview the first few lines of the file (which is feasible

because it is plain text), and then specify the delimiter character(s), quoting rules, etc.

If a particular CSV file's variations fall outside what a particular receiving program

supports, it is often feasible to examine and edit the file by hand or write a script or

program to fix the problem.

10. DESIGN DOCUMENT

10.1. MODULARIZATION DETAILS

In the source code we have used different libraries which are also called modules like

numpy, json, csv, difflib, pandas. In the project the data in JSON format is input and

22| P a g e

http://en.wikipedia.org/wiki/Script

http://en.wikipedia.org/wiki/Plain_text

http://en.wikipedia.org/wiki/Delimiters

http://en.wikipedia.org/wiki/Field_(computer_science)

http://en.wikipedia.org/wiki/Record_(computer_science)

http://en.wikipedia.org/wiki/Shift_JIS

http://en.wikipedia.org/wiki/EBCDIC

http://en.wikipedia.org/wiki/UTF-8

http://en.wikipedia.org/wiki/Unicode

http://en.wikipedia.org/wiki/ASCII


http://en.wikipedia.org/wiki/Comma-separated_values#cite_note-4

http://en.wikipedia.org/wiki/Comma-separated_values#cite_note-rfc4180pg2-2

http://en.wikipedia.org/wiki/Tab_character#Tab_characters

http://en.wikipedia.org/wiki/Field_(computer_science)

http://en.wikipedia.org/wiki/Record_(computer_science)

http://en.wikipedia.org/wiki/Character_(computing)


http://en.wikipedia.org/wiki/Tabular

the output is shown on the output screen i.e. Python shell. The output is further stored

in a csv file.

INPUT DATA FORMAT : JSON

OUTPUT DATA FORMAT : CSV/Python shell

11. SOURCE CODE

'''

Created on 7-May-2015

@author: gaurav & vishesh

'''

import csv

import operator

import numpy as np

from collections import defaultdict

from operator import itemgetter

import json

import pandas as pd

from difflib import SequenceMatcher as SM

class CodingTest:

'''

'''

def __init__(self):

'''

initialization of the variables.

'''

self.dataFile = "C:\\Python27\\project\\data.txt"

self.csvToList = list()

self.dataList = list()

self.frame = pd.DataFrame()

self.SnapDeal = list()

self.Flipkart = list()

self.Amazon = list()

self.gg = {}

23| P a g e

self.values = dict()

self.competitorList = ['SnapDeal','Flipkart','Amazon']

self.mylist = list()

def createDataDict(self,data,a,mylist):

title = data['title']

mrp = data['mrp']

source = data['source']

url = data['url']

stock = data['stock']

selling_price = data['available_price']

mylist.append({'Title':title,'MRP':mrp,'Source':source, 'URL':url,'Stock_Status':stock,'Selling_Price': selling_price})

return mylist

def filterData(self):

dataList = list()

self.loadFile = open(self.datafile, "r")

for line in self.loadFile:

try:

data = json.loads(line)

#print data

a = dict(map(str.strip,x) for x in data.items())

print a

dataList = dataList.append(data)

except:

pass

#print dataList[0:4]

def loadDataFile(self):

'''

function to load data file.

'''

SnapDeal = list()

Flipkart = list()

Amazon = list()

self.loadFile = open("C:\\Python27\\project\\data.txt", "r")

print 'file is ok'

for line in self.loadFile:

24| P a g e

try:

data = json.loads(line)

#data = dict(map(str.strip,x) for x in data.items())

#datalist = datalist.append(data)

if data['source'] == 'SnapDeal':

snapdeal = self.createDataDict(data,'SnapDeal',SnapDeal)

if data['source'] == 'Flipkart':

flipkart = self.createDataDict(data,'Flipkart',Flipkart)

if data['source'] == 'Amazon':

amazon = self.createDataDict(data,'Amazon',Amazon)

else:

pass

except:

continue

return(snapdeal,flipkart,amazon)

def productMatching(self):

a = list()

b = list()

d = list()

e = list()

f = list()

r = list()

j = 0.0

t = 60.00

mydict = {}

Snapdeal,Flipkart,Amazon = self.loadDataFile()

matches = []

biglist1_indexed = {}

for item in Snapdeal:

a.append(item["Title"])

#biglist1_indexed[(item["Title"])] = item

25| P a g e

for item in Flipkart:

d.append(item['Title'])

biglist1_indexed[(item["Title"])] = item

#for item in Amazon:

# f.append(item['Title'])

for i in a:

c = list()

for k in d:

s = SM(None,i,k)

j = s.quick_ratio()

if(j >= .70):#you can adjust the threshold as per your requirement.

#print "{0:20} {1:40} {2:40}".format(j,i,k)

c.append((j,i,k))

#print 'appending',len(c)

else:

pass

c.sort(reverse = True)

if not c:

pass

else:

#c.sort(reverse = True)

e.append(c[0])

print 'List containing the top matching results with repective matching quotient'

for s in e:

print s

#for i biglist1_indexed.

#biglist1_indexed[(item["Title"].lower().replace(" ",''))] = item

#for t in e:

# print t[2]

for i in biglist1_indexed:

for t in e:

26| P a g e

if (t[2] == i):

r.append(biglist1_indexed[i])

keys = r[0].keys()

with open('C:\\Python27\\project\\aa.csv', 'wb') as output_file:

dict_writer = csv.DictWriter(output_file, keys)

dict_writer.writeheader()

dict_writer.writerows(r)

print 'done'

'''

make a dictionary which contains the attributes of snapdeal also

'''

#print biglist1_indexed

#for t in e:

# if t[1] ==

#for h in e:

# print h

#print len(e)

#r = open('C:\\Python27\\project\\aa.csv', 'w')

#l = csv.writer(r)

#l.writerows(e)

#r.close()

def createCSVReport(self):

columns = ['Flipkart Product Title','Flipkart Pro. Stock Status','Flipkart Pro MRP','Flipkart Selling Price' ]

with open('C:\\Python27\\project\\aa.csv', 'w') as f:

[f.write('{0},{1}\n'.format(key, value)) for key, value in mydict.items()]

def comupteRecallAndPrecision(self):

return(recall,precision)

12. TESTING

27| P a g e

12.1. UNIT TESTING

In computer programming, unit testing is a software testing method by which

individual units of source code, sets of one or more computer program modules

together with associated control data, usage procedures, and operating procedures, are

tested to determine whether they are fit for use.

Here we have treated the functions as individual modules or units. We have executed

them individually in order to test them

12.2. INTEGRATION TESTING

Integration testing (sometimes called integration and testing, abbreviated I&T) is

the phase in software testing in which individual software modules are combined

and tested as a group. It occurs after unit testing and before validation testing.

Here we have tested the functions working together i.e. in integrated form.

12.3. SYSTEM TESTING

System testing of software or hardware is testing conducted on a complete, integrated

system to evaluate the system's compliance with its specified requirements. System

testing falls within the scope of black box testing, and as such, should require no

knowledge of the inner design of the code or logic.

Here we have tested the system as a whole and the results are as expected and quiet

satisfactory.

13. REPORTS

13.1. UNIT TESTING

28| P a g e

http://en.wikipedia.org/wiki/Black_box_testing

http://en.wikipedia.org/wiki/Requirements

Tested units : competitorList, createCSVReport,values, myList, productmatching.

The unit testing report is shown above.

13.2. INTEGRATION TESTING

29| P a g e

Tested the system integration by running the most important function which is productmatching() which is linked and integrated with other functions.

productmatching() contains the product matching algorithm which is the most important aspect of this program.

13.3. SYSTEM TESTING

30| P a g e

The whole integrated system is tested as a whole and the system is run in its

production environment. The system testing report is shown above in the screenshot.

14. INPUT AND OUTPUT SCREENS

14.1. INPUT

31| P a g e

This is the input data in JSON format which was crawled from the internet. The data

is basically information of all the products of various e-commerce sites (Flipkart,

snapdeal, amazon).

32| P a g e

14.2. OUTPUT

Output at threshold = 0.70

Using quick_ratio()

33| P a g e

Output at threshold = 0.80

Using quick_ratio()

34| P a g e

15. CONCLUSION

This API is actually a system software to help Business Intelligence team to yield

market strategies. The main task of this API is the Product - Matching Algorithm

which intelligently matches the products across various e-commerce platforms and

gives an analytical insight about the market capture of a product or a company against

its competitors. Our API is designed to address such problems and reduces the overall

effort by many folds. This API can be used by various e-commerce companies to do

competitive analysis. The companies can use the results generated by this project to

make future market strategies which could help them in capturing more market and

thereby increase in profits. The use of this software by the e-commerce companies

will result in better strategies and planning and will force the software developers to

pay more attention in this area which will help the business organizations to perform

data analysis more efficient and at a place.

35| P a g e

16. LIMITATIONS OF THE PROJECT

The software has its own set of limitations. The one which is at the forefront is the

efficiency of the algorithm. Currently, the algorithm is not 100% efficient. As the

product names have some variations for e.g. a product sony xperia z may be present

as sony xperia z in flipkart but in snapdeal it may be present as xperia z ,in this case

there algorithm won’t be able to function efficiently and may give unexpected results.

To solve this problem high level mathematical concepts can be included, applied and

implemented in the algorithm such as probalibilty models etc.

The GUI of this project is still not developed and the software is still in its infancy

which is the reason of its less user friendliness. The software, currently is not much

user friendly and can be operated only by the persons having some knowledge

regarding the internal structures and working of the API.

17. FUTURE APPLICATIONS OF THE PROJECT

The application currently matches the products along with their data and outputs the

data in the desired format at one place. But the efficiency of the algorithm used in the

application is not 100%. In future the algorithm can be improved and made more

efficient so that it can produce results with 100% accuracy.

The application can be made more user friendly by making a proper GUI for it which

will increase its user friendliness and make it more intuitive, interactive and user

friendly.

36| P a g e

18. REFERENCES

https://www.python.org/

en.wikipedia.org/wiki/Python_(programming_language)

stackoverflow.com/

www.lfd.uci.edu/~gohlke/pythonlibs/

www.codeproject.com/

37| P a g e

https://www.python.org/

competitive analysis api report

Documents

products information

information of products

ecommerce company

python features94

system software

abstractthis api

different prices

numerous ecommerce sites