clickstream analysis - data collection, preprocessing and mining using lisp-miner system

27
Clickstream analysis - data collection, preprocessing and mining using LISp-Miner system Effective placement of on- line advertising Tomáš Kliegr KIZI A case study approach

Upload: rudyard-bentley

Post on 15-Mar-2016

42 views

Category:

Documents


2 download

DESCRIPTION

Clickstream analysis - data collection, preprocessing and mining using LISp-Miner system. A case study approach. Effective placement of on-line advertisin g Tom áš Kliegr KIZI. Methodology. CRISP-DM. I. Data collection. Data are collected on the server application layer - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Clickstream analysis - data collection, preprocessing and mining using LISp-Miner system

Clickstream analysis - data collection, preprocessing and mining using

LISp-Miner system

Effective placement of on-line advertising

Tomáš KliegrKIZI

A case study approach

Page 2: Clickstream analysis - data collection, preprocessing and mining using LISp-Miner system

Methodology

• CRISP-DM

Page 3: Clickstream analysis - data collection, preprocessing and mining using LISp-Miner system

I. Data collection

• Data are collected on the server application layer

• No demands on the tracked website• ASP.NET must be supported

Page 4: Clickstream analysis - data collection, preprocessing and mining using LISp-Miner system

UML Sequence diagram

Page 5: Clickstream analysis - data collection, preprocessing and mining using LISp-Miner system

Comparison with log-file based approaches

Advantages• Works with all browsers with enabled cookies• Automatic robot filtering• Storage efficiency• Easy to integrate & safe to operate

Disadvantages• Database required• Hosting must support .NET Framework

Page 6: Clickstream analysis - data collection, preprocessing and mining using LISp-Miner system

II. Data preprocessingProblem: collected click streams have varying

lengths.This phase creates a fixed-length visitor’s profile in

a two step processSegment procedure: classifies pages into a

domain specific taxonomy on several levels of granularity.

Merge procedure: extracts important and characteristic information from visitor’s clickstream.

Page 7: Clickstream analysis - data collection, preprocessing and mining using LISp-Miner system

Segment procedure

• Classifies pages into a domain specific taxonomy on several levels of granularity.

• Assigns Time on page and Score to each page in visitor’s clickstream

• Score expresses absolute weight of a particular page in user’s click stream.

S = (ln(O) + 1)* to – order of a page in users clickstreamt – time on page

Page 8: Clickstream analysis - data collection, preprocessing and mining using LISp-Miner system

Assigning pages to categories

Visited pages (URL addresses

Stored in a database)

Prespecified taxonomy(tuples ProductID - category,

Tuples URL pattern – category)

SQL Server SPSegment

Pages classified on several levels of granularity

Page 9: Clickstream analysis - data collection, preprocessing and mining using LISp-Miner system

Segment – Example outputPage

www.poznani.cz/hiking-alps/

General category (Cat)Search

Extended Category (ECat)Catalogue

TopicAlps

Page 10: Clickstream analysis - data collection, preprocessing and mining using LISp-Miner system

Merge procedureThis procedure creates the visitor profile:• Basic attributes (6): Total time on web, Number of

displayed pages, Day of week, Hour of day, Referring domain (constituted by URL and Cat attributes).

• Important points on the path (12): Entry page, Exit page, Conversion page. (Page name, Cat, ECat and S).

• Attributes conceptualizing the path (11): Range of interest, Most favourite topic (Topic, S), Search total (S) and Search analytically (Fulltext (S), Extended search (S),Catalogue Search (S)), General information pages total (S) and analytically (Discounts(S), Insurance (S), About (S)).

Page 11: Clickstream analysis - data collection, preprocessing and mining using LISp-Miner system

Merge – example output

Page 12: Clickstream analysis - data collection, preprocessing and mining using LISp-Miner system
Page 13: Clickstream analysis - data collection, preprocessing and mining using LISp-Miner system

III. Datamining

• Association Rules are the most frequently used approach [Facci, Lanza]

• LISp-Miner system - 4ft-Miner, SD4ft-Miner

• Categories created in LMDataSource

Page 14: Clickstream analysis - data collection, preprocessing and mining using LISp-Miner system

Sample tasks

• Task 1:– From which referring class of websites do most

converted visitors come?• Task 2:

– What are the visitor’s interests in relation to the referring server

• Task 3:– Relation between provision of information on

discounts, insurance and entrance page and conversion

Page 15: Clickstream analysis - data collection, preprocessing and mining using LISp-Miner system

Choosing the right quantifier

• Founded implication – Support a, a/(a+b+c+d)– Confidence a/(a+b)– Problem: tight dependancies rarely found and

rarely required in clickstream data• Above average quantifier

“Among objects satisfying Ant there are at least 100*p per cent more objects satisfying Suc then there are objects satisfying Suc in the whole data matrix.” LISp-Miner Help

Page 16: Clickstream analysis - data collection, preprocessing and mining using LISp-Miner system

SD4ft-Miner• Mines for patterns of the form /(,,)• This SD4ft-Pattern means that the subsets given

by Boolean attributes , differ in what concerns the relation of Boolean attributes , when condition is satisfied.

• What groups of customers , (i.e. depending on where they come from) under what condition remarkably differ when it comes to the probability of conversion.

• We express “the conversion condition” by setting only the succedent () and we leave the antecedent unset.

Page 17: Clickstream analysis - data collection, preprocessing and mining using LISp-Miner system
Page 18: Clickstream analysis - data collection, preprocessing and mining using LISp-Miner system

4ft Miner vs SD4ft4ft-Miner, Above Average Quant.

SD4ft-Miner, (neg. gace type for 2nd subset)

The value of increase in the conversion rate is more suitable for out purposes as the 2nd set is disjunctive with the 1st set. The cr. For partner webs is 78 % higher than is the average for other referrers

Con1/Conf2= 0,132/0,074 = 1,784

Page 19: Clickstream analysis - data collection, preprocessing and mining using LISp-Miner system

Solution to Task 1From which referring class of websites do most

converted visitors come?

Conversion rate

0

0,05

0,1

0,15Fulltexts

Catalogues

No referer

Other

Partner webs

Own webs

Conversion rate

Page 20: Clickstream analysis - data collection, preprocessing and mining using LISp-Miner system

SD4Ft – cont.

• If the output is sorted according to Difference of values of confidence

• The first rule says: Conversion rate for visitors coming from

partner websites is 13.2%, while conversion rate for visitors coming from company’s own websites is only 4.9%.

Page 21: Clickstream analysis - data collection, preprocessing and mining using LISp-Miner system

Review

• The goal of the second run of the CRISP-DM Cycle is to

• improve currently used tools, • increase the quality of current attributes• add new attributes by involving page texts• wrap feasible solutions into Ferda modules

Page 22: Clickstream analysis - data collection, preprocessing and mining using LISp-Miner system

I. Data collection

• Track visitors across visits– Permanent cookies

• Track real actions not only page views– Add parameters

• Stronger normalization– Database can become easily full under

current implementation

Page 23: Clickstream analysis - data collection, preprocessing and mining using LISp-Miner system

II. Data preprocessing• Provide tool for taxonomy design and matching

– Match pages to taxonomies semi-manually • based on pattern in URL• Based on words in documents

– Automatically cluster pages using information retrieval methods

• Functionally – repeating content in sidebars, etc.• Semantically – use headings, title, em, strong,desc.

– Assumption: Commercial content is written for search engines.

– Use Wordnet to assign hypernyms to keywords– Negative use of WordNet could aid distinguishing product

names

Page 24: Clickstream analysis - data collection, preprocessing and mining using LISp-Miner system

This Boring Headline is Written for Google

• New York Times: “About a year ago, The Sacramento Bee

changed online section titles. "Real Estate" became "Homes," "Scene" turned into "Lifestyle," and dining information found in newsprint under "Taste," is online under "Taste/Food."'"

Page 25: Clickstream analysis - data collection, preprocessing and mining using LISp-Miner system

Preprocessing cont.

Are there more relavant pages to this keyword?

Does this keyword occur on some other page

of the web?

Possible Google Bomb / negative reputationPossible mistake in SEOAll is the way it

should be

No Yes Yes No

Are the keywords used to find the document on a search engine contained in the document?

Yes No

Page 26: Clickstream analysis - data collection, preprocessing and mining using LISp-Miner system

III. DataMining

– Example DM task 1: Which “classes” of words are most frequently used?

– Example DM task 2: What two groups of people (e.g. googling for Africa vs. Mountain biking) under what condition (did they buy something) remarkably differ what concerns the relation of number of visited pages and number of visited topics

Page 27: Clickstream analysis - data collection, preprocessing and mining using LISp-Miner system

Conclusion

• To do:– Utilize (Euro)WordNet– Assign different weights based on HTML Tags – Test feasibility of Query/Document

coocurrencies (Sample DM Tasks)• If it works:

– Include/ Write Spider– Write taxonomy editor/miner– Wrap it all as Ferda modules