web mining presentation final

55
PRESENTATION ON WEB MINING (CONTENT + STRUCTURE + USAGE) Presented By:- Mr. Jagrat Gupta M.Tech. 1 st Year CSE Branch

Upload: er-jagrat-gupta

Post on 11-Jan-2017

1.759 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Web Mining Presentation Final

PRESENTATIONON

WEB MINING(CONTENT + STRUCTURE +

USAGE)

Presented By:-Mr. Jagrat GuptaM.Tech. 1st Year

CSE Branch

Page 2: Web Mining Presentation Final

WEB MINING•Extraction of knowledge from web data.

•Web Data Includes:-web documents.hyperlinks between documents.usage logs of web sites, etc.

A panel organized at ICTAI 1997 (Srivastava and Mobasher 1997) asked the question “Is there anything distinct about web mining (compared to data mining in general)?”

Page 3: Web Mining Presentation Final

WEB MINING:APPROACHES First was a “Process-centric view” which defined

web mining as a sequence of tasks (Etzioni 1996). Resource finding. Information selection and preprocessing. Generalization. Analysis.

Kosala and Blockeel divided web mining process into the following five subtasks:

Resource finding and retrieving. Information selection and preprocessing. Patterns analysis and recognition.

Page 4: Web Mining Presentation Final

WEB MINING:APPROACHES Validation and interpretation. Visualization.

Second was a “Data-centric view” which defined web mining in terms of the types of web data that was being used in the mining process (Cooley, Srivastava, and Mobasher 1997). The second definition has become more acceptable.

In this Presentation we follow the data-centric view of web mining which is defined as follows-

“Web mining is the application of data mining techniques to extract knowledge from web data, i.e. Web content, Web structure, and Web usage data.”

Page 5: Web Mining Presentation Final

WEB MINING TAXONOMY

Page 6: Web Mining Presentation Final

WEB CONTENT MINING Mining, extraction and integration of useful data,

information and knowledge from Web page content. Content data is the collection of facts a web page is

designed to contain. It may consist of text, images, audio, video, or structured records such as lists and tables.

Search Engines do not generally provide structural information nor categorize, filter, or interpret documents.

In recent years these factors have prompted researchers to develop more intelligent tools for information retrieval, such as intelligent web agents.

Research activities are going on in Information retrieval methods, Natural language processing and Computer vision.

Page 7: Web Mining Presentation Final

WEB CONTENT MINING- PROBLEMS

Data/information extraction: Our focus will be on extraction of structured data from Web pages, such as products and search results. Extracting such data allows one to provide services. Two main types of techniques, machine learning and automatic extraction are used.

Web information integration and schema matching: Although the Web contains a huge amount of data, each web site (or even page) represents similar information differently. How to identify or match semantically similar data is a very important problem with many practical applications.

Opinion extraction from online sources: There are many online opinion sources, e.g., customer reviews of products, forums, blogs and chat rooms. Mining opinions (especially consumer opinions) is of great importance for marketing intelligence and product benchmarking.

Page 8: Web Mining Presentation Final

WEB CONTENT MINING- PROBLEMS Knowledge synthesis: Concept hierarchies or

ontology are useful in many applications. However, generating them manually is very time consuming.

Segmenting Web pages and detecting noise: In many Web applications, one only wants the main content of the Web page without advertisements, navigation links, copyright notices. Automatically segmenting Web page to extract the main content of the pages is interesting problem.

Page 9: Web Mining Presentation Final

WEB CONTENT MINING - APPROACHES

Page 10: Web Mining Presentation Final

WEB CONTENT MINING - APPROACHES The database approaches to Web mining have generally

focused on techniques for integrating and organizing the heterogeneous and semi-structured data on the Web into more structured and high-level collections of resources, such as in relational databases, and using standard database querying mechanisms and data mining techniques to access and analyze this information.

Multilevel-Databases:- The main idea behind these proposals is that the lowest level of the database contains primitive semi-structured information stored in various Web repositories, such as hypertext documents. At the higher level(s) meta data or generalizations are extracted from lower levels and organized in structured collections such as relational or object-oriented databases. ARANEUS system extracts relevant information from hypertext documents and integrates these into higher-level derived Web Hypertexts which are generalizations of the notion of database views.

Page 11: Web Mining Presentation Final

WEB CONTENT MINING - APPROACHES WebQuery-Systems:- There have been many Web-based

query systems and languages developed recently that attempt to utilize standard database query languages such as SQL, structural information about Web documents, and even natural language processing for accommodating the types of queries that are used in World Wide Web searches. W3QL, WebLog, Lorel and UnQL , TSIMMIS.

The agent-based approach to web mining involves the development of sophisticated AI systems that can act autonomously or semi-autonomously on behalf of a particular user, to discover and organize web-based information.

Intelligent-Search-Agents:- Several intelligent Web agents have been developed that search for relevant information using characteristics of a particular domain (and possibly a user profile) to organize and interpret the discovered information.

Page 12: Web Mining Presentation Final

WEB CONTENT MINING - APPROACHESHarvest , FAQ-Finder , Information Manifold ,

OCCAM ,ParaSite, ShopBot, ILA. Information-Filtering/Categorization:- A

number of Web agents use various information retrieval techniques and characteristics of open hypertext Web documents to automatically retrieve, filter, and categorize them. HyPursuit, BO (Bookmark Organizer).

Page 13: Web Mining Presentation Final
Page 14: Web Mining Presentation Final
Page 15: Web Mining Presentation Final
Page 16: Web Mining Presentation Final
Page 17: Web Mining Presentation Final

PREDECESSORS AND SUCCESSORS OF A WEB PAGE

… …

Predecessors Successors

Page 18: Web Mining Presentation Final

PAGE RANK • Simple solution: create a stochastic matrix of

the Web:• Each page i corresponds to row i and column

j of the matrix.• If page j has n successors (links) then the ijth

cell of the matrix is equal to-1/n if page i is one of these n

succesors of page j0 otherwise.

Page 19: Web Mining Presentation Final

PAGE RANK – EXAMPLE

Assume that the Web consists of only three pages - A, B, and C. The links among these pages are shown below.

A

B

C

Let [a, b, c] bethe vector of importances for these three pages

A B C

A 1/2 1/2 0

B 1/2 0 1

C 0 1/2 0

Page 20: Web Mining Presentation Final

PAGE RANK – EXAMPLE (CONT.) The equation describing the asymptotic values of

these three variables is:a 1/2 1/2 0 ab = 1/2 0 1 bc 0 1/2 0 c

We can solve the equations like this one by starting with the assumption a = b = c = 1, and applying the matrix to the current estimate of these values repeatedly. The first four iterations give the following estimates:

a = 1 1 5/4 9/8 5/4 … 6/5b = 1 3/2 1 11/8 17/16 … 6/5c = 1 1/2 3/4 1/2 11/16 ... 3/5

Page 21: Web Mining Presentation Final

PAGE RANK – EXAMPLE (CONT.)

In the limit, the solution is a=b=6/5, c=3/5. That is, a and b each have the same importance, and twice of c.

Page 22: Web Mining Presentation Final
Page 23: Web Mining Presentation Final
Page 24: Web Mining Presentation Final
Page 25: Web Mining Presentation Final
Page 26: Web Mining Presentation Final

HITS Define a matrix A whose rows and columns

correspond to Web pages with entry Aij=1 if page i links to page j, and 0 if not.

Let a and h be vectors, whose ith component corresponds to the degrees of authority and hubbiness of the ith page. Then:

h = A × a. That is, the hubbiness of each page is the sum of the authorities of all the pages it links to.

a = AT × h. That is, the authority of each page is the sum of the hubbiness of all the pages that link to it (AT - transponed matrix).

Then, a = AT × A × a h = A × AT × h

Page 27: Web Mining Presentation Final

HUB AND AUTHORITIES - EXAMPLE

Consider the Web presented below.

A

C

B

1 1 1A = 0 0 1 1 1 0 1 0 1AT = 1 0 1 1 1 0 3 1 2AAT = 1 1 0 2 0 2

2 2 1ATA = 2 2 1 1 1 2

Page 28: Web Mining Presentation Final

HUB AND AUTHORITIES - EXAMPLE

If we assume that the vectors h = [ ha, hb, hc ] and a = [ aa, ab, ac ] are each initially [ 1,1,1 ], the first three iterations of the equations for a and h are the following: aa = 1 5 24 114 ab = 1 5 24 114

ac = 1 4 18 84 ha = 1 6 28 132 hb = 1 2 8 36

hc = 1 4 20 96

Page 29: Web Mining Presentation Final
Page 30: Web Mining Presentation Final

Data Sources

server level collection: the server stores data regarding requests performed by the client, thus data regard generally just one source.

client level collection: it is the client itself which sends to a repository information regarding the user's behaviour (can be implemented by using a remote agent (such as Javascripts or Java applets) or by modifying the source code of an existing browser (such as Mosaic or Mozilla) to enhance its data collection capabilities. );

proxy level collection: information is stored at the proxy side, thus Web data regards several Websites, but only users whose Web clients pass through the proxy.

Page 31: Web Mining Presentation Final
Page 32: Web Mining Presentation Final

WEB SERVER LOG

Page 33: Web Mining Presentation Final
Page 34: Web Mining Presentation Final

THREE PHASES

Page 35: Web Mining Presentation Final

PREPROCESSING

Convert raw usage data into the data abstractions.

Most difficult task in Web usage mining due to the incompleteness of the available data.

Page 36: Web Mining Presentation Final
Page 37: Web Mining Presentation Final

DATA CLEANING Irrelevant records in web access log will be

eliminated during data cleaning. Since the target of Web Usage Mining is to get the

user’s travel patterns, following two kinds of records are unnecessary and should be removed:-

The records of graphics, videos and the format information The records have filename suffixes of GIF, JPEG, CSS, and so on, which can found in the URI field of the every record.

The records with the failed HTTP status code.

Page 38: Web Mining Presentation Final

USER & SESSION IDENTIFICATION The task of user and session identification is find out

the different user sessions from the original web access log.

User’s identification is, to identify who access web site and which pages are accessed.

Session identification is to divide the page accesses of each user at a time into individual sessions.

The difficulties to accomplish this step are introduced by using proxy servers, e.g. different users may have same IP address in the log.

A referrer-based method is proposed to solve these problems in this study.

Page 39: Web Mining Presentation Final

USER & SESSION IDENTIFICATION The different IP addresses distinguish different users. If the IP addresses are same, the different browsers

and operation systems indicate different users. If all of the IP address, browsers and operating

systems are same, the referrer information should be taken into account.

“The Refer URI field is checked, and a new user session is identified if the URL in the Refer URI field hasn’t been accessed previously, or there

is a large interval (usually more than 10 seconds) between the accessing time of this record and the previous one if the Refer URI

field is empty.”

Page 40: Web Mining Presentation Final

PATH COMPLETION The session identified by rule 3 may contains more

than one visit by the same user at different time, the time oriented heuristics is then used to divide the different visits into different user sessions.

Path completion process should be used for acquiring the complete user access path.

There are some reasons that result in path’s incompletion, for instance, local cache, agent cache, “post” technique and browser’s “back” button can result in some important accesses not recorded in the access log file, and the number of Uniform Resource Locators(URL) recorded in log may be less than the real one.

Page 41: Web Mining Presentation Final

PATH COMPLETION Using the local caching and proxy servers also

produces the difficulties for path completion because users can access the pages in the local caching or the proxy servers caching without leaving any record in server’s access log.

As a result, the user access paths are incompletely preserved in the web access log. To discover user’s travel pattern, the missing pages in the user access path should be appended. The purpose of the path completion is to accomplish this task.

Page 42: Web Mining Presentation Final

CONTENT PREPROCESSINGConverting the text, image, scripts, and other files such as multimedia into forms that are useful for the Web Usage Mining process. This consists of performing content mining such as classification or clustering. (also found in pattern discovery)

Page 43: Web Mining Presentation Final

PATTERN DISCOVERY

Pattern discovery draws upon methods and algorithms developed from several fields such as statistics, data mining, machine learning and pattern recognition.

Page 44: Web Mining Presentation Final

PATTERN DISCOVERY

Statistics Association Rules Clustering Classification Sequential Patterns Path Analysis etc...

Page 45: Web Mining Presentation Final

PATTERN DISCOVERY(CONT.)Statistics:-

Most common method. This kind of analysis is performed by many tools,

its aim is to give a description of the traffic on a Web site, like most visited pages, average daily hits, etc.;

Useful for improving the system performance, enhancing the security of the system, facilitating the site modification task, etc.

Page 46: Web Mining Presentation Final

PATTERN DISCOVERY (CONT.)Association rules:

Its main idea is to consider every URL requested by a user in a visit as basket data (item) and to discover relationships with a minimum support level between them;

Discover the correlations among references to various pages of a web site in a single server session.

Useful for restructuring web site, serving as a heuristic for pre-fetching docs to reduce latency.

Page 47: Web Mining Presentation Final

Association Rules (cont.) discovers affinities among sets of items across

transactions

X =====> Y where X, Y are sets of items,

confidence,supportExamples:

60% of clients who accessed /products/, also accessed /products/software/webminer.htm.

30% of clients who accessed /special-offer.html, placed an online order in /products/software/.

Page 48: Web Mining Presentation Final

PATTERN DISCOVERY (CONT.)Clustering:-

meaningful clusters of URLs can be created by discovering similar characteristics between them according to users behaviors.

Usage clusters Useful to perform market segmentation in E-commerce or provide personalized Web content to the users.

Pages clustersUseful for Internet search engines and web assistance providers.

Page 49: Web Mining Presentation Final

PATTERN DISCOVERY (CONT.)Classification:-

Develop a profile of users belonging to a particular class or category.

Require extraction and selection of features that best describe the properties of a given class or category.

Page 50: Web Mining Presentation Final

Clustering and Classification:-

clients who often access /products/software/webminer.html tend to be from educational institutions.

clients who placed an online order for software tend to be students in the 20-25 age group and live in the United States.

75% of clients who download software from /products/software/demos/ visit between 7:00 and 11:00 pm on weekends.

Pattern Discovery (cont.)

Page 51: Web Mining Presentation Final

Sequential Patterns:-

30% of clients who visited /products/software/, had done a search in Yahoo using the keyword “software” before their visit

60% of clients who placed an online order for WEBMINER, placed another online order for software within 15 days

Page 52: Web Mining Presentation Final

PATTERN DISCOVERY (CONT.)Path Analysis:-

Types of Path/Usage InformationMost Frequent paths traversed by usersEntry and Exit PointsDistribution of user session durations.

Page 53: Web Mining Presentation Final

PATTERN ANALYSIS Challenges of Pattern Analysis is to filter

uninteresting information and to visualize and interpret the interesting patterns to the user.

First delete the less significance rules or models from the interested model storehouse; Next use technology of OLAP and so on to carry on the comprehensive mining and analysis.

Once more, let discovered data or knowledge be visible; Finally, provide the characteristic service to the electronic commerce website.

Page 54: Web Mining Presentation Final

WEB MINING SOFTWARES SPSS Clementine. Megaputer PolyAnalyst. ClickTracks by Web analytics. QL2 by QL2 Software Inc.

Page 55: Web Mining Presentation Final

Thanks…..