web mining presentation final

PRESENTATIONON

WEB MINING(CONTENT + STRUCTURE +

USAGE)

Presented By:-Mr. Jagrat GuptaM.Tech. 1st Year

CSE Branch

WEB MINING•Extraction of knowledge from web data.

•Web Data Includes:-web documents.hyperlinks between documents.usage logs of web sites, etc.

A panel organized at ICTAI 1997 (Srivastava and Mobasher 1997) asked the question “Is there anything distinct about web mining (compared to data mining in general)?”

WEB MINING:APPROACHES First was a “Process-centric view” which defined

web mining as a sequence of tasks (Etzioni 1996). Resource finding. Information selection and preprocessing. Generalization. Analysis.

Kosala and Blockeel divided web mining process into the following five subtasks:

Resource finding and retrieving. Information selection and preprocessing. Patterns analysis and recognition.

WEB MINING:APPROACHES Validation and interpretation. Visualization.

Second was a “Data-centric view” which defined web mining in terms of the types of web data that was being used in the mining process (Cooley, Srivastava, and Mobasher 1997). The second definition has become more acceptable.

In this Presentation we follow the data-centric view of web mining which is defined as follows-

“Web mining is the application of data mining techniques to extract knowledge from web data, i.e. Web content, Web structure, and Web usage data.”

WEB MINING TAXONOMY

WEB CONTENT MINING Mining, extraction and integration of useful data,

information and knowledge from Web page content. Content data is the collection of facts a web page is

designed to contain. It may consist of text, images, audio, video, or structured records such as lists and tables.

Search Engines do not generally provide structural information nor categorize, filter, or interpret documents.

In recent years these factors have prompted researchers to develop more intelligent tools for information retrieval, such as intelligent web agents.

Research activities are going on in Information retrieval methods, Natural language processing and Computer vision.

WEB CONTENT MINING- PROBLEMS

Data/information extraction: Our focus will be on extraction of structured data from Web pages, such as products and search results. Extracting such data allows one to provide services. Two main types of techniques, machine learning and automatic extraction are used.

Web information integration and schema matching: Although the Web contains a huge amount of data, each web site (or even page) represents similar information differently. How to identify or match semantically similar data is a very important problem with many practical applications.

Opinion extraction from online sources: There are many online opinion sources, e.g., customer reviews of products, forums, blogs and chat rooms. Mining opinions (especially consumer opinions) is of great importance for marketing intelligence and product benchmarking.

WEB CONTENT MINING- PROBLEMS Knowledge synthesis: Concept hierarchies or

ontology are useful in many applications. However, generating them manually is very time consuming.

Segmenting Web pages and detecting noise: In many Web applications, one only wants the main content of the Web page without advertisements, navigation links, copyright notices. Automatically segmenting Web page to extract the main content of the pages is interesting problem.

WEB CONTENT MINING - APPROACHES

WEB CONTENT MINING - APPROACHES The database approaches to Web mining have generally

focused on techniques for integrating and organizing the heterogeneous and semi-structured data on the Web into more structured and high-level collections of resources, such as in relational databases, and using standard database querying mechanisms and data mining techniques to access and analyze this information.

Multilevel-Databases:- The main idea behind these proposals is that the lowest level of the database contains primitive semi-structured information stored in various Web repositories, such as hypertext documents. At the higher level(s) meta data or generalizations are extracted from lower levels and organized in structured collections such as relational or object-oriented databases. ARANEUS system extracts relevant information from hypertext documents and integrates these into higher-level derived Web Hypertexts which are generalizations of the notion of database views.

WEB CONTENT MINING - APPROACHES WebQuery-Systems:- There have been many Web-based

query systems and languages developed recently that attempt to utilize standard database query languages such as SQL, structural information about Web documents, and even natural language processing for accommodating the types of queries that are used in World Wide Web searches. W3QL, WebLog, Lorel and UnQL , TSIMMIS.

The agent-based approach to web mining involves the development of sophisticated AI systems that can act autonomously or semi-autonomously on behalf of a particular user, to discover and organize web-based information.

Intelligent-Search-Agents:- Several intelligent Web agents have been developed that search for relevant information using characteristics of a particular domain (and possibly a user profile) to organize and interpret the discovered information.

WEB CONTENT MINING - APPROACHESHarvest , FAQ-Finder , Information Manifold ,

OCCAM ,ParaSite, ShopBot, ILA. Information-Filtering/Categorization:- A

number of Web agents use various information retrieval techniques and characteristics of open hypertext Web documents to automatically retrieve, filter, and categorize them. HyPursuit, BO (Bookmark Organizer).

PREDECESSORS AND SUCCESSORS OF A WEB PAGE

… …

Predecessors Successors

PAGE RANK • Simple solution: create a stochastic matrix of

the Web:• Each page i corresponds to row i and column

j of the matrix.• If page j has n successors (links) then the ijth

cell of the matrix is equal to-1/n if page i is one of these n

succesors of page j0 otherwise.

PAGE RANK – EXAMPLE

Assume that the Web consists of only three pages - A, B, and C. The links among these pages are shown below.

A

B

C

Let [a, b, c] bethe vector of importances for these three pages

A B C

A 1/2 1/2 0

B 1/2 0 1

C 0 1/2 0

PAGE RANK – EXAMPLE (CONT.) The equation describing the asymptotic values of

these three variables is:a 1/2 1/2 0 ab = 1/2 0 1 bc 0 1/2 0 c

We can solve the equations like this one by starting with the assumption a = b = c = 1, and applying the matrix to the current estimate of these values repeatedly. The first four iterations give the following estimates:

a = 1 1 5/4 9/8 5/4 … 6/5b = 1 3/2 1 11/8 17/16 … 6/5c = 1 1/2 3/4 1/2 11/16 ... 3/5

PAGE RANK – EXAMPLE (CONT.)

In the limit, the solution is a=b=6/5, c=3/5. That is, a and b each have the same importance, and twice of c.

HITS Define a matrix A whose rows and columns

correspond to Web pages with entry Aij=1 if page i links to page j, and 0 if not.

Let a and h be vectors, whose ith component corresponds to the degrees of authority and hubbiness of the ith page. Then:

h = A × a. That is, the hubbiness of each page is the sum of the authorities of all the pages it links to.

a = AT × h. That is, the authority of each page is the sum of the hubbiness of all the pages that link to it (AT - transponed matrix).

Then, a = AT × A × a h = A × AT × h

HUB AND AUTHORITIES - EXAMPLE

Consider the Web presented below.

A

C

B

1 1 1A = 0 0 1 1 1 0 1 0 1AT = 1 0 1 1 1 0 3 1 2AAT = 1 1 0 2 0 2

2 2 1ATA = 2 2 1 1 1 2

HUB AND AUTHORITIES - EXAMPLE

If we assume that the vectors h = [ ha, hb, hc ] and a = [ aa, ab, ac ] are each initially [ 1,1,1 ], the first three iterations of the equations for a and h are the following: aa = 1 5 24 114 ab = 1 5 24 114

ac = 1 4 18 84 ha = 1 6 28 132 hb = 1 2 8 36

hc = 1 4 20 96

Data Sources

server level collection: the server stores data regarding requests performed by the client, thus data regard generally just one source.

client level collection: it is the client itself which sends to a repository information regarding the user's behaviour (can be implemented by using a remote agent (such as Javascripts or Java applets) or by modifying the source code of an existing browser (such as Mosaic or Mozilla) to enhance its data collection capabilities. );

proxy level collection: information is stored at the proxy side, thus Web data regards several Websites, but only users whose Web clients pass through the proxy.

WEB SERVER LOG

THREE PHASES

PREPROCESSING

Convert raw usage data into the data abstractions.

Most difficult task in Web usage mining due to the incompleteness of the available data.

DATA CLEANING Irrelevant records in web access log will be

eliminated during data cleaning. Since the target of Web Usage Mining is to get the

user’s travel patterns, following two kinds of records are unnecessary and should be removed:-

The records of graphics, videos and the format information The records have filename suffixes of GIF, JPEG, CSS, and so on, which can found in the URI field of the every record.

The records with the failed HTTP status code.

USER & SESSION IDENTIFICATION The task of user and session identification is find out

the different user sessions from the original web access log.

User’s identification is, to identify who access web site and which pages are accessed.

Session identification is to divide the page accesses of each user at a time into individual sessions.

The difficulties to accomplish this step are introduced by using proxy servers, e.g. different users may have same IP address in the log.

A referrer-based method is proposed to solve these problems in this study.

USER & SESSION IDENTIFICATION The different IP addresses distinguish different users. If the IP addresses are same, the different browsers

and operation systems indicate different users. If all of the IP address, browsers and operating

systems are same, the referrer information should be taken into account.

“The Refer URI field is checked, and a new user session is identified if the URL in the Refer URI field hasn’t been accessed previously, or there

is a large interval (usually more than 10 seconds) between the accessing time of this record and the previous one if the Refer URI

field is empty.”

PATH COMPLETION The session identified by rule 3 may contains more

than one visit by the same user at different time, the time oriented heuristics is then used to divide the different visits into different user sessions.

Path completion process should be used for acquiring the complete user access path.

There are some reasons that result in path’s incompletion, for instance, local cache, agent cache, “post” technique and browser’s “back” button can result in some important accesses not recorded in the access log file, and the number of Uniform Resource Locators(URL) recorded in log may be less than the real one.

PATH COMPLETION Using the local caching and proxy servers also

produces the difficulties for path completion because users can access the pages in the local caching or the proxy servers caching without leaving any record in server’s access log.

As a result, the user access paths are incompletely preserved in the web access log. To discover user’s travel pattern, the missing pages in the user access path should be appended. The purpose of the path completion is to accomplish this task.

CONTENT PREPROCESSINGConverting the text, image, scripts, and other files such as multimedia into forms that are useful for the Web Usage Mining process. This consists of performing content mining such as classification or clustering. (also found in pattern discovery)

PATTERN DISCOVERY

Pattern discovery draws upon methods and algorithms developed from several fields such as statistics, data mining, machine learning and pattern recognition.

PATTERN DISCOVERY

Statistics Association Rules Clustering Classification Sequential Patterns Path Analysis etc...

PATTERN DISCOVERY(CONT.)Statistics:-

Most common method. This kind of analysis is performed by many tools,

its aim is to give a description of the traffic on a Web site, like most visited pages, average daily hits, etc.;

Useful for improving the system performance, enhancing the security of the system, facilitating the site modification task, etc.

PATTERN DISCOVERY (CONT.)Association rules:

Its main idea is to consider every URL requested by a user in a visit as basket data (item) and to discover relationships with a minimum support level between them;

Discover the correlations among references to various pages of a web site in a single server session.

Useful for restructuring web site, serving as a heuristic for pre-fetching docs to reduce latency.

Association Rules (cont.) discovers affinities among sets of items across

transactions

X =====> Y where X, Y are sets of items,

confidence,supportExamples:

60% of clients who accessed /products/, also accessed /products/software/webminer.htm.

30% of clients who accessed /special-offer.html, placed an online order in /products/software/.

PATTERN DISCOVERY (CONT.)Clustering:-

meaningful clusters of URLs can be created by discovering similar characteristics between them according to users behaviors.

Usage clusters Useful to perform market segmentation in E-commerce or provide personalized Web content to the users.

Pages clustersUseful for Internet search engines and web assistance providers.

PATTERN DISCOVERY (CONT.)Classification:-

Develop a profile of users belonging to a particular class or category.

Require extraction and selection of features that best describe the properties of a given class or category.

Clustering and Classification:-

clients who often access /products/software/webminer.html tend to be from educational institutions.

clients who placed an online order for software tend to be students in the 20-25 age group and live in the United States.

75% of clients who download software from /products/software/demos/ visit between 7:00 and 11:00 pm on weekends.

Pattern Discovery (cont.)

Sequential Patterns:-

30% of clients who visited /products/software/, had done a search in Yahoo using the keyword “software” before their visit

60% of clients who placed an online order for WEBMINER, placed another online order for software within 15 days

PATTERN DISCOVERY (CONT.)Path Analysis:-

Types of Path/Usage InformationMost Frequent paths traversed by usersEntry and Exit PointsDistribution of user session durations.

PATTERN ANALYSIS Challenges of Pattern Analysis is to filter

uninteresting information and to visualize and interpret the interesting patterns to the user.

First delete the less significance rules or models from the interested model storehouse; Next use technology of OLAP and so on to carry on the comprehensive mining and analysis.

Once more, let discovered data or knowledge be visible; Finally, provide the characteristic service to the electronic commerce website.

WEB MINING SOFTWARES SPSS Clementine. Megaputer PolyAnalyst. ClickTracks by Web analytics. QL2 by QL2 Software Inc.

Thanks…..

web mining presentation final

Documents