discovering knowledge using web structure mining

27
Discovering Knowledge Using Web Structure Mining

Upload: atul-khanna

Post on 17-Dec-2014

358 views

Category:

Education


0 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Discovering knowledge using web structure mining

Discovering Knowledge Using Web Structure

Mining

Page 2: Discovering knowledge using web structure mining

1. What is Web?

Page 3: Discovering knowledge using web structure mining

1.1 Problems With WebDifficulty in finding

relevant information

Personalization of information

Learning about consumers or individual users

Page 4: Discovering knowledge using web structure mining

2.Objectivesi. To Survey the area

of web mining.

ii. Introduction to Link Mining.

iii. Review of HITS and Page Rank algorithm.

Page 5: Discovering knowledge using web structure mining

3. Web Mining: DefinitionProcess of

discovering

potentially useful &

previously unknown

information or knowledge from the web data.

Page 6: Discovering knowledge using web structure mining

3.1 Web Mining: SubtasksResource finding

Information selection and pre-processing

Generalization

Analysis

Page 7: Discovering knowledge using web structure mining

3.1 Web Mining Categories

Web Mining

Web Content Mining Web Structure Mining Web Usage Mining

Text and Multimedia Documents

Web Log Records

Hyperlink Structure

Page 8: Discovering knowledge using web structure mining

3.1.1 Web Content Mining

Scanning data of a Web page to determine content relevance with respect to search query.

Web Content Mining

Agent Based Approach

Database Approach

Page 9: Discovering knowledge using web structure mining

3.1.2 Web Structure MiningIdentifies

relationships between Web pages.

Focuses on following problemsReducing irrelevant

search results.Helps indexing

information on the web.

Page 10: Discovering knowledge using web structure mining

3.1.3 Web Usage MiningFocuses on techniques that predict user behavior

while interacting with the WWW.

Web log records analyzed to discover user access pattern.

The challenges could be divided into three phases:

Pre-processingPattern discoveryPattern Analysis

Page 11: Discovering knowledge using web structure mining

4. Link MiningIt is located at the intersection of the work in

Link analysisHypertext and web miningRelational learning and inductive logic programming Graph mining.

Some tasks of link mining applicable in web structure mining are:Linked-based classificationLinked-based cluster analysisLink TypeLink StrengthLink Cardinality

Page 12: Discovering knowledge using web structure mining

(i) Link-based ClassificationPredicts category of a web

page, based on words that occur on the

page Links between pages anchor text HTML tags and other possible

attributes on web page.

Eg: Predicting the category of a paper, based on its citations and the co-citations.

Page 13: Discovering knowledge using web structure mining

(ii) Link-based Cluster AnalysisGoal : Finding naturally occurring subclasses.

Data is segmented into groups similar objects - grouped togetherdissimilar objects - different groups.

Helps in discovering hidden patterns.

Eg: Finding diseases with similar transmission pattern.

Page 14: Discovering knowledge using web structure mining

(iii) Link TypePredicting link

type between two entities.

Predicting purpose of a link.Eg. Navigational

or Advertising

Page 15: Discovering knowledge using web structure mining

(iv) Link StrengthLinks could be associated with weights.

Strong links - higher weight Weak links – lower weight

Page 16: Discovering knowledge using web structure mining

(v) Link CardinalityRefers to the

number of inbound links to a web site.

Link popularity :combination of

factors that weigh the importance of each incoming link.

Page 17: Discovering knowledge using web structure mining

5. Hyperlink-Induced Topic Search (HITS)Link analysis algorithm that

rates pages.

Identifies two kinds of pages from Web hyperlink structure:Authorities: Contains

valuable information on the subject.

Hubs: Contains useful links towards the authoritative pages.

Web Pages

WithLinks

To

OtherPages

WebPages

With

Content

Hubs Authority

Page 18: Discovering knowledge using web structure mining

HITS Contd…Two step process:

Sampling step: Set of relevant pages collected

Iterative step: Hubs and authorities are found using output of above step

Page 19: Discovering knowledge using web structure mining

HITS Contd…Sampling Step:

Query submitted to search engine yields a root set

From root set we expand to base set

Expanding the root set into base set

Page 20: Discovering knowledge using web structure mining

HITS Contd…Iterative step:

Associate non-negative authority weight x<p> and non-negative hub weight y<p>.

Computing Authority Weight Computing Hub Weight

Page 21: Discovering knowledge using web structure mining

Problems With HITS AlgorithmSome problems with the HITS algorithm are:

Mutually reinforced relationships between hosts

Automatically generated linksNon-relevant nodesHubs and authoritiesTopic driftEfficiency

Page 22: Discovering knowledge using web structure mining

6. PageRank ModelIt is a link analysis

algorithm.

Numeric value to know the importance of a web page

Computes importance by no. of incoming links

Page 23: Discovering knowledge using web structure mining

PageRank Contd…Rank of a page is divided evenly among its out-

links to contribute to the ranks of the pages they point to.

Page Ranks form a probability distribution over web pages, so the sum of all pages’ Page Ranks will be one.

Page 24: Discovering knowledge using web structure mining

PageRank Contd…PageRank can be calculated by:

PR(A)= (1-d) + d (PR (T1)/C (T1) +…+ PR (Tn)/C (Tn)) T1..Tn are the pages that point to page A. C(A) is defined as the number of links going out of page A. d is the dampening factor which is usually set to 0.85

The dampening factor is the probability at each page a random surfer will get bored and will request another random page.

Page 25: Discovering knowledge using web structure mining

ApplicationsHITS was used in Clever search engine by IBM.

PageRank is used by Google.

Page 26: Discovering knowledge using web structure mining

References Knowledge Discovery and Retrieval on World Wide Web Using Web

Structure Mining: Sekhar Babu Boddu, V.P Krishna Anne, Rajesekhara Rao Kurra and Durgesh Kumar Mishra, 2010, In proceedings of Fourth Asia International Conference on Mathematical/Analytical Modelling and Computer Simulation (AMS), IEEE.

Link Mining: A New Data Mining Challenge by Lise Getoor, 2003, SIGKDD Explorations, Volume 4, Issue 2

Authoritative Sources in a Hyperlinked Environment by Jon M. Kleinberg, 1998, In proceedings of ACM-SIAM Symposium on Discrete Algorithms

The PageRank Citation Ranking: Bringing Order to the Web by L. Page, S. Brin and T. Winograd, 1998, Technical report, Stanford University

wikipedia.org web-datamining.net maya.cs.depaul.edu

Page 27: Discovering knowledge using web structure mining

Thank You