presented by, lokesh chikkakempanna authoritative sources in a hyperlinked environment

Click here to load reader

Upload: lilian-perry

Post on 31-Dec-2015

215 views

Category:

Documents


1 download

TRANSCRIPT

Authoritative Sources in a Hyperlinked environment

Presented by,Lokesh ChikkakempannaAuthoritative Sources in a Hyperlinked environment AgendaIntroduction.Central Issue.Queries.Constructing a focused subgraph.Computing hubs and authorities.Extracting authorities and hubs.Similar page queries.conclusion

IntroductionProcess of discovering pages that are relevant to a particular query.A hyperlinked environment can be a rich source of information.Analyzing the link structure of WWW environment.The WWW is a hypertext corpus of enormous complexity, and it continues to expand at very fast rate.High level structure can only emerge through the complete analysis of the WWW environment.An hyperlinked environment can be a rich source of information, provided there is an effective means for understanding its structure.3Central IssueDistillation of broad search topics through the discovery of Authoritative information sources.Link analysis for discovering authoritative pagesImproving the quality of search methods on WWW is a rich and interesting problem, because it should be both algoritmic and storage efficient.What does a typical search tool computes in the extra time it takes to produce results that are of greater value to the user?There is no objective function that is concretely defined and correspond to human notions of quality..QueriesTypes of queries-specific queries: lead to scarcity problem.-Broad topic queries: Abundance problem.Filter and provide from a huge set of relevant pages, A small set of the most authoritative or definitive ones.Problems in identifying authoritiesExample: harvardThere are over million pages on web that use the term harvard.Remember TF- Term frequency.How do we circumvent this problem?Link analysisHuman judgement is needed to formulate the notion of authority.If a person includes a link for page q in page p, He has conferred authority on q in some measure.What are the problems in this?

Links may be created for various reasons.Example: for navigational purposes.Paid advertisements.A hacker may create a bot that keeps on adding links to all the pages.Solution?Link-based model for the Conferral of AuthorityIdentifies relevant authoritative www pages for broad search topics.Based on the relationship between authorities and hubs.Exploit the equilibrium between authorities and hubs to develop an algorithm that identifies both type of pages simultaneously.Algorithm operates on focused subgraph produced by text based search engines.Produces small collection of pages likely to contain the most authoritative pages for a given topic.Example: Alta VistaConstructing a focused subgraph of wwwWe can view any collection V of hyperlinked pages as a directed graph G=(V,E)The nodes correspond to the pages.Edge(p,q) indicates the presence of a link from p to q.Construct a subgraph on www on which the algorithm operates.

The Goal is to focus the computational effort on relevant pages.(i) S(sigma) is relatively small.(ii)S(sigma) is rich in relevant pages.(iii) S(sigma) contains most (or many) of the strongest authorities.How to find such a collection of pages?t highest ranked pages for the query (sigma) from a text based search engine.These t pages are refered as root set R(sigma)The root set satisfies both conditions (i) and (ii)It is far from satisfying (iii) . Why? There are often extremely few links between pages in R(sigma), Rendering it essentially structureless.Eample: root set for the query java contained 15 links between pages in different domains.Total number of possible links 200*199. (t=200)We can use the root set R(sigma) to produce s(sigma) that satisfies all the conditions. A strong authority may not be in the set R(sigma), but it is likely to be pointed to by atleast one page in R(sigma).Subgraph(sigma,,t,d)Sigma: query string,-a text based search engine,t and d are natural numbers.

S(sigma) is obtained by growing R(sigma) to include any page pointed to by a page in R(sigma) and any page that points to a page in R(sigma).A single page in R(sigma) brings atmost d pages into S(sigma).Does this S(sigma) contains authorities?Heuristics to reduce S(sigma)Two types of links:Transverse: between pages with different domain names.Intrinsic: between pages with the same domain name.Remove all the intrinsic links to get a graph G(sigma)To remove links that are for navigational purposes.18A large number of pages from a single domain point to a page p.This is because of advertisements.Allow only m4-8 pages from a single domain to point to any given page p.G(sigma) now contains many relevant pages and strong authorities.Computing hubs and authoritiesExtracting authorities based on maximum indegree does not work.Example: For the query java the largest indegree pages consisted of www.gamelan.com and java.sun.com, together with advertising pages and home page of amazon.While the first two are good answers, others are not relevant.

Authoritative pages relevant to the initial query should not only have large in-degree;Since they are all authorities , there should be considerable overlap in the sets of pages that point to them.Thus in addition to authorities we should find what are called as hub pages.Hub pages: That have links to multiple relevant authoritative pages.Hub pages allow us to throw away unrelated pages with high indegree.Mutually reinforcing relationship: A good hub is page that points to many good authorities; a good authority is a page that is pointed to by many good hubs.We should break this circularity to identify hubs and authorities.How?An Iterative algorithmMaintains and updates numerical weights for each page.Each page is assosciated with a non-negative authority weight x^p and non-negative hub weight y^p.Each type is normalized so their squares sum to 1.Pages with larger x and y values are considered better authorities and hubs respectively.Two operations for weights.

The second operation updates the hub weights as follows:

The set of weights is represented as a vector with a co-ordinate for each page in G(sigma).

The set of weights is represented as a vector y.

Iterate(G,k)G: a collection of n linked pagesk: a natural numberLet z denote the vector (1, 1, 1, ..., 1)Rn.Set x0 :=z.Set y0 :=z.For i=1, 2, . . . , kApply the operation to (xi-1, yi-1), obtaining new x-weightsxi. Apply the operation to (xi, yi-1), obtaining new y-weights yi. Normalize xi, obtaining xi.Normalize yi, obtaining yi.End Return (xk, yk).Filter out top c authorities and top c hubsFilter(G,k,c)

G: a collection of n linked pages

k,c: natural numbers

(xk, yk) :=iterate(G, k).

Report the pages with the c largest coordinates in xk as authorities.

Report the pages with the c largest coordinates in yk as hubs.The is applied with G set equal to G(sigma) and c 5-10With arbitrarily large values of k, the sequences of vectors {xk} and {yk} converge to fixed points x* and y *.What is R^n in the ITERATE algorithm?.Eigenspace assosciated with . is the eigen value of an n x n matrix M, with the property that M=. For some vector Similar-Page QueriesThe algorithm discussed can be applied to another type of problem.Using the link structure to infer the notion of similarity among pages.We begin with the page p and pose the request Find t pages pointing to p31ConclusionThe approach developed here might be integrated into a study of traffic patterns on www.Future work can be done to include other than broad topic queries.It would be interesting to understand eigenvector based heuristics completely in the context of algorithms presented here. Thank You!