measurement and modeling of the web and related data sets
Embed Size (px)
DESCRIPTION
 Web Measurement Self similarity on the web Extraction of information from large graphs A word on evolutionTRANSCRIPT
 1. IMA Tutorial (part II): Measurement and modeling of the web and related data sets Andrew Tomkins IBM Almaden Research Center May 5, 2003 Title slide
2. Setup
 This hour: data analysis on the web
 Next hour: probabilistic generative models, particularly focused on models that generate distributions that are power laws in the limit
3. Context
 Data Analysison the web
 as a hyperlinked corpus
 Note: Many areas of document analysis are highly relevant to the web, and should not be ignored (but will be):

 Supervised/unsupervised classification (Jon combinatorial side)

 Machine learning (Jon a little)

 Information retrieval (Jon dimensionality reduction)

 Information extraction

 NLP

 Discourse analysis

 Relationship induction

 etc
4. Focus Areas
 Web Measurement
 Self similarity on the web
 Extraction of information from large graphs
 A word on evolution
5. One view of the Internet: InterDomain Connectivity
 Core: maximal clique of highdegree nodes
 Shells: nodes in 1neighborhood of core, or of previous shell, with degree > 1
 Legs: 1degree nodes
Core Shells: 1 2 3 [Tauro, Palmer, Siganos, Faloutsos, 2001 Global Internet] 6. Another view of the web: the hyperlink graph
 Each static html page = a node
 Each hyperlink = a directed edge
 Currently ~10 10nodes (mostly junk), 10 11edges
7. Getting started structure at the hyperlink level
 Measure properties of the link structure of the web.
 Study a sample of the web that contains a reasonable fraction of the entire web.
 Apply tools from graph theory to understand the structure.
[Broder, Kumar, Maghoul, Raghavan, Rajagopalan, Stata, Tomkins, Wiener, 2001] 8. Terminology
 SCC strongly connected component
 WCC weakly connected component connected component in the underlyingundirectedgraph
9. Data
 Altavista crawls, up to 500M pages
 Ran strong and weak connected component algorithms
 Ran random directed breadthfirst searches from 1000 starting nodes, both forwards and backwards along links
10. Breadthfirst search from random starts
 How many vertices are reachable from a random vertex?
11. A Picture of (~200M) pages. 12. Some distance measurements
 Pr[ ureachable fromv ] ~ 1/4
 Max distance between 2 SCC nodes: 28
 Max distance between 2 nodes (if there is a path) > 900
 Avg distance between 2 SCC nodes: 16
13. Facts (about the crawl).
 Indegree and Outdegree distributions satisfy the power law. Consistent over time and scale.
The distribution of indegrees on the web is given by a Power Law  Heavytailed distribution, with many highindegree pages (eg, Yahoo) 14. Analysis of power law Pr [ page haskinlinks ]=~k 2.1 Pr [ page has >kinlinks ]=~1/ k Pr [ page haskoutlinks ]=~k 2.7 Corollary: 15. Component sizes.
 Component sizes are distributed by the power law.
16. Other observed power laws in the web
 Depths of URLs
 Sizes of sites
 Eigenvalues of adjacency matrix of hyperlink graph [Mihail and Papadimitriou shed some light here]
 Many different traffic measures
 Linkage between hosts and domains
 Many of the above measures on particular subsets of the graph
[Faloutsos, Faloutsos, Faloutsos 99] [Bharat, Chang, Henzinger, Ruhl 02] 17. More Characterization: SelfSimilarity 18. Ways to Slice the Web
 Domain (*.it)
 Host (www.ibm.com)
 Geography (pages with a geographical reference in the Western US)
 Content

 Keyword: Math, subdivided by Math Geometry

 Keyword: MP3, subdivided by MP3 Napster
We call these slices Thematically Unified Communities, or TUCs 19. SelfSimilarity on the Web
 Pervasive: holds for all reasonable characteristics
 Robust: holds for all reasonable slices
 Theorem:

 TUCs share properties with the web at large

 TUCs are linked by a navigational backbone
20. In particular
 All TUCs have:

 Power laws for degree, SCC, and WCC distributions

 Similar exponents for power laws

 Similar bow tie structure

 Large number of dense subgraphs
21. Is this surprising?
 YES (for downsampling general graphs).Example:
 This graph has 1 SCC containing all nodes
 Remove any nonzero fraction of edges graph hasncomponents of size 1
 Generally: random subset of size n 1/2in a graph with O( n ) edges will have only constant number of edges
22. A structural explanation
 Each TUC has a bow tie how do they relate?
23. The Navigational Backbone Each TUC contains a large SCC that is wellconnected to the SCCs of other TUCs 24. Information Extraction from Large Graphs 25. Overview WWW Distill KB1 KB2 KB3 Goal:Create higherlevel "knowledge bases" of web information for further processing. [Kumar, Raghavan, Rajagopalan, Tomkins 1999] 26. Many approaches to this problem
 Databases over the web:

 Web SQL, Lore, ParaSite, etc
 Data mining

 A priori, Query flocks, etc
 Information foraging
 Community extraction

 [Lawrence et al]
 Authoritybased search

 HITS, and variants
27. General approach
 Its hard (though getting easier) to analyze the content of all pages on the web
 Its easier (though still hard) to analyze the graph
 How successfully can we extract useful semantic knowledge (ie, community structure) from links alone?
28. Web Communities Fishing Outdoor Magazine Bill's Fishing Resources Linux Linux Links LDP Different communities appear to have very different structure. 29. Web Communities Fishing Outdoor Magazine Bill's Fishing Resources Linux Linux Links LDP But both contain a common footprint: two pages () that both Point to three other pages in common () 30. Communities and cores Example K 2,3 Definition:A "core" K ij consists ofileft nodes, jright nodes, and all left>right edges. Critical facts: 1. Almost all communities contain a core [expected] 2. Almost all cores betoken a community [unexpected] 31. Other footprint structures Newsgroup thread Web ring Corporate partnership Intranet fragment 32. Subgraph enumeration
 Goal:Given a graphtheoretic "footprint" for structures of interest, find ALL occurrences of these footprints.
33. Enumerating cores a a belongs to a K 2,3 if and only if some node points to b1, b2, b3. b2 b1 b3 Inclusion/Exclusion Pruning Clean data by removing: mirrors (true and approximate) empty pages, toopopular pages, nepotistic pages Preprocessing When no more pruning is possible, finish using database techniques Postprocessing 34. Results for cores 3 5 7 9 0 20 40 60 80 100 Thousands i=3 i=4 i=5 i=6 Number of cores found by Elimination/Generation 3 5 7 9 0 20 40 60 80 Thousands i=3 i=4 Number of cores found during postprocessing 35. The cores are interesting (1) Implicit communities are defined by cores. (2) There are an order ofmagnitude more of these.(10 5+ ) (3) Can grow the core to the community using further processing. Explicit communities.
 Yahoo!, Excite, Infoseek
 webrings
 news groups
 mailing lists
Implicit communities
 japanese elementary schools
 turkish student associations
 oil spills off the coast of japan
 australian fire brigades
36. Elementary Schools in Japan
 The American School in Japan
 The Link Page
 scwZz[y[W
 Kids' Space
 swZ
 {wwZ
 KEIMEI GAKUEN Home Page ( Japanese )
 Shiranuma Home Page
 fuzokues.fukuiu.ac.jp
 welcome to Miasa E&J school
 _ElswZy
 http://www...p/~m_maru/index.html
 fukui haruyamaes HomePage
 Torisu primary school
 goo
 Yakumo Elementary,Hokkaido,Japan
 FUZOKU Home Page
 Kamishibun Elementary School...
 schools
 LINK Page13
 {wZ
 awZz[y[W
 100 Schools Home Pages (English)
 K12 from Japan 10/...rnet and Education )
 http://www...iglobe.ne.jp/~IKESAN
 lfjwZUNPg
 wZ
 Koulutus ja oppilaitokset
 TOYODA HOMEPAGE
 Education
 Cay's Homepage(Japanese)
 ywZz[y[W
 UNIVERSITY
 JwZ DRAGON97TOP
 wZTNPgz[y[W
37. So
 Possible to extract orderofmagnitude more communities than currently known.
 Few (4%) of these appear coincidental.
 Entirely automatic extraction.
 Open question:how to use implicit communities?
38. A word on evolution 39. A word on evolution
 Phenomenon to characterize: A topic in a temporal stream occurs in a burst of activity
 Model source as multistate
 Each state has certain emission properties
 Traversal between states is controlled by a Markov Model
 Determine most likely underlying state sequence over time, given observable output
[Kleinberg02] 40. Example Time Ive been thinking about your idea with the asparagus Uh huh I think I see Uh huh Yeah, thats what Im saying So then I said Hey, lets give it a try And anyway she said maybe, okay? Most likely hidden sequence: 0.005 1 2 0.01 State 1: Output rate: very low State 2: Output rate: very high Pr[2] ~ 10 Pr[2] ~ 10 Pr[2] ~ 7 Pr[2] ~ 2 Pr[2] ~ 5 Pr[2] ~ 2 Pr[2] ~ 5 Pr[1] ~ 2 Pr[1] ~ 1 Pr[1] ~ 2 Pr[1] ~ 10 Pr[1] ~ 5 Pr[1] ~ 10 Pr[1] ~ 1 2 2 2 1 1 1 1 41. More bursts
 Infinite chain of increasingly highoutput states
 Allows hierarchical bursts
 Example 1: email messages
 Example 2: conference titles
42. Integrating bursts and graph analysis Wired magazine publishes an article on weblogs that impacts the tech community Newsweek magazine publishes an article that reaches the population at large, responding to emergence, and triggering mainstream adoption [KNRT03] Number of communities identified automatically as exhibiting bursty behavior measure of cohesiveness of the blogspace Number of blog pages that belong to a community Number of blog communities 43. IMA Tutorial (part III): Generative and probabilistic models of data May 5, 2003 Title slide 44. Probabilistic generative models
 Observation: These distributions have the same form:

 Fraction of laptops that fail catastrophically during tutorials, by city

 Fraction of pairs of shoes that spontaneously desole during periods of stress, by city
 Conclusion: The distribution arises because the same stochastic process is at work, and this process can be understood beyond the context of each example
45. Models for Power Laws
 Power laws arise in many different areas of human endeavor, the hallmark of human activity
 (they also occur in nature)
 Can we find the underlying process (processes?) that accounts for this prevalence?
46. An Introduction to the Power Law
 Definition: a distribution is said to have a power law if Pr[ X >= x ] cx
 Normally: 0< =W ] has some form

 Number of words with count >=Whas some form

 The frequency of the word with rankrhas some form
 The first two forms are clearly identical.
 What about the third?
51. Equivalence of rank versus value formulation
 Given: number of words occurringttimes ~t
 Approach:

 Consider single most frequent word, with countT

 Characterize word occurringttimes in terms ofT

 Approximate rank of words occurringttimes by counting number of words occurring at each more frequent count.
 Conclusion: Rank jword occurs (c j+ d) times (power law)
 But... high ranks correspond to low values must keep straight the head and the tail
[Bookstein90, Adamic99] 52. Early modeling work
 The characterization of power laws is a limiting statement
 Early modeling work showed approaches that provide the correct form of the tail in the limit
 Later work introduced the rate of convergence of a process to its limiting distribution
53. A model of Simon
 Following Simon [1955], described in terms of word frequences
 Consider a book being written.Initially, the book contains a single word, the.
 At timet , the book containstwords.The process of Simon generates thet+1 stword based on the current book.
54. Constructing a book: snapshot at timet When in the course of human events, it becomes necessary Current word frequencies:Letf(i,t)be the number of words of countiat timet Count Word Rank 11,325 4,791 3 2 1 ... ... 5 necessary 1 neccesary ... 300 from 600 of 1000 the 55. The Generative Model
 Assumptions:

 Constant probability that a neologism will be introduced at any timestep

 Probability of reusing a word of countiis proportional toif(i,t) , that is, number of occurrences of countiwords.
 Algorithm:

 With probabilitya new word is introduced into the text

 With remaining probability, a word with countiis introduced with probability proportional toif(i,t)
56. Constructing a book: snapshot at timet Current word frequencies:Letf(i,t)be the number of words of countiat timet Pr[the] = (1 ) 1000 / K Pr[of] = (1 ) 600 / K Pr[some count1 word] = (1 ) 1 *f(1,t)/ K K = if(i,t) Count Word Rank 11,325 4,791 3 2 1 ... ... 5 necessary 1 neccesary ... 300 from 600 of 1000 the 57. Whats going on? One unique word (which occurs 1 or more times) 1 2 3 4 5 6 Each word in bucketioccursitimes in the current document . 58. Whats going on? 1 With probabilitya new word is introduced into the text 2 3 4 5 6 59. Whats going on? 1 4 How many times do words in this bucket occur? With probability 1 an existing word is reused 2 3 5 6 60. Whats going on? 2 3 4 Size of bucket 3 at timet+1depends only on sizes of buckets 2 and 3 at timet ? ? Must show: fraction of balls in 3 rdbucket approaches some limiting value 61. Models for power laws in the web graph
 Retelling the Simon model: preferential attachment

 Barabasi et al

 Kumar et al
 Other models for the web graph:

 [Aiello, Chung, Lu], [Huberman et al]
62. Why create such a model?
 Evaluate algorithms and heuristics
 Get insight into page creation
 Estimate hardtosample parameters
 Help understand web structure
 Cost modeling for query optimization
 To find surprises means we must understand what is typical .
63. Random graph models G(n,p) Web indeg > 1000 k23's 4cliques 0 0 0 100000 125000 many Traditional random graphs [Bollobas 85] are not like the web! Is there a better model? 64. Desiderata for a graph model
 Succinct description
 Insight into page creation
 No a priori set of "topics", but...
 ... topics should emerge naturally
 Reflect structural phenomena
 Dynamic page arrivals
 Should mirror web's "rich get richer" property, and manifest link correlation.
65. Page creation on the web
 Some page creators will link to other sites without regard to existing topics, but
 Most page creators will be drawn to pages covering existing topics they care about, and will link to pages within these topics
Model idea:new pages add links by "copying" them from existing pages 66. Generally, would require
 Separate processes for:

 Node creation

 Node deletion

 Edge creation

 Edge deletion
67. A specific model
 Nodes are created in a sequence of discrete time steps

 e.g. at each time step, a new node is created withd 1) outlinks
 Probabilistic copying


 links go to random nodes with probability



 copyd links from a random node with probability 1

68. Example New node arrives With probability , it links to a uniformlychosen page 69. Example To copy, it first chooses a page uniformly Then chooses a uniform outedge from that page Then links to the destination of that edge ("copies" the edge) Under copying, your rate of getting new inlinks is proportional to your indegree. With probability (1 ), it decides to copy a link. 70. Degree sequences in this model Pr[page haskinlinks]=~k Heavytailed inverse polynomial degree sequences. Pages like netscape and yahoo exist. Many cores, cliques, and other dense subgraphs ( = 1/11 matches web) (2 ) (1 ) 71. Model extensions
 Component size distributions.
 More complex copying.
 Tighter lower tail bounds.
 More structure results.
72. A model of Mandelbrot
 Key idea: Generate frequencies of English words to maximize information transferred per unit cost
 Approach:

 Say wordioccurs with probabilityp(i)

 Set the transmission cost of wordito be log( i)

 Average information per word: p(i) log(p(i))

 Cost of a word with probabilityp(j): log (j)

 Average cost per word: p(j) log(j)

 Choose probabilitiesp(i)to maximize information/cost
 Result:p(j) = c j
73. Discussion of Mandelbrots model
 Tradeoffs between communication cost ( log(p(j) ) and information.
 Are there other tradeoffbased models that drive similar properties?
74. Heuristically Optimized Tradeoffs
 Goal: construction of trees (note: models to generate trees with power law behavior were first proposed in [Yule26])
 Idea: New nodes must trade off connecting to nearby nodes, and connecting to central nodes.
 Model:

 Points arrive uniformly within the unit square

 New point arrives, and computes two measures for candidate connection pointsj


 d(j) : distance from new node to existing nodej(nearness)



 h(j) : distance from nodejto root of tree (centrality)


 New destination chosen to minimize d(j) + h(j)
 Result: for a wide variety of values of , distribution of degrees has a power law
[Fabrikant, Koutsoupias, Papadimitriou 2002] 75. Monkeys on Typewriters
 Consider a creation model divorced form concerns of information and cost
 Model:

 Monkey types randomly, hits space bar with probabilityq , character chosen uniformly with remaining probability
 Result:

 Rankjword occurs with probabilityqj log(1q)1= c j
76. Other Distributions
 Power law means a clean characterization of a particular property on distribution upper tails
 Often used to mean heavy tailed, meaning bounded away from an exponentially decaying distribution
 There are other forms of heavytailed distributions
 A commonlyoccurring example: lognormal distribution
77. Quick characterization of lognormal distributions
 Let X be a normallydistributed random variable
 Let Y = ln X
 Then Y is lognormal
 Properties:

 Often occur in situations of multiplicative growth

 Prop2
 Concern: There is a growing sequence of papers dating back several decades questioning whether certain observed values are best described by power law or lognormal (or other) distributions.
78. One final direction
 The Central Limit Theorem tells us how sums of independent random variables behave in the limit
 Example: lnX j= lnX 0+ lnF j
 X j wellapproximated by a lognormal variable
 Thus, lognormal variables arise in situations of multiplicative growth
 Examples in biology, ecology, economics,
 Example: [Huberman et al]: growth of web sites
 Similarly: the product The same result applies to the product of lognormal variables
 Each of these generative models is evolutionary
 What is the role of time?