jure leskovecand anandrajaramanweb.stanford.edu/class/cs345a/slides/08-hits_spam.pdf · course home...
TRANSCRIPT
![Page 1: Jure Leskovecand AnandRajaramanweb.stanford.edu/class/cs345a/slides/08-hits_spam.pdf · Course home pages Home pages of auto manufacturers 2. Hubs are ppgages that link to authorities](https://reader035.vdocuments.mx/reader035/viewer/2022070717/5edd560ead6a402d666863d1/html5/thumbnails/1.jpg)
CS345a: Data MiningJure Leskovec and Anand RajaramanjStanford University
![Page 2: Jure Leskovecand AnandRajaramanweb.stanford.edu/class/cs345a/slides/08-hits_spam.pdf · Course home pages Home pages of auto manufacturers 2. Hubs are ppgages that link to authorities](https://reader035.vdocuments.mx/reader035/viewer/2022070717/5edd560ead6a402d666863d1/html5/thumbnails/2.jpg)
Instead of generic popularity can we measureInstead of generic popularity, can we measure popularity within a topic? E.g., computer science, health
Bias the random walk When the random walker teleports, he picks a page from a set S of web pagesfrom a set S of web pages S contains only pages that are relevant to the topic E g Open Directory (DMOZ) pages for a given topicE.g., Open Directory (DMOZ) pages for a given topic (www.dmoz.org)
For each teleport set S, we get a different rank vector rS
1/28/2010 2Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining
![Page 3: Jure Leskovecand AnandRajaramanweb.stanford.edu/class/cs345a/slides/08-hits_spam.pdf · Course home pages Home pages of auto manufacturers 2. Hubs are ppgages that link to authorities](https://reader035.vdocuments.mx/reader035/viewer/2022070717/5edd560ead6a402d666863d1/html5/thumbnails/3.jpg)
Let: Let: Aik = Mik + (1‐)/|S| if iS
M th iMik otherwise A is stochastic!
We have weighted all pages in the teleport set S equallyteleport set S equally Could also assign different weights to pages
1/28/2010 3Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining
![Page 4: Jure Leskovecand AnandRajaramanweb.stanford.edu/class/cs345a/slides/08-hits_spam.pdf · Course home pages Home pages of auto manufacturers 2. Hubs are ppgages that link to authorities](https://reader035.vdocuments.mx/reader035/viewer/2022070717/5edd560ead6a402d666863d1/html5/thumbnails/4.jpg)
Suppose S = {1}, = 0.80.2
1
0.2
0.50.5
1
0.40.4
2 3Node Iteration
0 1 2… stable
1
1 1
0.8
0.8 0.8
41 1.0 0.2 0.52 0.2942 0 0.4 0.08 0.1183 0 0.4 0.08 0.3274 0 0 0 32 0 2614 0 0 0.32 0.261
Note how we initialize the PageRank vector differently from the unbiased PageRank case.
1/28/2010 4Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining
![Page 5: Jure Leskovecand AnandRajaramanweb.stanford.edu/class/cs345a/slides/08-hits_spam.pdf · Course home pages Home pages of auto manufacturers 2. Hubs are ppgages that link to authorities](https://reader035.vdocuments.mx/reader035/viewer/2022070717/5edd560ead6a402d666863d1/html5/thumbnails/5.jpg)
Experimental results [Haveliwala 2000] Experimental results [Haveliwala 2000] Picked 16 topics Teleport sets determined using DMOZTeleport sets determined using DMOZ E.g., arts, business, sports,…
“Blind study” using volunteers 35 test queries Results ranked using PageRank and TSPR of most closely related topic E.g., bicycling using Sports ranking I t l t f d TSPR ki In most cases volunteers preferred TSPR ranking
1/28/2010 5Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining
![Page 6: Jure Leskovecand AnandRajaramanweb.stanford.edu/class/cs345a/slides/08-hits_spam.pdf · Course home pages Home pages of auto manufacturers 2. Hubs are ppgages that link to authorities](https://reader035.vdocuments.mx/reader035/viewer/2022070717/5edd560ead6a402d666863d1/html5/thumbnails/6.jpg)
User can pick from a menu User can pick from a menu Use Naïve Bayes to classify query into a topic Can use the context of the query Can use the context of the query E.g., query is launched from a web page talking about a known topicabout a known topic History of queries e.g., “basketball” followed by “Jordan”Jordan
User context e.g., user’s My Yahoo settings, bookmarks, …bookmarks, …
1/28/2010 6Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining
![Page 7: Jure Leskovecand AnandRajaramanweb.stanford.edu/class/cs345a/slides/08-hits_spam.pdf · Course home pages Home pages of auto manufacturers 2. Hubs are ppgages that link to authorities](https://reader035.vdocuments.mx/reader035/viewer/2022070717/5edd560ead6a402d666863d1/html5/thumbnails/7.jpg)
Goal: Goal: Don’t just find newspapers but also find “experts” – people who link in a coordinated way to many– people who link in a coordinated way to many good newspapers
Idea: link votingIdea: link voting Quality as an expert (hub): Total sum of votes of pages pointed to
NYT: 10Ebay: 3Total sum of votes of pages pointed to
Quality as an content (authority): Total sum of votes of experts
Ebay: 3Yahoo: 3CNN: 8WSJ: 9p
Principle of repeated improvement1/28/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 7
![Page 8: Jure Leskovecand AnandRajaramanweb.stanford.edu/class/cs345a/slides/08-hits_spam.pdf · Course home pages Home pages of auto manufacturers 2. Hubs are ppgages that link to authorities](https://reader035.vdocuments.mx/reader035/viewer/2022070717/5edd560ead6a402d666863d1/html5/thumbnails/8.jpg)
1/28/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 8
![Page 9: Jure Leskovecand AnandRajaramanweb.stanford.edu/class/cs345a/slides/08-hits_spam.pdf · Course home pages Home pages of auto manufacturers 2. Hubs are ppgages that link to authorities](https://reader035.vdocuments.mx/reader035/viewer/2022070717/5edd560ead6a402d666863d1/html5/thumbnails/9.jpg)
1/28/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 9
![Page 10: Jure Leskovecand AnandRajaramanweb.stanford.edu/class/cs345a/slides/08-hits_spam.pdf · Course home pages Home pages of auto manufacturers 2. Hubs are ppgages that link to authorities](https://reader035.vdocuments.mx/reader035/viewer/2022070717/5edd560ead6a402d666863d1/html5/thumbnails/10.jpg)
1/28/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 10
![Page 11: Jure Leskovecand AnandRajaramanweb.stanford.edu/class/cs345a/slides/08-hits_spam.pdf · Course home pages Home pages of auto manufacturers 2. Hubs are ppgages that link to authorities](https://reader035.vdocuments.mx/reader035/viewer/2022070717/5edd560ead6a402d666863d1/html5/thumbnails/11.jpg)
Interesting documents fall into two classes:Interesting documents fall into two classes:1. Authorities are pages containing useful
information Newspaper home pages Course home pages Home pages of auto manufacturers
2. Hubs are pages that link to authoritiesp g List of newspapers Course bulletin
NYT: 10Ebay: 3Yahoo: 3
List of US auto manufacturers CNN: 8WSJ: 9
1/28/2010 11Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining
![Page 12: Jure Leskovecand AnandRajaramanweb.stanford.edu/class/cs345a/slides/08-hits_spam.pdf · Course home pages Home pages of auto manufacturers 2. Hubs are ppgages that link to authorities](https://reader035.vdocuments.mx/reader035/viewer/2022070717/5edd560ead6a402d666863d1/html5/thumbnails/12.jpg)
A good hub links to many good authorities A good hub links to many good authorities
A good authority is linked from many good g y y ghubs
f Model using two scores for each node: Hub score and Authority score Represented as vectors h and a
1/28/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 12
![Page 13: Jure Leskovecand AnandRajaramanweb.stanford.edu/class/cs345a/slides/08-hits_spam.pdf · Course home pages Home pages of auto manufacturers 2. Hubs are ppgages that link to authorities](https://reader035.vdocuments.mx/reader035/viewer/2022070717/5edd560ead6a402d666863d1/html5/thumbnails/13.jpg)
Each page i has 2 kinds of scores: Each page i has 2 kinds of scores: Hub score: hi A th it Authority score: ai
Algorithm:I iti li h 1 Initialize: ai=hi=1 Then keep iterating:
A th it h Authority: Hub: Normalize:
ji
ij ha
ji
ji ah
Normalize:ai=1, hi=1
1/28/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 13
![Page 14: Jure Leskovecand AnandRajaramanweb.stanford.edu/class/cs345a/slides/08-hits_spam.pdf · Course home pages Home pages of auto manufacturers 2. Hubs are ppgages that link to authorities](https://reader035.vdocuments.mx/reader035/viewer/2022070717/5edd560ead6a402d666863d1/html5/thumbnails/14.jpg)
HITS uses adjacency matrix HITS uses adjacency matrix
A[i j] = 1 if page i links to page jA[i, j] = 1 if page i links to page j, 0 else
AT, the transpose of A, is similar to the PageRank matrixM but AT has 1’s whereMPageRank matrix M but A has 1 s where Mhas fractions
1/28/2010 14Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining
![Page 15: Jure Leskovecand AnandRajaramanweb.stanford.edu/class/cs345a/slides/08-hits_spam.pdf · Course home pages Home pages of auto manufacturers 2. Hubs are ppgages that link to authorities](https://reader035.vdocuments.mx/reader035/viewer/2022070717/5edd560ead6a402d666863d1/html5/thumbnails/15.jpg)
Yahooy 1 1 1
y a my 1 1 1a 1 0 1m 0 1 0
A =
M’softAmazon
1/28/2010 15Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining
![Page 16: Jure Leskovecand AnandRajaramanweb.stanford.edu/class/cs345a/slides/08-hits_spam.pdf · Course home pages Home pages of auto manufacturers 2. Hubs are ppgages that link to authorities](https://reader035.vdocuments.mx/reader035/viewer/2022070717/5edd560ead6a402d666863d1/html5/thumbnails/16.jpg)
Notation: Notation: Vector a=(a1…,an), h=(h1…,hn) Adj t i ( ) A 1 if i j Adjacency matrix (n x n): Aij=1 if ij
Then: Ahh
So:
j
jijiji
ji aAhah
Ah So: Likewise:
Aah
hAa T
1/28/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining
hAa
16
![Page 17: Jure Leskovecand AnandRajaramanweb.stanford.edu/class/cs345a/slides/08-hits_spam.pdf · Course home pages Home pages of auto manufacturers 2. Hubs are ppgages that link to authorities](https://reader035.vdocuments.mx/reader035/viewer/2022070717/5edd560ead6a402d666863d1/html5/thumbnails/17.jpg)
The hub score of page i is proportional to the The hub score of page i is proportional to the sum of the authority scores of the pages it links to: h = λAalinks to: h = λAa Constant λ is a scale factor, λ=1/hi
The authority score of page i is proportional to the sum of the hub scores of the pages it is p glinked from: a = μAT h Constant μ is scale factor, μ=1/aiConstant μ is scale factor, μ 1/ai
1/28/2010 17Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining
![Page 18: Jure Leskovecand AnandRajaramanweb.stanford.edu/class/cs345a/slides/08-hits_spam.pdf · Course home pages Home pages of auto manufacturers 2. Hubs are ppgages that link to authorities](https://reader035.vdocuments.mx/reader035/viewer/2022070717/5edd560ead6a402d666863d1/html5/thumbnails/18.jpg)
The HITS algorithm: The HITS algorithm: Initialize h, a to all 1’s R t Repeat: h = Aa Scale h so that its sums to 1 0 Scale h so that its sums to 1.0 a = ATh Scale a so that its sums to 1.0
Until h, a converge (i.e., change very little)
1/28/2010 18Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining
![Page 19: Jure Leskovecand AnandRajaramanweb.stanford.edu/class/cs345a/slides/08-hits_spam.pdf · Course home pages Home pages of auto manufacturers 2. Hubs are ppgages that link to authorities](https://reader035.vdocuments.mx/reader035/viewer/2022070717/5edd560ead6a402d666863d1/html5/thumbnails/19.jpg)
1 1 1A 1 0 1
1 1 0AT 1 0 1
Yahoo
A = 1 0 10 1 0
AT = 1 0 11 1 0
M’softAmazonAmazon
a(yahoo) = 1 1 1 1 . . . 1a(yahoo)a(amazon)a(m’soft)
==
111
111
14/51
10.751
. . .
. . .
10.7321
h(yahoo) = 1h(amazon) = 1
12/3
10.73
. . .
. . .1.0000.732
10.71
h(m’soft) = 1 1/3 0.27 . . . 0.2680.29
1/28/2010 19Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining
![Page 20: Jure Leskovecand AnandRajaramanweb.stanford.edu/class/cs345a/slides/08-hits_spam.pdf · Course home pages Home pages of auto manufacturers 2. Hubs are ppgages that link to authorities](https://reader035.vdocuments.mx/reader035/viewer/2022070717/5edd560ead6a402d666863d1/html5/thumbnails/20.jpg)
Algorithm: Algorithm: Set: a = h = 1n
Repeat:Repeat: h=Ma, a=MTh Normalize
T a is being updated (in 2 steps): Then: a=MT(Ma)new h
new a
a is being updated (in 2 steps):MT(Ma)=(MTM)ah is updated (in 2 steps):
Thus, in 2k steps: a=(MTM)ka
new a p ( p )M (MTh)=(MMT)h
Repeated matrix poweringa=(M M) ah=(MMT)kh
1/28/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining
Repeated matrix powering
20
![Page 21: Jure Leskovecand AnandRajaramanweb.stanford.edu/class/cs345a/slides/08-hits_spam.pdf · Course home pages Home pages of auto manufacturers 2. Hubs are ppgages that link to authorities](https://reader035.vdocuments.mx/reader035/viewer/2022070717/5edd560ead6a402d666863d1/html5/thumbnails/21.jpg)
h = λAa a = μAT h h = λμAAT h
λ ATA a = λμATA a
Under reasonable assumptions about A, theUnder reasonable assumptions about A, the HITS iterative algorithm converges to vectors h* and a*: h* is the principal eigenvector of matrix AAT
a* is the principal eigenvector of matrix ATA
1/28/2010 21Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining
![Page 22: Jure Leskovecand AnandRajaramanweb.stanford.edu/class/cs345a/slides/08-hits_spam.pdf · Course home pages Home pages of auto manufacturers 2. Hubs are ppgages that link to authorities](https://reader035.vdocuments.mx/reader035/viewer/2022070717/5edd560ead6a402d666863d1/html5/thumbnails/22.jpg)
Hubs Authorities
Most densely‐connected coreMost densely connected core(primary core)
Less densely‐connected coreLess densely connected core(secondary core)
1/28/2010 22Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining
![Page 23: Jure Leskovecand AnandRajaramanweb.stanford.edu/class/cs345a/slides/08-hits_spam.pdf · Course home pages Home pages of auto manufacturers 2. Hubs are ppgages that link to authorities](https://reader035.vdocuments.mx/reader035/viewer/2022070717/5edd560ead6a402d666863d1/html5/thumbnails/23.jpg)
A single topic can have many bipartite cores A single topic can have many bipartite cores Corresponding to different meanings or points of view:points of view: abortion: pro‐choice, pro‐life evolution: darwinian, intelligent designe o ut o da a , te ge t des g jaguar: auto, Mac, NFL team, panthera onca
H fi d h d ? How to find such secondary cores?
1/28/2010 23Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining
![Page 24: Jure Leskovecand AnandRajaramanweb.stanford.edu/class/cs345a/slides/08-hits_spam.pdf · Course home pages Home pages of auto manufacturers 2. Hubs are ppgages that link to authorities](https://reader035.vdocuments.mx/reader035/viewer/2022070717/5edd560ead6a402d666863d1/html5/thumbnails/24.jpg)
Once we find the primary core we can Once we find the primary core, we can remove its links from the graph
Repeat HITS algorithm on residual graph to find the next bipartite corep
Roughly, correspond to non‐primary f T d Teigenvectors of AAT and ATA
1/28/2010 24Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining
![Page 25: Jure Leskovecand AnandRajaramanweb.stanford.edu/class/cs345a/slides/08-hits_spam.pdf · Course home pages Home pages of auto manufacturers 2. Hubs are ppgages that link to authorities](https://reader035.vdocuments.mx/reader035/viewer/2022070717/5edd560ead6a402d666863d1/html5/thumbnails/25.jpg)
We need a well connected graph of pages for We need a well‐connected graph of pages for HITS to work well:
1/28/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 25
![Page 26: Jure Leskovecand AnandRajaramanweb.stanford.edu/class/cs345a/slides/08-hits_spam.pdf · Course home pages Home pages of auto manufacturers 2. Hubs are ppgages that link to authorities](https://reader035.vdocuments.mx/reader035/viewer/2022070717/5edd560ead6a402d666863d1/html5/thumbnails/26.jpg)
PageRank and HITS are two solutions to the PageRank and HITS are two solutions to the same problem: What is the value of an in link from u to v? What is the value of an in‐link from u to v? In the PageRank model, the value of the link depends on the links into udepends on the links into u In the HITS model, it depends on the value of the other links out of uother links out of u
The destinies of PageRank and HITS post‐1998 were very different
1/28/2010 26Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining
![Page 27: Jure Leskovecand AnandRajaramanweb.stanford.edu/class/cs345a/slides/08-hits_spam.pdf · Course home pages Home pages of auto manufacturers 2. Hubs are ppgages that link to authorities](https://reader035.vdocuments.mx/reader035/viewer/2022070717/5edd560ead6a402d666863d1/html5/thumbnails/27.jpg)
Search is the default gateway to the web Search is the default gateway to the web
Very high premium to appear on the first y g p pppage of search results: e‐commerce sites advertising‐driven sites
1/28/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 27
![Page 28: Jure Leskovecand AnandRajaramanweb.stanford.edu/class/cs345a/slides/08-hits_spam.pdf · Course home pages Home pages of auto manufacturers 2. Hubs are ppgages that link to authorities](https://reader035.vdocuments.mx/reader035/viewer/2022070717/5edd560ead6a402d666863d1/html5/thumbnails/28.jpg)
Spamming: Spamming: any deliberate action to boost a web page’s position in search engine results, p g , incommensurate with page’s real value
Spam: web pages that are the result of spamming
This is a very broad definition This is a very broad definition SEO industry might disagree! SEO = search engine optimizationSEO = search engine optimization
Approximately 10‐15% of web pages are spam1/28/2010 28Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining
![Page 29: Jure Leskovecand AnandRajaramanweb.stanford.edu/class/cs345a/slides/08-hits_spam.pdf · Course home pages Home pages of auto manufacturers 2. Hubs are ppgages that link to authorities](https://reader035.vdocuments.mx/reader035/viewer/2022070717/5edd560ead6a402d666863d1/html5/thumbnails/29.jpg)
The treatment by Gyongyi & Garcia Molina: The treatment by Gyongyi & Garcia‐Molina:
Boosting techniquesg q Techniques for achieving high relevance/importance for a web page/ p p g
Hiding techniques Techniques to hide the use of boosting From humans and web crawlers
1/28/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 29
![Page 30: Jure Leskovecand AnandRajaramanweb.stanford.edu/class/cs345a/slides/08-hits_spam.pdf · Course home pages Home pages of auto manufacturers 2. Hubs are ppgages that link to authorities](https://reader035.vdocuments.mx/reader035/viewer/2022070717/5edd560ead6a402d666863d1/html5/thumbnails/30.jpg)
Term spamming Term spamming Manipulating the text of web pages in order to appear relevant to queriesappear relevant to queries
Link spamming Link spamming Creating link structures that boost PageRank or hubs and authorities scoreshubs and authorities scores
1/28/2010 30Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining
![Page 31: Jure Leskovecand AnandRajaramanweb.stanford.edu/class/cs345a/slides/08-hits_spam.pdf · Course home pages Home pages of auto manufacturers 2. Hubs are ppgages that link to authorities](https://reader035.vdocuments.mx/reader035/viewer/2022070717/5edd560ead6a402d666863d1/html5/thumbnails/31.jpg)
Repetition:Repetition: of one or a few specific terms e.g., free, cheap, viagra Goal is to subvert TF‐IDF ranking schemesD i Dumping: of a large number of unrelated terms e.g., copy entire dictionariese.g., copy entire dictionaries
Weaving: Copy legitimate pages and insert spam terms at random positionsrandom positions
Phrase Stitching: Glue together sentences and phrases from different sources
1/28/2010 31Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining
![Page 32: Jure Leskovecand AnandRajaramanweb.stanford.edu/class/cs345a/slides/08-hits_spam.pdf · Course home pages Home pages of auto manufacturers 2. Hubs are ppgages that link to authorities](https://reader035.vdocuments.mx/reader035/viewer/2022070717/5edd560ead6a402d666863d1/html5/thumbnails/32.jpg)
Three kinds of web pages from a Three kinds of web pages from a spammer’s point of view: Inaccessible pages Inaccessible pages Accessible pages: e g blog comments pages e.g., blog comments pages spammer can post links to his pages
Own pages: Own pages: Completely controlled by spammer May span multiple domain namesMay span multiple domain names
1/28/2010 32Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining
![Page 33: Jure Leskovecand AnandRajaramanweb.stanford.edu/class/cs345a/slides/08-hits_spam.pdf · Course home pages Home pages of auto manufacturers 2. Hubs are ppgages that link to authorities](https://reader035.vdocuments.mx/reader035/viewer/2022070717/5edd560ead6a402d666863d1/html5/thumbnails/33.jpg)
Spammer’s goal: Spammer s goal: Maximize the PageRank of target page t
Technique: Get as many links from accessible pages asGet as many links from accessible pages as possible to target page t Construct “link farm” to get PageRank multiplierConstruct link farm to get PageRank multiplier effect
1/28/2010 33Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining
![Page 34: Jure Leskovecand AnandRajaramanweb.stanford.edu/class/cs345a/slides/08-hits_spam.pdf · Course home pages Home pages of auto manufacturers 2. Hubs are ppgages that link to authorities](https://reader035.vdocuments.mx/reader035/viewer/2022070717/5edd560ead6a402d666863d1/html5/thumbnails/34.jpg)
Accessible Own
Inaccessible
t
1
2t
M
One of the most common and effective organizations for a link farm
1/28/2010 34Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining
![Page 35: Jure Leskovecand AnandRajaramanweb.stanford.edu/class/cs345a/slides/08-hits_spam.pdf · Course home pages Home pages of auto manufacturers 2. Hubs are ppgages that link to authorities](https://reader035.vdocuments.mx/reader035/viewer/2022070717/5edd560ead6a402d666863d1/html5/thumbnails/35.jpg)
I ibl
Accessible OwnInaccessible
t12
N…# pages on the web
Suppose rank contributed by accessible pages = xM
p gM…# of pages spammer owns
Suppose rank contributed by accessible pages xLet PageRank of target page = yRank of each “farm” page = y/M + (1‐)/N
M[ /M (1 )/N] (1 )/Ny = x + M[y/M + (1‐)/N] + (1‐)/N= x + 2y + (1‐)M/N + (1‐)/N
y = x/(1‐2) + cM/NVery small; ignore
y x/(1 ) cM/N where c = /(1+)
1/28/2010 35Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining
![Page 36: Jure Leskovecand AnandRajaramanweb.stanford.edu/class/cs345a/slides/08-hits_spam.pdf · Course home pages Home pages of auto manufacturers 2. Hubs are ppgages that link to authorities](https://reader035.vdocuments.mx/reader035/viewer/2022070717/5edd560ead6a402d666863d1/html5/thumbnails/36.jpg)
I ibl
Accessible OwnInaccessible
t12
/(1 2) + M/N
MN…# pages on the webM…# of pages
y = x/(1‐2) + cM/N where c = /(1+)
For = 0.85, 1/(1‐2)= 3.6
spammer owns
, /( )
Multiplier effect for “acquired” PageRank By making M large, we can make y as y g g , ylarge as we want
1/28/2010 36Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining
![Page 37: Jure Leskovecand AnandRajaramanweb.stanford.edu/class/cs345a/slides/08-hits_spam.pdf · Course home pages Home pages of auto manufacturers 2. Hubs are ppgages that link to authorities](https://reader035.vdocuments.mx/reader035/viewer/2022070717/5edd560ead6a402d666863d1/html5/thumbnails/37.jpg)
Term spamming: Term spamming: Analyze text using statistical methods: E g Naïve Bayes Logistic regression E.g., Naïve Bayes, Logistic regression
Similar to email spam filtering Also useful: detecting approximate duplicate pages Also useful: detecting approximate duplicate pages
Link spamming: Open research area Open research area One approach: TrustRank
1/28/2010 37Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining
![Page 38: Jure Leskovecand AnandRajaramanweb.stanford.edu/class/cs345a/slides/08-hits_spam.pdf · Course home pages Home pages of auto manufacturers 2. Hubs are ppgages that link to authorities](https://reader035.vdocuments.mx/reader035/viewer/2022070717/5edd560ead6a402d666863d1/html5/thumbnails/38.jpg)
Basic principle: approximate isolation Basic principle: approximate isolation It is rare for a “good” page to point to a “bad” (spam) page(spam) page
Sample a set of “seed pages” from the web
Have an oracle (human) identify the good d th i th d tpages and the spam pages in the seed set
Expensive task Must make seed set as small as possible
1/28/2010 38Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining
![Page 39: Jure Leskovecand AnandRajaramanweb.stanford.edu/class/cs345a/slides/08-hits_spam.pdf · Course home pages Home pages of auto manufacturers 2. Hubs are ppgages that link to authorities](https://reader035.vdocuments.mx/reader035/viewer/2022070717/5edd560ead6a402d666863d1/html5/thumbnails/39.jpg)
Call the subset of seed pages that are Call the subset of seed pages that are identified as “good” the “trusted pages”
Set trust of each trusted page to 1
Propagate trust through links: Each page gets a trust value between 0 and 1 Use a threshold value and mark all pages below the trust threshold as spam
1/28/2010 39Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining
![Page 40: Jure Leskovecand AnandRajaramanweb.stanford.edu/class/cs345a/slides/08-hits_spam.pdf · Course home pages Home pages of auto manufacturers 2. Hubs are ppgages that link to authorities](https://reader035.vdocuments.mx/reader035/viewer/2022070717/5edd560ead6a402d666863d1/html5/thumbnails/40.jpg)
Trust attenuation: Trust attenuation: The degree of trust conferred by a trusted page decreases with distancepage decreases with distance
Trust splitting: The larger the number of out‐links from a page, the less scrutiny the page author gives each out‐link Trust is “split” across out‐links
1/28/2010 40Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining
![Page 41: Jure Leskovecand AnandRajaramanweb.stanford.edu/class/cs345a/slides/08-hits_spam.pdf · Course home pages Home pages of auto manufacturers 2. Hubs are ppgages that link to authorities](https://reader035.vdocuments.mx/reader035/viewer/2022070717/5edd560ead6a402d666863d1/html5/thumbnails/41.jpg)
Suppose trust of page p is tpSuppose trust of page p is tp Set of out‐links op
For each qo p confers the trust:For each qop, p confers the trust: tp/|op| for 0<<1
Trust is additive Trust is additive Trust of p is the sum of the trust conferred on p by all its in‐linked pages
Note similarity to Topic‐Specific PageRank Within a scaling factor, TrustRank = PageRank with
d ltrusted pages as teleport set
1/28/2010 41Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining
![Page 42: Jure Leskovecand AnandRajaramanweb.stanford.edu/class/cs345a/slides/08-hits_spam.pdf · Course home pages Home pages of auto manufacturers 2. Hubs are ppgages that link to authorities](https://reader035.vdocuments.mx/reader035/viewer/2022070717/5edd560ead6a402d666863d1/html5/thumbnails/42.jpg)
Two conflicting considerations: Two conflicting considerations: Human has to inspect each seed page, so seed set must be as small as possibleseed set must be as small as possible
Must ensure every “good page” gets y g p g gadequate trust rank, so need make all good pages reachable from seed set by short paths
1/28/2010 42Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining
![Page 43: Jure Leskovecand AnandRajaramanweb.stanford.edu/class/cs345a/slides/08-hits_spam.pdf · Course home pages Home pages of auto manufacturers 2. Hubs are ppgages that link to authorities](https://reader035.vdocuments.mx/reader035/viewer/2022070717/5edd560ead6a402d666863d1/html5/thumbnails/43.jpg)
Suppose we want to pick a seed set of k Suppose we want to pick a seed set of kpages
PageRank: Pick the top k pages by PageRankp p g y g Assume high PageRank pages are close to other highly ranked pagesg y p g We care more about high PageRank “good” pages
1/28/2010 43Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining
![Page 44: Jure Leskovecand AnandRajaramanweb.stanford.edu/class/cs345a/slides/08-hits_spam.pdf · Course home pages Home pages of auto manufacturers 2. Hubs are ppgages that link to authorities](https://reader035.vdocuments.mx/reader035/viewer/2022070717/5edd560ead6a402d666863d1/html5/thumbnails/44.jpg)
Pick the pages with the maximum number of Pick the pages with the maximum number of outlinks
Can make it recursive: Pick pages that link to pages with many out‐linksp g p g y
Formalize as “inverse PageRank” Construct graph G’ by reversing edges in G PageRank in G’ is inverse page rank in G
Pick top k pages by inverse PageRank1/28/2010 44Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining
![Page 45: Jure Leskovecand AnandRajaramanweb.stanford.edu/class/cs345a/slides/08-hits_spam.pdf · Course home pages Home pages of auto manufacturers 2. Hubs are ppgages that link to authorities](https://reader035.vdocuments.mx/reader035/viewer/2022070717/5edd560ead6a402d666863d1/html5/thumbnails/45.jpg)
In the TrustRank model we start with good In the TrustRank model, we start with good pages and propagate trust
Complementary view:What fraction of a page’s PageRank comes p g gfrom “spam” pages?
d ’ k ll h In practice, we don’t know all the spam pages, so we need to estimate
1/28/2010 45Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining
![Page 46: Jure Leskovecand AnandRajaramanweb.stanford.edu/class/cs345a/slides/08-hits_spam.pdf · Course home pages Home pages of auto manufacturers 2. Hubs are ppgages that link to authorities](https://reader035.vdocuments.mx/reader035/viewer/2022070717/5edd560ead6a402d666863d1/html5/thumbnails/46.jpg)
r(p) = PageRank of page p r(p) = PageRank of page p
r+(p) = page rank of p with teleport into (p) p g p p“good” pages only
Then:r‐(p) = r(p) – r+(p)
Spam mass of p = r‐(p)/r(p)
1/28/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 46
![Page 47: Jure Leskovecand AnandRajaramanweb.stanford.edu/class/cs345a/slides/08-hits_spam.pdf · Course home pages Home pages of auto manufacturers 2. Hubs are ppgages that link to authorities](https://reader035.vdocuments.mx/reader035/viewer/2022070717/5edd560ead6a402d666863d1/html5/thumbnails/47.jpg)
For spam mass we need a large set of For spam mass, we need a large set of “good” pages: Need not be as careful about quality of individual Need not be as careful about quality of individual pages as with TrustRank
One reasonable approach .edu sites .gov sites .mil sites
1/28/2010 47Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining
![Page 48: Jure Leskovecand AnandRajaramanweb.stanford.edu/class/cs345a/slides/08-hits_spam.pdf · Course home pages Home pages of auto manufacturers 2. Hubs are ppgages that link to authorities](https://reader035.vdocuments.mx/reader035/viewer/2022070717/5edd560ead6a402d666863d1/html5/thumbnails/48.jpg)
Backflow from known spam pages: Backflow from known spam pages: Course project from last year’s edition of this coursecourse
Still an open area of research…
1/28/2010 48Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining
![Page 49: Jure Leskovecand AnandRajaramanweb.stanford.edu/class/cs345a/slides/08-hits_spam.pdf · Course home pages Home pages of auto manufacturers 2. Hubs are ppgages that link to authorities](https://reader035.vdocuments.mx/reader035/viewer/2022070717/5edd560ead6a402d666863d1/html5/thumbnails/49.jpg)
Project write up is due Mon Feb 1 midnight Project write‐up is due Mon, Feb 1 midnight What is the problem you are solving? Wh t d t ill ( h ill t it)? What data will you use (where will you get it)? How will you do it? Wh l i h / h i ill ? What algorithms/techniques will you use? Who will you evaluate, measure success? What do you expect to submit at the end of the quarter?
Homework is due on Tue Feb 2 midnight Homework is due on Tue, Feb 2 midnight
1/28/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 49