algorithmic web spam detection - matt peters mozcon
DESCRIPTION
Deep dive into algorithmic web spam detection, presented by Matt Peters at MozCon 2012.TRANSCRIPT
![Page 1: Algorithmic Web Spam detection - Matt Peters MozCon](https://reader038.vdocuments.mx/reader038/viewer/2022110306/554ba8c0b4c905b3618b51d2/html5/thumbnails/1.jpg)
Web Spam Research: Good Robots vs Bad RobotsMatthew Peters
ScientistSEOmoz
@mattthemathman
![Page 2: Algorithmic Web Spam detection - Matt Peters MozCon](https://reader038.vdocuments.mx/reader038/viewer/2022110306/554ba8c0b4c905b3618b51d2/html5/thumbnails/2.jpg)
Penguin (and Panda)
Practical SEO considerations
![Page 3: Algorithmic Web Spam detection - Matt Peters MozCon](https://reader038.vdocuments.mx/reader038/viewer/2022110306/554ba8c0b4c905b3618b51d2/html5/thumbnails/3.jpg)
SEOmoz engineering challenges
![Page 4: Algorithmic Web Spam detection - Matt Peters MozCon](https://reader038.vdocuments.mx/reader038/viewer/2022110306/554ba8c0b4c905b3618b51d2/html5/thumbnails/4.jpg)
SEOmoz engineering challenges
Processing(~4 weeks,
40-200 computers)
Mozscape Index
Open Site Explorer, Mozscape API, PRO app
![Page 5: Algorithmic Web Spam detection - Matt Peters MozCon](https://reader038.vdocuments.mx/reader038/viewer/2022110306/554ba8c0b4c905b3618b51d2/html5/thumbnails/5.jpg)
SEOmoz engineering challenges
Due to scale, need an algorithmic approach.
Processing(~4 weeks,
40-200 computers)
Mozscape Index
Open Site Explorer, Mozscape API, PRO app
![Page 6: Algorithmic Web Spam detection - Matt Peters MozCon](https://reader038.vdocuments.mx/reader038/viewer/2022110306/554ba8c0b4c905b3618b51d2/html5/thumbnails/6.jpg)
Goals
![Page 7: Algorithmic Web Spam detection - Matt Peters MozCon](https://reader038.vdocuments.mx/reader038/viewer/2022110306/554ba8c0b4c905b3618b51d2/html5/thumbnails/7.jpg)
Machine Learning 101
Web crawler
![Page 8: Algorithmic Web Spam detection - Matt Peters MozCon](https://reader038.vdocuments.mx/reader038/viewer/2022110306/554ba8c0b4c905b3618b51d2/html5/thumbnails/8.jpg)
Machine Learning 101
“Features”Web crawler
![Page 9: Algorithmic Web Spam detection - Matt Peters MozCon](https://reader038.vdocuments.mx/reader038/viewer/2022110306/554ba8c0b4c905b3618b51d2/html5/thumbnails/9.jpg)
Machine Learning 101
BLACK BOX
SPAM
NOT SPAM
??
Machine learningalgorithm
“Features”Web crawler
![Page 10: Algorithmic Web Spam detection - Matt Peters MozCon](https://reader038.vdocuments.mx/reader038/viewer/2022110306/554ba8c0b4c905b3618b51d2/html5/thumbnails/10.jpg)
In-link and on-page features
Spam sites (link farms, fake blogs,
etc)
Legitimate sites that may have some spam in-
links
![Page 11: Algorithmic Web Spam detection - Matt Peters MozCon](https://reader038.vdocuments.mx/reader038/viewer/2022110306/554ba8c0b4c905b3618b51d2/html5/thumbnails/11.jpg)
In-link and on-page features
Spam sites (link farms, fake blogs,
etc)
Legitimate sites that may have some spam in-
links
A spam site with spam in/out links
![Page 12: Algorithmic Web Spam detection - Matt Peters MozCon](https://reader038.vdocuments.mx/reader038/viewer/2022110306/554ba8c0b4c905b3618b51d2/html5/thumbnails/12.jpg)
In-link and on-page features
Spam sites (link farms, fake blogs,
etc)
Legitimate sites that may have some spam in-
links
A legit site with spam in links
![Page 13: Algorithmic Web Spam detection - Matt Peters MozCon](https://reader038.vdocuments.mx/reader038/viewer/2022110306/554ba8c0b4c905b3618b51d2/html5/thumbnails/13.jpg)
On-page features
Organized research conferences:WEBSPAM-UK2006/7, ECML/PKDD 2010 Discovery challenge
![Page 14: Algorithmic Web Spam detection - Matt Peters MozCon](https://reader038.vdocuments.mx/reader038/viewer/2022110306/554ba8c0b4c905b3618b51d2/html5/thumbnails/14.jpg)
On-page features
Ntoulas et al: Detecting Spam Web Pages through Content Analysis, WWW ‘06
Number of words in title
![Page 15: Algorithmic Web Spam detection - Matt Peters MozCon](https://reader038.vdocuments.mx/reader038/viewer/2022110306/554ba8c0b4c905b3618b51d2/html5/thumbnails/15.jpg)
On-page features
Ntoulas et al: Detecting Spam Web Pages through Content Analysis, WWW ‘06
Number of words in title
Histogram (probability density) of all pages
![Page 16: Algorithmic Web Spam detection - Matt Peters MozCon](https://reader038.vdocuments.mx/reader038/viewer/2022110306/554ba8c0b4c905b3618b51d2/html5/thumbnails/16.jpg)
On-page features
Ntoulas et al: Detecting Spam Web Pages through Content Analysis, WWW ‘06
Number of words in title
Histogram (probability density) of all pages
Percent of spam for each title length
![Page 17: Algorithmic Web Spam detection - Matt Peters MozCon](https://reader038.vdocuments.mx/reader038/viewer/2022110306/554ba8c0b4c905b3618b51d2/html5/thumbnails/17.jpg)
On-page features
Ntoulas et al: Detecting Spam Web Pages through Content Analysis, WWW ‘06
Percent of anchor text words
![Page 18: Algorithmic Web Spam detection - Matt Peters MozCon](https://reader038.vdocuments.mx/reader038/viewer/2022110306/554ba8c0b4c905b3618b51d2/html5/thumbnails/18.jpg)
These few features are remarkably effective(assuming your model is complex enough)
Erdélyi et al: Web Spam Classification: a Few Features Worth More, WebQuality 2011
![Page 19: Algorithmic Web Spam detection - Matt Peters MozCon](https://reader038.vdocuments.mx/reader038/viewer/2022110306/554ba8c0b4c905b3618b51d2/html5/thumbnails/19.jpg)
On-page features > in-link features
Erdélyi et al: Web Spam Classification: a Few Features Worth More, WebQuality 2011
![Page 20: Algorithmic Web Spam detection - Matt Peters MozCon](https://reader038.vdocuments.mx/reader038/viewer/2022110306/554ba8c0b4c905b3618b51d2/html5/thumbnails/20.jpg)
In-link features (mozTrust)
Gyöngyi et al: Combating Web Spam with TrustRank, 2004See also: Abernethy et al: Graph regularization methods for Web spam detection, 2010
Seed site
High mozTrust
mozTrust (TrustRank) measures the average
distance from a trusted set of “seed” sites
ModeratemozTrust
ModeratemozTrust
![Page 21: Algorithmic Web Spam detection - Matt Peters MozCon](https://reader038.vdocuments.mx/reader038/viewer/2022110306/554ba8c0b4c905b3618b51d2/html5/thumbnails/21.jpg)
Anchor text
Ryan Kent: http://www.seomoz.org/blog/identifying-link-penalties-in-2012
![Page 22: Algorithmic Web Spam detection - Matt Peters MozCon](https://reader038.vdocuments.mx/reader038/viewer/2022110306/554ba8c0b4c905b3618b51d2/html5/thumbnails/22.jpg)
Are these still relevant today?
Banned: manual penalty and removed from index.Kurtis Bohrnstedt: http://www.seomoz.org/blog/web-directory-submission-danger
![Page 23: Algorithmic Web Spam detection - Matt Peters MozCon](https://reader038.vdocuments.mx/reader038/viewer/2022110306/554ba8c0b4c905b3618b51d2/html5/thumbnails/23.jpg)
Google penalized sites
Penalized sites: algorithmic penalty demoted off first page.
I will group both banned an penalized sites together and call them simply “penalized.”
Kurtis Bohrnstedt: http://www.seomoz.org/blog/web-directory-submission-danger
![Page 24: Algorithmic Web Spam detection - Matt Peters MozCon](https://reader038.vdocuments.mx/reader038/viewer/2022110306/554ba8c0b4c905b3618b51d2/html5/thumbnails/24.jpg)
Data sources
47K sites
Mozscape (200 mil)
Stratified sample by mozRank
Directory + suspected SPAM (3K)
![Page 25: Algorithmic Web Spam detection - Matt Peters MozCon](https://reader038.vdocuments.mx/reader038/viewer/2022110306/554ba8c0b4c905b3618b51d2/html5/thumbnails/25.jpg)
Data sources
47K sites
Mozscape (200 mil)
Wikipedia, SEMRush
5 pages / site
Stratified sample by mozRank
Directory + suspected SPAM (3K)
![Page 26: Algorithmic Web Spam detection - Matt Peters MozCon](https://reader038.vdocuments.mx/reader038/viewer/2022110306/554ba8c0b4c905b3618b51d2/html5/thumbnails/26.jpg)
Data sources
47K sites
Mozscape (200 mil)
Wikipedia, SEMRush
Filter by HTTP 200,
English
22K sites
5 pages / site
Stratified sample by mozRank
Directory + suspected SPAM (3K)
![Page 27: Algorithmic Web Spam detection - Matt Peters MozCon](https://reader038.vdocuments.mx/reader038/viewer/2022110306/554ba8c0b4c905b3618b51d2/html5/thumbnails/27.jpg)
Data sources
47K sites
Mozscape (200 mil)
Wikipedia, SEMRush
Filter by HTTP 200,
English
22K sites
5 pages / site
Stratified sample by mozRank
Directory + suspected SPAM (3K)
![Page 28: Algorithmic Web Spam detection - Matt Peters MozCon](https://reader038.vdocuments.mx/reader038/viewer/2022110306/554ba8c0b4c905b3618b51d2/html5/thumbnails/28.jpg)
Results(show me the graphs!)
![Page 29: Algorithmic Web Spam detection - Matt Peters MozCon](https://reader038.vdocuments.mx/reader038/viewer/2022110306/554ba8c0b4c905b3618b51d2/html5/thumbnails/29.jpg)
Overall results
Overall 17% of sites are penalized, 5% banned
![Page 30: Algorithmic Web Spam detection - Matt Peters MozCon](https://reader038.vdocuments.mx/reader038/viewer/2022110306/554ba8c0b4c905b3618b51d2/html5/thumbnails/30.jpg)
mozTrust
mozTrust is a strong predictor of spam
![Page 31: Algorithmic Web Spam detection - Matt Peters MozCon](https://reader038.vdocuments.mx/reader038/viewer/2022110306/554ba8c0b4c905b3618b51d2/html5/thumbnails/31.jpg)
mozTrust vs mozRank
mozRank is also a strong predictor, although not a good as mozTrust
![Page 32: Algorithmic Web Spam detection - Matt Peters MozCon](https://reader038.vdocuments.mx/reader038/viewer/2022110306/554ba8c0b4c905b3618b51d2/html5/thumbnails/32.jpg)
In-links
In-links increase = Spam decrease, except for some sites with many internal links
![Page 33: Algorithmic Web Spam detection - Matt Peters MozCon](https://reader038.vdocuments.mx/reader038/viewer/2022110306/554ba8c0b4c905b3618b51d2/html5/thumbnails/33.jpg)
The trend in domain size is similar
Domain size
![Page 34: Algorithmic Web Spam detection - Matt Peters MozCon](https://reader038.vdocuments.mx/reader038/viewer/2022110306/554ba8c0b4c905b3618b51d2/html5/thumbnails/34.jpg)
Linking root domains exhibit the same overall trend as linking URLs
Link diversity – Linking root domains
![Page 35: Algorithmic Web Spam detection - Matt Peters MozCon](https://reader038.vdocuments.mx/reader038/viewer/2022110306/554ba8c0b4c905b3618b51d2/html5/thumbnails/35.jpg)
Anchor text
Simple heuristic for branded/organic anchor text:
(1) Strip off all sub-domains, TLD extensions, paths from URLs. Remove white space.
(2) Exact or partial match between the result and the target domain.
(3) Use a specified list for “organic” (click, here, …)
(4) Compute the percentage of unbranded anchor text.
There are some more technical details (certain symbols are removed, another heuristic for acronyms), but this is the main idea.
![Page 36: Algorithmic Web Spam detection - Matt Peters MozCon](https://reader038.vdocuments.mx/reader038/viewer/2022110306/554ba8c0b4c905b3618b51d2/html5/thumbnails/36.jpg)
Anchor text
Large percent of unbranded anchor text is a spam signal.
![Page 37: Algorithmic Web Spam detection - Matt Peters MozCon](https://reader038.vdocuments.mx/reader038/viewer/2022110306/554ba8c0b4c905b3618b51d2/html5/thumbnails/37.jpg)
Anchor text
Large percent of unbranded anchor text is a spam signal.
Mix of branded and unbranded anchor text is best.
![Page 38: Algorithmic Web Spam detection - Matt Peters MozCon](https://reader038.vdocuments.mx/reader038/viewer/2022110306/554ba8c0b4c905b3618b51d2/html5/thumbnails/38.jpg)
Entire in-link profile
18
55
48
22
18
28
37
62
68
![Page 39: Algorithmic Web Spam detection - Matt Peters MozCon](https://reader038.vdocuments.mx/reader038/viewer/2022110306/554ba8c0b4c905b3618b51d2/html5/thumbnails/39.jpg)
Entire in-link profile
![Page 40: Algorithmic Web Spam detection - Matt Peters MozCon](https://reader038.vdocuments.mx/reader038/viewer/2022110306/554ba8c0b4c905b3618b51d2/html5/thumbnails/40.jpg)
On-page features – Anchor text
![Page 41: Algorithmic Web Spam detection - Matt Peters MozCon](https://reader038.vdocuments.mx/reader038/viewer/2022110306/554ba8c0b4c905b3618b51d2/html5/thumbnails/41.jpg)
Internal vs External Anchor text
Higher spam percent for sites without internal anchor text
![Page 42: Algorithmic Web Spam detection - Matt Peters MozCon](https://reader038.vdocuments.mx/reader038/viewer/2022110306/554ba8c0b4c905b3618b51d2/html5/thumbnails/42.jpg)
Internal vs External Anchor text
Higher spam percent for sites without internal anchor text
Spam increases with external anchor text
![Page 43: Algorithmic Web Spam detection - Matt Peters MozCon](https://reader038.vdocuments.mx/reader038/viewer/2022110306/554ba8c0b4c905b3618b51d2/html5/thumbnails/43.jpg)
Title characteristics
Unlike in 2006, the length of the title isn’t very informative.
![Page 44: Algorithmic Web Spam detection - Matt Peters MozCon](https://reader038.vdocuments.mx/reader038/viewer/2022110306/554ba8c0b4c905b3618b51d2/html5/thumbnails/44.jpg)
Number of words
Short documents are much more likely to be spam now.
![Page 45: Algorithmic Web Spam detection - Matt Peters MozCon](https://reader038.vdocuments.mx/reader038/viewer/2022110306/554ba8c0b4c905b3618b51d2/html5/thumbnails/45.jpg)
Visible ratio
Spam percent increases for visible ratio above 25%.
![Page 46: Algorithmic Web Spam detection - Matt Peters MozCon](https://reader038.vdocuments.mx/reader038/viewer/2022110306/554ba8c0b4c905b3618b51d2/html5/thumbnails/46.jpg)
Plus lots of other features…
![Page 47: Algorithmic Web Spam detection - Matt Peters MozCon](https://reader038.vdocuments.mx/reader038/viewer/2022110306/554ba8c0b4c905b3618b51d2/html5/thumbnails/47.jpg)
Commercial intent
![Page 48: Algorithmic Web Spam detection - Matt Peters MozCon](https://reader038.vdocuments.mx/reader038/viewer/2022110306/554ba8c0b4c905b3618b51d2/html5/thumbnails/48.jpg)
Commercial intent
Idea: measure the “commercial intent” using lists of high CPC and search volume queries
![Page 49: Algorithmic Web Spam detection - Matt Peters MozCon](https://reader038.vdocuments.mx/reader038/viewer/2022110306/554ba8c0b4c905b3618b51d2/html5/thumbnails/49.jpg)
Commercial intent
Unfortunately the results are inconclusive. With hindsight, we need a larger data set.
![Page 50: Algorithmic Web Spam detection - Matt Peters MozCon](https://reader038.vdocuments.mx/reader038/viewer/2022110306/554ba8c0b4c905b3618b51d2/html5/thumbnails/50.jpg)
What features are missing?
![Page 51: Algorithmic Web Spam detection - Matt Peters MozCon](https://reader038.vdocuments.mx/reader038/viewer/2022110306/554ba8c0b4c905b3618b51d2/html5/thumbnails/51.jpg)
What features are missing?
“Clean money provides detailed info concerning online monetary unfold Betting corporations”
Huh?
![Page 52: Algorithmic Web Spam detection - Matt Peters MozCon](https://reader038.vdocuments.mx/reader038/viewer/2022110306/554ba8c0b4c905b3618b51d2/html5/thumbnails/52.jpg)
What features are missing?
0 comments, no shares or tweets.
If this was a real blog, it would have some user
interaction
![Page 53: Algorithmic Web Spam detection - Matt Peters MozCon](https://reader038.vdocuments.mx/reader038/viewer/2022110306/554ba8c0b4c905b3618b51d2/html5/thumbnails/53.jpg)
What features are missing?There’s something
strange about these sidebar links…
![Page 54: Algorithmic Web Spam detection - Matt Peters MozCon](https://reader038.vdocuments.mx/reader038/viewer/2022110306/554ba8c0b4c905b3618b51d2/html5/thumbnails/54.jpg)
How well can we model spam with these features?
![Page 55: Algorithmic Web Spam detection - Matt Peters MozCon](https://reader038.vdocuments.mx/reader038/viewer/2022110306/554ba8c0b4c905b3618b51d2/html5/thumbnails/55.jpg)
How well can we model spam with these features?
Quite well!
Using a logistic regression model, we can obtain
86%1 accuracy and 0.82 AUC using just 32 features (11 in-link features and 21 on-page features).
![Page 56: Algorithmic Web Spam detection - Matt Peters MozCon](https://reader038.vdocuments.mx/reader038/viewer/2022110306/554ba8c0b4c905b3618b51d2/html5/thumbnails/56.jpg)
How well can we model spam with these features?
Quite well!
Using a logistic regression model, we can obtain
86%1 accuracy and 0.82 AUC using just 32 features (11 in-link features and 21 on-page features).
1 Well, we can get 83% accuracy by always choosing not-spam so accuracy isn’t the best measure. The 0.82 AUC is quite good for such a simple model.
Overfitting was controlled with L2 regularization and k-fold cross-validation.
![Page 57: Algorithmic Web Spam detection - Matt Peters MozCon](https://reader038.vdocuments.mx/reader038/viewer/2022110306/554ba8c0b4c905b3618b51d2/html5/thumbnails/57.jpg)
More sophisticated modeling
LogisticSPAM
NOT SPAM
??
Logistic
In-link features
On-page features
Mi x ture
Can use a mixture of logistic models, one for in-link and one for on-page. Use EM to set parameters.
![Page 58: Algorithmic Web Spam detection - Matt Peters MozCon](https://reader038.vdocuments.mx/reader038/viewer/2022110306/554ba8c0b4c905b3618b51d2/html5/thumbnails/58.jpg)
More sophisticated modeling
90% penalized50% in-link50% on-page
![Page 59: Algorithmic Web Spam detection - Matt Peters MozCon](https://reader038.vdocuments.mx/reader038/viewer/2022110306/554ba8c0b4c905b3618b51d2/html5/thumbnails/59.jpg)
More sophisticated modeling
65% penalized85% in-link15% on-page
A mixture of logistic models attributes “responsibility” to both the in-link and on-page features as well as predicts the likelihood of a penalty.
![Page 60: Algorithmic Web Spam detection - Matt Peters MozCon](https://reader038.vdocuments.mx/reader038/viewer/2022110306/554ba8c0b4c905b3618b51d2/html5/thumbnails/60.jpg)
Takeaways!
![Page 61: Algorithmic Web Spam detection - Matt Peters MozCon](https://reader038.vdocuments.mx/reader038/viewer/2022110306/554ba8c0b4c905b3618b51d2/html5/thumbnails/61.jpg)
“Unnatural” sites or link profiles
With lots of data, “unnatural” sites or link profiles are moderately easy to detect algorithmically.
You are at risk to be penalized if you build obvious low quality links.
![Page 62: Algorithmic Web Spam detection - Matt Peters MozCon](https://reader038.vdocuments.mx/reader038/viewer/2022110306/554ba8c0b4c905b3618b51d2/html5/thumbnails/62.jpg)
MozTrust Rules!mozTrust is a good predictor of spam. Be careful if you are building links from low mozTrust sites.
mozTrust, an engineering feat of awesomeness.
![Page 63: Algorithmic Web Spam detection - Matt Peters MozCon](https://reader038.vdocuments.mx/reader038/viewer/2022110306/554ba8c0b4c905b3618b51d2/html5/thumbnails/63.jpg)
SEOmoz Tools Future
We hope to have a spam score of some sort available in Mozscape in the future.
In the more near term, we plan to repurpose some of this work for improving Freshscape.
![Page 64: Algorithmic Web Spam detection - Matt Peters MozCon](https://reader038.vdocuments.mx/reader038/viewer/2022110306/554ba8c0b4c905b3618b51d2/html5/thumbnails/64.jpg)
Matthew Peters
Scientist
SEOmoz
@mattthemathman