web spam taxonomy - stanford universityinfolab.stanford.edu/~zoltan/presentations/taxonomy... ·...

23
Web Spam Taxonomy Zoltán Gyöngyi Hector Garcia-Molina

Upload: others

Post on 23-Mar-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Web Spam Taxonomy - Stanford Universityinfolab.stanford.edu/~zoltan/presentations/taxonomy... · 2005. 5. 10. · Web Spam Taxonomy Zoltán Gyöngyi Hector Garcia-Molina. AIRWeb'05

Web Spam Taxonomy

Zoltán GyöngyiHector Garcia-Molina

Page 2: Web Spam Taxonomy - Stanford Universityinfolab.stanford.edu/~zoltan/presentations/taxonomy... · 2005. 5. 10. · Web Spam Taxonomy Zoltán Gyöngyi Hector Garcia-Molina. AIRWeb'05

AIRWeb'05 • Tokyo, May 10, 2005 2

Roadmap

• Subject• Observed behavior

Boosting–Term-based–Link-based

Hiding

• Statistics• Challenges

Page 3: Web Spam Taxonomy - Stanford Universityinfolab.stanford.edu/~zoltan/presentations/taxonomy... · 2005. 5. 10. · Web Spam Taxonomy Zoltán Gyöngyi Hector Garcia-Molina. AIRWeb'05

AIRWeb'05 • Tokyo, May 10, 2005 3

Roadmap

• Subject• Observed behavior

Boosting–Term-based–Link-based

Hiding

• Statistics• Challenges

Page 4: Web Spam Taxonomy - Stanford Universityinfolab.stanford.edu/~zoltan/presentations/taxonomy... · 2005. 5. 10. · Web Spam Taxonomy Zoltán Gyöngyi Hector Garcia-Molina. AIRWeb'05

AIRWeb'05 • Tokyo, May 10, 2005 4

importance(global)

relevance(query-dependent)

Subject

So… who does what?

Spamming

deliberate human action

meant to trigger unjustifiably high ranking

Page 5: Web Spam Taxonomy - Stanford Universityinfolab.stanford.edu/~zoltan/presentations/taxonomy... · 2005. 5. 10. · Web Spam Taxonomy Zoltán Gyöngyi Hector Garcia-Molina. AIRWeb'05

AIRWeb'05 • Tokyo, May 10, 2005 5

Subject

• MonetizationBetter ranking = higher click-through rateSearch engine optimizationAffiliate spam

Why?

Page 6: Web Spam Taxonomy - Stanford Universityinfolab.stanford.edu/~zoltan/presentations/taxonomy... · 2005. 5. 10. · Web Spam Taxonomy Zoltán Gyöngyi Hector Garcia-Molina. AIRWeb'05

AIRWeb'05 • Tokyo, May 10, 2005 6

Subject

• MonetizationBetter ranking = higher click-through rateSearch engine optimizationAffiliate spam

Why?

How?

Page 7: Web Spam Taxonomy - Stanford Universityinfolab.stanford.edu/~zoltan/presentations/taxonomy... · 2005. 5. 10. · Web Spam Taxonomy Zoltán Gyöngyi Hector Garcia-Molina. AIRWeb'05

AIRWeb'05 • Tokyo, May 10, 2005 7

Roadmap

• Subject• Observed behavior

Boosting–Term-based–Link-based

Hiding

• Statistics• Challenges

Page 8: Web Spam Taxonomy - Stanford Universityinfolab.stanford.edu/~zoltan/presentations/taxonomy... · 2005. 5. 10. · Web Spam Taxonomy Zoltán Gyöngyi Hector Garcia-Molina. AIRWeb'05

AIRWeb'05 • Tokyo, May 10, 2005 8

Techniques / Boosting

• Used to increase ranking• Hypertext boosting

Term–Relevance (one/many queries)–Target: TF-IDF variants

Link–Importance–Target: inlink/outlink count, HITS, PageRank

Page 9: Web Spam Taxonomy - Stanford Universityinfolab.stanford.edu/~zoltan/presentations/taxonomy... · 2005. 5. 10. · Web Spam Taxonomy Zoltán Gyöngyi Hector Garcia-Molina. AIRWeb'05

AIRWeb'05 • Tokyo, May 10, 2005 9

what?

how?

Techniques / Boosting / Term

term

body title anchor url

repetition

dumping

weaving

meta tag

stitching

Page 10: Web Spam Taxonomy - Stanford Universityinfolab.stanford.edu/~zoltan/presentations/taxonomy... · 2005. 5. 10. · Web Spam Taxonomy Zoltán Gyöngyi Hector Garcia-Molina. AIRWeb'05

AIRWeb'05 • Tokyo, May 10, 2005 10

title

meta tag

body

Techniques / Boosting / Term<html>

<head><meta name = “keywords” content = “teddybears; plush bears; plus animals; gift bears; toybears; stuffed bears”><title>Teddy Bears</title>

</head><body>

Our customers agree that we are the best onlineretailer of plush teddy bears!…

</body></html>

anchor texturl

What?

<html>…A great <a href = “plush.com”>stuffed plush bear</a>store.

</html>

Page 11: Web Spam Taxonomy - Stanford Universityinfolab.stanford.edu/~zoltan/presentations/taxonomy... · 2005. 5. 10. · Web Spam Taxonomy Zoltán Gyöngyi Hector Garcia-Molina. AIRWeb'05

AIRWeb'05 • Tokyo, May 10, 2005 11

Techniques / Boosting / Term

• repetition repetition repetitionrepetition repetition repetition

• dumortierite dumose dumous dump dumpage dumper dumpily dumpiness dumping dumpish dumpishly

• work in weaving three-women teamsis an ancient textile art on looms

• please refrain from using the phrasestitching wounds located on the lower limbs

How?

Page 12: Web Spam Taxonomy - Stanford Universityinfolab.stanford.edu/~zoltan/presentations/taxonomy... · 2005. 5. 10. · Web Spam Taxonomy Zoltán Gyöngyi Hector Garcia-Molina. AIRWeb'05

AIRWeb'05 • Tokyo, May 10, 2005 12

Techniques / Boosting / Term

• repetition repetition repetitionrepetition repetition repetition

• dumortierite dumose dumous dump dumpage dumper dumpily dumpiness dumping dumpish dumpishly

• work in weaving three-women teamsis an ancient textile art on looms

• please refrain from using the phrasestitching wounds located on the lower limbs

How?

• heuristics

• statistical analysis

Page 13: Web Spam Taxonomy - Stanford Universityinfolab.stanford.edu/~zoltan/presentations/taxonomy... · 2005. 5. 10. · Web Spam Taxonomy Zoltán Gyöngyi Hector Garcia-Molina. AIRWeb'05

AIRWeb'05 • Tokyo, May 10, 2005 13

what?

Techniques / Boosting / Link

how?

Page 14: Web Spam Taxonomy - Stanford Universityinfolab.stanford.edu/~zoltan/presentations/taxonomy... · 2005. 5. 10. · Web Spam Taxonomy Zoltán Gyöngyi Hector Garcia-Molina. AIRWeb'05

AIRWeb'05 • Tokyo, May 10, 2005 14

Techniques / Boosting / Link

• Directory clonesDuplicate (parts of) DMOZ

• Comment spamPost messages (containing links) to–Blogs–(Unmoderated) forums–Wikis

• Link spam farmsIncrease sizeIncrease collusion

How?

[BYCL’05]

[BCSU’05]

[MCL’05]

Page 15: Web Spam Taxonomy - Stanford Universityinfolab.stanford.edu/~zoltan/presentations/taxonomy... · 2005. 5. 10. · Web Spam Taxonomy Zoltán Gyöngyi Hector Garcia-Molina. AIRWeb'05

AIRWeb'05 • Tokyo, May 10, 2005 15

Techniques / Hiding

• Used to conceal boosting

hiding techniques

content hiding

text link

redirection

meta tag script

cloaking

color script graphics

Page 16: Web Spam Taxonomy - Stanford Universityinfolab.stanford.edu/~zoltan/presentations/taxonomy... · 2005. 5. 10. · Web Spam Taxonomy Zoltán Gyöngyi Hector Garcia-Molina. AIRWeb'05

AIRWeb'05 • Tokyo, May 10, 2005 16

• Content hiding

• CloakingIdentify web crawlersServe a different version of the page

Techniques / Hiding

<style type = “text/css”>body {

background-color: white;color: white; }

</style>

<div style = “visibility: hidden”>You can’t see me!</div>

<a href = “…”><img src= “1x1.gif”></img></a>

Page 17: Web Spam Taxonomy - Stanford Universityinfolab.stanford.edu/~zoltan/presentations/taxonomy... · 2005. 5. 10. · Web Spam Taxonomy Zoltán Gyöngyi Hector Garcia-Molina. AIRWeb'05

AIRWeb'05 • Tokyo, May 10, 2005 17

• RedirectionRedirect on load from a heavily spammed page to the true target

Techniques / Hiding

<meta http-equiv = “refresh” content = “0;url=plush.com”>

<script type = “text/javascript”><!--eval(window.location =“plush.com”);

//--></script>

[WD’05]

Page 18: Web Spam Taxonomy - Stanford Universityinfolab.stanford.edu/~zoltan/presentations/taxonomy... · 2005. 5. 10. · Web Spam Taxonomy Zoltán Gyöngyi Hector Garcia-Molina. AIRWeb'05

AIRWeb'05 • Tokyo, May 10, 2005 18

Roadmap

• Subject• Observed behavior

Boosting–Term-based–Link-based

Hiding

• Statistics• Challenges

Page 19: Web Spam Taxonomy - Stanford Universityinfolab.stanford.edu/~zoltan/presentations/taxonomy... · 2005. 5. 10. · Web Spam Taxonomy Zoltán Gyöngyi Hector Garcia-Molina. AIRWeb'05

AIRWeb'05 • Tokyo, May 10, 2005 19

Statistics

• [FMN’04]/1Beginning of 2003150M total / 751 sample pages8.1% spam

• [FMN’04]/2Summer of 2002429M total / 535 sample pages6.9% spam

• [GGMP’04]August 200331M total / 748 sample sites18% spam

Page 20: Web Spam Taxonomy - Stanford Universityinfolab.stanford.edu/~zoltan/presentations/taxonomy... · 2005. 5. 10. · Web Spam Taxonomy Zoltán Gyöngyi Hector Garcia-Molina. AIRWeb'05

AIRWeb'05 • Tokyo, May 10, 2005 20

Statistics

• PageRank of spam

Page 21: Web Spam Taxonomy - Stanford Universityinfolab.stanford.edu/~zoltan/presentations/taxonomy... · 2005. 5. 10. · Web Spam Taxonomy Zoltán Gyöngyi Hector Garcia-Molina. AIRWeb'05

AIRWeb'05 • Tokyo, May 10, 2005 21

Roadmap

• Subject• Observed behavior

Boosting–Term-based–Link-based

Hiding

• Statistics• Challenges

Page 22: Web Spam Taxonomy - Stanford Universityinfolab.stanford.edu/~zoltan/presentations/taxonomy... · 2005. 5. 10. · Web Spam Taxonomy Zoltán Gyöngyi Hector Garcia-Molina. AIRWeb'05

AIRWeb'05 • Tokyo, May 10, 2005 22

Challenges

• Spam prevalence statisticsPer typeAt various levels of granularityIn index vs. in results

• Spam neutralizationSpam-proof ranking algorithms (?)Better use of human judgment–Exploitation of implicit feedback–Better semantic separation

Economy/game-theory + ads

Page 23: Web Spam Taxonomy - Stanford Universityinfolab.stanford.edu/~zoltan/presentations/taxonomy... · 2005. 5. 10. · Web Spam Taxonomy Zoltán Gyöngyi Hector Garcia-Molina. AIRWeb'05

AIRWeb'05 • Tokyo, May 10, 2005 23

Conclusions

• Spamming techniquesTerm-based or link-basedOf various complexity/efficiency

• Spam detection techniquesWide scaleWork in progress

• ChallengesStatistics

• Contact: [email protected]