know your neighbors: web spam detection using the web topology presented by, soumo gorai carlos...

26
Know your Neighbors: Web Spam Detection Using the Web Topology Presented By, SOUMO GORAI los Castillo(1), Debora Donato(1), Aristides Gionis essa Murdock(1), Fabrizio Silvestri(2). Yahoo! Research Barcelona – Catalunya, Spain ISTI-CNR –Pisa,Italy SIGIR, 25 July 2007, Amsterdam

Upload: anissa-ward

Post on 30-Dec-2015

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Know your Neighbors: Web Spam Detection Using the Web Topology Presented By, SOUMO GORAI Carlos Castillo(1), Debora Donato(1), Aristides Gionis(1), Vanessa

Know your Neighbors:Web Spam Detection Using the Web Topology

Presented By,

SOUMO GORAI

Carlos Castillo(1), Debora Donato(1), Aristides Gionis(1),Vanessa Murdock(1), Fabrizio Silvestri(2).1. Yahoo! Research Barcelona – Catalunya, Spain2. ISTI-CNR –Pisa,ItalyACM SIGIR, 25 July 2007, Amsterdam

Page 2: Know your Neighbors: Web Spam Detection Using the Web Topology Presented By, SOUMO GORAI Carlos Castillo(1), Debora Donato(1), Aristides Gionis(1), Vanessa

Soumo’s Biography

•4th Year CS Major

•Graduating May 2008

•Interesting About Me: Lived in India, Australia, and the U.S.

•CS Interests: Databases, HCI, Web Programming, Networking,

Graphics, Gaming,

.

Page 3: Know your Neighbors: Web Spam Detection Using the Web Topology Presented By, SOUMO GORAI Carlos Castillo(1), Debora Donato(1), Aristides Gionis(1), Vanessa

Here’s all that you can find on the web….

Page 4: Know your Neighbors: Web Spam Detection Using the Web Topology Presented By, SOUMO GORAI Carlos Castillo(1), Debora Donato(1), Aristides Gionis(1), Vanessa

Here’s just some of what really is out there…

Page 5: Know your Neighbors: Web Spam Detection Using the Web Topology Presented By, SOUMO GORAI Carlos Castillo(1), Debora Donato(1), Aristides Gionis(1), Vanessa

And more….

Page 6: Know your Neighbors: Web Spam Detection Using the Web Topology Presented By, SOUMO GORAI Carlos Castillo(1), Debora Donato(1), Aristides Gionis(1), Vanessa

Why so many different things…?

There is a fierce competition for your attention!

Ease of publication for personal publication as well as commercial publication, advertisements, and economic activity.

…and there’s lots lots lots lots…lots of spam!

Page 7: Know your Neighbors: Web Spam Detection Using the Web Topology Presented By, SOUMO GORAI Carlos Castillo(1), Debora Donato(1), Aristides Gionis(1), Vanessa

What’s Spam?!

Page 8: Know your Neighbors: Web Spam Detection Using the Web Topology Presented By, SOUMO GORAI Carlos Castillo(1), Debora Donato(1), Aristides Gionis(1), Vanessa

Hidden Text

Page 9: Know your Neighbors: Web Spam Detection Using the Web Topology Presented By, SOUMO GORAI Carlos Castillo(1), Debora Donato(1), Aristides Gionis(1), Vanessa

Only hidden text? Here’s a whole fake search engine!!!

Page 10: Know your Neighbors: Web Spam Detection Using the Web Topology Presented By, SOUMO GORAI Carlos Castillo(1), Debora Donato(1), Aristides Gionis(1), Vanessa

Why is Spam bad?

Costs:

• Costs for users: lower precision for some queries

•Costs for search engines: wasted storage space, network resources, and processing cycles

• Costs for the publishers: resources invested in cheating and not in improving their contentsEvery undeserved gain in ranking for a spammer is a loss of search precision for the search engine.

Page 11: Know your Neighbors: Web Spam Detection Using the Web Topology Presented By, SOUMO GORAI Carlos Castillo(1), Debora Donato(1), Aristides Gionis(1), Vanessa

How Do We Detect Spam?

•Machine Learning/Training

•Link-based Detection

•Content-based Detection

•Using Links and Contents

•Using Web-based Topology

Page 12: Know your Neighbors: Web Spam Detection Using the Web Topology Presented By, SOUMO GORAI Carlos Castillo(1), Debora Donato(1), Aristides Gionis(1), Vanessa

Machine Learning/Training

Page 13: Know your Neighbors: Web Spam Detection Using the Web Topology Presented By, SOUMO GORAI Carlos Castillo(1), Debora Donato(1), Aristides Gionis(1), Vanessa

ML ChallengesMachine Learning Challenges:

•Instances are not really independent (graph)

•Training set is relatively small

Information Retrieval Challenges:

•It is hard to find out which features are relevant

•It is hard for search engines to provide labeled data

•Even if they do, it will not reflect a consensus on what is Web Spam

Page 14: Know your Neighbors: Web Spam Detection Using the Web Topology Presented By, SOUMO GORAI Carlos Castillo(1), Debora Donato(1), Aristides Gionis(1), Vanessa

Link-based Detection

Single-level farms can be detected by searching groups of nodes sharing their out-links [Gibson et al., 2005]

Page 15: Know your Neighbors: Web Spam Detection Using the Web Topology Presented By, SOUMO GORAI Carlos Castillo(1), Debora Donato(1), Aristides Gionis(1), Vanessa

Why use it?• Degree-related measures

• PageRank

• TrustRank [Gy¨ongyi et al., 2004]

• Truncated PageRank [Becchetti et al., 2006]:similar to PageRank, it limits a page to the PageRank score

of its close neighbors. Thus, the Truncated PageRank scoreis a useful feature for spam detection because spam pagesgenerally try to reinforce their PageRank scores by linkingto each other.”

Page 16: Know your Neighbors: Web Spam Detection Using the Web Topology Presented By, SOUMO GORAI Carlos Castillo(1), Debora Donato(1), Aristides Gionis(1), Vanessa

Degree-basedMeasures are related to in-degree and out-degree

Edge-reciprocity (the number of links that are reciprocal)

Assortativity (the ratio between the degree of a particular page and the average degree of its neighbors

Page 17: Know your Neighbors: Web Spam Detection Using the Web Topology Presented By, SOUMO GORAI Carlos Castillo(1), Debora Donato(1), Aristides Gionis(1), Vanessa

TrustRank / PageRank

TrustRank: an algorithm that picks trusted nodes derived from page-ranks but tests the degree of relationship one page has with other known trusted pages. This is given a TrustRank score.

Ratio between TrustRank and Page Rank

Number of home pages.

Cons: this alone is not sufficient as there are many false positives.

Page 18: Know your Neighbors: Web Spam Detection Using the Web Topology Presented By, SOUMO GORAI Carlos Castillo(1), Debora Donato(1), Aristides Gionis(1), Vanessa

Content-based Detection

Most of the features reported in [Ntoulas et al., 2006]Number of words in the page and titleAverage word lengthFraction of anchor textFraction of visible textCompression rateCorpus precision and corpus recallQuery precision and query recallIndependent trigram likelihoodEntropy of trigrams

Page 19: Know your Neighbors: Web Spam Detection Using the Web Topology Presented By, SOUMO GORAI Carlos Castillo(1), Debora Donato(1), Aristides Gionis(1), Vanessa

Corpus and Query

F: set of most frequent terms in the collectionQ: set of most frequent terms in a query logP: set of terms in a page

Computation Techniques:

corpus precision: the fraction of words(except stopwords) in a page that appear in the set of popular terms of a data collection.

corpus recall: the fraction of popular terms of the data collection that appear in the page.

query precision: the fraction of words in a page that appear in the set of q most popular terms appearing in a query log.

query recall: the fraction of q most popular terms of the query log that appear in the page.

Page 20: Know your Neighbors: Web Spam Detection Using the Web Topology Presented By, SOUMO GORAI Carlos Castillo(1), Debora Donato(1), Aristides Gionis(1), Vanessa

Visual Clues

Figure: Histogram of the average word length in non-spam vs. spam pages for k = 500.

Figure: Histogram of the corpus precision in non-spam vs. spam pages.

Figure: Histogram of the query precision in non-spam vs. spam pages for k = 500.

Page 21: Know your Neighbors: Web Spam Detection Using the Web Topology Presented By, SOUMO GORAI Carlos Castillo(1), Debora Donato(1), Aristides Gionis(1), Vanessa

Links AND Contents Detection

Why Both?:

Page 22: Know your Neighbors: Web Spam Detection Using the Web Topology Presented By, SOUMO GORAI Carlos Castillo(1), Debora Donato(1), Aristides Gionis(1), Vanessa

Web Topology Detection

• Pages topologically close to each other are more likely to have the same label (spam/nonspam) than random pairs of pages.

• Pages linked together are more likely to be on the same topic than random pairs of pages [Davison, 2000]

•Spam tends to be clustered on the Web (black on figure)

Page 23: Know your Neighbors: Web Spam Detection Using the Web Topology Presented By, SOUMO GORAI Carlos Castillo(1), Debora Donato(1), Aristides Gionis(1), Vanessa

Topological dependencies: in-links

Let SOUT(x) be the fraction of spam hosts linked by host x out of all labeled hosts linked by host x. This figure shows the histogram of SOUT for spam and non-spam hosts. We see that almost all non-spam hosts link mostly to non-spam hosts.

Let SIN(x) be the fraction of spam hosts that link to host x out of all labeled hosts that link to x. This figure shows the histograms of SINfor spam and non-spam hosts.In this case there is a clear separation between spam and non-spam hosts.

Page 24: Know your Neighbors: Web Spam Detection Using the Web Topology Presented By, SOUMO GORAI Carlos Castillo(1), Debora Donato(1), Aristides Gionis(1), Vanessa

Clustering: if the majority of a cluster is predicted to be spam then we change the prediction for all hosts in the cluster to spam. The inverse holds true too.

Page 25: Know your Neighbors: Web Spam Detection Using the Web Topology Presented By, SOUMO GORAI Carlos Castillo(1), Debora Donato(1), Aristides Gionis(1), Vanessa

Article CritiquePros:

•Has detailed descriptions of various detection mechanisms.

•Integrates link and content attributes for building a system to detect Web spam

Cons:

•Statistics and success rate for other content-based detection techniques.

•Some graphs had axis labels missing.

Extension:

combine the regularization (any method of preventing overfitting of data by a model) methods at hand in order to improve the overall accuracy

Page 26: Know your Neighbors: Web Spam Detection Using the Web Topology Presented By, SOUMO GORAI Carlos Castillo(1), Debora Donato(1), Aristides Gionis(1), Vanessa

Summary

•Machine Learning/Training

•Link-based Detection

•Content-based Detection

•Using Links and Contents

•Using Web-based Topology

Costs:

•Costs for users: lower precision for some queries

•Costs for search engines: wasted storage space, network resources, and processing cycles

•Costs for the publishers: resources invested in cheating and not in improving their contentsEvery undeserved gain in ranking for a spammer, is a loss of precision for the search engine.

How Do We Detect Spam?Why is Spam bad?