web spamming detecting spam web pages through content analysis alexandros ntoulas et al, 2006,...

Web spamming

Detecting Spam Web Pages through Content Analysis

Alexandros Ntoulas et al, 2006, International World Wide Web Conference

• link stuffing: for link-based ranking, black hat SEO techniques include the creation of extraneous pages which link to a target page

• keyword-stuffing:The content of other pages

• may be “engineered” so as to appear relevant to popular searches

Figure 1: An example spam page; although it contains popularkeywords, the overall content is useless to a human user

Web spam

• The practices of crafting web pages for the sole purpose of increasing the ranking of these or some affiliated pages, without improving the utility to the viewer, are called “web spam”.

왜 web spamming 을 하는가 ?

• 첫째 , Search engine 이 스팸사이트를 상위에 rank 하게 하여 웹검색자들을 스팸사이트로 끌여들여 경제적 이득을 취함

• 둘째로 search engine 이 스팸사이트를 노출시켜 사용자가 search engine 의 성능을 믿지 못하도록 함 , 즉 search engine 에 대한 공격

• 마지막으로 a search engine 이 spam pages 들로 인하여 필요없는 공간과 시간 , 혹은 네트워크 resource 를 을 낭비하게 함 . – 1/7 of English-language pages

Importance of detecting web spam

• Creating an effective spam detection method is a challenging problem. – Given the size of the web, such a method has to be automated.

– However, while detecting spam, we have to ensure that we identify spam pages alone, and that we do not mistakenly consider legitimate pages to be spam.

– At the same time, it is most useful if we can detect that a page is spam as early as possible, and certainly prior to query processing. In this way, we can allocate our crawling, processing, and indexing efforts to non-spam pages, thus making more efficient use of our resources.

Web spamming techniques

• Web Spam TaxonomyBy Zoltán Gyöngyi and Hector Garcia-Molina, Stanford University. First International Workshop on Adversarial Information Retrieval on the Web, May 2005

http://airweb.cse.lehigh.edu/2005/gyongyi.pdf

http://airweb.cse.lehigh.edu/2005/gyongyi.pdf

Term Spamming

• p: page, q: query words• TF(t)= 문서에 출현하는 term t 의 수• IDF(t)=term t 를 포함하는 문서의 수

• Term spamming 은 TFIDF score 에 기반한 랭킹알고리즘을 채택하고 있는 search engine 을 대상으로 공격

Term Spamming• Body/title/meta tag/Anchor text

<meta name=\keywords" content=\buy, cheap,cameras, lens, accessories, nikon, canon">

<a href=\target.html">free, great deals, cheap, in-expensive, cheap, free</a>

• URL spambuy-canon-rebel-20d-lens-case.camerasx.com,buy-nikon-d100-d70-lens-case.camerasx.com,

How to Term Spamming• Repetition of one or a few specific terms• Dumping of a large number of unrelated terms• Weaving of spam terms into copied contents

• Phrase stitching is also used by spammers to create content quickly

Link Spamming

• PageRank 알고리즘의 특징을 파악하여 Outgoing links, Incoming links 를 조작하는 수법

Outgoing links

• A spammer might manually add a number of outgoing links to well-known pages, hoping to increase the page's hub score.

• At the same time, the most wide-spread method for creating a massive number of outgoing links is directory cloning: One can find on the World Wide Web a number of directory sites, some larger and better known (e.g., the DMOZ Open Directory, dmoz.org, or the Yahoo! directory, dir.yahoo.com)

Incoming links

• Create a honey pot, a set of pages that provide some useful resource (e.g., copies of some Unix documentation pages), but that also have (hidden) links to the target spam page(s).

• Post links on blogs, unmoderated message boards, guest books, or wikis. spammers may include URLs to their spam pages as part of the seemingly innocent comments/messages they post.

Hiding Techniques-Content Hiding

Hiding Techniques-Cloaking

If spammers can clearly identify web crawler clients,they can adopt the following strategy, called cloak-ing: given a URL, spam web servers return one specicHTML document to a regular web browser, while theyreturn a dierent document to a web crawler. This way,spammers can present the ultimately intended contentto the web users (without traces of spam on the page),and, at the same time, send a spammed document tothe search engine for indexing.

Hiding Techniques-Redirection

Spam occurrence per top-level domain

• 105, 484, 446 web pages, collected by the MSN Search crawler during August 2004.

Spam occurrence per language in our data set.

Prevalence of spam - number of words on page

Prevalence of spam - number of words in title

Prevalence of spam - average word-length of page

Prevalence of spam - visible content on page

Prevalence of spam - compressibility of page

Classification model to detect spam

• given the training set DS we generate N training sets by sampling n random items with replacement

• For each of the N training sets, we now create a classifier, thus obtaining N classifiers.

• In order to classify a page, we have each of the N classifiers provide a class prediction, which is considered as a vote for that particular class.

• The eventual class of the page is the class with the majority of the votes

Bagging & Boosting

spam Non-spam

Spam A B

Non-spam

C D

예측

실제

Challenges in Web Information Retrieval

Mehran Sahami Vibhu Mittal Shumeet Baluja Henry Rowley

Google Inc.

Information Retrieval on the Web

• Goal: identify which pages are of high quality and relevance to a user’s query.– PageRank, HITS

• Two Challenges– Adversarial classification: detecting Web spammin

g– Evaluating Search results

PageRank

• Assume four web pages: A, B,C and D. • The initial values of PageRank

– PR(A)= PR(B)= PR(C)= PR(D)= 0.25.• PageRank for any page u

• Bu ={v| v links to page u }• Nv = the number of links from page v.

PR(A) = PR(C)/1

PR(B) = PR(A)/2

PR(C) = PR(A)/2 + PR(B)/1+PR(D)/1

PR(D) = 0

Determining the relatedness of fragments of text

• eg:– “Captain Kirk” & “Star Trek” is similar than– “Captain Kirk” & “Fried Chicken”.

• How to measure the closeness between two phases.

• K(x,y) =

Retrieval of UseNet Articles

• at least 800 million documents

Retrieval of Images and Sounds

• non-textual “documents”– from digital still and video cameras, camera phone

s, audio recording devices, and mp3 music.

web spamming detecting spam web pages through content analysis alexandros ntoulas et al, 2006,...

Documents