web spamming detecting spam web pages through content analysis alexandros ntoulas et al, 2006,...
TRANSCRIPT
Web spamming
Detecting Spam Web Pages through Content Analysis
Alexandros Ntoulas et al, 2006, International World Wide Web Conference
• link stuffing: for link-based ranking, black hat SEO techniques include the creation of extraneous pages which link to a target page
• keyword-stuffing:The content of other pages
• may be “engineered” so as to appear relevant to popular searches
Figure 1: An example spam page; although it contains popularkeywords, the overall content is useless to a human user
Web spam
• The practices of crafting web pages for the sole purpose of increasing the ranking of these or some affiliated pages, without improving the utility to the viewer, are called “web spam”.
왜 web spamming 을 하는가 ?
• 첫째 , Search engine 이 스팸사이트를 상위에 rank 하게 하여 웹검색자들을 스팸사이트로 끌여들여 경제적 이득을 취함
• 둘째로 search engine 이 스팸사이트를 노출시켜 사용자가 search engine 의 성능을 믿지 못하도록 함 , 즉 search engine 에 대한 공격
• 마지막으로 a search engine 이 spam pages 들로 인하여 필요없는 공간과 시간 , 혹은 네트워크 resource 를 을 낭비하게 함 . – 1/7 of English-language pages
Importance of detecting web spam
• Creating an effective spam detection method is a challenging problem. – Given the size of the web, such a method has to be automated.
– However, while detecting spam, we have to ensure that we identify spam pages alone, and that we do not mistakenly consider legitimate pages to be spam.
– At the same time, it is most useful if we can detect that a page is spam as early as possible, and certainly prior to query processing. In this way, we can allocate our crawling, processing, and indexing efforts to non-spam pages, thus making more efficient use of our resources.
Web spamming techniques
• Web Spam TaxonomyBy Zoltán Gyöngyi and Hector Garcia-Molina, Stanford University. First International Workshop on Adversarial Information Retrieval on the Web, May 2005
Term Spamming
• p: page, q: query words• TF(t)= 문서에 출현하는 term t 의 수• IDF(t)=term t 를 포함하는 문서의 수
• Term spamming 은 TFIDF score 에 기반한 랭킹알고리즘을 채택하고 있는 search engine 을 대상으로 공격
Term Spamming• Body/title/meta tag/Anchor text
<meta name=\keywords" content=\buy, cheap,cameras, lens, accessories, nikon, canon">
<a href=\target.html">free, great deals, cheap, in-expensive, cheap, free</a>
• URL spambuy-canon-rebel-20d-lens-case.camerasx.com,buy-nikon-d100-d70-lens-case.camerasx.com,
How to Term Spamming• Repetition of one or a few specific terms• Dumping of a large number of unrelated terms• Weaving of spam terms into copied contents
• Phrase stitching is also used by spammers to create content quickly
Link Spamming
• PageRank 알고리즘의 특징을 파악하여 Outgoing links, Incoming links 를 조작하는 수법
Outgoing links
• A spammer might manually add a number of outgoing links to well-known pages, hoping to increase the page's hub score.
• At the same time, the most wide-spread method for creating a massive number of outgoing links is directory cloning: One can find on the World Wide Web a number of directory sites, some larger and better known (e.g., the DMOZ Open Directory, dmoz.org, or the Yahoo! directory, dir.yahoo.com)
Incoming links
• Create a honey pot, a set of pages that provide some useful resource (e.g., copies of some Unix documentation pages), but that also have (hidden) links to the target spam page(s).
• Post links on blogs, unmoderated message boards, guest books, or wikis. spammers may include URLs to their spam pages as part of the seemingly innocent comments/messages they post.
Hiding Techniques-Content Hiding
Hiding Techniques-Cloaking
If spammers can clearly identify web crawler clients,they can adopt the following strategy, called cloak-ing: given a URL, spam web servers return one specicHTML document to a regular web browser, while theyreturn a dierent document to a web crawler. This way,spammers can present the ultimately intended contentto the web users (without traces of spam on the page),and, at the same time, send a spammed document tothe search engine for indexing.
Hiding Techniques-Redirection
Spam occurrence per top-level domain
• 105, 484, 446 web pages, collected by the MSN Search crawler during August 2004.
Spam occurrence per language in our data set.
Prevalence of spam - number of words on page
Prevalence of spam - number of words in title
Prevalence of spam - average word-length of page
Prevalence of spam - visible content on page
Prevalence of spam - compressibility of page
Classification model to detect spam
• given the training set DS we generate N training sets by sampling n random items with replacement
• For each of the N training sets, we now create a classifier, thus obtaining N classifiers.
• In order to classify a page, we have each of the N classifiers provide a class prediction, which is considered as a vote for that particular class.
• The eventual class of the page is the class with the majority of the votes
Bagging & Boosting
spam Non-spam
Spam A B
Non-spam
C D
예측
실제
Challenges in Web Information Retrieval
Mehran Sahami Vibhu Mittal Shumeet Baluja Henry Rowley
Google Inc.
Information Retrieval on the Web
• Goal: identify which pages are of high quality and relevance to a user’s query.– PageRank, HITS
• Two Challenges– Adversarial classification: detecting Web spammin
g– Evaluating Search results
PageRank
• Assume four web pages: A, B,C and D. • The initial values of PageRank
– PR(A)= PR(B)= PR(C)= PR(D)= 0.25.• PageRank for any page u
• Bu ={v| v links to page u }• Nv = the number of links from page v.
PR(A) = PR(C)/1
PR(B) = PR(A)/2
PR(C) = PR(A)/2 + PR(B)/1+PR(D)/1
PR(D) = 0
Determining the relatedness of fragments of text
• eg:– “Captain Kirk” & “Star Trek” is similar than– “Captain Kirk” & “Fried Chicken”.
• How to measure the closeness between two phases.
• K(x,y) =
Retrieval of UseNet Articles
• at least 800 million documents
Retrieval of Images and Sounds
• non-textual “documents”– from digital still and video cameras, camera phone
s, audio recording devices, and mp3 music.