a quantitative study of forum spamming using context-based analysis
DESCRIPTION
A Quantitative Study of Forum Spamming Using Context-Based Analysis. Yi-Min Wang^ Ming Ma^. Yuan Niu* Hao Chen* Francis Hsu*. *UC Davis, ^Microsoft Research. User. Spammer. A Look at the Web. Why do we care about spam?. Users want to Look at quality pages on the web - PowerPoint PPT PresentationTRANSCRIPT
1
A Quantitative Study of Forum Spamming Using Context-Based Analysis
Yi-Min Wang^Ming Ma^
Yuan Niu*Hao Chen*Francis Hsu**UC Davis, ^Microsoft Research
2
User
Spammer
A Look at the Web
3
Why do we care about spam?
Users want to Look at quality pages on the web Interact without the trouble of moderation Surf safely
Search engines want to Provide good search results Profit from ads
We want to investigate the landscape of the problem Popular battleground: web forums
4
Why Web Forums?
Open communities: wiki, forums, blogs Increasingly easy to contribute
5
Why Web Forums?
6
How Spammers Operate
Spammer
Doorway
Pages
(Splogs)
Doorway
Pages
(Splogs)
Search Results
CommentSpam
Search Engine
Spammer
Domain
Spammer
Domain
2. WritesSplog URLs
1. Creates
Returns
3. Propagates Splog URL
4. Sends User to Doorway URL
5. Redirects User
7
How to deal with the problem?
Content based approach Constrained by content retrieved May be deceived by tricks like cloaking and
redirection
We propose: context-based analysis
8
Context-based Analysis
Consisting of Redirection Cloaking analysis
See dynamic content not served to crawlers Use the Strider URL Tracer
Flag large number of doorway pages to spam domains
Based on intuition that: Publishing links is necessary to increase popularity We must see the destination URL eventually
9
Doorways & RedirectionsGoogle search: Coach handbag
10
Redirection Analysis
Fed URLs to Strider URL Tracer, which records all pages visited Ranked top 3rd Party Domains by redirections
Seed known spammer domain Identified doorway pages based on
association with spammer domains Manually investigated unknown domains to
expand the blacklist
11
Cloaking Analysis
Diff-based check Run URL twice – once with anti-cloaking, once
without Crawler-browser cloaking (User-agent,
scripting-on/off) Click-through cloaking (Referer)
12
Crawler-Browser CloakingGoogle Search: ringtones download
www.welcometuscany.it/images/_notes/xc/26/Ringtones-Download.htmlJavascript Disabled
www.welcometuscany.it/images/_notes/xc/26/Ringtones-Download.htmlJavascript Enabled
13
Crawler-Browser Cloaking
14
Click-Through Cloaking
Cached page/ Scripting off/
Crawler View
Advertising Page from Click-throughs
Directly Visiting the Page
Directly Visiting the Page
Cached page/ Scripting off/
Crawler View
15
Three Perspectives
Spammer
Doorway
Pages
(Splogs)
Doorway
Pages
(Splogs)
Search Results
CommentSpam
Search Engine
Spammer
Domain
Spammer
Domain
2. WritesSplog URLs
1. Creates
Returns
3. Propagates Splog URL
4. Sends User to Doorway URL
5. Redirects User
Search User
Webhost
16
Search User
17
Search User
Chose 9 popular forum software – written in Perl/PHP, hosted/unhosted WWWBoard, Hypernews, Ikonboard, Ezboard, Bravenet,
Invision Board, Phpbb, Phorum, and VBulletin
Compiled popular tags and common spam terms –list of 190 keywords “Myspace, jewelry, casino, shopping, baseball…”
Searched for all <keyword, forum-software> pairs in Google & MSN
18
Search User
Search terms returned spammed forums in top 20 results from both Google and MSN Only exception is “palm-texas-holdem-game”
Top 5 most spammed forums:
Forum Pages Keywords
http://fs.fed.us/...mm/get/mmforumA.html 175 102
http://www.comm.fsu.edu/interactive/forum/ 134 82
http://www.usra.edu/phorum 119 94
http://classicauthors.net/messageboard/list.php?f=1 117 97
http://samba.eecs.umich.edu/phorum/list.php?2 105 79
19
Honeyblogs
Spammers: Create their own doorway pages, and Promote the doorways by posting to other
people’s pages Honeyblogs lure the spammer in:
No moderation, default accept all policy Pinged blog aggregators with every post Abandoned within three months
20
Honeyblogs
41,100 comments collected over 339 days 19,297 comments received in the last month
Ilium – 930/1432 Litlog – 3734/5714
Spammer activity got me kicked off my hosting server
21
Honeyblog Activity
Accumulated Comment Totals by Day
0
2500
5000
7500
10000
12500
15000
17500
20000
22500
25000
27500
30000
32500
35000
0 25 50 75 100 125 150 175 200 225 250 275 300 325
Day
To
tal
Co
mm
ents
Acc
um
ula
ted
ilium_total litlog_total yabi_total
22
Honeyblog Activity
Comments Received By Day
0
200
400
600
800
1000
1200
1400
1600
1800
2000
2200
2400
2600
2800
3000
3200
0 25 50 75 100 125 150 175 200 225 250 275 300 325
Day
Co
mm
ents
Rec
eive
d
Ilium Litlog Yabi
3142
23
Webhost Perspective
Focus on splog doorways
Blog Host Examined URLs
Spam URLs URLs Using Cloaking
Blogspot 13,389 1,091 (8.1%) 652
Blogspoint 4,714 3,535 (75%) 131
Blogstudio 369 198 (54%) 0
Blogsharing 99 82 (83%) 0
•Above Numbers are lower bounds•Consider only pages using cloaking & redirection
24
Webhost Perspective
Blogspot: 1,091 splogs Most popular Randomly sampled 1% of profile pages created in
July and extracted all blog links – 13,389 60% of splogs used cloaking 24% of splogs redirected to filldirect.com
25
Webhost Perspective
Blogspoint: 3535 splogs 2166 redirected to finance-web-search.com 917 redirected to casino-web-search.com
Blogstudio: 198 splogs 130 redirected to finance-web-search.com 54 redirected to casino-web-search.com
Blogsharing: 82 splogs Plumber related link spamming in splogs
26
Also of note… Malicious URLs
Previous work by MSR (Strider HoneyMonkey)1 discovered sites that actively exploit browser vulnerabilities
We tested 8 known malicious URLs for presence on the web Found 5 spammed in forums, 2 in link farms, 1 in referrer logs
Universal redirectors Redirects user to any URL (sometimes destination is
obfuscated): www.rit.edu/~ksa/cgi-bin/splinks/click.cgi?num=2&url=[your url here] http://tinyurl.com/3c7twl
http://www.canadianpharmacyltd.com/group.php?id=59&aid=860
Could be used to serve malicious URLs, particularly those on .edu and .gov sites
1Yi-Min Wang, et al. Automated Web Patrol with Strider HoneyMonkeys: Finding Web Sites That Exploit Browser Vulnerabilities. NDSS, 2006.
27
Related Work (Part 1)
Diff-based cloaking Wu & Davison – Diff-based cloaking combined with content
based analysis Our approach detects click-through cloaking
Content based approaches Fetterly, Manasse and Najork – URL properties, clustering
pages of similar content Mishne, Carmel, Lempel – Compared statistical models of
comments & target pages against post content Kolari, Finin and Joshi – Meta tag text, anchor text, URLs Our approach is complimentary to content-based
approaches
28
Related Work (Part 2)
Measurements of Trust Metaxas et al – Defined trust neighborhoods Benczur et al – SpamRank: Identify outliers by looking at
PageRank of the site and its “supporters” Similarly, our approach propagates distrust by following
redirections Plugins to aid moderating forums/blogs
Akismet Bad Behavior, Spam Karma Our approach does not require cooperation from forum owners
29
Conclusions
Context-based approach successfully detects advanced cloaking & redirection based spam
Spammers are pervasive 189 of 190 search terms returned spammed
forums in the top 20 search results from both Google and MSN
Same spammer redirecting to two domains on blogspoint and blogstudio
30
Future work
There is hope! Economic solution Identifies middlemen in online advertising
Read our WWW07 paper1
http://wwwcsif.cs.ucdavis.edu/~niu http://research.microsoft.com/csm/strider/
1Yi-Min Wang et al. Spam Double-Funnel: Connecting Web Spammers with Advertisers. WWW 2007.