a quantitative study of forum spamming using context-based analysis

30
1 A Quantitative Study of Forum Spamming Using Context-Based Analysis Yi-Min Wang^ Ming Ma^ Yuan Niu* Hao Chen* Francis Hsu* *UC Davis, ^Microsoft Research

Upload: thaddeus-shaffer

Post on 03-Jan-2016

20 views

Category:

Documents


2 download

DESCRIPTION

A Quantitative Study of Forum Spamming Using Context-Based Analysis. Yi-Min Wang^ Ming Ma^. Yuan Niu* Hao Chen* Francis Hsu*. *UC Davis, ^Microsoft Research. User. Spammer. A Look at the Web. Why do we care about spam?. Users want to Look at quality pages on the web - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: A Quantitative Study of Forum Spamming Using Context-Based Analysis

1

A Quantitative Study of Forum Spamming Using Context-Based Analysis

Yi-Min Wang^Ming Ma^

Yuan Niu*Hao Chen*Francis Hsu**UC Davis, ^Microsoft Research

Page 2: A Quantitative Study of Forum Spamming Using Context-Based Analysis

2

User

Spammer

A Look at the Web

Page 3: A Quantitative Study of Forum Spamming Using Context-Based Analysis

3

Why do we care about spam?

Users want to Look at quality pages on the web Interact without the trouble of moderation Surf safely

Search engines want to Provide good search results Profit from ads

We want to investigate the landscape of the problem Popular battleground: web forums

Page 4: A Quantitative Study of Forum Spamming Using Context-Based Analysis

4

Why Web Forums?

Open communities: wiki, forums, blogs Increasingly easy to contribute

Page 5: A Quantitative Study of Forum Spamming Using Context-Based Analysis

5

Why Web Forums?

Page 6: A Quantitative Study of Forum Spamming Using Context-Based Analysis

6

How Spammers Operate

Spammer

Doorway

Pages

(Splogs)

Doorway

Pages

(Splogs)

Search Results

CommentSpam

Search Engine

Spammer

Domain

Spammer

Domain

2. WritesSplog URLs

1. Creates

Returns

3. Propagates Splog URL

4. Sends User to Doorway URL

5. Redirects User

Page 7: A Quantitative Study of Forum Spamming Using Context-Based Analysis

7

How to deal with the problem?

Content based approach Constrained by content retrieved May be deceived by tricks like cloaking and

redirection

We propose: context-based analysis

Page 8: A Quantitative Study of Forum Spamming Using Context-Based Analysis

8

Context-based Analysis

Consisting of Redirection Cloaking analysis

See dynamic content not served to crawlers Use the Strider URL Tracer

Flag large number of doorway pages to spam domains

Based on intuition that: Publishing links is necessary to increase popularity We must see the destination URL eventually

Page 9: A Quantitative Study of Forum Spamming Using Context-Based Analysis

9

Doorways & RedirectionsGoogle search: Coach handbag

Page 10: A Quantitative Study of Forum Spamming Using Context-Based Analysis

10

Redirection Analysis

Fed URLs to Strider URL Tracer, which records all pages visited Ranked top 3rd Party Domains by redirections

Seed known spammer domain Identified doorway pages based on

association with spammer domains Manually investigated unknown domains to

expand the blacklist

Page 11: A Quantitative Study of Forum Spamming Using Context-Based Analysis

11

Cloaking Analysis

Diff-based check Run URL twice – once with anti-cloaking, once

without Crawler-browser cloaking (User-agent,

scripting-on/off) Click-through cloaking (Referer)

Page 12: A Quantitative Study of Forum Spamming Using Context-Based Analysis

12

Crawler-Browser CloakingGoogle Search: ringtones download

www.welcometuscany.it/images/_notes/xc/26/Ringtones-Download.htmlJavascript Disabled

www.welcometuscany.it/images/_notes/xc/26/Ringtones-Download.htmlJavascript Enabled

Page 13: A Quantitative Study of Forum Spamming Using Context-Based Analysis

13

Crawler-Browser Cloaking

Page 14: A Quantitative Study of Forum Spamming Using Context-Based Analysis

14

Click-Through Cloaking

Cached page/ Scripting off/

Crawler View

Advertising Page from Click-throughs

Directly Visiting the Page

Directly Visiting the Page

Cached page/ Scripting off/

Crawler View

Page 15: A Quantitative Study of Forum Spamming Using Context-Based Analysis

15

Three Perspectives

Spammer

Doorway

Pages

(Splogs)

Doorway

Pages

(Splogs)

Search Results

CommentSpam

Search Engine

Spammer

Domain

Spammer

Domain

2. WritesSplog URLs

1. Creates

Returns

3. Propagates Splog URL

4. Sends User to Doorway URL

5. Redirects User

Search User

Webhost

Page 16: A Quantitative Study of Forum Spamming Using Context-Based Analysis

16

Search User

Page 17: A Quantitative Study of Forum Spamming Using Context-Based Analysis

17

Search User

Chose 9 popular forum software – written in Perl/PHP, hosted/unhosted WWWBoard, Hypernews, Ikonboard, Ezboard, Bravenet,

Invision Board, Phpbb, Phorum, and VBulletin

Compiled popular tags and common spam terms –list of 190 keywords “Myspace, jewelry, casino, shopping, baseball…”

Searched for all <keyword, forum-software> pairs in Google & MSN

Page 18: A Quantitative Study of Forum Spamming Using Context-Based Analysis

18

Search User

Search terms returned spammed forums in top 20 results from both Google and MSN Only exception is “palm-texas-holdem-game”

Top 5 most spammed forums:

Forum Pages Keywords

http://fs.fed.us/...mm/get/mmforumA.html 175 102

http://www.comm.fsu.edu/interactive/forum/ 134 82

http://www.usra.edu/phorum 119 94

http://classicauthors.net/messageboard/list.php?f=1 117 97

http://samba.eecs.umich.edu/phorum/list.php?2 105 79

Page 19: A Quantitative Study of Forum Spamming Using Context-Based Analysis

19

Honeyblogs

Spammers: Create their own doorway pages, and Promote the doorways by posting to other

people’s pages Honeyblogs lure the spammer in:

No moderation, default accept all policy Pinged blog aggregators with every post Abandoned within three months

Page 20: A Quantitative Study of Forum Spamming Using Context-Based Analysis

20

Honeyblogs

41,100 comments collected over 339 days 19,297 comments received in the last month

Ilium – 930/1432 Litlog – 3734/5714

Spammer activity got me kicked off my hosting server

Page 21: A Quantitative Study of Forum Spamming Using Context-Based Analysis

21

Honeyblog Activity

Accumulated Comment Totals by Day

0

2500

5000

7500

10000

12500

15000

17500

20000

22500

25000

27500

30000

32500

35000

0 25 50 75 100 125 150 175 200 225 250 275 300 325

Day

To

tal

Co

mm

ents

Acc

um

ula

ted

ilium_total litlog_total yabi_total

Page 22: A Quantitative Study of Forum Spamming Using Context-Based Analysis

22

Honeyblog Activity

Comments Received By Day

0

200

400

600

800

1000

1200

1400

1600

1800

2000

2200

2400

2600

2800

3000

3200

0 25 50 75 100 125 150 175 200 225 250 275 300 325

Day

Co

mm

ents

Rec

eive

d

Ilium Litlog Yabi

3142

Page 23: A Quantitative Study of Forum Spamming Using Context-Based Analysis

23

Webhost Perspective

Focus on splog doorways

Blog Host Examined URLs

Spam URLs URLs Using Cloaking

Blogspot 13,389 1,091 (8.1%) 652

Blogspoint 4,714 3,535 (75%) 131

Blogstudio 369 198 (54%) 0

Blogsharing 99 82 (83%) 0

•Above Numbers are lower bounds•Consider only pages using cloaking & redirection

Page 24: A Quantitative Study of Forum Spamming Using Context-Based Analysis

24

Webhost Perspective

Blogspot: 1,091 splogs Most popular Randomly sampled 1% of profile pages created in

July and extracted all blog links – 13,389 60% of splogs used cloaking 24% of splogs redirected to filldirect.com

Page 25: A Quantitative Study of Forum Spamming Using Context-Based Analysis

25

Webhost Perspective

Blogspoint: 3535 splogs 2166 redirected to finance-web-search.com 917 redirected to casino-web-search.com

Blogstudio: 198 splogs 130 redirected to finance-web-search.com 54 redirected to casino-web-search.com

Blogsharing: 82 splogs Plumber related link spamming in splogs

Page 26: A Quantitative Study of Forum Spamming Using Context-Based Analysis

26

Also of note… Malicious URLs

Previous work by MSR (Strider HoneyMonkey)1 discovered sites that actively exploit browser vulnerabilities

We tested 8 known malicious URLs for presence on the web Found 5 spammed in forums, 2 in link farms, 1 in referrer logs

Universal redirectors Redirects user to any URL (sometimes destination is

obfuscated): www.rit.edu/~ksa/cgi-bin/splinks/click.cgi?num=2&url=[your url here] http://tinyurl.com/3c7twl

http://www.canadianpharmacyltd.com/group.php?id=59&aid=860

Could be used to serve malicious URLs, particularly those on .edu and .gov sites

1Yi-Min Wang, et al. Automated Web Patrol with Strider HoneyMonkeys: Finding Web Sites That Exploit Browser Vulnerabilities. NDSS, 2006.

Page 27: A Quantitative Study of Forum Spamming Using Context-Based Analysis

27

Related Work (Part 1)

Diff-based cloaking Wu & Davison – Diff-based cloaking combined with content

based analysis Our approach detects click-through cloaking

Content based approaches Fetterly, Manasse and Najork – URL properties, clustering

pages of similar content Mishne, Carmel, Lempel – Compared statistical models of

comments & target pages against post content Kolari, Finin and Joshi – Meta tag text, anchor text, URLs Our approach is complimentary to content-based

approaches

Page 28: A Quantitative Study of Forum Spamming Using Context-Based Analysis

28

Related Work (Part 2)

Measurements of Trust Metaxas et al – Defined trust neighborhoods Benczur et al – SpamRank: Identify outliers by looking at

PageRank of the site and its “supporters” Similarly, our approach propagates distrust by following

redirections Plugins to aid moderating forums/blogs

Akismet Bad Behavior, Spam Karma Our approach does not require cooperation from forum owners

Page 29: A Quantitative Study of Forum Spamming Using Context-Based Analysis

29

Conclusions

Context-based approach successfully detects advanced cloaking & redirection based spam

Spammers are pervasive 189 of 190 search terms returned spammed

forums in the top 20 search results from both Google and MSN

Same spammer redirecting to two domains on blogspoint and blogstudio

Page 30: A Quantitative Study of Forum Spamming Using Context-Based Analysis

30

Future work

There is hope! Economic solution Identifies middlemen in online advertising

Read our WWW07 paper1

http://wwwcsif.cs.ucdavis.edu/~niu http://research.microsoft.com/csm/strider/

1Yi-Min Wang et al. Spam Double-Funnel: Connecting Web Spammers with Advertisers. WWW 2007.