sdp-march-talk 恶意任务检测 姚大海 2013/11/24. papers characterizing and detecting...

30

Click here to load reader

Upload: amie-lilian-henry

Post on 17-Dec-2015

336 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: SDP-MARCH-Talk 恶意任务检测 姚大海 2013/11/24. papers Characterizing and Detecting Malicious Crowdsourcing Detecting Deceptive Opinion Spam Using Human Computation

SDP-MARCH-Talk

恶意任务检测

姚大海2013/11/24

Page 2: SDP-MARCH-Talk 恶意任务检测 姚大海 2013/11/24. papers Characterizing and Detecting Malicious Crowdsourcing Detecting Deceptive Opinion Spam Using Human Computation

papers

• Characterizing and Detecting Malicious Crowdsourcing

• Detecting Deceptive Opinion Spam Using Human Computation

• SmartNotes: Application of Crowdsourcing to the Dectection of Web Threats

Page 3: SDP-MARCH-Talk 恶意任务检测 姚大海 2013/11/24. papers Characterizing and Detecting Malicious Crowdsourcing Detecting Deceptive Opinion Spam Using Human Computation

papers

• Characterizing and Detecting Malicious Crowdsourcing

• Detecting Deceptive Opinion Spam Using Human Computation

• SmartNotes: Application of Crowdsourcing to the Dectection of Web Threats

Page 4: SDP-MARCH-Talk 恶意任务检测 姚大海 2013/11/24. papers Characterizing and Detecting Malicious Crowdsourcing Detecting Deceptive Opinion Spam Using Human Computation

outline

• malicious crowdsourcing

• measured datasets

• some initial results

Page 5: SDP-MARCH-Talk 恶意任务检测 姚大海 2013/11/24. papers Characterizing and Detecting Malicious Crowdsourcing Detecting Deceptive Opinion Spam Using Human Computation

malicious crowdsourcing

• increasing secrecy– tracking jobs is more difficult and easier to det

ect– details of the task are only revealed to worker

s that take on a task.– worker accounts require association with pho

ne numbers or bank accounts.

Page 6: SDP-MARCH-Talk 恶意任务检测 姚大海 2013/11/24. papers Characterizing and Detecting Malicious Crowdsourcing Detecting Deceptive Opinion Spam Using Human Computation

malicious crowdsourcing

• behavioral signatures– ouput from crowdturfing tasks are likely to dis

play specific patterns that distinguish them from "organically" generated content.

– signatures• worker account (their behavior)• cotent (bursts of content generation when tasks ar

e first posted)

Page 7: SDP-MARCH-Talk 恶意任务检测 姚大海 2013/11/24. papers Characterizing and Detecting Malicious Crowdsourcing Detecting Deceptive Opinion Spam Using Human Computation

malicious crowdsourcing

• our methodology– we limit our scope to campaigns that target mi

croblogging platforms (Sina Weibo).– First, we gather "ground truth" content generat

ed by turfer and "organic" content generated by normal users.

– Second, we compare and contrast these datasets.

– Our end goal is to develop detectors by testing them against new crowdturfing campaigns as they arrive.

Page 8: SDP-MARCH-Talk 恶意任务检测 姚大海 2013/11/24. papers Characterizing and Detecting Malicious Crowdsourcing Detecting Deceptive Opinion Spam Using Human Computation

measured datasets

• crowdturf accounts on Weibo– download full user profiles of 28947 Weibo ac

counts IDs

• crowdturf campaigns– crawled tweets, retweets and comments of 18

335 campaigns– 61.5 million tweets, 118 million comments and

86 million rerweets (2012.11~2013.1)

Page 9: SDP-MARCH-Talk 恶意任务检测 姚大海 2013/11/24. papers Characterizing and Detecting Malicious Crowdsourcing Detecting Deceptive Opinion Spam Using Human Computation

some initial results

turkers tend to straddle the line between malicious and normal users.

crowdturfing campaigns have a higher ratio of repeated users.

Page 10: SDP-MARCH-Talk 恶意任务检测 姚大海 2013/11/24. papers Characterizing and Detecting Malicious Crowdsourcing Detecting Deceptive Opinion Spam Using Human Computation

papers

• Characterizing and Detecting Malicious Crowdsourcing

• Detecting Deceptive Opinion Spam Using Human Computation

• SmartNotes: Application of Crowdsourcing to the Dectection of Web Threats

Page 11: SDP-MARCH-Talk 恶意任务检测 姚大海 2013/11/24. papers Characterizing and Detecting Malicious Crowdsourcing Detecting Deceptive Opinion Spam Using Human Computation

outline

• introduction

• data preparation

• human assessor measurements

• writing style measurements

• classifier measurements

• hybrid measurements

• conclusion

Page 12: SDP-MARCH-Talk 恶意任务检测 姚大海 2013/11/24. papers Characterizing and Detecting Malicious Crowdsourcing Detecting Deceptive Opinion Spam Using Human Computation

introduction

• review spam

– hyper spam——positive review

– defaming spam——negative review

• limitation of related work– focus on hyper spam

Page 13: SDP-MARCH-Talk 恶意任务检测 姚大海 2013/11/24. papers Characterizing and Detecting Malicious Crowdsourcing Detecting Deceptive Opinion Spam Using Human Computation

data preparation

• truthful reviews (each of 8 products)

: 25 highly-rated reviews

: 25 low-rated reviews

• fake reviews (created on AMT)

: 25 highly-rated reviews

: 25 low-rated reviews

HTLT

HDLD

Page 14: SDP-MARCH-Talk 恶意任务检测 姚大海 2013/11/24. papers Characterizing and Detecting Malicious Crowdsourcing Detecting Deceptive Opinion Spam Using Human Computation

human assessor measurements

• balanced: 5 truthful and 5 deceptive reviews

• random: n deceptive reviews and (10-n) truthful reviews

1.students performed better than the crowd, but not significant.

2.detecting high-ralted reviews is easier than low-rated reviews.

an assessor has a "default" belief that a review must be true.

Page 15: SDP-MARCH-Talk 恶意任务检测 姚大海 2013/11/24. papers Characterizing and Detecting Malicious Crowdsourcing Detecting Deceptive Opinion Spam Using Human Computation

writing style measurements

• three linguistic qualities (语言指标 )– polarity– sentiment– readability——ARI

C——#charactersW——#wordsS——#sentences

sentiment API in text-processing.com

43.21*5.0*71.4 S

W

W

CARI

Page 16: SDP-MARCH-Talk 恶意任务检测 姚大海 2013/11/24. papers Characterizing and Detecting Malicious Crowdsourcing Detecting Deceptive Opinion Spam Using Human Computation

writing style measurements

truth reviews require higher readability

highly-rated reviews require higher readability

Page 17: SDP-MARCH-Talk 恶意任务检测 姚大海 2013/11/24. papers Characterizing and Detecting Malicious Crowdsourcing Detecting Deceptive Opinion Spam Using Human Computation

classifier measurements

• QuickLM language model toolkit

• language model, sentiment score, ARI as feature set inputs to SVM

our classifier outperfomed our human and crowd assessors.

Page 18: SDP-MARCH-Talk 恶意任务检测 姚大海 2013/11/24. papers Characterizing and Detecting Malicious Crowdsourcing Detecting Deceptive Opinion Spam Using Human Computation

hybrid measurements

• providing students and thd crowd with additional measurement data: sentiment scores and ARI scores

providing assessors with meaningful metrics is likely to improve the quality of assessment.

Page 19: SDP-MARCH-Talk 恶意任务检测 姚大海 2013/11/24. papers Characterizing and Detecting Malicious Crowdsourcing Detecting Deceptive Opinion Spam Using Human Computation

conclusion

• 展望:如果使用对相关问题很熟悉的众包工人,效果是否比自动分类要好?

• 疑问: SVM的效果比混合方法的效果好,为啥还要用混合方法?

Page 20: SDP-MARCH-Talk 恶意任务检测 姚大海 2013/11/24. papers Characterizing and Detecting Malicious Crowdsourcing Detecting Deceptive Opinion Spam Using Human Computation

papers

• Characterizing and Detecting Malicious Crowdsourcing

• Detecting Deceptive Opinion Spam Using Human Computation

• SmartNotes: Application of Crowdsourcing to the Dectection of Web Threats

Page 21: SDP-MARCH-Talk 恶意任务检测 姚大海 2013/11/24. papers Characterizing and Detecting Malicious Crowdsourcing Detecting Deceptive Opinion Spam Using Human Computation

outline

• introduction

• related work

• design of SmartNotes

• web scam detection technique

Page 22: SDP-MARCH-Talk 恶意任务检测 姚大海 2013/11/24. papers Characterizing and Detecting Malicious Crowdsourcing Detecting Deceptive Opinion Spam Using Human Computation

introduction

• two types of cybersecurity threats– threats created by factors outside the en

d user's control, such as security flaws in application and protocols.

– threats caused by the user's actions, such as phishing.

• the way to identifying the these websites– statistic– blacklist

Page 23: SDP-MARCH-Talk 恶意任务检测 姚大海 2013/11/24. papers Characterizing and Detecting Malicious Crowdsourcing Detecting Deceptive Opinion Spam Using Human Computation

introductin

• our crowdsourcing approach– users report security theats– machine learning to integrate their responses

• features– combining data from multiple sources– combining social bookmarking with questiong-

answering– appling machine learning and natural-languag

e processing

Page 24: SDP-MARCH-Talk 恶意任务检测 姚大海 2013/11/24. papers Characterizing and Detecting Malicious Crowdsourcing Detecting Deceptive Opinion Spam Using Human Computation

related work

• social bookmarking– sharing bookmarks among users

• question answering– post questions and answer question posed by

others

• safe browsing——browser extensions• web scam detection

– closely related to spam email detection– content based

Page 25: SDP-MARCH-Talk 恶意任务检测 姚大海 2013/11/24. papers Characterizing and Detecting Malicious Crowdsourcing Detecting Deceptive Opinion Spam Using Human Computation

design of SmartNotes• user interface

– Chrome browser extension– post a comment or ask a question– share your notes and questions with others– analyze the current wbsite

Page 26: SDP-MARCH-Talk 恶意任务检测 姚大海 2013/11/24. papers Characterizing and Detecting Malicious Crowdsourcing Detecting Deceptive Opinion Spam Using Human Computation

design of SmartNotes

read and write notes, account...

javascript and Chrome extension API

machine learning algorithms

collecting 43 features from 11 sources

Page 27: SDP-MARCH-Talk 恶意任务检测 姚大海 2013/11/24. papers Characterizing and Detecting Malicious Crowdsourcing Detecting Deceptive Opinion Spam Using Human Computation

web scam detection technique

• We need a training set of websites labeled scam or non-scam to apply our supervised machine learning technique.

• approaches to construct a training set• 1. Scam queries (random)

– select 100 domain names from each query and summitted them to AMT.

• 2. Web of Trust (scam)– 200 most recent discussion threats

Page 28: SDP-MARCH-Talk 恶意任务检测 姚大海 2013/11/24. papers Characterizing and Detecting Malicious Crowdsourcing Detecting Deceptive Opinion Spam Using Human Computation

web scam detection technique

• 3. Spam emails (scam)– 1551 spam emails from a corporate email syst

em.

• 4. hpHosts (scam)– top 100 most recent reported website on the b

lacklist

• 5. Spam emails (non-scam)– top 100 websites according to the ranking on

alexa.com.

Page 29: SDP-MARCH-Talk 恶意任务检测 姚大海 2013/11/24. papers Characterizing and Detecting Malicious Crowdsourcing Detecting Deceptive Opinion Spam Using Human Computation

validation & result

harmonic mean of the precision and the recallthe area under the ROC curve

Page 30: SDP-MARCH-Talk 恶意任务检测 姚大海 2013/11/24. papers Characterizing and Detecting Malicious Crowdsourcing Detecting Deceptive Opinion Spam Using Human Computation

Q&A