jhih-sin jheng 2009/09/01 machine learning and bioinformatics laboratory

34
Jhih-sin Jheng 2009/09/01 Machine Learning and Bioinformatics Laboratory

Upload: branden-walton

Post on 01-Jan-2016

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Jhih-sin Jheng 2009/09/01 Machine Learning and Bioinformatics Laboratory

Jhih-sin Jheng2009/09/01

Machine Learning and Bioinformatics Laboratory

Page 2: Jhih-sin Jheng 2009/09/01 Machine Learning and Bioinformatics Laboratory

Reference

Measurement and Classification of Humans and Bots in Internet ChatSteven Gianvecchio, Mengjun Xie, ZhenyuWu, and Haining WangDepartment of Computer ScienceThe College of William and Mary(USENIX Security),2008

2

Page 3: Jhih-sin Jheng 2009/09/01 Machine Learning and Bioinformatics Laboratory

OutlineBackgroundMeasurementClassification SystemExperimental EvaluationConclusion

3

Page 4: Jhih-sin Jheng 2009/09/01 Machine Learning and Bioinformatics Laboratory

OutlineBackgroundMeasurementClassification SystemExperimental EvaluationConclusion

4

Page 5: Jhih-sin Jheng 2009/09/01 Machine Learning and Bioinformatics Laboratory

Chat Bots vs. BotNetsBotNets – networks of compromised machines

some use chat systems (IRC) for C&C, others use P2P, HTTP, etc.

abuse various systemsChat Bots – automated chat programs

some are helpful, e.g., chat loggerscan abuse chat systems and their users

Send spam ,spread malicious software , mount phishing attacks

Our focus is on the Yahoo! Chat system.

5

Page 6: Jhih-sin Jheng 2009/09/01 Machine Learning and Bioinformatics Laboratory

OutlineBackgroundMeasurementClassification SystemExperimental EvaluationConclusion

6

Page 7: Jhih-sin Jheng 2009/09/01 Machine Learning and Bioinformatics Laboratory

MeasurementAugust-November 2007 – we collect data

August 2007 – Yahoo! adds CAPTCHAvery few chat bots

October 2007 – bots are back

7

Page 8: Jhih-sin Jheng 2009/09/01 Machine Learning and Bioinformatics Laboratory

MeasurementAugust and November 2007

many chat bots1,440 hours of chat logs147 chat logs21 chat rooms

8

Page 9: Jhih-sin Jheng 2009/09/01 Machine Learning and Bioinformatics Laboratory

MeasurementTo create our dataset, we read and label the

chat users ashuman, bot, or ambiguous

In total, we recognized 14 different types of chat botsdifferent triggering mechanismsdifferent text generation techniques

9

Page 10: Jhih-sin Jheng 2009/09/01 Machine Learning and Bioinformatics Laboratory

Types of Chat BotsPeriodic Bots – sends messages based on

periodic timersRandom Bots – sends messages based on

random timersResponder Bots – responds to messages of

other usersReplay Bots – replays messages of other

users

10

Page 11: Jhih-sin Jheng 2009/09/01 Machine Learning and Bioinformatics Laboratory

Humansinter-message delay – evidence of heavy tailmessage size – well fit by Exponential

(λ=0.034)

11

Page 12: Jhih-sin Jheng 2009/09/01 Machine Learning and Bioinformatics Laboratory

Periodic Botsinter-message delay – several clusters with

high probabilitiesmessage size – messages built from templates

approximate a normal distribution

12

Page 13: Jhih-sin Jheng 2009/09/01 Machine Learning and Bioinformatics Laboratory

Random Botsinter-message delay – Equilikely distribution at

40, 64, and 88; Uniform distribution 45-125message size – messages selected from a small

database

13

Page 14: Jhih-sin Jheng 2009/09/01 Machine Learning and Bioinformatics Laboratory

Responder Botsinter-message delay – human-like timingmessage size – multiple templates of different

lengths

14

Page 15: Jhih-sin Jheng 2009/09/01 Machine Learning and Bioinformatics Laboratory

Replay Botsinter-message delay – cluster with high

probabilities (replay bots are periodic)message size – human-like size, well fit by

Exponential (λ=0.028)

15

Page 16: Jhih-sin Jheng 2009/09/01 Machine Learning and Bioinformatics Laboratory

OutlineBackgroundMeasurementClassification SystemExperimental EvaluationConclusion

16

Page 17: Jhih-sin Jheng 2009/09/01 Machine Learning and Bioinformatics Laboratory

Classification SystemEntropy Classifier

detects abnormal behaviorbased on message sizes and inter-message

delaysaccurate but slow

Machine Learning Classifierdetects “learned” patternsbased on message contentfast but must be trained

17

Page 18: Jhih-sin Jheng 2009/09/01 Machine Learning and Bioinformatics Laboratory

18

Observation – chat bots are less complex than humans, and thus, lower in entropyexploits the low entropy of chat bots

Corrected Conditional Entropy Test (CCE)estimates higher-order entropy

Entropy Test (EN)estimates first-order entropy

Entropy Classifier

18

Page 19: Jhih-sin Jheng 2009/09/01 Machine Learning and Bioinformatics Laboratory

Machine Learning ClassifierObservation - chat spam like email spam is a

text classification problemexploits message content of chat bots

CRM114a powerful text classification system

19

Page 20: Jhih-sin Jheng 2009/09/01 Machine Learning and Bioinformatics Laboratory

20

Hybrid Classification System entropy classifier builds and maintains

the bot corpus machine learning classifier uses the bot

and human corpora

BOT CORPUS

CLASSIFY AS CHAT BOT

HUMAN CORPUS

CLASSIFY AS HUMAN

INPUT

ENTROPY CLASSIFIER

MACHINE LEARNING

CLASSIFIER

Page 21: Jhih-sin Jheng 2009/09/01 Machine Learning and Bioinformatics Laboratory

OutlineBackgroundMeasurementClassification SystemExperimental EvaluationConclusion

21

Page 22: Jhih-sin Jheng 2009/09/01 Machine Learning and Bioinformatics Laboratory

Experimental EvaluationTypes of Chat Bots

Periodic BotsRandom BotsResponder BotsReplay Bots

Classifiersentropy classifier – 100 messagesmachine learning classifier – 25 messages

22

Page 23: Jhih-sin Jheng 2009/09/01 Machine Learning and Bioinformatics Laboratory

Experimental EvaluationClassification Tests

Ent – entropy classifier SupML – fully-supervised ML classifier, trained

on AUG BOTSSupMLre – fully-supervised ML classifier,

retrained on NOV BOTSEntML – entropy-trained ML on AUG BOTS

23

Page 24: Jhih-sin Jheng 2009/09/01 Machine Learning and Bioinformatics Laboratory

AUG BOTS NOV BOTS

periodic random respond periodic random replay human

test TP TP TP TP TP TP FP

EN(imd) 121/121 68/68 1/30 51/51 109/109 40/40 7/1713

CCE(imd) 121/121 49/68 4/30 51/51 109/109 40/40 11/1713

EN(ms) 92/121 7/68 8/30 46/51 34/109 0/40 7/1713

CCE(ms) 77/121 8/68 30/30 51/51 6/109 0/40 11/1713

OVERALL 121/121 68/68 30/30 51/51 109/109 40/40 17/1713

24

Entropy Classifier EN – entropy CCE – corrected conditional entropy (imd) – inter-message delay (ms) – message size

Page 25: Jhih-sin Jheng 2009/09/01 Machine Learning and Bioinformatics Laboratory

AUG BOTS NOV BOTS

periodic random respond periodic random replay human

test TP TP TP TP TP TP FP

EN(imd) 121/121 68/68 1/30 51/51 109/109 40/40 7/1713

CCE(imd) 121/121 49/68 4/30 51/51 109/109 40/40 11/1713

EN(ms) 92/121 7/68 8/30 46/51 34/109 0/40 7/1713

CCE(ms) 77/121 8/68 30/30 51/51 6/109 0/40 11/1713

OVERALL 121/121 68/68 30/30 51/51 109/109 40/40 17/1713

25

EN(imd) and CCE(imd) problems against responder bots detect most other chat bots

Page 26: Jhih-sin Jheng 2009/09/01 Machine Learning and Bioinformatics Laboratory

AUG BOTS NOV BOTS

periodic random respond periodic random replay human

test TP TP TP TP TP TP FP

EN(imd) 121/121 68/68 1/30 51/51 109/109 40/40 7/1713

CCE(imd) 121/121 49/68 4/30 51/51 109/109 40/40 11/1713

EN(ms) 92/121 7/68 8/30 46/51 34/109 0/40 7/1713

CCE(ms) 77/121 8/68 30/30 51/51 6/109 0/40 11/1713

OVERALL 121/121 68/68 30/30 51/51 109/109 40/40 17/1713

26

EN(ms) and CCE(ms) problems against random and replay

bots detect most other chat bots

Page 27: Jhih-sin Jheng 2009/09/01 Machine Learning and Bioinformatics Laboratory

AUG BOTS NOV BOTS

periodic random respond periodic random replay human

test TP TP TP TP TP TP FP

EN(imd) 121/121 68/68 1/30 51/51 109/109 40/40 7/1713

CCE(imd) 121/121 49/68 4/30 51/51 109/109 40/40 11/1713

EN(ms) 92/121 7/68 8/30 46/51 34/109 0/40 7/1713

CCE(ms) 77/121 8/68 30/30 51/51 6/109 0/40 11/1713

OVERALL 121/121 68/68 30/30 51/51 109/109 40/40 17/1713

27

OVERALL detects all chat bots false positive rate is ~0.01 100 messages

Page 28: Jhih-sin Jheng 2009/09/01 Machine Learning and Bioinformatics Laboratory

AUG BOTS NOV BOTS

periodic random respond periodic random replay human

test TP TP TP TP TP TP FP

Ent 121/121 68/68 30/30 51/51 109/109 40/40 17/1713

SupML 121/121 68/68 30/30 14/51 104/109 1/40 0/1713

SupMLre 121/121 68/68 30/30 51/51 109/109 40/40 0/1713

EntML 121/121 68/68 30/30 51/51 109/109 40/40 1/1713

28

Entropy and Machine Learning Classifiers Ent – entropy classifier (from last slide) SupML – fully-supervised ML classifier,

trained on AUG BOTS SupMLre – fully-supervised ML

classifier, retrained on NOV BOTS EntML – entropy-trained ML on AUG

BOTS

Page 29: Jhih-sin Jheng 2009/09/01 Machine Learning and Bioinformatics Laboratory

AUG BOTS NOV BOTS

periodic random respond periodic random replay human

Test TP TP TP TP TP TP FP

Ent 121/121 68/68 30/30 51/51 109/109 40/40 17/1713

SupML 121/121 68/68 30/30 14/51 104/109 1/40 0/1713

SupMLre 121/121 68/68 30/30 51/51 109/109 40/40 0/1713

EntML 121/121 68/68 30/30 51/51 109/109 40/40 1/1713

29

Ent OVERALL results from previous slide

Page 30: Jhih-sin Jheng 2009/09/01 Machine Learning and Bioinformatics Laboratory

AUG BOTS NOV BOTS

periodic random respond periodic random replay human

test TP TP TP TP TP TP FP

Ent 121/121 68/68 30/30 51/51 109/109 40/40 17/1713

SupML 121/121 68/68 30/30 14/51 104/109 1/40 0/1713

SupMLre 121/121 68/68 30/30 51/51 109/109 40/40 0/1713

EntML 121/121 68/68 30/30 51/51 109/109 40/40 1/1713

30

SupML has problems against November bots needs to be retrained for new bots

SupMLre detects all bots

Page 31: Jhih-sin Jheng 2009/09/01 Machine Learning and Bioinformatics Laboratory

AUG BOTS NOV BOTS

periodic random respond periodic random replay human

test TP TP TP TP TP TP FP

Ent 121/121 68/68 30/30 51/51 109/109 40/40 17/1713

SupML 121/121 68/68 30/30 14/51 104/109 1/40 0/1713

SupMLre 121/121 68/68 30/30 51/51 109/109 40/40 0/1713

EntML 121/121 68/68 30/30 51/51 109/109 40/40 1/1713

31

EntML false positive rate is ~0.0005

(Ent is ~0.01) 25 messages

Page 32: Jhih-sin Jheng 2009/09/01 Machine Learning and Bioinformatics Laboratory

OutlineBackgroundMeasurementClassification SystemExperimental EvaluationConclusion

32

Page 33: Jhih-sin Jheng 2009/09/01 Machine Learning and Bioinformatics Laboratory

ConclusionMeasurements

overall, chat bots are less complex than humans

some chat bots more human-likeClassification System

exploits benefits of both classifiersquickly classifies known chat botsaccurately classifies unknown chat bots

33

Page 34: Jhih-sin Jheng 2009/09/01 Machine Learning and Bioinformatics Laboratory

Thank you !