graph-based knn algorithm for spam sms detection

Tran Phuc Ho, Ho-Seok Kang, Sung-Ryul KimJournal of Universal Computer Science, vol. 19, no. 16 (2013)

*

*

* Spam SMS : advertisements by commercial

companies, hacking messages for cheating and

stealing personal information.

* Content-based approach

Graph-based

Text representation

KNN

algorithm

spam

normal

Labeled

small

message

groups

5 messages (in real time, only 1 message)

Tokenize them by white spaces

and punctuations

*

*

* remove the noisy features and select the good

ones

Mutual information(MI),

X2-Statistic (CHI)

*The dependence between a word(t) and a type of message(c)

t : token (word or phrase)

c : class (type of message – spam or ham)

The probability that t and c

co-occur

The conditional probability of t in c

Probability of t

*The lack of independence between a word(t) and a type of message(c)

t : token (word or phrase)

c : class (type of message – spam or ham)

Probability of t

The probability that

t and c co-occur

t t

Probability that the text belong to c

*

* calculate the weight of each feature

*Use the high weighted words for constructing

the graphs

CHI(X2-statistic)

MI(Mutual Information)

*

Token selected

by feature selection

- unique word

G = (V, E, FWN)

V :set of nodes

E :set of weighted edges linking the nodes

FWN :feature weight matrix – weight of edges and nodes

The order &

Co-occurrence relationship

Between two feature words

(If feature words co-occur

within a step length, assign

an edge)

*

G = (V, E, FWN)

V :set of nodes

E :set of weighted edges linking the nodes

FWN :feature weight matrix – weight of edges and nodes

Weight of edges, Probability of tokens represented by nodes

W_ij : co-occurrence frequency of two feature words

f_i and f_j within a step length

Only calculate the

weight W_ij (i>j).

Ex) scientific paper

Zero

Ex) paper scientific

Frequency of single words

*

in K nearest neighbors of the text T to be classified, the class of T is the most

frequently appearing class in this collection

1. Build sample graphs (elements)

2. New message comes in

3. Build a testing graph

Similarity

Of two graphs

-> Feature Weight :

Weights of the edges

+ weight of the edge itself

(appear in the two graphs)

*

Testing graph (g) Sample graphs (sg_1, sg_2, … sg_n)

….

List (RL)

1 FW(tg,sg1)=2 Spam

2 FW(tg,sg2)=3 Spam

… FW(tg,sg3)=4 Normal

K FW(tg,sg4)=5 Spam(Nfp : how many nodes in the sample

graph with their weights larger than 0

also appear in the test graph)

If Nfp > threshold, calculate FW(tg,sg1)

0.0001

3

*

Testing graph (g) Sample graphs (sg_1, sg_2, … sg_n)

….

List (RL)

1 FW(tg,sg5)=6 Spam

2 FW(tg,sg2)=3 Spam

… FW(tg,sg3)=4 Normal

K FW(tg,sg4)=5 Spam

If Nfp > threshold, calculate FW(tg,sg5)

6

Spam message

*

NUS SMS Corpus (5,574 messages)

– 4,827 normal(86.6%), 747 spam(13.4%)

[Uysal and Yildiz] SMS

collection

(875 messages)

- 450 normal, 425 spam

*

(%)(seconds)

*

* Spam SMS messages are evolving.. Hard to

capture keywords.

* ex) 대★출, 이ㅈr, <<통>> / <<장>>, no space or

punctuation, no specific keyword, same content

with other phone numbers, no words only with

image …

* Graph patterns of communication between

sender and receiver should be added with

content-based approach.

graph-based knn algorithm for spam sms detection

Data & Analytics

graphs feature weight

edges weight

feature wordsif feature

nodes weight of edges

weight w

class type of message

feature words f

text t