graph-based knn algorithm for spam sms detection
TRANSCRIPT
Tran Phuc Ho, Ho-Seok Kang, Sung-Ryul KimJournal of Universal Computer Science, vol. 19, no. 16 (2013)
*
*
* Spam SMS : advertisements by commercial
companies, hacking messages for cheating and
stealing personal information.
* Content-based approach
Graph-based
Text representation
KNN
algorithm
spam
normal
Labeled
small
message
groups
5 messages (in real time, only 1 message)
Tokenize them by white spaces
and punctuations
*
*
* remove the noisy features and select the good
ones
Mutual information(MI),
X2-Statistic (CHI)
*The dependence between a word(t) and a type of message(c)
t : token (word or phrase)
c : class (type of message – spam or ham)
The probability that t and c
co-occur
The conditional probability of t in c
Probability of t
*The lack of independence between a word(t) and a type of message(c)
t : token (word or phrase)
c : class (type of message – spam or ham)
Probability of t
The probability that
t and c co-occur
t t
Probability that the text belong to c
*
* calculate the weight of each feature
*Use the high weighted words for constructing
the graphs
CHI(X2-statistic)
MI(Mutual Information)
*
Token selected
by feature selection
- unique word
G = (V, E, FWN)
V :set of nodes
E :set of weighted edges linking the nodes
FWN :feature weight matrix – weight of edges and nodes
The order &
Co-occurrence relationship
Between two feature words
(If feature words co-occur
within a step length, assign
an edge)
*
G = (V, E, FWN)
V :set of nodes
E :set of weighted edges linking the nodes
FWN :feature weight matrix – weight of edges and nodes
Weight of edges, Probability of tokens represented by nodes
W_ij : co-occurrence frequency of two feature words
f_i and f_j within a step length
Only calculate the
weight W_ij (i>j).
Ex) scientific paper
Zero
Ex) paper scientific
Frequency of single words
*
in K nearest neighbors of the text T to be classified, the class of T is the most
frequently appearing class in this collection
1. Build sample graphs (elements)
2. New message comes in
3. Build a testing graph
Similarity
Of two graphs
-> Feature Weight :
Weights of the edges
+ weight of the edge itself
(appear in the two graphs)
*
Testing graph (g) Sample graphs (sg_1, sg_2, … sg_n)
….
List (RL)
1 FW(tg,sg1)=2 Spam
2 FW(tg,sg2)=3 Spam
… FW(tg,sg3)=4 Normal
K FW(tg,sg4)=5 Spam(Nfp : how many nodes in the sample
graph with their weights larger than 0
also appear in the test graph)
If Nfp > threshold, calculate FW(tg,sg1)
0.0001
3
*
Testing graph (g) Sample graphs (sg_1, sg_2, … sg_n)
….
List (RL)
1 FW(tg,sg5)=6 Spam
2 FW(tg,sg2)=3 Spam
… FW(tg,sg3)=4 Normal
K FW(tg,sg4)=5 Spam
If Nfp > threshold, calculate FW(tg,sg5)
6
Spam message
*
NUS SMS Corpus (5,574 messages)
– 4,827 normal(86.6%), 747 spam(13.4%)
[Uysal and Yildiz] SMS
collection
(875 messages)
- 450 normal, 425 spam
*
*
(%)(seconds)
*
* Spam SMS messages are evolving.. Hard to
capture keywords.
* ex) 대★출, 이ㅈr, <<통>> / <<장>>, no space or
punctuation, no specific keyword, same content
with other phone numbers, no words only with
image …
* Graph patterns of communication between
sender and receiver should be added with
content-based approach.