deep belief networks for spam filtering

Motivation

• Text mining + Network effect

SMS corpus

Spam score

Content Analysis

Network Analysis

spam ham

Spam FilteringSystem

Many users’ data are needed

Deep Belief Networks (DBNs)

• What is a DBN (for classification)?– A feedforward neural network

with a deep architecture - many hidden layers

– Consists of : visible (input) units, hidden units, output units (for classification, one for each class)

• Parameters of a DBN– W( j) :weights between the units of

layers j-1 and j– b( j) : biases of layer j (no biases in

the input layer).

Training a DBN

• Conventional approach: Gradient based optimization– Random initialization of weights and biases– Adjustment by backpropagation

Optimization algorithms get stuck in poor solutions due to random initialization

Solution– Hinton et al [2006] proposed the use of a greedy layer-

wise unsupervised algorithm for initialization of DBNs parameters

– Initialization phase: initialize each layer by treating it as a Restricted Boltzmann Machine (RBM)

Restricted Boltzmann Machines (RBMs)

• An RBM is a two layer neural network– Binary inputs (visible units) are connected

to binary outputs (hidden units) using symmetrically weightedconnections

• Parameters of an RBM– W :weights between the two layers

– b, c :biases for visible and hidden layers respectively

• Layer-to-layer conditional distributions

BidirectionalConnections

RBM Training

• For every training example 1. Propagate it from visible to hidden units

2. Sample from the conditional

3. Propagate the sample in the opposite direction using ⇒ confabulation of the original data

4. Update the hidden units once more using the confabulation

• Update the RBM parameters

Data vector v

Sample

Sample

Remember that RBM training is unsupervised

Repeat

DBN Training

1. Train the first layer RBM

2. Stack another hidden layer on top of the first RBM & train W(2) as a second RBM

3. Continue to stack layers on top of the network, and train it as previous step

W(1) ,b(1)

W(2) ,b(2)

W(L) ,b(L)

W(L+1)

random

Good initializations are obtained

Fine tune the whole network by typical supervised criterion (mean square error, cross-entropy) -> they used conjugate gradients

Dataset

• LingSpam SpamAssassin EnronSpam

Performance Measures

• Accuracy: percentage of correctly classified messages

• Ham - Spam Recall: percentage of correctly classified ham – spam messages

• Ham - Spam Precision: percentage of messages that are classified as ham – spam that are indeed ham - spam

Experimental Setup

• Message representation: x=[x1, x2, …, xm]– Each attribute(message) corresponds to a distinct word from

the corpus

– Use of frequency of the corresponding word

• Attribute selection– Stop words and words appearing in <2 messages were

removed + Information gain score (m=1500 for LingSpam, m=1000 for SpamAssassin and EnronSpam)

• All experiments were performed using 10-fold cross validation

Experimental Setup

• SVM configuration– Cosine kernel (the usual trend in text classification)

– The cost parameter C must be determined a priori

– Tried many values for C – kept the best

• DBN configuration– Use of a m-50-50-200-2 DBN architecture (3 hidden layers)

– RBM training was performed using binary vectors for message representation (the presence or absence of a word in a message)

Experimental Results

Experimental Results

The DBN achieves higher accuracy on all datasets

Beats the SVM against all measures on SpamAssassin

The DBN proved robust to variations on the number of units of each layer

DBN training is much slower compared to SVM training

Conclusions

• The effectiveness of the initialization method was demonstrated in practice

• DBNs constitute a new viable solution to e-mail filtering

• The selection of the DBN architecture needs to be addressed in a more systematic way– Number of layers– Number of units in each layer

Challenges• One example of SpamAssassin dataset (email spam)

Hi there,

To be removed please visit:http://www.supersitescentral.com/rl/remove.html

BIG News...

Visit http://www.supersitescentral.com/rl/x601001.html for full details.

We have discovered a secret to generating a fortune over the Internet and are looking for a few good people to share it with.

This could finally be your chance to get that brand new car and go on that dream vacation you have always wanted. This is THE BIG ONE! So pay real close attention...

Literally thousands of people are making obscene amounts of money from the Internet and ecommerce. We found an Internet giant who markets 11 million products with HUGE demand in every country around the globe.

You can sit in the comfort of your home making money hand over fist with a HUGE global market at your fingertips. Most people never get the opportunity like this to join *BEFORE* the masses come in.

Consider this:

* Debt Free Multi-Million Dollar Company* International - in over 180 Countries* A 100 Billion Dollar Industry* 3 Year Proven Track Record* eCommerce Shopping Giant* Online Marketing Tools* Phenomenal Support Systems* Automated Recruiting Systems* Proprietary Back Office Technology* Huge Compensation Plan* Lifetime Residual income

Go to the web site below to get all the details.

http://www.supersitescentral.com/rl/x601001.html

Isn't it your turn to make a fortune over the Internet? Don't drag your feet on this one. It could be the one you have been waiting for all your life.

Talk to you soon,Mark

iNet Marketing Services

Challenges

[Web발신]- ＮＨ. 금 융 -더쉽고, 더안전하게~7.8 ％ 로 7000사.용.하.실.수있습니다

[Web발신]크♥사[ㅏ리1.95로♡ㅂㅔ당+0.05스♥ㅂL셀1.65OK♡레알1.49추쳐닌ck77time-pr콤

[Web발신]사-용-중-인

체_크_카_드

빌-려-주-면

월-４-５-０

당-일-진-행

바_로_결_제

[Web발신]KB국민카드 김소연님08/18KB국민카드결제금액3,500원.잔여포인트리230(08/06기준)

<공학인증>2014-1학기미상담 시 성적확인 및 수강신청제약!! 학기 중 상담 필수~!!

• In case of Korean Spam SMS..?

1. See the distribution of wordsand special characters in spam and ham messages.

2. Input vector of DBN can be ‘number of special characters’or ‘how correct the grammar of message is’ … instead of ‘number of spam words’

Challenges• How to handle MMS Spam with image..?

• Extract text from image

• Image clustering

• Input vector of DBN can be image vector

deep belief networks for spam filtering

Data & Analytics