topic models based personalized spam filter

20
ISCF - 2006 Topic Models Based Personalized Spam Filter Sudarsun. S Director – R & D, Checktronix India Pvt Ltd, Chennai Venkatesh Prabhu. G Research Associate, Checktronix India Pvt Ltd, Chennai Valarmathi B Professor, SKP Engineering College, Thiruvannamalai

Upload: sudarsun-santhiappan

Post on 15-Jan-2015

9.334 views

Category:

Business


0 download

DESCRIPTION

Spam filtering poses a critical problem in text categorization as the features of text is continuously changing. Spam evolves continuously and makes it difficult for the filter to classify the evolving and evading new feature patterns. Most practical applications are based on online user feedback, the task calls for fast, incremental and robust learning algorithms. This paper presents a system for automatically detection and filtering of unsolicited electronic messages. In this paper, we have developed a content-based classifier, which uses two topic models LSI and PLSA complemented with a text patternmatching based natural language approach. By combining these powerful statistical and NLP techniques we obtained a parallel content based Spam filter, which performs the filtration in two stages. In the first stage each model generates its individual predictions, which are combined by a voting mechanism as the second stage.

TRANSCRIPT

Page 1: Topic Models Based Personalized Spam Filter

ISCF - 2006

Topic Models Based Personalized Spam Filter

Sudarsun. SDirector – R & D, Checktronix India Pvt Ltd, Chennai

Venkatesh Prabhu. GResearch Associate, Checktronix India Pvt Ltd, Chennai

Valarmathi BProfessor, SKP Engineering College, Thiruvannamalai

Page 2: Topic Models Based Personalized Spam Filter

ISCF - 2006

What is Spam ?

unsolicited, unwanted email

What is Spam Filtering ? Detection/Filtering of unsolicited content

What’s Personalized Spam Filtering ? Definition of “unsolicited” becomes personal

Approaches Origin-Based Filtering [ Generic ]

Content Based-Filtering [ Personalized ]

Page 3: Topic Models Based Personalized Spam Filter

ISCF - 2006

Content Based Filtering

What does the message contain ?Images, Text, URL

Is it “irrelevant” to my preferences ?How to define relevancy ?How does the system understands relevancy ?

Supervised LearningTeach the system about what I like and what I don’t

Unsupervised LearningDecision made using latent patterns

Page 4: Topic Models Based Personalized Spam Filter

ISCF - 2006

Content-Based Filtering -- Methods

Bayesian Spam Filtering Simplest Design / Less computation cost

Based on keyword distribution

Cannot work on contexts

Accuracy is around 60%

Topic Models based Text Mining Based on distribution of n-grams (key phrases)

Addresses Synonymy and Polysemy

Run-time computation cost is less

Unsupervised technique

Rule based Filtering Supervised technique based on hand-written rules

Best accuracy for known cases

Cannot adopt to new patterns

Page 5: Topic Models Based Personalized Spam Filter

ISCF - 2006

Topic Models Treats every word as a feature

Represents the corpus as a higher-dimensional distribution

SVD: Decomposes the higher-dimensional data to a small reduced sub-space containing only the dominant feature vectors

PLSA: Documents can be understood as a mixture of topics

Rule Based Approaches N-Grams – Language Model Approach

More common n-grams more closer the patterns are.

Page 6: Topic Models Based Personalized Spam Filter

ISCF - 2006

Describes underlying structure among text.

Computes similarities between text.

Represents documents in high-dimensional Semantic Space (Term – Document Matrix).

High dimensional space is approximated to low-dimensional space using Singular Value Decomposition (SVD).

Decomposes the higher dimensional TDM to U, S, V matrices.

U: Left Singular Vectors ( reduced word vectors )

V: Right Singular Vector ( reduced document vectors )

S: Array of Singular Values ( variances or scaling factor )

LSA Model, In Brief

Page 7: Topic Models Based Personalized Spam Filter

ISCF - 2006

PLSA Model By PLSA model, a document is a mixture of topics and topics generate words.

The probabilistic latent factor model can be described as the following generative model

Select a document di from D with probability Pr(di).

Pick a latent factor zk with probability Pr(zk|di).

Generate a word wj from W with probability Pr(wj|zk).

),|Pr()Pr(),Pr( ijiji dwdwd Where

l

kikkjij dzzwdw

1

)|Pr()|Pr()|Pr(

Computing the aspects model parameters using EM Algorithm

Page 8: Topic Models Based Personalized Spam Filter

ISCF - 2006

N–Gram Approach Language Model Approach

Looks for repeated patterns

Each word depends probabilistically on the n-1 preceding words.

)...|()...( 111 iniin wwwPwwP

Calculating and Comparing the N-Gram profiles.

Page 9: Topic Models Based Personalized Spam Filter

ISCF - 2006

Overall System Architecture

Training Mails

Preprocessor

LSA Model

PLSA Model

N-GramOther

Classifiers

Combiner

Final Result

Test Mail

….

Page 10: Topic Models Based Personalized Spam Filter

ISCF - 2006

PreprocessingFeature Extraction

Tokenizing

Feature Selection

Pruning

Stemming

Weighting

Feature Representation

Term Document Matrix Generation

Sub Spacing

LSA / PLSA Model Projection

Feature Reduction

Principle Component Analysis

Page 11: Topic Models Based Personalized Spam Filter

ISCF - 2006

Principle Component Analysis - PCA

Data Reduction - Ignore the features of lesser significance

Given N data vectors from k-dimensions, find c <= k orthogonal vectors that can be best used to represent data

The original data set is reduced to one consisting of N data vectors on c principal components (reduced dimensions)

To detect structure in the relationship between variables that is used to classify data.

Page 12: Topic Models Based Personalized Spam Filter

ISCF - 2006

LSA Classification

ScoreInputMails

LSAModel PCA BPN

Token List

Vector 1xRR: Rank

MxRM: Vocab Size

R: Rank

Vector 1xR’

RxR’R: InVar Size

R’: OutVar Size

Page 13: Topic Models Based Personalized Spam Filter

ISCF - 2006

PLSA Classification

Score

InputMails

PLSAModel PCA BPN

Token List

Vector 1xZZ: Aspects

MxZM: Vocab Size

R: Aspects Count

Vector 1xZ’

ZxZ’Z: InVar Size

Z’: OutVar Size

Page 14: Topic Models Based Personalized Spam Filter

ISCF - 2006

Model Training Build the Global (P)LSA model using the training mails.

Vectorize the training mails using LSI/PSLA model

Reduce the dimensionality of the matrix of pseudo vectors of training documents using PCA.

Feed the reduced matrix into neural networks for learning.

Model Testing Test mails is fed to (P)LSA for vectorization.

Vector is reduced using PCA model.

Reduced vector is fed into BPN neural network.

BPN network emits its prediction with a confidence score

(P)LSA Classification

Page 15: Topic Models Based Personalized Spam Filter

ISCF - 2006

N-Gram method

Construct an N-Gram tree out of training docsDocuments make the leavesNodes make the identified N-grams from docsWeight of an N-gram = Number of childrenHigher order of N-gram implies more weightWeight Wt Wt * S / ( S + L )P: Total number of docs sharing a N-GramS: Number of SPAM docs sharing N-GramL: P - S

Page 16: Topic Models Based Personalized Spam Filter

ISCF - 2006

An Example N-Gram Tree

T5 T1 T2 T3 T4

3rd

2nd

N1

2nd 1st

N2

N3

N4

Page 17: Topic Models Based Personalized Spam Filter

ISCF - 2006

Combiner

Mixture of Experts

Get Predictions from all the Experts

Use the maximum common prediction

Use the prediction with maximum confidence score

Page 18: Topic Models Based Personalized Spam Filter

ISCF - 2006

Conclusion

Objective is to Filter mail messages based on the preference of an individual

Classification performance increases with increased (incremental) training

Initial learning is not necessary for LSA, PLSA & N-Gram.

Performs unsupervised filtering

Performs fast prediction although background training is a relatively slower process

Page 19: Topic Models Based Personalized Spam Filter

ISCF - 2006

References[1]I. Androutsopoulos, J. Koutsias, K. V. Chandrinos, G. Paliouras, and C. D. Spyropoulos. “An Evaluation of Naïve Bayesian Anti-Spam Filtering”, Proc. of the workshop on Machine Learning in the New Information Age, 2000.

[2]W. Cohen, “Learning rules that classify e-mail”, AAAI Spring Symposium on Machine Learning in Information Access, 1996.

[3] W. Daelemans, J. Zavrel, K. van der Sloot, and A. van den Bosch, “TiMBL: Tilburg Memory-Based Learner - version 4.0 Reference Guide”, 2001.

[4] H. Drucker, D. Wu, and V. N. Vapnik., “Support Vector Machines for Spam Categorization”, IEEE Trans. on Neural networks, 1999.

[5] D. Mertz, “Spam Filtering Techniques. Six approaches to eliminating unwanted e-mail.”, Gnosis Software Inc., September, 2002. Ciencias Físicas, Universidad de Valencia, 1992.

[6] M. Vinther, “Junk Detection using neural networks”, MeeSoft Technical Report, June 2002. Available: http://logicnet.dk/reports/JunkDetection/JunkDetection.htm.

[7] Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. “Indexing By Latent Semantic Analysis”, Journal of the American Society For Information Science, 41, 391-407. (1990)

[8] Sudarsun Santhiappan, Venkatesh Prabhu Gopalan, and Sathish Kumar Veeraswamy,”Role of Weighting on TDM in Improvising Performance of LSA on Text Data”, Proceedings of IEEE INDICON 2006.

[9] Thomas Hofmann, “Probabilistic Latent Semantic Indexing,” Proc. 22 Int’l SIGIR Conf. on Research and Development in Information Retrieval, 1999

[10]Sudarsun Santhiappan, Dalou Kalaivendhan and Venkateswarlu Malapatti .”Unsupervised Contextual Keyword Relevance Learning and Measurement using PLSA”, Proceedings of IEEE INDICON 2006.

[11]Landauer, T. K., Foltz, P. W., & Laham, D. “Introduction to Latent Semantic Analysis”, DiscourseProcesses, 25, 259-284. (1998).

[12]G. Furnas, S. Deerwester, S. Dumais, T. Landauer, R. Harshman, L. Streeter and K. Lochbaum, "Information retrieval using a singular value decomposition model of latent semantic structure," in The 11th International Conference on Research and Development in Information Retrieval, Grenoble, France: ACM Press, pp. 465--480. (1988)

[13] Damashek, M. Gauging , “Similarity via N-Grams: Language-Independant Sorting, Categorization and Retrieval of Text”. Science, 267. 843-848.

[14] Sholomo Hershkop, Salvatore J.Stolfo , “Combining Email models for False Positive Reduction”, KDD’05, August 2005.

Page 20: Topic Models Based Personalized Spam Filter

ISCF - 2006

Any Queries…. ?

You can post your queries to [email protected]