ppm based spam filtering in sewm2008
Post on 20-Jan-2016
50 Views
Preview:
DESCRIPTION
TRANSCRIPT
PPM based Spam Filtering
in SEWM2008Liu JuXin, Xu Congfu, Peng Peng, Lu
Guanzhong
llx_2008@yahoo.com.cn,xucongfu@zju.edu.cn,billpengpeng@sohu.com oillgz@gmail.com
College of Computer Science, Zhejiang UniversityApril 10, 2008
Outline
PPM( prediction by partial matching ) Email Pre-processing Train PPM Model Model Classification
PPM
Data Compression
PPM Framework
Email Pre-processing
Source alphabet Merge continuous spaces Truncate long messages
Email Pre-processing
Raw DataAbcd_= - Af?/[]=+ safj =ab fe addfe
Sample:Alphabet : {a,b,c,d,e,f,_,=, }Replace char: ?Truncate length: 20
After Replaceabcd_= ? Af????=? ?af? =ab fe addfe
After Merge Blankabcd_= ? Af????=? ?af? =ab fe addfe
After Truncateabcd_= ? Af????=? ?a
Train PPM Model
Use order-6 PPM* model Use Method D Escape estimation Train Two PPM model HAM Model SPAM Model
Model Classification
MCE( Minimum Cross-entropy ) MDL( Minimum Description Length ) Spam Score
Advantage
Simple pre-processing No decode ( avoid obfuscate ) Highly self-adaptive Low false positive
Reference
《 Spam Filtering Using Statistical Data Compression Models 》
《 Unbounded Length Contexts for PPM 》
Question
Delay Index ham, Ham and HAM Active learning 10000
Deliver the filter
Thanks for your attention!Q&A
top related