ppm based spam filtering in sewm2008
DESCRIPTION
PPM based Spam Filtering in SEWM2008. Liu JuXin, Xu Congfu, Peng Peng, Lu Guanzhong [email protected],[email protected] ,[email protected] [email protected] College of Computer Science, Zhejiang University April 10, 2008. Outline. PPM( prediction by partial matching ) - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: PPM based Spam Filtering in SEWM2008](https://reader035.vdocuments.mx/reader035/viewer/2022062519/56815003550346895dbdcf65/html5/thumbnails/1.jpg)
PPM based Spam Filtering
in SEWM2008Liu JuXin, Xu Congfu, Peng Peng, Lu
Guanzhong
[email protected],[email protected],[email protected] [email protected]
College of Computer Science, Zhejiang UniversityApril 10, 2008
![Page 2: PPM based Spam Filtering in SEWM2008](https://reader035.vdocuments.mx/reader035/viewer/2022062519/56815003550346895dbdcf65/html5/thumbnails/2.jpg)
Outline
PPM( prediction by partial matching ) Email Pre-processing Train PPM Model Model Classification
![Page 3: PPM based Spam Filtering in SEWM2008](https://reader035.vdocuments.mx/reader035/viewer/2022062519/56815003550346895dbdcf65/html5/thumbnails/3.jpg)
PPM
Data Compression
![Page 4: PPM based Spam Filtering in SEWM2008](https://reader035.vdocuments.mx/reader035/viewer/2022062519/56815003550346895dbdcf65/html5/thumbnails/4.jpg)
PPM Framework
![Page 5: PPM based Spam Filtering in SEWM2008](https://reader035.vdocuments.mx/reader035/viewer/2022062519/56815003550346895dbdcf65/html5/thumbnails/5.jpg)
Email Pre-processing
Source alphabet Merge continuous spaces Truncate long messages
![Page 6: PPM based Spam Filtering in SEWM2008](https://reader035.vdocuments.mx/reader035/viewer/2022062519/56815003550346895dbdcf65/html5/thumbnails/6.jpg)
Email Pre-processing
Raw DataAbcd_= - Af?/[]=+ safj =ab fe addfe
Sample:Alphabet : {a,b,c,d,e,f,_,=, }Replace char: ?Truncate length: 20
After Replaceabcd_= ? Af????=? ?af? =ab fe addfe
After Merge Blankabcd_= ? Af????=? ?af? =ab fe addfe
After Truncateabcd_= ? Af????=? ?a
![Page 7: PPM based Spam Filtering in SEWM2008](https://reader035.vdocuments.mx/reader035/viewer/2022062519/56815003550346895dbdcf65/html5/thumbnails/7.jpg)
Train PPM Model
Use order-6 PPM* model Use Method D Escape estimation Train Two PPM model HAM Model SPAM Model
![Page 8: PPM based Spam Filtering in SEWM2008](https://reader035.vdocuments.mx/reader035/viewer/2022062519/56815003550346895dbdcf65/html5/thumbnails/8.jpg)
Model Classification
MCE( Minimum Cross-entropy ) MDL( Minimum Description Length ) Spam Score
![Page 9: PPM based Spam Filtering in SEWM2008](https://reader035.vdocuments.mx/reader035/viewer/2022062519/56815003550346895dbdcf65/html5/thumbnails/9.jpg)
Advantage
Simple pre-processing No decode ( avoid obfuscate ) Highly self-adaptive Low false positive
![Page 10: PPM based Spam Filtering in SEWM2008](https://reader035.vdocuments.mx/reader035/viewer/2022062519/56815003550346895dbdcf65/html5/thumbnails/10.jpg)
Reference
《 Spam Filtering Using Statistical Data Compression Models 》
《 Unbounded Length Contexts for PPM 》
![Page 11: PPM based Spam Filtering in SEWM2008](https://reader035.vdocuments.mx/reader035/viewer/2022062519/56815003550346895dbdcf65/html5/thumbnails/11.jpg)
Question
Delay Index ham, Ham and HAM Active learning 10000
Deliver the filter
![Page 12: PPM based Spam Filtering in SEWM2008](https://reader035.vdocuments.mx/reader035/viewer/2022062519/56815003550346895dbdcf65/html5/thumbnails/12.jpg)
Thanks for your attention!Q&A