bayesian filtering team glyph debbie bridygham pravesvuth uparanukraw ronald ko rihui luo thuong luu...

Bayesian FilteringBayesian Filtering

Team Glyph

Debbie Bridygham Pravesvuth UparanukrawRonald Ko Rihui Luo Thuong Luu

Team Glyph

Debbie Bridygham Pravesvuth UparanukrawRonald Ko Rihui Luo Thuong Luu

BackgroundBackground

•Strong need exists to identify “bad” items in a population and remove them -- Examples: SPAM, Unsolicited IMs, Etc.

•Filtering often results in “Arm’s Race” requiring rapid response

•“Arm’s Race” favors inherently adaptive methods over others

•Strong need exists to identify “bad” items in a population and remove them -- Examples: SPAM, Unsolicited IMs, Etc.

•Filtering often results in “Arm’s Race” requiring rapid response

•“Arm’s Race” favors inherently adaptive methods over others

Benefits of FiltersBenefits of Filters

•Less unwanted traffic, thus less wasted space on clients & servers

•Greater use of internet services due to reduced customer frustration

•Provide some protection against dangerous traffic: scams, phishing attacks, viruses, etc.

•Less unwanted traffic, thus less wasted space on clients & servers

•Greater use of internet services due to reduced customer frustration

•Provide some protection against dangerous traffic: scams, phishing attacks, viruses, etc.

Downsides of FilteringDownsides of Filtering

•Exclusion of even one legitimate item (i.e., False Positives) less desirable than letting 10 or more illegitimate items pass.

•Reducing the percentage of undesirable traffic often causes legitimate traffic to be excluded as well.

•Exclusion of even one legitimate item (i.e., False Positives) less desirable than letting 10 or more illegitimate items pass.

•Reducing the percentage of undesirable traffic often causes legitimate traffic to be excluded as well.

Cost of FilteringCost of Filtering

•Manual filtering has become prohibitive

•Maintenance of static filters costs time & money

•Time spent maintaining keywords or updating software delays response

•“Arm’s Race” often results in ever escalating costs

•Manual filtering has become prohibitive

•Maintenance of static filters costs time & money

•Time spent maintaining keywords or updating software delays response

•“Arm’s Race” often results in ever escalating costs

Methodologies Methodologies

•Manual filtering prohibitive in terms of time

•Static filtering based on heuristics and keywords does not adapt except via manual updates

•Bayesian filtering is dynamic, adapting with each new item scanned and/or marked

•Manual filtering prohibitive in terms of time

•Static filtering based on heuristics and keywords does not adapt except via manual updates

•Bayesian filtering is dynamic, adapting with each new item scanned and/or marked

What is Bayesian Filtering?

What is Bayesian Filtering?

•Uses Naïve Bayes Classifier, which uses Bayes Theorem

•Classifier allows items to be adaptively categorized using probabilities & has low rate of False Positives

•Most well-known use in SPAM filtering; often credited to initial work by Paul Graham (“A Plan for Spam”) in 2002

•Uses Naïve Bayes Classifier, which uses Bayes Theorem

•Classifier allows items to be adaptively categorized using probabilities & has low rate of False Positives

•Most well-known use in SPAM filtering; often credited to initial work by Paul Graham (“A Plan for Spam”) in 2002

Naïve Bayes ClassifierNaïve Bayes Classifier

•Uses Bayes Theorem with assumptions that probabilities are independent (rarely true), thus “naïve”

•Classifier can start with initial assumptions, i.e., probabilities that words occur in legitimate or illegitimate messages

•Is trained over time and adapts. If final probability reaches some threshold, an item is rejected. Superior to keyword filtering.

•Uses Bayes Theorem with assumptions that probabilities are independent (rarely true), thus “naïve”

•Classifier can start with initial assumptions, i.e., probabilities that words occur in legitimate or illegitimate messages

•Is trained over time and adapts. If final probability reaches some threshold, an item is rejected. Superior to keyword filtering.

Bayes TheoremBayes Theorem

•First presented in 1763 based on work by mathematician Thomas Bayes

•Pr(A|B) = Pr(B|A)· Pr(A) / Pr(B)

•Specifies relationships between conditional probabilities

•Currently has practical use in many fields

•First presented in 1763 based on work by mathematician Thomas Bayes

•Pr(A|B) = Pr(B|A)· Pr(A) / Pr(B)

•Specifies relationships between conditional probabilities

•Currently has practical use in many fields

Bayesian Filtering Usage

Bayesian Filtering Usage

•Uses user input to develop individual statistics

•Probability matrix changes over time based on scanned messages and user decisions

•Matrix is used to calculate probability a message is unwanted

•Matrix adapts quickly to new input, resulting in surprisingly good results

•Uses user input to develop individual statistics

•Probability matrix changes over time based on scanned messages and user decisions

•Matrix is used to calculate probability a message is unwanted

•Matrix adapts quickly to new input, resulting in surprisingly good results

Example MatrixExample Matrix

ExampleExample

•Suppose the word “guarantee” occurs in 500 of 2000 Spam emails, but only in 5 of 1000 Non-Spam emails

•The probability of Spam for this word is then (500 / 2000) / ((500 / 2000) + (5 / 1000)) = 0.98

•This probability is combined with that of others obtained from message to compute a probability for the entire message being Spam.

•Suppose the word “guarantee” occurs in 500 of 2000 Spam emails, but only in 5 of 1000 Non-Spam emails

•The probability of Spam for this word is then (500 / 2000) / ((500 / 2000) + (5 / 1000)) = 0.98

•This probability is combined with that of others obtained from message to compute a probability for the entire message being Spam.

Bayesian PoisoningBayesian Poisoning

•Attempts to fool BF systems by adding irrelevant words (often hidden)

•Type I attacks attempt to get messages through filter -- could be active or passive, with active producing feedback to sender via a “Web Bug” or other means

•Type II attacks attempt to cause “False Positives”, i.e., force desirable messages to be rejected

•Attempts to fool BF systems by adding irrelevant words (often hidden)

•Type I attacks attempt to get messages through filter -- could be active or passive, with active producing feedback to sender via a “Web Bug” or other means

•Type II attacks attempt to cause “False Positives”, i.e., force desirable messages to be rejected

Poisoning Effectiveness

Poisoning Effectiveness

•Passive attacks are rarely effective as filters are individual and sender gets no feedback

•Active attacks can be initially highly effective, if systems access “Web Bugs”

•All attacks lose effectiveness as the filter adjusts to incoming traffic

•Passive attacks are rarely effective as filters are individual and sender gets no feedback

•Active attacks can be initially highly effective, if systems access “Web Bugs”

•All attacks lose effectiveness as the filter adjusts to incoming traffic

Products that use Bayesian FilteringProducts that use Bayesian Filtering

AlienCameAlienCamell

DSPAMDSPAMEudoraEudora

eXpurgateeXpurgateJunk-OutJunk-Out

MozillaMozillaPegasus Pegasus

MailMailPOPFilePOPFilePostiniPostini

SeaMonkeSeaMonkeyy

SpamAssaSpamAssassin ssin

SpamBayeSpamBayess

SpamProbSpamProbee

ThunderbirThunderbirdd

SummarySummary

•BF adapts to individual needs

•BF is highly effective

•BF adapts more quickly than other solutions

•BF is resistant to “poisoning”

•BF adapts to individual needs

•BF is highly effective

•BF adapts more quickly than other solutions

•BF is resistant to “poisoning”

ReferencesReferences

•[1] Sahami, M., et. al. “A Bayesian Approach to Filtering Junk E-Mail”, 1998

•[2] Graham, Paul. “A Plan for SPAM”, 2002

•[3] Graham-Cumming, John. “Does Bayesian poisoning exist?”, 2006

•[1] Sahami, M., et. al. “A Bayesian Approach to Filtering Junk E-Mail”, 1998

•[2] Graham, Paul. “A Plan for SPAM”, 2002

•[3] Graham-Cumming, John. “Does Bayesian poisoning exist?”, 2006

References, cont.References, cont.

•[4] Naive Bayes Classifier, Wikipedia, 2007

•[5] Bayes Theorem, Wikipedia, 2007

•[4] Naive Bayes Classifier, Wikipedia, 2007

•[5] Bayes Theorem, Wikipedia, 2007

bayesian filtering team glyph debbie bridygham pravesvuth uparanukraw ronald ko rihui luo thuong luu...

Documents

spam filtering

keyword filtering

terms of timestatic

legitimate traffic

bayes theoremfirst

bayes theoremclassifier

final probability

nave classifier