spam detection technique
TRANSCRIPT
-
7/28/2019 Spam Detection Technique
1/15
Koushik Mandal
Jadavpur University
4/13/2013
-
7/28/2019 Spam Detection Technique
2/15
What is spam?
Spam is basically irrelevant information ormessages we get through email or bysearch engine results.
Email Spam : Junk Email ,Unsolicited Commercial Email
for making Advertisements and offers.
Web Spam :
A page created for the sole purpose ofattracting search engine referrals(to this page or some other target page)
4/13/2013
-
7/28/2019 Spam Detection Technique
3/15
Problem of Spam
Users dont want spam Lost productivityOffensive, Embarrassing Legitimate messages get lost in the sea of spam
Spam isnt going away
People buy from spammers Legislation has not been effective The SMTP protocol is inadequate
It allows spammers to forge message information
Spam is difficult to detect
Spammers learn how to get past filtersLegitimate messages WILL be lost
4/13/2013
-
7/28/2019 Spam Detection Technique
4/15
Spam Categories
4/13/2013
-
7/28/2019 Spam Detection Technique
5/15
Spam Detection :
EmailSpamAutomated Spam filtering :
An instance of Document classificationproblems
First document set predefines class(spam or legitimate)
training set
Second document set no class labels
testing purpose
4/13/2013
-
7/28/2019 Spam Detection Technique
6/15
Problem of Document Classification
4/13/2013
-
7/28/2019 Spam Detection Technique
7/15
Nave Bayesian Approach
4/13/2013
Based on Bayes Theorem and total
probability.
the probability that an email is spam, given
that it has certain words in it, is equal to the
probability of finding those certain words inspam email, times the probability that any
email is spam, divided by the probability of
finding those words in any email.
-
7/28/2019 Spam Detection Technique
8/154/13/2013
Nave Bayesian classifier is based on Bayestheorem and the theorem of total probability. For
an email instance, the probability that it belongsto class C having a Vector of words X = (x1, x2,x3xN ) is
Where J (Spam, Legitimate). In practice, theprobabilities P(X|Ci) are impossible to estimatewithout simplifying assumptions, because the
possible values of X are too many .
-
7/28/2019 Spam Detection Technique
9/15
Spam Detection :
WebSpam
4/13/2013
Types of Spamming Techniques
Term spamming
Manipulating the text of web pages in order toappear relevant to queries
Link spamming
Creating link structures that boost page rank
or hubs and authorities scores
-
7/28/2019 Spam Detection Technique
10/154/13/2013
Link Spam
Three kinds of web pages from aspammers point of view Inaccessible pages
Accessible pages e.g., web log comments pages
spammer can post links to his pages
Own pages
Completely controlled by spammer May span multiple domain names
-
7/28/2019 Spam Detection Technique
11/15
Detecting Spam
Term spamming
Analyze text using statistical methods e.g.,
Nave Bayes classifiers
Similar to email spam filteringAlso useful: detecting approximate duplicate
pages
Link spamming Open research area
One approach: TrustRank
4/13/2013
-
7/28/2019 Spam Detection Technique
12/15
Trust Rank
4/13/2013
Basic principle: approximate isolation It is rare for a good page to point to a bad
(spam) page
Sample a set of seed pages from the
web.Set trust of each trusted page to 1
Propagate trust through links
Each page gets a trust value between 0
and 1 Use a threshold value and mark all pages
below the trust threshold as spam
-
7/28/2019 Spam Detection Technique
13/15
Anti-Trust Approach
4/13/2013
Broadly based on the same approximate
isolation principle
This principle also implies that the pagespointing to spam pages are very likely to bespam pages themselves.
Anti-Trust is propagated in the reversedirection along incoming links, starting from aseed set of spam pages.
A page can be classified as a spam page if ithas Anti-Trust Rank value more than a chosenthreshold value.
-
7/28/2019 Spam Detection Technique
14/15
Q & A
4/13/2013
-
7/28/2019 Spam Detection Technique
15/15
Thank you
4/13/2013