anti spam algorithm

37
1 Anti-spam Algorithm Anti-spam Algorithm TrustRank TrustRank Hil Hil ltop ltop 954203041 954203041 林林林 林林林 954203057 954203057 林林林 林林林

Upload: flyingsheep

Post on 12-May-2015

3.022 views

Category:

Economy & Finance


2 download

DESCRIPTION

Two algorithm for anti-spam

TRANSCRIPT

Page 1: Anti Spam Algorithm

1

Anti-spam AlgorithmAnti-spam AlgorithmTrustRankTrustRank 、、 HilltoHilltopp

Anti-spam AlgorithmAnti-spam AlgorithmTrustRankTrustRank 、、 HilltoHilltopp

954203041954203041林裕得林裕得 954203057954203057蔡繼正蔡繼正

Page 2: Anti Spam Algorithm

2

Outline• Introduction• Compare with Page Rank 、 Trust Rank 、 Hilltop• Trust Rank

– Combating Web Spam with Trust Rank

• Hilltop

– Hilltop: A Search Engine based on Expert Documents

• Evaluation

Page 3: Anti Spam Algorithm

3

Introduction(1/3)• Page rank• Current Problem

– web spam pages use various techniques to achieve higher-than-deserved rankings in a search engine’s result

0.4

0.4

0.2

Page 4: Anti Spam Algorithm

4

Introduction(2/3)• Type of web spam

– Content spam • Hidden or invisible text  • Keyword stuffing  • Meta tag stuffing 

– Link spam • Link farms  • Hidden links 

– Other types• Mirror websites  • URL redirection 

Page 5: Anti Spam Algorithm

5

Introduction(3/3)• How to combat web spam

– TrustRank or Hilltop :哪些頁面肯定不是作弊頁面

– BadRank or SpamRank:哪些頁面肯定是作弊頁面

– Sandbox:不能有效的識別哪些是作弊或者不作弊頁面,但是可以通過這種行為有效的打壓 SEO 市場

– 人工舉報和具體 ANTI-SPAM 方法: 幫助建立更加全面的 SPAM POOL 資源

• http://www.google.com/contact/spamreport.html

Page 6: Anti Spam Algorithm

6

Compare with Page Rank 、Trust Rank 、 Hilltop(1/3)

• All are connectivity algorithms, namely that the number and quality of the sources referring to a page are a good measure of the page's quality.

Page 7: Anti Spam Algorithm

7

Compare with Page Rank 、Trust Rank 、 Hilltop(2/3)

• Basic assumption– Page rank

• good page has many important inlinks.

– Trust rank• Good pages point to good ones.

– Hilltop rank• Only expert pages point to good ones.

Page 8: Anti Spam Algorithm

8

Compare with Page Rank 、Trust Rank 、 Hilltop(3/3)

  Page Rank Trust Rank Hilltop

Inlinks Source All pages All pages Expert pages

Initial Score

Average 1 or 0 Algorithm

0.16

0.16

0.160.16

0.16

0.16

0.33

0

0.33 0

0.33

0

0.5

0.20.3

Page 9: Anti Spam Algorithm

9

Trust Rank(1/7)

0 0 0 ….1 0 1 ….0 0.5 0 .…0 0.5 0 …………………

0 0.5 0 …

0 0 0.5 …

0 0.5 0 …

………………

Page 10: Anti Spam Algorithm

10

Page 11: Anti Spam Algorithm

11

Trust Rank (2/7)• Step1 : Evaluate seed-desirability of

pages By Inverse Page Rank

Page 12: Anti Spam Algorithm

12

U SM1N …………….

M

Page 13: Anti Spam Algorithm

13

Trust Rank(3/7)• Step2 : Generate good seeds

Page 14: Anti Spam Algorithm

14

Trust Rank(4/7)• Step3 : Select good seeds

ex, L=3, seed set is {2,4,5}

Page 15: Anti Spam Algorithm

15

Trust Rank(5/7)• Step4 : normalize static score

distribution vector

Page 16: Anti Spam Algorithm

16

Trust Rank(6/7)• Step5:Compute TrustRank score

T d t*…………….

M

Page 17: Anti Spam Algorithm

17

Trust Rank(7/7)• Conclusion

Page 18: Anti Spam Algorithm

18

Hilltop (1/9)• expert page

– a page is about a certain topic and has links to many non-affiliated pages on that topic.

• non-affiliated – Two pages are non-affiliated conceptually if

they are authored by authors from non-affiliated organizations.

Page 19: Anti Spam Algorithm

19

Hilltop (2/9)• Step1 : Expert Lookup

– Detecting Host Affiliation – Selecting the Experts

– Indexing the Experts

non-affiliated pagesexpert page

……

Index key phrases

Page 20: Anti Spam Algorithm

20

Hilltop (3/9)• Detecting Host Affiliation

– Rules: one or both of the following must be true

• Affiliation relation is transitive – if A and B are affiliated and B and C are affiliated then

we take A and C to be affiliated

• They share the same first 3 octets of the IP

address.

• The rightmost non-generic token in the hostname

is the same.

ex, “www.ibm.com" and

"ibm.co.mx“

Page 21: Anti Spam Algorithm

21

Hilltop (4/9)• Selecting the Experts

– Considering all pages with out-degree greater than a threshold, k (e.g., k=5) we test to see if these URLs point to k distinct non-affiliated hosts. Every such page is considered an expert page.

non-affiliated pagesexpert page

……

Page 22: Anti Spam Algorithm

22

Hilltop (5/9)• Indexing the Experts

– index text contained within "key phrases" of the expert. The following are considered key phrases.

• title• headings (e.g., <H1> </H1> tags)• anchor text

– A key phrase is a piece of text that qualifies one or more URLs in the page. And every key phrase has a scope with the document text.

Page 23: Anti Spam Algorithm

23

Hilltop (6/9)• Example

– Title qualify 4 URLs– heading qualify 2 URLs– anchor qualify 1 URLs

<title> 中央大學 </title>

<h1> 資管系 </h1> <A> 001 </A> <A> 002 </A>

<h1> 企管系 </h1> <A> 001 </A> <A> 002 </A>

Page 24: Anti Spam Algorithm

24

Hilltop (7/9)• Step2 : Target Ranking

– Computing the Expert Score

– Computing the Target Score

Target page expert pages

N = 200

…… Least 2 experts point to target

Page 25: Anti Spam Algorithm

25

Hilltop (8/9)• Computing the Expert Score

– Expert score reflect the number and importance of the key phrases that contain the query keywords.

Page 26: Anti Spam Algorithm

26

Computing the Expert Score(1/2)

S0 :包含 k 個 ( 所有 )keywords 的 Key Phrase 的總值 S1 :包含 k-1 個 keywords 的 Key Phrase 的總值 S2 :包含 k-2 個 keywords 的 Key Phrase 的總值

Si = SUM(key phrases p with k-i query terms)

LevelScore(p) * FullnessFactor(p,q)

LevelScore : 16 of title, 6 of heading, 1 of anchor

m is the number of terms in p which are not in q If m <= 2, FullnessFactor(p,q) = 1 If m > 2, FullnessFactor(p,q) = 1 – (m-2) / plen

Query: A B

S0 = 16*1S1 = 16*1 + 6*1 + 16*1S2 = 0

Title: A B C

H1: A

Page 27: Anti Spam Algorithm

27

Computing the Expert Score(2/2)

S0 :包含 k 個 ( 所有 )keywords 的 Key Phrase 的總值 S1 :包含 k-1 個 keywords 的 Key Phrase 的總值 S2 :包含 k-2 個 keywords 的 Key Phrase 的總值

Expert_Score = ( 232 * S0 ) + ( 216 * S1 ) + S2

Page 28: Anti Spam Algorithm

28

Hilltop(9/9)• Computing the Target Score

– Target score reflect both the number and relevance of the experts pointing to it

– And the relevance of the phrases qualifying the links.

Page 29: Anti Spam Algorithm

29

Computing the Target Score(1/2)

occ(w,T) is the number of distinct key phrases in E that contain w and qualify the edge(E,T)

If occ(w,T) is 0 for any query keyword then the Edge_Score(E,T) = 0

Otherwise,

Edge_Score(E,T) = Expert_Score(E) * SUM(query keywords w) occ(w,T)

TEedge

Page 30: Anti Spam Algorithm

30

Computing the Target Score(2/2)

Target_Score = SUM( non-affiliated E) Edge_Score(E,T)

T

E1

E2

E3

E2 and E3 are affiliated, and ES(E2,T) > ES(E3,T)

Page 31: Anti Spam Algorithm

31

Evaluation-Trust Rank(1/3)

Page 32: Anti Spam Algorithm

32

Evaluation-Trust Rank(2/3)

• Pairwise Orderness

Page 33: Anti Spam Algorithm

33

Evaluation-Trust Rank(3/3)

• Precision

• Recall

Page 34: Anti Spam Algorithm

34

Evaluation-Hilltop(1/2)• Precision

Page 35: Anti Spam Algorithm

35

Evaluation-Hilltop(2/2)• Recall

Page 36: Anti Spam Algorithm

36

Reference• Combating Web Spam with Trust Rank

– http://www.vldb.org/conf/2004/RS15P3.PDF• Hilltop: A Search Engine based on Expert Documents

– http://www.cs.toronto.edu/~georgem/hilltop/

• Type of web spam

– http://en.wikipedia.org/wiki/Spamdexing

Page 37: Anti Spam Algorithm

37

Q&AQ&A