Download - Master's Thesis Defense: Improving the Quality of Web Spam Filtering by Using Seed Refinement
Master Thesis Defense
December 16, 2010
Database and Multimedia LabKorea Advanced Institute of Science and Technology (KAIST)
Improving the Quality of Web Spam Filtering by Using Seed Refinement
Presenter: Qureshi, Muhammad AtifAdvisor: Whang, Kyu-Young
Database and Multimedia Lab 2
Contents Introduction
Related Work
Web Spam Filtering Using Seed Refinement Algorithms Strategy
Performance Evaluation
Conclusion
Apr 7, 2023
Database and Multimedia Lab 3
Web Search Engine Definition [BP98]
A system that retrieves relevant web pages for users’ queries from the World Wide Web (WWW).
ExampleGoogle, Yahoo!, MS Live Search, Naver.
Apr 7, 2023
Introduction
Database and Multimedia Lab 4
Web Page Ranking Motivation
User queries return huge amount of relevant web pages, but the users want to browse the most important ones.
Note: Relevance represents that a web page matches the user’s query.
ConceptOrdering the relevant web pages according to their importance [GMT04].
Note: Importance represents the interest of a user on the relevant web pages.
Methods [ACG01]
Link-based method: exploiting the link structure of web for ordering the search results. Content-based method: exploiting the contents of web pages for ordering the search results.
We focus on link-based methods since these methods are prevalent in
popular search engines [BP98, CDG07, YUT08] such as Google and Yahoo! [YUT08].
Apr 7, 2023
Introduction
Database and Multimedia Lab 5
Link Structure of Web [GGP04]
Concept Web can be modeled as a graph G(V, E) where V is a set of vertices representing web
nodes, and E is a set of edges representing directed links between the nodes.
Note: Web node represents either a web page or a web domain. Links are classifed into two classes as follows:
The link structure is called web graph.
Example
Introduction
V = {A, B, C}
E = {AB, BC}
AB is an outlink of the web node A.
BC is an outlink of the web node B.
AB is an inlink of the web node B.
BC is an inlink of the web node C.
A CB
Inlink: the incoming link to a web node. Outlink: the outgoing link from a web node.
Fig. 1: An example of a web graph.
Database and Multimedia Lab 6
Web Page Ranking by Using the Link-based Methods
Concept [BP98]
A web node is more important if it receives more inlinks.
Popular method: PageRank [BP98]
Apr 7, 2023
][)1()(
][][),(:
pvdqN
qPRdpPREpqq outlink
PR[p]: PageRank value of the web node p
Noutlink(q): the number of outlinks of the web node q
d: damping factor (probability of following an outlink)
v[p]: the probability of random jump from the web node p to any arbitrary web node
Introduction
Database and Multimedia Lab 7
Web Spam [HMS02, GG05]
ConceptAny deliberate action in order to boost a web node’s rank, without improving its real merit.
Link spam: web spam against link-based methods An action that changes the link structure of web in order to boost web node's ranking. Example
Apr 7, 2023
Introduction
N3
N4
N1 N2
The web nodes N1 and N2 are not involved in link
spam, so they care called non-spam nodes
…
N5
Nx
Web nodes N3-Nx are involved in link spam, so
they are called spam nodes
Node Link Actor
Actor creates
the web node
N 3 to N x
I want to boost the rank of the web node N3
Fig. 2: An example of link spam.
Database and Multimedia Lab 8
Web Spam Filtering Algorithm Overview
The web spam filtering algorithms output spam nodes to be filtered out [GBG06]. In order to identify spam nodes, a web spam filtering algorithm needs spam or non-spam
nodes (called input seed sets) as an input [GGP04, KR06, GBG06, WD05].
Spam input seed set: the input seed set containing spam nodes. Non-spam input seed set: the input seed set containing non-spam nodes.
The input seed set can be used as the basis for grading the degree of whether web nodes are spam or non-spam nodes [GGP04, KR06, GBG06].
Observation The output quality of web spam filtering algorithms is dependent on that of the input seed
sets. The output of the one web spam filtering algorithm can be used as the input of the other web
spam filtering algorithm.
The algorithms may support one another if placed in appropriate succession.
Introduction
Database and Multimedia Lab 9
Motivation and Goal Motivation
There is no well-known study which addresses the refinement of the input seed sets for web spam filtering algorithms.
There is no well-known study on successions among web spam filtering algorithms.
Goal Improving the quality of web spam filtering by using seed refinement. Improving the quality of web spam filtering by finding the appropriate succession among web
spam filtering algorithms.
Apr 7, 2023
Motivation and Goal
Database and Multimedia Lab 10
Contributions We propose modified algorithms that apply seed refinement techniques using
both spam and non-spam input seed sets to well-known web spam filtering algorithms.
We propose a strategy that makes the best succession of the modified algorithms.
We conduct extensive experiments in order to show quality improvement for our work. We compare the original(i.e., well-known) algorithms with the respective modified algorithms. We evaluate the best succession among our modified algorithms.
Apr 7, 2023
Contributions
Database and Multimedia Lab 11
Related Work There are two research directions related to the Web spam.
1. Evaluating either the goodness or badness of web nodes [GGP05, KR06].
TrustRank and Anti-TrustRank are well-known algorithms. These two algorithms can be used for refining input seed sets.
2. Detecting spam nodes [GBG06, WD05].
Spam Mass and Link Farm Spam are well-known algorithms. These two algorithms can be used for identifying Web Spam.
We classify web spam filtering algorithms into two types of algorithms Seed refinement algorithms (e.g., TrustRank and Anti-Trust Rank). Spam detection algorithms (e.g., Spam Mass and Link Farm Spam).
Apr 7, 2023
Note: Existing work exploit web graph whose web node represents a domain [GBG06, WD05].
Related Work
Database and Multimedia Lab 12
TrustRank Overview [GGP04]
Trusted domains(e.g., well-known non-spam domains such as .gov and .edu) usually point to non-spam domains by using outlinks.
Trust scores are propagated through the outlinks of trusted domains. Domains having high trust scores(≥threshold) at the end of propagation are declared as non-
spam domains.
Example
ObservationTrust scores can propagate to spam domains if trusted domain outlinks to the spam domains.
Apr 7, 2023
1
2
31/2
t(1)=1
t(2)=1
t(3)=5/6
1/2
1/31/3
1/3
5/12
5/12
4t(4)=1/3
A seed non-spam domain
t(i): The trust score of domain i
The domain 3 gets trust scores from the domains 1 and 2.
A domain being considered
Fig. 3: An example for explaining TrustRank.
Related Work
Database and Multimedia Lab 13
Anti-TrustRank Overview [KR06]
Anti-trusted domains (e.g., well-known spam domains) are usually pointed by spam domains by using inlinks.
Anti-trust scores are propagated by the inlinks of anti-trusted domains. Domains having high anti-trust scores(≥threshold) at the end of propagation are declared as
spam domains.
Example
ObservationAnti-trust score can propagate to non-spam domains if a non-spam domain outlinks to spam domain.
Apr 7, 2023
1
2
31/2
at(1)=1
at(2)=1
at(3)=5/6
1/2
1/3
1/3
1/3
5/12
5/12
4at(4)=1/3
A seed spam domain
at(i): The anti-trust score of domain i
The domain 3 gets anti-trust scores from the domains 1 and 2.
A domain being considered
Fig. 4: An example for explaining Anti-TrustRank.
Related Work
Database and Multimedia Lab 14
Spam Mass Overview [GBG06]
A domain is spam if it has excessively high spam score. Spam score is estimated as subtraction from a PageRank score to a non-spam score. Non-spam score is estimated as a trust score computed by TrustRank.
Example
Observation Since the Spam Mass has use TrustRank, it has inherently the same problem as TrustRank does.
Apr 7, 2023
1
25
3A seed non-spam domain
A domain being considered
The domain 5 receives many inlinks but only one indirect inlink from a
non-spam domain.
4
76
Fig. 5: An example for explaining Spam Mass.
Related Work
Database and Multimedia Lab 15
Link Farm Spam Overview [WD05]
A domain is spam if it has many bidirectional links with domains. A domain is spam if it has many outlinks pointing to spam domains.
Example
Observation Link Farm Spam does not take any input seed set. A domain can have many bidirectional links with trusted domains as well.
Apr 7, 2023
Related Work
2
1 345
A domain being considered
The domains 1, 3, and 4 have two directional links.
Fig. 6: An example for explaining Link Farm Spam.
Database and Multimedia Lab 16
Web Spam Filtering Using Seed Refinement
Objectives Decrease the number of domains incorrectly detected as belonging to the class of non-spam
domains (called False Positives). Increase the number of domains correctly detected as belonging to the class of spam domains
(called True Positives).
Our approaches We modify the spam filtering algorithms by using both spam and non-spam domains in order
to decrease False Positives. We use non-spam domains so that their goodness should not propagate to spam domains. We use spam domains so that their badness should not propagate to non-spam domains.
We make the succession of these algorithms in order to increase True Positives. We make the succession of the seed refinement algorithm followed by the spam detection algorithm so
that the spam detection algorithm uses the refined input seed sets, which is produced by the seed refinement algorithm.
Apr 7, 2023
Database and Multimedia Lab 17
Modified TrustRank Modification
Trust score should not propagate to spam domains.
Example
Apr 7, 2023
Modifications
1
2
31/2
t(1)=1
t(2)=1
t(3)=5/6
1/2
1/31/3
1/3
5/12
5/12
A seed non-spam domain
t(i): The trust score of domain iThe domains 5 and 6 are involved in Web spam.
A domain being consideredt(5)=5/12 +
…
5 6
4t(4)=1/3
t(6)=5/12 + …
5/12
5/12
A seed spam domain
Fig. 7: An example explaining Modified TrustRank.
Database and Multimedia Lab 18
Modified Anti-TrustRank Modification
Anti-Trust score should not propagate to non-spam domains.
Example
Apr 7, 2023
Modifications
1
2
31/2at(1)=1
at(2)=1
at(3)=5/6
1/2
1/3
1/3
1/3
5/12
5/12
4
The domains 5 ,6 and 7 are non- spam domains.
at(5)=5/12
at(6)=5/12 + …
56
at(i): The anti-trust score of domain i
A domain being considered
A seed spam domain
75/12
at(4)=1/3
5/12
5/12 at(7)=5/12 + … A seed non-spam domain
Fig. 8: An example explaining Modified Anti-TrustRank.
Database and Multimedia Lab 19
Modified Spam Mass Modification
Use modified TrustRank in place of TrustRank.
Example
Apr 7, 2023
Modifications
1
25
3A seed non-spam domain
A domain being considered
The domain 5 receives many inlinks4
76
but only one indirect inlink from a non-spam domain.
A seed spam domain
Fig. 9: An example explaining Modified Spam Mass.
Database and Multimedia Lab 20
Modified Link Farm Spam Modification
Use two types (i.e., spam and non-spam domain) of input seed sets. A domain having many bidirectional links with only trusted domains is not detected as a spam
domain.
Example
Apr 7, 2023
Modifications
2
1 345
A domain being considered
The domains 1, 3, and 4 have two directional links.
Fig. 10: An example explaining Modified Link Farm Spam.
A seed non-spam domain
6 87
Database and Multimedia Lab 21
Strategy to Make Succession of Modified Algorithms
Overview We make the succession of the seed refinement algorithms (simply, Seed
Refiner) followed by the spam detection algorithms (simply, Spam Detector).
We also consider the execution order of algorithms belonging to Seed Refiner and Spam Detector, respectively.
Apr 7, 2023
Strategy
Seed Refiner
Spam Detector
Detected spam domains
Class
Data flow
Refined spam and non-spam
domains
Manually labeled spam and non-spam
domains
Fig. 11: The strategy of succession.
Strategy Consideration of the execution order in Seed Refiner.
Modified TrustRank followed by Modified Anti-TrustRank.
Modified Anti-TrustRank followed by Modified TrustRank.
Consideration of the execution order in Spam Detector.
Modified Spam Mass followed by Modified Link Farm Spam.
Modified Link Farm Spam followed by Modified Spam Mass.
Database and Multimedia Lab 22
Performance Evaluation Purpose
Show the effect of seed refinement on the quality of web spam filtering. Show the effect of succession on the quality of web spam filtering.
Experiments We conduct two sets of the experiments according to the two purposes as mentioned above.
Apr 7, 2023
Performance Evaluation
Table. 1: Summary of the experiments.
Experimental Sets Experiments Parameters
Set 1: Comparisons for showing the effect of
refining seed
Exp.1 Comparison between TR (TrustRank) and MTR (Modified TrustRank)
cutoffTr 0% − 300%ratioTop 10%, 50%, 100%damp 0.85
Exp.2 Comparison between ATR (Anti-TrustRank) and MATR (Modified Anti-TrustRank)
cutoffATr 0% − 300%ratioTop 10%, 50%, 100%damp 0.85
Exp.3 Comparison between SM (Spam Mass) and MSM (Modified Spam Mass)
relativeMass 0.7 − 1.0topPR 10%, 50%, 100%damp 0.85
Exp.4 Comparison between LFS (Link Farm Spam) and MLFS (Modified Link Farm Spam)
limitBL 2 − 7limitOL 2 − 7
Set 2: Comparisons for showing the effect of ordering executions
Exp.5 Finding the best succession for the seed refinercutoffTr 50%, 75%, 100%cutoffATr 100%damp 0.85
Exp.6 Finding the best succession for the spam detector
relativeMass 0.8 − 0.99topPR 100%limitBL 7limitOL 7damp 0.85
Exp.7 Comparison among the best succession, the best known algorithm, and best modified algorithm
relativeMass 0.8 − 0.99topPR 100%limitBL 7limitOL 7damp 0.85
Database and Multimedia Lab 23
Experimental Parameters
Apr 7, 2023
Table. 2: Parameters used in experiments.
Performance Evaluation
Parameters Descriptiondamp It is a parameter used in TR, MTR, ATR, and MATR. It is the probability of following an outlink.
RatioTop
It is the ratio for determining the input seed sets in TR, MTR, ATR, and MATR.Specifically, from Spam (or Non-Spam) Seed Set, we retrieve the domains whose PageRank scores are larger than or equal to the PageRank score of top-Ratiotop% domain in the entire domains, and then, use the domains as the input seed set.
cutoffTrIt is the cutoff threshold in TR and MTR for declaring the number of non-spam domains. In this thesis, we decide the value of cutoffTr proportional to the size of input seed set of the non-spam domains.
cutoffATrIt is the cutoff threshold in ATR and MATR for declaring the number of spam domains. In this thesis, we decide the value of cutoffATr proportional to the size of input seed set of the spam domains.
relativeMassIt is a threshold used in SM and MSM for deciding a domain as a spam such that, if the domain receives excessively higher spam score compared to the non-spam score, the domain is one of the candidates for Web spam.
topPRIt is a threshold used in SM and MSM for deciding the candidate of being a spam domain by comparing the PageRank score of the domain to be within the top percentage (i.e., topPR %) of the PageRank scores.
limitBL It is a threshold used in LFS and MLFS for declaring the domain as spam, if the number of bidirectional links of the domain is equal to or greater than this threshold.
limiOL It is a threshold used in LFS and MLFS for declaring the domain as spam, if the number of outlinks of a domains pointing to the spam domains is equal to or greater than this threshold.
Database and Multimedia Lab 24
Experimental data [BCD08] [CDB06] [CDG07]
Experimental Data
Domains Web Pages
LabeledSpam 1,924
Total77.9
MillionNon-Spam 5,549
Unlabeled Unknown 3,929Total 11,402
Apr 7, 2023
Performance Evaluation
Seed Set Test SetLabeled Spam Domains 674 1,250Labeled Non-Spam Domains 4,948 601
Table. 3: Characteristics of the data set in terms of domains and web pages.
Table. 4: Classification of the data set as Seed Set and Test Set.
Database and Multimedia Lab 25Apr 7, 2023
Experimental MeasurePerformance Evaluation
Measures Description
True positivesThe number of domains correctly labeled as belonging to the class (i.e., spam or non-spam). [BCD08]
False positivesThe number of domains incorrectly labeled as belonging to the class (i.e., spam or non-spam). [BCD08]
F-measure
The combined representation of precision and recall. Precision, recall [SM86], and F-measure are expressed as follows.
–
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠+ 𝑓𝑎𝑙𝑠𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
𝑟𝑒𝑐𝑎𝑙𝑙 = 𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠+ 𝑓𝑎𝑙𝑠𝑒 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠2
𝐹 𝑚𝑒𝑎𝑠𝑢𝑟𝑒= 2× 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 × 𝑟𝑒𝑐𝑎𝑙𝑙𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙 Table. 5: Description of the measures.
1False negatives are the number of domains incorrectly labeled as not belonging to the class (i.e., spam or non-spam).
1
Database and Multimedia Lab 26
Comparison between Originaland Modified Algorithms (1/3)
Apr 7, 2023
Performance Evaluation
Experiment 1: Comparison Between TR and MTR MTR performs either comparable to or slightly better than TR in terms of both true positives and
false positives.
We find cutoffTr effective till 100% mark indicating that after 100% detection becomes unstable in terms of false positives.
For later experiments, we fix the cutoffTr range till 100%.
Experiment 2: Comparison Between ATR and MATR MATR generally performs better than ATR in terms of true positives
We find cutoffATr effective till 180% mark indicating that after 100% detection becomes unstable in terms of false positives.
For later experiments, we fix the cutoffATr at 100% to ensure high precision.
Database and Multimedia Lab 27
Comparison between Originaland Modified Algorithms (2/3)
Experiment 3: Comparison Between SM and MSM MSM performs slightly better than SM in terms of true positives and comparable in terms of
false positives
We find relativeMass effective between the range of 0.95 to 0.99 in terms of maximizing true positives and minimizing false positives.
For later experiments, we keep the range from 0.8 to 0.99 of relativeMass as effective range.
Experiment 4: Comparison Between LFS and MLFS MLFS performs better than LFS in terms of false positives while at some expense of true
positives. We find limitBL and limitOL highly effective at 7 and 7 respectively in terms of minimizing
many false positives.
For later experiments, we keep limitBL = 7 and limitOL = 7.
Apr 7, 2023
Performance Evaluation
Database and Multimedia Lab 28
Comparison between Originaland Modified Algorithms (3/3)
Summary We have found all modified algorithms providing better quality than the respective original
algorithms. We found SM as the best original web spam detection algorithms among ATR, SM, and LFS
algorithms due to high true positives and relatively less false positives. We also found MSM as the best modified web spam detection algorithms among MATR, MSM,
and MLFS algorithms due to high true positives and relatively less false positives.
Apr 7, 2023
Performance Evaluation
Database and Multimedia Lab 29
True Positives False Positives
For Finding Refined Non-
Spam Domains
For Finding Refined Spam
Domains
The Best Succession for the Seed Refiner
Apr 7, 2023Therefore, MATR-MTR is found to be the winner, and hence we select it as the seed refiner.
Performance Evaluation
Identical performance for both successions Identical performance for both successions
Identical performance for both successionsBetter performance for MATR-MTR compared toMTR-MATR
Table. 6: Comparison for the seed refiner.
Database and Multimedia Lab 30
The Best Successionfor the Spam Detector
Comparison We pick 0.99 of relativeMass since false positives are minimum at this value compared to other
values of relativeMass while true positives are almost comparable for all values of relativeMass. We observe MLFS fails to detect considerable number of spam domains. We obtain the precisions 0.86, 0.86, 0.93, and 0.87 for MLFS-MSM, MSM-MLFS, MLFS, and MSM
respectively. We obtain the recalls 0.80, 0.80, 0.33, and 0.76 for MLFS-MSM, MSM-MLFS, MLFS, and MSM
respectively. MLFS-MSM and MSM-MLFS are best and identical in performance, we choose MLFS-MSM as the
best spam detector without loss of generality.
Apr 7, 2023
Performance Evaluation
Fig. 12: Comparison for the spam detector.
Database and Multimedia Lab 31
Comparison We pick 0.99 of relativeMass since false positives are minimum at this value compared to other
values of relativeMass while true positives are almost comparable for all values of relativeMass. We observe MATR-MTR-MLFS-MSM finds more true positives and some more false positives. We obtain the precisions 0.85, 0.86, and 0.86 for SM, MSM, and MATR-MTR-LFS-MSM
respectively. We obtain the recalls 0.64, 0.70, and 0.80 for SM, MSM, and MATR-MTR-LFS-MSM respectively.
Comparison among the Best Succession, theBest Known Algorithm and the Best Modified
Algorithm
Apr 7, 2023
Fig. 13: Comparison among MATR-MTR-MLFS-MSM, SM, and MSM.
Therefore, MATR-MTR-MLFS-MSM is more effective.
Database and Multimedia Lab 32
Conclusions
We have improved the quality of web spam filtering by using seed refinement We have proposed modifications in four well-known web spam filtering algorithms.
We have proposed a strategy of succession of modified algorithms Seed Refiner contains order of executions for seed refinement algorithms. Spam Detector contains order of executions for spam detection algorithms.
We have conducted extensive experiments in order to show the effect of seed refinement on the quality of web spam filtering We find that every modified algorithm performs better than the respective original algorithm. We find the best performance among the successions by MATR followed by MTR, MLFS, and MSM (i.e.,
MATR-MTR-MSM). This succession outperforms the best original algorithm i.e., SM, by up to 1.25 times in recall and is comparable in terms of precision.
Apr 7, 2023
Database and Multimedia Lab 33
References (1/2)[ACG01] Arasu, A., Cho, J., Garcia-Molina, H., Paepcke, A., and Raghavan, S., “Searching the Web,” ACM Transactions on
Internet Technology (TOIT), Vol. 1, No. 1, pp. 2-43, Aug. 2001.
[BP98] Brin, S. and Page, L., “The Anatomy of a Large-Scale Hypertextual Web Search Engine,” In Proc. 7th Int'l Conf. on World Wide Web (WWW), pp. 107-117, Brisbane, Australia, Apr. 1998.
[BCD08] Becchetti, L., Castillo, C., Donato, D., Baeza-YATES, R., and Leonardi, S., “Link Analysis for Web Spam Detection,” ACM Transactions on Web (TWEB), Vol. 2, No. 1, pp. 1-42, Mar. 2008.
[CDB06] Castillo, C., Donato, D., Becchetti, L., Boldi, P., Leonardi, S., Santini, M., and Vigna, S., “A Reference Collection for Web Spam,” SIGIR Forum, Vol. 40, No. 2, pp. 11-24, Dec. 2006.
[CDG07] Castillo, C., Donato, D., Gionis, A., Murdock, V., and Silvestri, F, “Know Your Neighbors: Web Spam Detection Using the Web Topology,” In Proc. 30th Annual Int'l ACM SIGIR Conf. on Research and Development in Information
Retrieval, pp. 423-430, Amsterdam, The Netherlands, July 2007.
[GG05] Gyongyi, Z., Berkhin, P., and Garcia-Molina, H., “Web Spam Taxonomy,” In Proc. 1st Int'l Workshop on Adversarial Information Retrieval on the Web (AIRWeb), pp. 39-47, Chiba, Japan, May 2005.
[GBG06] Gyongyi, Z., Berkhin, P., Garcia-Molina, H., and Pedersen, J., “Link Spam Detection Based on Mass Estimation,” In Proc. 32th Int'l Conf. on Very Large Data Bases (VLDB), pp. 439-450, Seoul, Korea, Sept. 2006.
[GGP04] Gyongyi, Z., Garcia-Molina, H., and Jan, P., “Combating Web Spam with TrustRank,” In Proc. 30th Int'l Conf. on Very Large Data Bases (VLDB), pp. 576-587, Toronto, Canada, Aug. 2004.
Apr 7, 2023
Database and Multimedia Lab 34
References (2/2)[KR06] Krishnan, V. and Raj, R., “Web Spam Detection with Anti-TrustRank,” In Proc. 2nd Int'l Workshop on Adversarial
Information Retrieval on the Web (AIRWeb), pp. 37-40, Washington, USA, Aug. 2006.
[WD05] Wu, B. and Davison, B., “Identifying Link Farm Spam Pages,” In Proc. Special Interest Tracks and Posters of the 14th Int'l Conf. on World Wide Web (WWW), pp. 820-829, Chiba, Japan, May 2005.
[SM86] Salton, G. and McGill, M. J., Introduction to Modern Information Retrieval, McGraw-Hill, 1986.
[YUT08] Yoshida, Y., Ueda, T., Tashiro, T., Hirate, Y., and Yamana, “What's Going on in Search Engine Rankings,” In Proc. 22nd Int'l Conf. on Advanced Information Networking and Applications (AINAW), pp. 1199 - 1204, Okinawa, Japan, Mar. 2008.
Apr 7, 2023
Database and Multimedia Lab 35
THANK YOU VERY MUCH!
Apr 7, 2023
Database and Multimedia Lab 36
MTR Algorithm
Apr 7, 2023
Supplement
Input:
A seed set of non-spam domains N A seed set of spam domains S The threshold cutoff
The difference threshold ε Web graph G= (V, E)
Output: A set of non-spam domains ON Trust score vector of all domains
Algorithm: 1. for each d ∈ V 2. if d ∈ N then
3.
4. else 5. T0[d] = 0 6. i = 0 7. do 8. for each d ∈ V 9. for each (d, q) ∈ E 10. if q ∉ S then
11.
12. for each d ∈ V 13. 14. Δ = | Ti+1 - Ti | 15. i = i + 1 16. until Δ < ε 17. Tordered = Order Ti+1 by trust scores in descending order 18. ON = Highest trust score domains within cutoff 19. return ON, Tordered
1 size(N)
T0[d]
Ti[d] damp Ti+1[q] Ti+1[q]
T Noutlink(d)
). 1 ( Ti[d] damp Ti+1[d] Ti+1[d]
Database and Multimedia Lab 37
MATR Algorithm
Apr 7, 2023
Supplement
Input:
A seed set of non-spam domains N A seed set of spam domains S The threshold cutoff The difference threshold Δ Web graph G= (V, E)
Output: A set of spam domains OS Anti-trust score vector of all domains
Algorithm: 1. for each d ∈ V 2. if d ∈ S then
3.
4. else 5. AT0[d] = 0 6. i = 0 7. do 8. for each d ∈ V 9. for each (q ,d) ∈ E 10. if q ∉ N then
11.
12. for each d ∈ V 13. 14. Δ = | ATi+1 - ATi | 15. i = i + 1 16. until Δ < ε 17. ATordered = Order ATi+1 by anti-trust scores in descending order 18. OS = Highest anti-trust score domains within cutoff 19. return OS, ATordered
][)1(][][ 11 dATdampdATdAT iii
)(][][][ 11
dNdATdampqATqAT
inlink
iii
1 size(S)
AT0[d]
Database and Multimedia Lab 38
MSM Algorithm
Apr 7, 2023
Supplement
Input: A seed set of non-spam domains N A seed set of spam domains S The threshold topPR The threshold relativeMass The difference threshold Δ Web graph G= (V, E) Output: A set of spam domains OS Algorithm: 1. ON, T = Modified TrustRank(N, S, cutoff,, Δ, G) 2. P = PageRank(Δ, G) 3. for each d ∈ V 4. if P[d] ≥ topPR then
5. if then
6. OS ← OS ⋃ {d} 7. return OS
ssrelativeMadP
dTdP
][
][][
Database and Multimedia Lab 39
MLFS Algorithm
Apr 7, 2023
Supplement
Input:
A seed set of non-spam domains N A seed set of spam domains S The threshold limitBL The threshold limitOL Web graph G= (V, E)
Output: A set of spam domains OS
Algorithm: 1. OS ← S 2. for each d ∈ V 3. if d ∉ N then 4. I = inDomain(d) – N – {d} 5. O = outDomain(d) – N – {d} 6. if size( I ∩ O) ≥ limitBL 7. OS ← OS ⋃ {d} 8. do 9. Oold ← OS 10. for each d ∈ V 11. if d ∉ N then 12. O = outDomain(d) ∩ OS 13. if size(O) ≥ limitOL 14. OS ← OS ⋃ {d} 15. until size(OS) > size(Oold) 16. return OS
Database and Multimedia Lab 40
TR vs. MTR
Apr 7, 2023
Supplement
(a) (b)
(c) (d)
(e) (f)
RatioTop =10%
RatioTop =50%
RatioTop =100%
Database and Multimedia Lab 41
ATR vs. MATR
Apr 7, 2023
Supplement
(a) (b)
(c) (d)
(e) (f)
RatioTop =10%
RatioTop =50%
RatioTop =100%
Database and Multimedia Lab 42
SM vs. MSM
Apr 7, 2023
Supplement
(a) (b)
(c) (d)
(e) (f)
topPR =70%
topPR =85%
topPR =100%
Database and Multimedia Lab 43
LFS vs. MLFS
Apr 7, 2023
(a)
(b)
Supplement
Database and Multimedia Lab 44
MSM performs better than the rest due to the minimization of False Positives while almost comparable to best in terms of True Positives.
The Best Successionfor the Spam Detector
Apr 7, 2023
Fig x: Comparison for the spam detector
The winner is MSM for Spam Detector.
Supplement
Database and Multimedia Lab 45
MATR-MTR-MSM performs better than both SM and MSM. The MATR-MTR-MSM finds more True Positives than these two algorithms with comparable False Positives.
Comparison among the Best Succession, theBest Known Algorithm and Best Modified
Algorithm
Apr 7, 2023
MATR-MTR-MSM is very effective compared to best known algorithm.
Supplement
Database and Multimedia Lab 46
Possible Combinations for SeedRefinement Module
Apr 7, 2023
Supplement
Succession 1 (MATR-MTR) Succession 2 (MTR-MATR)
MATR
MTR
Manual spam and non-spam seed domains
Manual non-spam domains and refined spam domains
Manual spam and non-spam seed domains
MTR
MATR
Refined spam and non-spam seed domains Refined spam and non-spam seed domains
Manual spam domains and refined non-spam domains
Seed Refiner
Seed Refiner
Algorithm Class Data flow
Database and Multimedia Lab 47
Possible Combinations for SpamDetection Module
Apr 7, 2023
Supplement
Combinations Single Algorithm
MLFS-MSM MSM-MLFS MLFS MSM
Succession 1 (MLFS-MSM) Succession 2 (MSM-MLFS)
MLFS
MSM
Refined spam/non-spam seed domains
Spam domains and refined non-spam domains
Refined spam/non-spam seed domains
MSM
MLFS
Detected spam domains Detected spam domains
Spam domains and refined non-spam domains
Spam Detector
Spam Detector
Algorithm Class Data flow
Database and Multimedia Lab 48
TR and ATR problem
Apr 7, 2023
1
2
31/2
t(1)=1
t(2)=1
t(3)=5/6
1/2
1/31/3
1/3
5/12
5/12
A seed non-spam domain
t(i): The trust score of domain iThe domains 5 and 6 are involved in Web spam.
A domain being consideredt(5)=5/12 +
…
5 6
4t(4)=1/3
t(6)=5/12 + …
5/12
5/12
1
2
31/2at(1)=1
at(2)=1
at(3)=5/6
1/2
1/3
1/3
1/3
5/12
5/12
4
The domains 5 ,6 and 7 are non- spam domains.
at(5)=5/12
at(6)=5/12 + …
56
at(i): The anti-trust score of domain i
A domain being considered
A seed spam domain
75/12
at(4)=1/3
5/12
5/12 at(7)=5/12 + …
Supplement