trustworthiness assessment (on web pages)
DESCRIPTION
Trustworthiness assessment (on web pages). Task 3.3. Introduction. The number of available data sources keeps increasing at fast pace Sensors embedded in mobile phones, websites, blogs, … Data becomes more valuable when combined from different sources - PowerPoint PPT PresentationTRANSCRIPT
19/04/23 Planet Data - Madrid 2
Credibility assessment on web
pages
Introduction
• The number of available data sources keeps increasing at fast pace• Sensors embedded in mobile phones, websites, blogs, …
• Data becomes more valuable when combined from different sources
• What about the trustworthiness of this aggregated data?• Unknown data sources
• No standard way to evaluate trustworthiness
• Subjectivity of the consumer of the data
• Important economic incentive to lie
• Interesting case of the WWW
• Web credibility assessment
19/04/23 Planet Data - Madrid 3
Credibility assessment on web
pages
What is the problem of web credibility ?
• Non credible websites represent an important percentage of the web• Credibility seen as an aggregation of objective and subjective components
(Fogg)• Credibility= trustworthiness AND expertise• Web users can be naïve or lazy and won’t try to verify information• Focus on domains where expertise is hard to evaluate for lambda users
• Medical treatments• Trading operations• Ideological assertions
• Economic / politic interests are at stacks
19/04/23 Planet Data - Madrid 4
Credibility assessment on web
pages
Background
• Trustworthiness components in the context of web credibility:• Y. Yamamoto and K. Tanaka. Enhancing credibility judgment of web
search results.• Accuracy: referential importance• Authority: social reputation • Objectivity: content typicality• Currency: update frequency• Coverage: coverage of topic
• M. J. Metzger. Making sense of credibility on the web: Models for evaluating online information and recommendations for future research.
• Credentials• Advertisements• Design
19/04/23 Planet Data - Madrid 5
Credibility assessment on web
pages
Credibility assessment as a classification problem
• Use historical information on evaluations for future credibility assessment
• A machine learning approach• Binary classification
• Users evaluate pages as credible or non-credible• Content-based Features
• Extracted programmatically from web pages
• Training set and test set• Leave-one-out cross validation• Tested by category
19/04/23 Planet Data - Madrid 6
Credibility assessment on web
pages
Feature selection• Categories
• Act as a filter, only pages from the same category are tested for similarity• Keywords and Entities in the document
• Reflect the topic of the web page at a finer grain• Sentiment analysis
• Computed at the words level• Used in conjunction with keywords & entities
• Part of speech• Extra feature reflecting the overall structure of the webpage
• Number of Ads displayed (in process)• They distract users from their activity and the page loose credibility
• Complexity of the css files (not included yet)• Pages with no structure tend to loose credibility
• PageRank• Google’s metric which include a credibility measure
19/04/23 Planet Data - Madrid 7
Credibility assessment on web
pages
Experimental setup
• Two machine learning algorithms• kNN Item-Item algorithm
• Compute a similarity between pages• take only into account the most similar pages
• C4.5 decision tree• Has good performance in general• However not suitable for multivalued features (keywords, entities)• Defined as a baseline
• Microsoft corpus• 1000 pages evaluated for credibility by experts and regular users• Divided into 5 topics
• Top 40 pages retrieved by search engines for 5 queries• Rescaled from Likert scale [0;5] to binary scale {-1;1}
19/04/23 Planet Data - Madrid 8
Credibility assessment on web
pages
Content-based rating
• kNN item-item algorithm
• Based on similarity between pages rated by the user
• Aggregated similarities
• Based on pages features’ similarity
• Cosine similarity for monovalued features (POS, pageRank, …)
• Jaccard similarity for multivalued features (keywords, entities)
• Only positive similarity are taken into account
mssimilarItejji
mssimilarItejjuji
ius
rs
,
,,
,
19/04/23 Planet Data - Madrid 10
Credibility assessment on web
pages
Results
• Mixed results• Precision ~ 0.7, recall ~ 0.8• Impossible to predict accurately the credibility• Biased by ratings distribution over classes
19/04/23 Planet Data - Madrid 11
Credibility assessment on web
pages
Results
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
celebrities environment health personalfinance
politics
kNN precision
kNN recall
ML precision
ML recall
• Tests on keywords + entities + sentiment• Similar results (Precision ~ 0.7, Recall ~ 0.8)
19/04/23 Planet Data - Madrid 12
Credibility assessment on web
pages
Results
00.10.20.30.40.50.60.70.80.9
1
celebrities environment health personalf inance
politics
kNN precision
kNN recall
ML precision
ML recall
Mixed results among classes
• Tests on all features (POS + keywords + entities + sentiments)• Similar results (Precision ~ 0.7 and Recall ~ 0.8)
19/04/23 Planet Data - Madrid 13
Credibility assessment on web
pages
Future work
• Semantic distances• Pages seen as set of concepts• Definition of a distance between two sets in the concepts space
• Similarity using a path distance in a concept hierarchy• Social referrals
• Use evaluation of other peoples • Weights based on their trustworthiness• Estimate page credibility based on beta reputation
• Combine reputation with classification approaches to have an aggregated metric• To get better estimation of the credibility than the two components
separated
19/04/23 Planet Data - Madrid 14
Credibility assessment on web
pages
Conclusion
• Project based on content-based aspects
• Results promising although room for improvement• Accuracy of the prediction
• Time complexity of the implementation
• Several features remain unimplemented• Local extraction of features
• Integration of new page features
• Semantic aspect of web pages