trustworthiness assessment (on web pages)

14
16/03/22 Jean-Eudes Ranvier Planet Data - Madrid Trustworthiness assessment (on web pages) Task 3.3

Upload: quennell-poirier

Post on 31-Dec-2015

27 views

Category:

Documents


3 download

DESCRIPTION

Trustworthiness assessment (on web pages). Task 3.3. Introduction. The number of available data sources keeps increasing at fast pace Sensors embedded in mobile phones, websites, blogs, … Data becomes more valuable when combined from different sources - PowerPoint PPT Presentation

TRANSCRIPT

19/04/23 Jean-Eudes RanvierPlanet Data - Madrid

Trustworthiness assessment

(on web pages)

Task 3.3

19/04/23 Planet Data - Madrid 2

Credibility assessment on web

pages

Introduction

• The number of available data sources keeps increasing at fast pace• Sensors embedded in mobile phones, websites, blogs, …

• Data becomes more valuable when combined from different sources

• What about the trustworthiness of this aggregated data?• Unknown data sources

• No standard way to evaluate trustworthiness

• Subjectivity of the consumer of the data

• Important economic incentive to lie

• Interesting case of the WWW

• Web credibility assessment

19/04/23 Planet Data - Madrid 3

Credibility assessment on web

pages

What is the problem of web credibility ?

• Non credible websites represent an important percentage of the web• Credibility seen as an aggregation of objective and subjective components

(Fogg)• Credibility= trustworthiness AND expertise• Web users can be naïve or lazy and won’t try to verify information• Focus on domains where expertise is hard to evaluate for lambda users

• Medical treatments• Trading operations• Ideological assertions

• Economic / politic interests are at stacks

19/04/23 Planet Data - Madrid 4

Credibility assessment on web

pages

Background

• Trustworthiness components in the context of web credibility:• Y. Yamamoto and K. Tanaka. Enhancing credibility judgment of web

search results.• Accuracy: referential importance• Authority: social reputation • Objectivity: content typicality• Currency: update frequency• Coverage: coverage of topic

• M. J. Metzger. Making sense of credibility on the web: Models for evaluating online information and recommendations for future research.

• Credentials• Advertisements• Design

Jean-Eudes Ranvier
Is it good to put the references like that in the slide?

19/04/23 Planet Data - Madrid 5

Credibility assessment on web

pages

Credibility assessment as a classification problem

• Use historical information on evaluations for future credibility assessment

• A machine learning approach• Binary classification

• Users evaluate pages as credible or non-credible• Content-based Features

• Extracted programmatically from web pages

• Training set and test set• Leave-one-out cross validation• Tested by category

19/04/23 Planet Data - Madrid 6

Credibility assessment on web

pages

Feature selection• Categories

• Act as a filter, only pages from the same category are tested for similarity• Keywords and Entities in the document

• Reflect the topic of the web page at a finer grain• Sentiment analysis

• Computed at the words level• Used in conjunction with keywords & entities

• Part of speech• Extra feature reflecting the overall structure of the webpage

• Number of Ads displayed (in process)• They distract users from their activity and the page loose credibility

• Complexity of the css files (not included yet)• Pages with no structure tend to loose credibility

• PageRank• Google’s metric which include a credibility measure

19/04/23 Planet Data - Madrid 7

Credibility assessment on web

pages

Experimental setup

• Two machine learning algorithms• kNN Item-Item algorithm

• Compute a similarity between pages• take only into account the most similar pages

• C4.5 decision tree• Has good performance in general• However not suitable for multivalued features (keywords, entities)• Defined as a baseline

• Microsoft corpus• 1000 pages evaluated for credibility by experts and regular users• Divided into 5 topics

• Top 40 pages retrieved by search engines for 5 queries• Rescaled from Likert scale [0;5] to binary scale {-1;1}

19/04/23 Planet Data - Madrid 8

Credibility assessment on web

pages

Content-based rating

• kNN item-item algorithm

• Based on similarity between pages rated by the user

• Aggregated similarities

• Based on pages features’ similarity

• Cosine similarity for monovalued features (POS, pageRank, …)

• Jaccard similarity for multivalued features (keywords, entities)

• Only positive similarity are taken into account

mssimilarItejji

mssimilarItejjuji

ius

rs

,

,,

,

19/04/23 Jean-Eudes RanvierPlanet Data - Madrid

Evaluation

Preliminary results

19/04/23 Planet Data - Madrid 10

Credibility assessment on web

pages

Results

• Mixed results• Precision ~ 0.7, recall ~ 0.8• Impossible to predict accurately the credibility• Biased by ratings distribution over classes

19/04/23 Planet Data - Madrid 11

Credibility assessment on web

pages

Results

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

celebrities environment health personalfinance

politics

kNN precision

kNN recall

ML precision

ML recall

• Tests on keywords + entities + sentiment• Similar results (Precision ~ 0.7, Recall ~ 0.8)

19/04/23 Planet Data - Madrid 12

Credibility assessment on web

pages

Results

00.10.20.30.40.50.60.70.80.9

1

celebrities environment health personalf inance

politics

kNN precision

kNN recall

ML precision

ML recall

Mixed results among classes

• Tests on all features (POS + keywords + entities + sentiments)• Similar results (Precision ~ 0.7 and Recall ~ 0.8)

19/04/23 Planet Data - Madrid 13

Credibility assessment on web

pages

Future work

• Semantic distances• Pages seen as set of concepts• Definition of a distance between two sets in the concepts space

• Similarity using a path distance in a concept hierarchy• Social referrals

• Use evaluation of other peoples • Weights based on their trustworthiness• Estimate page credibility based on beta reputation

• Combine reputation with classification approaches to have an aggregated metric• To get better estimation of the credibility than the two components

separated

19/04/23 Planet Data - Madrid 14

Credibility assessment on web

pages

Conclusion

• Project based on content-based aspects

• Results promising although room for improvement• Accuracy of the prediction

• Time complexity of the implementation

• Several features remain unimplemented• Local extraction of features

• Integration of new page features

• Semantic aspect of web pages

Jean-Eudes Ranvier
Isn't to concret?