(presentation chris) crowdsourcing & semantic web: dagstuhl 2014

How to Measure Quality with Disagreement?

or the Three Sides of CrowdTruth

Lora Aroyo & Chris Welty

CrowdTruth Annotator disagreement is signal, not noise.

It is indicative of the variation in human

semantic interpretation of signs

It can indicate ambiguity, vagueness, similarity, over-generality, etc, as well as

quality

CrowdTruth Dependencies

worker metrics for detecting spam à quality of sentences à quality of the target semantics worker quality metrics can improve significantly when the quality of these other aspects of semantic interpretation are considered

The Three Sides of CrowdTruth

Representation

Worker Vector

1 1 1

Representation

Sentence Vector

1 1 1

1 1

1

1 1

1 1

1 1

1

1

1

0 1 1 0 0 4 3 0 0 5 1 0

Feeling the way the CHEST expands (PALPATION), can identify areas of the lung that are full of fluid.

?PALPATIONIs CHEST related to

diagnose location associated with

is_a otherpart_of

0 0 02 3 0 0 0 1 0 0 44 1

Disagreement for Sentence Clarity

Unclear relationship between the two arguments reflected in the disagreement

?CONJUNCTIVITISHYPERAEMIA related toIs0 0 0 1 0 0 0 013 0 0 0 0 0

symptomcause

Redness (HYPERAEMIA), irritation (chemosis) and watering (epiphora) of the eyes are symptoms common to all forms of CONJUNCTIVITIS.

Disagreement for Sentence Clarity

Clearly expressed relation between the two arguments reflected in the agreement

Sentence-Relation Score

Measures how clearly a sentence expresses a relation

0 1 1 0 0 4 3 0 0 5 1 0

Unit vector for relation R6

Sentence Vector

Cosine = .55

Worker Disagreement

Measured per worker

Worker-sentence disagreement

0 1 1 0 0 4 3 0 0 5 1 0

Worker’s sentence vector

Sentence Vector

AVG (Cosine)

Worker Metrics how much A WORKER disagrees with THE CROWD per sentence à the avg of all cosine distances between each worker’s sentence vector & the full sentence vector (minus that worker) are there consistently like-minded workers à pairwise metric - avg for a particular worker à there may be communities of thought that consistently disagree with others, but agree within themselves Low quality workers generally have high scores in both avg relations per sentence à per worker the number of relations he/she chooses per sentence averaged over all sentences he/she annotates. High score here can help indicate low quality workers.

Sentence Metrics Sentence-relation score à core CrowdTruth metric for relation extraction à measured for each relation on each sentence as the cosine of the unit vector for the relation with the sentence vector indicating that a relation is clearly or vaguely expressed, Sentence clarity à defined for each sentence as the max relation score for that sentence indicating a clear or ambiguous or confusing sentence

Relation Metrics Relation similarity à the causal power (pairwise conditional probability). high similarity score indicates the relations are confusable to workers Relation ambiguity is defined for each relation as the max relation similarity for the relation. If a relation is clear, then it will have a low score. Relation clarity à defined for each relation as the max sentence-relation score for the relation over all sentences. If a relation has a high clarity score, it means that it is at least possible to express the relation clearly Relation frequency is the number of times the relation is annotated at least once in a sentence

Impact of Dependencies

Impact of Sentence Quality on Worker Quality

(a) the space with no filtering of sentences or relations, a single line cannot separate the spammers from non-spammers

(b) the space after sentence filtering, Figure (c) after relation filtering, and Figure (d) after both sentence and relation filtering. Sentence filtering makes the classes linearly separable, and the separation between the classes increases in the subsequent figures.

Impact of Relation Quality on Worker

Quality

(a) the space with no filtering of sentences or relations, a single line cannot separate the spammers from non-spammers (c) after relation filtering

the relation filtering much more clearly defines the space, with a large separation between positive and negative instances. the pairwise improvements to the worker scores are significant with p < :001, which is better than the sentence clarity improvements

Combining Sentence & Relation Filtering

•  first filtering out low clarity sentences

•  then filtering vague and ambiguous relations

•  worker metrics were computed on these new sentences and vectors

•  proves to even further separate the space, and the pairwise improvement in worker scores from the baseline (unfiltered) is significant with p < :0005.

•  The improvement over sentence filtering alone is also significant (p < :01)

•  The improvement over relation filtering alone is only significant with p < :05.

quality measures in semantic interpretation tasks

are inter-dependent higher accuracy can be achieved by considering the impact of sentence quality & relation quality on worker quality measurements significant improvement in worker quality metrics with respect to known spammers by incorporating the quality of the individual sentences & target relations relationships between the different corners of the triangle of reference, e.g. à the impact of relation & worker quality on sentence measures, à the impact of worker & sentence quality on relation measures

crowdtruth.org

(presentation chris) crowdsourcing & semantic web: dagstuhl 2014

Technology

max sentencerelation

relation clarity

confusing sentence

representation sentence

relation extraction

max relation score

workers relation ambiguity

max relation similarity