(presentation chris) crowdsourcing & semantic web: dagstuhl 2014
DESCRIPTION
How to Measure Quality with Disagreement? or the Three Sides of CrowdTruthTRANSCRIPT
How to Measure Quality with Disagreement?
or the Three Sides of CrowdTruth
Lora Aroyo & Chris Welty
CrowdTruth Annotator disagreement is signal, not noise.
It is indicative of the variation in human
semantic interpretation of signs
It can indicate ambiguity, vagueness, similarity, over-generality, etc, as well as
quality
CrowdTruth Dependencies
worker metrics for detecting spam à quality of sentences à quality of the target semantics worker quality metrics can improve significantly when the quality of these other aspects of semantic interpretation are considered
The Three Sides of CrowdTruth
Representation
Worker Vector
1 1 1
Representation
Sentence Vector
1 1 1
1 1
1
1 1
1 1
1 1
1
1
1
0 1 1 0 0 4 3 0 0 5 1 0
Feeling the way the CHEST expands (PALPATION), can identify areas of the lung that are full of fluid.
?PALPATIONIs CHEST related to
diagnose location associated with
is_a otherpart_of
0 0 02 3 0 0 0 1 0 0 44 1
Disagreement for Sentence Clarity
Unclear relationship between the two arguments reflected in the disagreement
?CONJUNCTIVITISHYPERAEMIA related toIs0 0 0 1 0 0 0 013 0 0 0 0 0
symptomcause
Redness (HYPERAEMIA), irritation (chemosis) and watering (epiphora) of the eyes are symptoms common to all forms of CONJUNCTIVITIS.
Disagreement for Sentence Clarity
Clearly expressed relation between the two arguments reflected in the agreement
Sentence-Relation Score
Measures how clearly a sentence expresses a relation
0 1 1 0 0 4 3 0 0 5 1 0
Unit vector for relation R6
Sentence Vector
Cosine = .55
Worker Disagreement
Measured per worker
Worker-sentence disagreement
0 1 1 0 0 4 3 0 0 5 1 0
Worker’s sentence vector
Sentence Vector
AVG (Cosine)
Worker Metrics how much A WORKER disagrees with THE CROWD per sentence à the avg of all cosine distances between each worker’s sentence vector & the full sentence vector (minus that worker) are there consistently like-minded workers à pairwise metric - avg for a particular worker à there may be communities of thought that consistently disagree with others, but agree within themselves Low quality workers generally have high scores in both avg relations per sentence à per worker the number of relations he/she chooses per sentence averaged over all sentences he/she annotates. High score here can help indicate low quality workers.
Sentence Metrics Sentence-relation score à core CrowdTruth metric for relation extraction à measured for each relation on each sentence as the cosine of the unit vector for the relation with the sentence vector indicating that a relation is clearly or vaguely expressed, Sentence clarity à defined for each sentence as the max relation score for that sentence indicating a clear or ambiguous or confusing sentence
Relation Metrics Relation similarity à the causal power (pairwise conditional probability). high similarity score indicates the relations are confusable to workers Relation ambiguity is defined for each relation as the max relation similarity for the relation. If a relation is clear, then it will have a low score. Relation clarity à defined for each relation as the max sentence-relation score for the relation over all sentences. If a relation has a high clarity score, it means that it is at least possible to express the relation clearly Relation frequency is the number of times the relation is annotated at least once in a sentence
Impact of Dependencies
Impact of Dependencies
Impact of Sentence Quality on Worker Quality
(a) the space with no filtering of sentences or relations, a single line cannot separate the spammers from non-spammers
(b) the space after sentence filtering, Figure (c) after relation filtering, and Figure (d) after both sentence and relation filtering. Sentence filtering makes the classes linearly separable, and the separation between the classes increases in the subsequent figures.
Impact of Relation Quality on Worker
Quality
(a) the space with no filtering of sentences or relations, a single line cannot separate the spammers from non-spammers (c) after relation filtering
the relation filtering much more clearly defines the space, with a large separation between positive and negative instances. the pairwise improvements to the worker scores are significant with p < :001, which is better than the sentence clarity improvements
Combining Sentence & Relation Filtering
• first filtering out low clarity sentences
• then filtering vague and ambiguous relations
• worker metrics were computed on these new sentences and vectors
• proves to even further separate the space, and the pairwise improvement in worker scores from the baseline (unfiltered) is significant with p < :0005.
• The improvement over sentence filtering alone is also significant (p < :01)
• The improvement over relation filtering alone is only significant with p < :05.
quality measures in semantic interpretation tasks
are inter-dependent higher accuracy can be achieved by considering the impact of sentence quality & relation quality on worker quality measurements significant improvement in worker quality metrics with respect to known spammers by incorporating the quality of the individual sentences & target relations relationships between the different corners of the triangle of reference, e.g. à the impact of relation & worker quality on sentence measures, à the impact of worker & sentence quality on relation measures
crowdtruth.org