#crowdtruth: biomedical data mining, modeling & semantic integration (bdm2i 2015) @iswc2015

Anca Dumitrache, Lora Aroyo, Chris Welty http://CrowdTruth.org

Achieving Expert-Level Annotation Quality with the Crowd

The Case of Medical Relation Extraction

Biomedical Data Mining, Modeling & Semantic Integration @ ISWC2015

#CrowdTruth @anouk_anca @laroyo @cawelty #BDM2I

•  Annotator disagreement is signal, not noise.

•  It is indicative of the variation in human semantic interpretation of signs

•  It can indicate ambiguity, vagueness, similarity, over-generality, etc, as well as quality

CrowdTruth http://CrowdTruth.org

•  Goals: collecting a relation extraction

gold standard improve the performance of a

relation extraction classifier

•  Approach: crowdsource 900 medical

sentences measure disagreement with

CrowdTruth metrics train & evaluate classifier with

CrowdTruth score

CrowdTruth for medical rela2on extrac2on

http://CrowdTruth.org

RelEx TA

SK in CrowdFlow

er Pa2ents with ACUTE FEVER and nausea could be suffering from INFLUENZA AH1N1

Is ACUTE FEVER – related to → INFLUENZA AH1N1?

h"p://CrowdTruth.org

1 1 1

Worker Vector


1 1 1

1 1

1

1 1

1 1

1 1

1

1

1

0 1 1 0 0 4 3 0 0 5 1 0

Sentence Vector


0.907, p = 0:007

0.844

Annota2on Quality of Expert vs. Crowd Annota2ons


0.907, p = 0:007

0.844

[0.6 -‐ 0.8] crowd significantly out-‐performs expert with max in 0.907 F1 @ 0.7 threshold

Annota2on Quality of Expert vs. Crowd Annota2ons


0.642, p = 0:016 0.638

Relex CAUSE Classifier F1 for Crowd vs. Expert Annota2ons


0.642, p = 0:016 0.638

crowd provides training data that is at least as good if not beEer than experts

Relex CAUSE Classifier F1 for Crowd vs. Expert Annota2ons


(crowd with pos./neg. threshold at 0.5)


Learning Curves

Learning Curves


above 400 sent.: crowd consistently over baseline & single above 600 sent.: crowd out-‐performs experts


Learning Curves Extended



Learning Curves Extended



crowd consistently performs beEer than baseline

# of Workers: Impact on Sentence-‐Rela2on Score


# of Workers: Impact on Annota2on Quality

only 54 sent. had 15 or more workers


Experts vs. Crowd in Human Annota2on Overall Comparison

•  91% of expert annotations covered by the crowd •  expert annotators reach agreement only in 30% •  most popular crowd vote covers 95% of this

expert annotation agreement


F1 Cost per sentence

CrowdTruth 0.642 $0.66

Expert Annotator 0.638 $2.00

Single Annotator 0.492 $0.08


Expert vs. Crowd in Human Annota2on

Cost Comparison

•  crowd performs just as well as medical experts

•  crowd is also cheaper •  crowd is always available

•  using only a few annotators for ground truth is faulty

•  min 10 workers/sentence are needed for highest quality annotations

•  CrowdTruth = a solution to Clinical

NLP Challenge: •  lack of ground truth for training &

benchmarking

Experimentsproved that:

http://CrowdTruth.org

#CrowdTruth @anouk_anca @laroyo @cawelty #BDM2I #ISWC2015

CrowdTruth.org

http://data.CrowdTruth.org/medical-relex

#crowdtruth: biomedical data mining, modeling & semantic integration (bdm2i 2015) @iswc2015

Technology