inception workshop active learning for text … · 12/03/2018 · inception workshop active...
TRANSCRIPT
INCEpTION workshop
Active Learning for Text Annotation
March 12th, 2018 | TU Darmstadt | Computer Science Department | UKP Lab | Ji-Ung Lee | INCEpTION Workshop | 1
Outline
I Motivation
I Active Learning in a Nutshell
I Active Learning Scenarios
I Sampling Strategies, Advantages and Disadvantages
I Conclusion
March 12th, 2018 | TU Darmstadt | Computer Science Department | UKP Lab | Ji-Ung Lee | INCEpTION Workshop | 2
Motivation
A Supervised Machine Learning (ML) Approach
1. Annotatedocuments
2. Train a model
3. Evaluate it
Supervised Machine Learning:
MLModel
HumanAnnotators
Evaluation
1
2 3
?
StaticProcess
March 12th, 2018 | TU Darmstadt | Computer Science Department | UKP Lab | Ji-Ung Lee | INCEpTION Workshop | 3
Motivation
A Supervised Machine Learning (ML) Approach
I What if the model performs poorly?
1. Try out different modelsI Tune hyper-parametersI Use different featuresI . . .
2. Annotate more data for trainingI Resource consumingI Not necessarily helpful
March 12th, 2018 | TU Darmstadt | Computer Science Department | UKP Lab | Ji-Ung Lee | INCEpTION Workshop | 4
Motivation
Example – WSD (bass)
Task: Classify sentences containing bass into the correct senses
I like playing the bass guitar. → bass (instrument)
I caught a big bass yesterday. → bass (fish)
Turn down the bass. → bass (tone)
March 12th, 2018 | TU Darmstadt | Computer Science Department | UKP Lab | Ji-Ung Lee | INCEpTION Workshop | 5
Motivation
Example – WSD (bass)
Perfect Model:
bass(fish)
bass (tone)
bass(instrument)
bass(instrument)
March 12th, 2018 | TU Darmstadt | Computer Science Department | UKP Lab | Ji-Ung Lee | INCEpTION Workshop | 6
Motivation
Example – WSD (bass)
Imperfect Model:
bass(fish)
bass (tone)
bass(instrument)
bass(instrument)
March 12th, 2018 | TU Darmstadt | Computer Science Department | UKP Lab | Ji-Ung Lee | INCEpTION Workshop | 7
Motivation
Example – WSD (bass)
I True labels are unknown beforeannotation
I More annotated sentences forbass (instrument) may not help,since the model is already goodfor bass (instrument)
Annotating more data:
HumanAnnotators?
?
bass(instrument)
MLModel
Sample data randomly
How to assess the helpfulness of unlabeled data?I Active Learning
March 12th, 2018 | TU Darmstadt | Computer Science Department | UKP Lab | Ji-Ung Lee | INCEpTION Workshop | 8
Active Learning in a Nutshell
Active learning hypothesis:Machine Learning (ML) algorithms can learn faster (and better) if they may chosethe training data themselves [1]
1. Sample most informativeexample(s)
2. Query those example(s)for labeling to an oracle(human annotator)
3. Improve model iteratively
Active Learning:
MLModel
HumanAnnotators
Re-training &Evaluation
Best Model
2
1 3
?
?Sampling
IterativeProcess
March 12th, 2018 | TU Darmstadt | Computer Science Department | UKP Lab | Ji-Ung Lee | INCEpTION Workshop | 9
Active Learning Scenarios
Pool-based Sampling Scenario
I Pool-based Sampling Scenario [8]I Small pool of labeled data, large pool of unlabeled dataI AL model samples examples which are assumed to be most helpfulI Fitting scenario for annotating large corpora
March 12th, 2018 | TU Darmstadt | Computer Science Department | UKP Lab | Ji-Ung Lee | INCEpTION Workshop | 10
Active Learning Scenarios
Stream-based Sampling Scenario
I Pool-based Sampling Scenario [8]I Small pool of labeled data, large pool of unlabeled dataI AL model samples examples which are assumed to be most helpfulI Fitting scenario for annotating large corpora
I Stream-based Sampling Scenario [3]I Continuous stream of unlabeled dataI AL model decides to sample an incoming example or notI Useful for online learning set ups
March 12th, 2018 | TU Darmstadt | Computer Science Department | UKP Lab | Ji-Ung Lee | INCEpTION Workshop | 11
Active Learning Scenarios
Membership Query Synthesis
I Pool-based Sampling Scenario [8]I Small pool of labeled data, large pool of unlabeled dataI AL model samples examples which are assumed to be most helpfulI Fitting scenario for annotating large corpora
I Stream-based Sampling Scenario [3]I Continuous stream of unlabeled dataI AL model decides to sample an incoming example or notI Useful for online learning set ups
I Membership Query Synthesis [2]I AL model constructs examples for samplingI May lead to nonsensical dataI Less-suited for textual data
March 12th, 2018 | TU Darmstadt | Computer Science Department | UKP Lab | Ji-Ung Lee | INCEpTION Workshop | 12
Sampling Strategies
How to determine the usefulness of unlabeled data?
I Uncertainty Sampling [8]
I Query-by-Committee [4]
I Expected Error Reduction [7]
I Variance Reduction [6]
March 12th, 2018 | TU Darmstadt | Computer Science Department | UKP Lab | Ji-Ung Lee | INCEpTION Workshop | 13
Sampling Strategies
Uncertainty Sampling
Idea: Sample example the model is most uncertain about
Measure uncertainty by:
I Prediction confidenceI MarginI Entropy
Uncertainty Sampling
I For binary classification, all three are equivalent
March 12th, 2018 | TU Darmstadt | Computer Science Department | UKP Lab | Ji-Ung Lee | INCEpTION Workshop | 14
Sampling Strategies
Prediction confidence
Input Tone Instrument Fish Confidence
I sing bass in our choir. → 0.8 0.15 0.05 0.8
I like playing the bass guitar. → 0.49 0.36 0.15 0.49
I caught a big bass yesterday. → 0.5 0.45 0.05 0.5
Turn down the bass. → 0.5 0.25 0.25 0.5
I Sample sentence with lowest prediction confidenceI Only takes into account the confidence for predicted class (e.g. tone)
March 12th, 2018 | TU Darmstadt | Computer Science Department | UKP Lab | Ji-Ung Lee | INCEpTION Workshop | 15
Sampling Strategies
Margin
Input Tone Instrument Fish Margin
I sing bass in our choir. → 0.8 0.15 0.05 0.65
I like playing the bass guitar. → 0.49 0.36 0.15 0.13
I caught a big bass yesterday. → 0.5 0.45 0.05 0.05
Turn down the bass. → 0.5 0.25 0.25 0.25
I Sample sentence with the smallest margin between the most confident andsecond most confident prediction
March 12th, 2018 | TU Darmstadt | Computer Science Department | UKP Lab | Ji-Ung Lee | INCEpTION Workshop | 16
Sampling Strategies
Entropy
Input Tone Instrument Fish Entropy
I sing bass in our choir. → 0.8 0.15 0.05 0.61
I like playing the bass guitar. → 0.49 0.36 0.15 1.00
I caught a big bass yesterday. → 0.5 0.45 0.05 0.85
Turn down the bass. → 0.5 0.25 0.25 1.04
I Entropy measures the amount of disorder – somewhat similar to measuringthe uncertainty over all classes
I Sample sentence with the highest entropy
March 12th, 2018 | TU Darmstadt | Computer Science Department | UKP Lab | Ji-Ung Lee | INCEpTION Workshop | 17
Sampling Strategies
Query-by-Committee (QbC)
Idea: Learn a set of classifiers with different hypotheses
I Every classifier predicts (votes) for anunlabeled candidate example
I Sample example with the most disagreementI Popular measurements for disagreement:
I Vote entropyI KL divergence
Query by Committee
I Can be seen as a search through the hypothesis space
March 12th, 2018 | TU Darmstadt | Computer Science Department | UKP Lab | Ji-Ung Lee | INCEpTION Workshop | 18
Sampling Strategies
(Soft) Vote Entropy
Prediction probabilities of two different models for [tone, instrument, fish]
Input Model 1 Model 2 Entropy
I sing bass in our choir. → [0.8, 0.1, 0.1] [0.6, 0.3, 0.1] 0.80
I like playing the bass guitar. → [0.2, 0.7, 0.1] [0.2, 0.6, 0.2] 0.89
I caught a big bass yesterday. → [0.5, 0.3, 0.2] [0.1, 0.1, 0.8] 1.03
Turn down the bass. → [0.4, 0.1, 0.5] [0.3, 0.6, 0.1] 1.10
I QbC generalization of entropy-based uncertainty samplingI Compute entropy over averaged prediction confidenceI Sample sentence with the highest vote entropy
March 12th, 2018 | TU Darmstadt | Computer Science Department | UKP Lab | Ji-Ung Lee | INCEpTION Workshop | 19
Sampling Strategies
KL Divergence (KLD)
Prediction probabilities of two different models for [tone, instrument, fish]
Input Model 1 Model 2 KLD
I sing bass in our choir. → [0.8, 0.1, 0.1] [0.6, 0.3, 0.1] 0.033
I like playing the bass guitar. → [0.2, 0.7, 0.1] [0.2, 0.6, 0.2] 0.010
I caught a big bass yesterday. → [0.5, 0.3, 0.2] [0.1, 0.1, 0.8] 0.195
Turn down the bass. → [0.4, 0.1, 0.5] [0.3, 0.6, 0.1] 0.175
I Kullback–Leibler divergence (relative entropy) compares probabilitydistributions
I Sample sentence with the highest KL divergence
March 12th, 2018 | TU Darmstadt | Computer Science Department | UKP Lab | Ji-Ung Lee | INCEpTION Workshop | 20
Sampling Strategies
Expected Error/ Variance Reduction
Uncertainty Sampling:I Most uncertain example gives most improvement on prediction performanceI This is not necessarily true
Expected Error Reduction [7]:I Minimize the expected future error directly
Variance Reduction [6]:I Computing expected future error is costlyI Minimize it indirectly by minimizing output variance
March 12th, 2018 | TU Darmstadt | Computer Science Department | UKP Lab | Ji-Ung Lee | INCEpTION Workshop | 21
Expected Error Reduction
Algorithm 1 Expected Error ReductionRequire: model M, labeled data L, unlabeled data X , labels Y , Expected loss E(M)
for x ∈ X dofor y ∈ Y do
L̂← {L + (x , y)}M̂ ← train(L̂)lossx ,y ← E(M̂)
end forlossx ← avg(lossx ,y )
end forx̂ ← min(lossx )
March 12th, 2018 | TU Darmstadt | Computer Science Department | UKP Lab | Ji-Ung Lee | INCEpTION Workshop | 22
Advantages and Disadvantages
Uncertainty Sampling
ProsI Simple, fastI Easy to implementI Usable with any probabilistic model
ConsI Does not care about outliersI Confident wrong predictions may never get sampled:
Input Tone Instrument Fish
I sing bass in our choir. → 0.1 0.1 0.8
March 12th, 2018 | TU Darmstadt | Computer Science Department | UKP Lab | Ji-Ung Lee | INCEpTION Workshop | 23
Advantages and Disadvantages
Query-by-Committee
ProsI SimpleI Usable with any learning algorithm, or sets of different algorithms
ConsI Difficult to trainI Difficult to maintain
If using different algorithms:I Make sure to normalize their outputs, if necessaryI Consider using weighted voting for different model performances
March 12th, 2018 | TU Darmstadt | Computer Science Department | UKP Lab | Ji-Ung Lee | INCEpTION Workshop | 24
Advantages and Disadvantages
Expected Error/ Variance Reduction
ProsI Directly minimizes the expected error / variance
ConsI Computationally expensiveI Difficult to implementI Limited to pool-based sampling scenarioI Variance reduction is limited to regression models
March 12th, 2018 | TU Darmstadt | Computer Science Department | UKP Lab | Ji-Ung Lee | INCEpTION Workshop | 25
Conclusion
Active Learning for Text Annotation
In general:I Allows an iterative training of model requiring less training dataI Gives a good estimate how models may perform later onI May sample data which is hard to annotate (increases annotation time)
Input Tone Instrument Fish
Turn down the bass. → 0.4 0.3 0.3
Watch out for:I Skewed label distributions (QbC can help)I Unreliable oracles, e.g. crowd-sourcing (estimate annotator performance)I Outliers (Use cluster-based extensions of active learning)
March 12th, 2018 | TU Darmstadt | Computer Science Department | UKP Lab | Ji-Ung Lee | INCEpTION Workshop | 26
Thank you for your attention!
March 12th, 2018 | TU Darmstadt | Computer Science Department | UKP Lab | Ji-Ung Lee | INCEpTION Workshop | 27
Other Query Strategies
Cluster-based Approaches:I Density WeightingI Hierarchical Sampling
Advantages, Disadvantages:I Pros: Model the actual input distribution, less prone to outliersI Cons: Actual input distribution may not relate to actual labels
March 12th, 2018 | TU Darmstadt | Computer Science Department | UKP Lab | Ji-Ung Lee | INCEpTION Workshop | 28
References I
Burr Settles, University of Wisconsin–Madison, Active Learning Literature Survey,Computer Sciences Technical Report, 2010
Dana Angluin, Queries and Concept Learning, In: Machine Learning, 1988, April,Vol. 2, Issue 4, pages 319–342, doi:10.1023/A:1022821128753,
Les E. Atlas and David A. Cohn and Richard E. Ladner, Training ConnectionistNetworks with Queries and Selective Sampling, In: Advances in Neural InformationProcessing Systems 2, pages 566–573, 1990, Morgan-Kaufmann,
H. S. Seung and M. Opper and H. Sompolinsky, Query by Committee, In:Proceedings of the Fifth Annual Workshop on Computational Learning Theory,COLT ’92, 1992, Pittsburgh, Pennsylvania, USA, pages 287–294,doi:10.1145/130385.130417, ACM
March 12th, 2018 | TU Darmstadt | Computer Science Department | UKP Lab | Ji-Ung Lee | INCEpTION Workshop | 29
References II
Burr Settles and Mark Craven and Soumya Ray, Multiple-Instance Active Learning,In: Advances in Neural Information Processing Systems 20, pages 1289–1296,2008, Curran Associates Inc.,
David A. Cohn, Neural Network Exploration Using Optimal Experiment Design, In:Advances in Neural Information Processing Systems 6, pages 679–686, 1994,Morgan-Kaufmann,
Nicholas Roy and Andrew McCallum, Toward Optimal Active Learning ThroughSampling Estimation of Error Reduction, In: Proceedings of the EighteenthInternational Conference on Machine Learning, ICML ’01, 2001, pages 441–448,Morgan Kaufmann,
March 12th, 2018 | TU Darmstadt | Computer Science Department | UKP Lab | Ji-Ung Lee | INCEpTION Workshop | 30
References III
David D. Lewis and William A. Gale, A Sequential Algorithm for Training TextClassifiers, In: Proceedings of the 17th Annual International ACM SIGIRConference on Research and Development in Information Retrieval, SIGIR ’94,1994, Dublin, Ireland, pages 3–12 Springer-Verlag New York Inc.,
March 12th, 2018 | TU Darmstadt | Computer Science Department | UKP Lab | Ji-Ung Lee | INCEpTION Workshop | 31