inception workshop active learning for text … · 12/03/2018 · inception workshop active...

Outline

I Motivation

I Active Learning in a Nutshell

I Active Learning Scenarios

I Sampling Strategies, Advantages and Disadvantages

I Conclusion


Motivation

A Supervised Machine Learning (ML) Approach

1. Annotatedocuments

2. Train a model

3. Evaluate it

Supervised Machine Learning:

MLModel

HumanAnnotators

Evaluation

1

2 3

?

StaticProcess


Motivation

A Supervised Machine Learning (ML) Approach

I What if the model performs poorly?

1. Try out different modelsI Tune hyper-parametersI Use different featuresI . . .

2. Annotate more data for trainingI Resource consumingI Not necessarily helpful


Motivation

Example – WSD (bass)

Task: Classify sentences containing bass into the correct senses

I like playing the bass guitar. → bass (instrument)

I caught a big bass yesterday. → bass (fish)

Turn down the bass. → bass (tone)


Motivation


Perfect Model:

bass(fish)

bass (tone)

bass(instrument)

bass(instrument)


Motivation


Imperfect Model:

bass(fish)

bass (tone)

bass(instrument)

bass(instrument)


Motivation


I True labels are unknown beforeannotation

I More annotated sentences forbass (instrument) may not help,since the model is already goodfor bass (instrument)

Annotating more data:

HumanAnnotators?

?

bass(instrument)

MLModel

Sample data randomly

How to assess the helpfulness of unlabeled data?I Active Learning


Active Learning in a Nutshell

Active learning hypothesis:Machine Learning (ML) algorithms can learn faster (and better) if they may chosethe training data themselves [1]

1. Sample most informativeexample(s)

2. Query those example(s)for labeling to an oracle(human annotator)

3. Improve model iteratively

Active Learning:

MLModel

HumanAnnotators

Re-training &Evaluation

Best Model

2

1 3

?

?Sampling

IterativeProcess


Active Learning Scenarios

Pool-based Sampling Scenario

I Pool-based Sampling Scenario [8]I Small pool of labeled data, large pool of unlabeled dataI AL model samples examples which are assumed to be most helpfulI Fitting scenario for annotating large corpora



Stream-based Sampling Scenario


I Stream-based Sampling Scenario [3]I Continuous stream of unlabeled dataI AL model decides to sample an incoming example or notI Useful for online learning set ups



Membership Query Synthesis


I Stream-based Sampling Scenario [3]I Continuous stream of unlabeled dataI AL model decides to sample an incoming example or notI Useful for online learning set ups

I Membership Query Synthesis [2]I AL model constructs examples for samplingI May lead to nonsensical dataI Less-suited for textual data


Sampling Strategies

How to determine the usefulness of unlabeled data?

I Uncertainty Sampling [8]

I Query-by-Committee [4]

I Expected Error Reduction [7]

I Variance Reduction [6]


Sampling Strategies

Uncertainty Sampling

Idea: Sample example the model is most uncertain about

Measure uncertainty by:

I Prediction confidenceI MarginI Entropy


I For binary classification, all three are equivalent


Sampling Strategies

Prediction confidence

Input Tone Instrument Fish Confidence

I sing bass in our choir. → 0.8 0.15 0.05 0.8

I like playing the bass guitar. → 0.49 0.36 0.15 0.49

I caught a big bass yesterday. → 0.5 0.45 0.05 0.5

Turn down the bass. → 0.5 0.25 0.25 0.5

I Sample sentence with lowest prediction confidenceI Only takes into account the confidence for predicted class (e.g. tone)


Sampling Strategies

Margin

Input Tone Instrument Fish Margin





I Sample sentence with the smallest margin between the most confident andsecond most confident prediction


Sampling Strategies

Entropy

Input Tone Instrument Fish Entropy





I Entropy measures the amount of disorder – somewhat similar to measuringthe uncertainty over all classes

I Sample sentence with the highest entropy


Sampling Strategies

Query-by-Committee (QbC)

Idea: Learn a set of classifiers with different hypotheses

I Every classifier predicts (votes) for anunlabeled candidate example

I Sample example with the most disagreementI Popular measurements for disagreement:

I Vote entropyI KL divergence

Query by Committee

I Can be seen as a search through the hypothesis space


Sampling Strategies

(Soft) Vote Entropy

Prediction probabilities of two different models for [tone, instrument, fish]

Input Model 1 Model 2 Entropy

I sing bass in our choir. → [0.8, 0.1, 0.1] [0.6, 0.3, 0.1] 0.80

I like playing the bass guitar. → [0.2, 0.7, 0.1] [0.2, 0.6, 0.2] 0.89

I caught a big bass yesterday. → [0.5, 0.3, 0.2] [0.1, 0.1, 0.8] 1.03

Turn down the bass. → [0.4, 0.1, 0.5] [0.3, 0.6, 0.1] 1.10

I QbC generalization of entropy-based uncertainty samplingI Compute entropy over averaged prediction confidenceI Sample sentence with the highest vote entropy


Sampling Strategies

KL Divergence (KLD)

Prediction probabilities of two different models for [tone, instrument, fish]

Input Model 1 Model 2 KLD

I sing bass in our choir. → [0.8, 0.1, 0.1] [0.6, 0.3, 0.1] 0.033

I like playing the bass guitar. → [0.2, 0.7, 0.1] [0.2, 0.6, 0.2] 0.010

I caught a big bass yesterday. → [0.5, 0.3, 0.2] [0.1, 0.1, 0.8] 0.195

Turn down the bass. → [0.4, 0.1, 0.5] [0.3, 0.6, 0.1] 0.175

I Kullback–Leibler divergence (relative entropy) compares probabilitydistributions

I Sample sentence with the highest KL divergence


Sampling Strategies

Expected Error/ Variance Reduction

Uncertainty Sampling:I Most uncertain example gives most improvement on prediction performanceI This is not necessarily true

Expected Error Reduction [7]:I Minimize the expected future error directly

Variance Reduction [6]:I Computing expected future error is costlyI Minimize it indirectly by minimizing output variance


Expected Error Reduction

Algorithm 1 Expected Error ReductionRequire: model M, labeled data L, unlabeled data X , labels Y , Expected loss E(M)

for x ∈ X dofor y ∈ Y do

L̂← {L + (x , y)}M̂ ← train(L̂)lossx ,y ← E(M̂)

end forlossx ← avg(lossx ,y )

end forx̂ ← min(lossx )


Advantages and Disadvantages


ProsI Simple, fastI Easy to implementI Usable with any probabilistic model

ConsI Does not care about outliersI Confident wrong predictions may never get sampled:

Input Tone Instrument Fish

I sing bass in our choir. → 0.1 0.1 0.8



Query-by-Committee

ProsI SimpleI Usable with any learning algorithm, or sets of different algorithms

ConsI Difficult to trainI Difficult to maintain

If using different algorithms:I Make sure to normalize their outputs, if necessaryI Consider using weighted voting for different model performances



Expected Error/ Variance Reduction

ProsI Directly minimizes the expected error / variance

ConsI Computationally expensiveI Difficult to implementI Limited to pool-based sampling scenarioI Variance reduction is limited to regression models


Conclusion

Active Learning for Text Annotation

In general:I Allows an iterative training of model requiring less training dataI Gives a good estimate how models may perform later onI May sample data which is hard to annotate (increases annotation time)

Input Tone Instrument Fish

Turn down the bass. → 0.4 0.3 0.3

Watch out for:I Skewed label distributions (QbC can help)I Unreliable oracles, e.g. crowd-sourcing (estimate annotator performance)I Outliers (Use cluster-based extensions of active learning)


Thank you for your attention!


Other Query Strategies

Cluster-based Approaches:I Density WeightingI Hierarchical Sampling

Advantages, Disadvantages:I Pros: Model the actual input distribution, less prone to outliersI Cons: Actual input distribution may not relate to actual labels


References I

Burr Settles, University of Wisconsin–Madison, Active Learning Literature Survey,Computer Sciences Technical Report, 2010

Dana Angluin, Queries and Concept Learning, In: Machine Learning, 1988, April,Vol. 2, Issue 4, pages 319–342, doi:10.1023/A:1022821128753,

Les E. Atlas and David A. Cohn and Richard E. Ladner, Training ConnectionistNetworks with Queries and Selective Sampling, In: Advances in Neural InformationProcessing Systems 2, pages 566–573, 1990, Morgan-Kaufmann,

H. S. Seung and M. Opper and H. Sompolinsky, Query by Committee, In:Proceedings of the Fifth Annual Workshop on Computational Learning Theory,COLT ’92, 1992, Pittsburgh, Pennsylvania, USA, pages 287–294,doi:10.1145/130385.130417, ACM


References II

Burr Settles and Mark Craven and Soumya Ray, Multiple-Instance Active Learning,In: Advances in Neural Information Processing Systems 20, pages 1289–1296,2008, Curran Associates Inc.,

David A. Cohn, Neural Network Exploration Using Optimal Experiment Design, In:Advances in Neural Information Processing Systems 6, pages 679–686, 1994,Morgan-Kaufmann,

Nicholas Roy and Andrew McCallum, Toward Optimal Active Learning ThroughSampling Estimation of Error Reduction, In: Proceedings of the EighteenthInternational Conference on Machine Learning, ICML ’01, 2001, pages 441–448,Morgan Kaufmann,


References III

David D. Lewis and William A. Gale, A Sequential Algorithm for Training TextClassifiers, In: Proceedings of the 17th Annual International ACM SIGIRConference on Research and Development in Information Retrieval, SIGIR ’94,1994, Dublin, Ireland, pages 3–12 Springer-Verlag New York Inc.,


inception workshop active learning for text … · 12/03/2018 · inception workshop active...

Documents