co-training ling 572 fei xia 02/21/06. overview proposed by blum and mitchell (1998) important work:...

Co-training

LING 572

Fei Xia

02/21/06

Overview

• Proposed by Blum and Mitchell (1998)

• Important work:– (Nigam and Ghani, 2000)– (Goldman and Zhou, 2000)– (Abney, 2002)– (Sarkar, 2002)– …

• Used in document classification, parsing, etc.

Outline

• Basic concept: (Blum and Mitchell, 1998)

• Relation with other SSL algorithms: (Nigam and Ghani, 2000)

An example

• Web-page classification: e.g., find homepages of faculty members.– Page text: words occurring on that page

e.g., “research interest”, “teaching”

– Hyperlink text: words occurring in hyperlinks that point to that page:

e.g., “my advisor”

Two views

• Features can be split into two sets:– The instance space:– Each example:

• D: the distribution over X• C1: the set of target functions over X1.

• C2: the set of target function over X2.

21 XXX

),( 21 xxx

11 Cf

22 Cf

Assumption #1: compatibility

• The instance distribution D is compatible with the target function f=(f1, f2) if for any x=(x1, x2) with non-zero prob, f(x)=f1(x1)=f2(x2).

• The compatibility of f with D:

)]()(:),[(Pr1 221121 xfxfxxp D

Each set of features is sufficient for classification

Assumption #2: conditional independence

Co-training algorithm

Co-training algorithm (cont)

• Why uses U’, in addition to U?– Using U’ yields better results.– Possible explanation: this forces h1 and h2 select

examples that are more representative of the underlying distribution D that generates U.

• Choosing p and n: the ratio of p/n should match the ratio of positive examples and negative examples in D.

• Choosing the iteration number and the size of U’.

Intuition behind the co-training algorithm

• h1 adds examples to the labeled set that h2 will be able to use for learning, and vice verse.

• If the conditional independence assumption holds, then on average each added document will be as informative as a random document, and the learning will progress.

Experiments: setting

• 1051 web pages from 4 CS depts– 263 pages (25%) as test data– The remaining 75% of pages

• Labeled data: 3 positive and 9 negative examples• Unlabeled data: the rest (776 pages)

• Manually labeled into a number of categories: e.g., “course home page”.

• Two views:– View #1 (page-based): words in the page– View #2 (hyperlink-based): words in the hyperlinks

• Learner: Naïve Bayes

Naïve Bayes classifier(Nigam and Ghani, 2000)

Experiment: results

Page-based

classifier

Hyperlink-based classifier

Combined classifier

Supervised

training

12.9 12.4 11.1

Co-training 6.2 11.6 5.0

p=1, n=3# of iterations: 30|U’| = 75

Questions

• Can co-training algorithms be applied to datasets without natural feature divisions?

• How sensitive are the co-training algorithms to the correctness of the assumptions?

• What is the relation between co-training and other SSL methods (e.g., self-training)?

(Nigam and Ghani, 2000)

EM

• Pool the features together.

• Use initial labeled data to get initial parameter estimates.

• In each iteration use all the data (labeled and unlabeled) to re-estimate the parameters.

• Repeat until converge.

Experimental results: WebKB course database

EM performs better than co-trainingBoth are close to supervised method when trained on more labeled data.

Another experiment: The News 2*2 dataset

• A semi-artificial dataset

• Conditional independence assumption holds.

Co-training outperforms EM and the “oracle” result.

Co-training vs. EM

• Co-training splits features, EM does not.

• Co-training incrementally uses the unlabeled data.

• EM probabilistically labels all the data at each round; EM iteratively uses the unlabeled data.

Co-EM: EM with feature split

• Repeat until converge– Train A-feature-set classifier using the labeled

data and the unlabeded data with B’s labels – Use classifier A to probabilistically label all the

unlabeled data– Train B-feature-set classifier using the labeled

data and the unlabeled data with A’s labels.– B re-labels the data for use by A.

Four SSL methods

Results on the News 2*2 dataset

Random feature split

Co-training: 3.7% 5.5%Co-EM: 3.3% 5.1%

When the conditional independence assumption does not hold, but there is sufficient redundancy among the features, co-training still works well.

Assumptions

• Assumptions made by the underlying classifier (supervised learner):– Naïve Bayes: words occur independently of each other, given

the class of the document.– Co-training uses the classifier to rank the unlabeled examples by

confidence.– EM uses the classifier to assign probabilities to each unlabeled

example.

• Assumptions made by SSL method:– Co-training: conditional independence assumption.– EM: maximizing likelihood correlates with reducing classification

errors.

Summary of (Nigam and Ghani, 2002)

• Comparison of four SSL methods: self-training, co-training, EM, co-EM.

• The performance of the SSL methods depends on how well the underlying assumptions are met.

• Random splitting features is not as good as natural splitting, but it still works if there is sufficient redundancy among features.

Variations of co-training

• Goldman and Zhou (2000) use two learners of different types but both takes the whole feature set.

• Zhou and Li (2005) use three learners. If two agree, the data is used to teach the third learner.

• Balcan et al. (2005) relax the conditional independence assumption with much weaker expansion condition.

An alternative?

• L L1, LL2• U U1, U U2• Repeat

– Train h1 using L1 on Feat Set1– Train h2 using L2 on Feat Set2– Classify U2 with h1 and let U2’ be the subset with the

most confident scores, L2 + U2’ L2, U2-U2’ U2– Classify U1 with h2 and let U1’ be the subset with the

most confident scores, L1 + U1’ L1, U1-U1’ U1

Yarowsky’s algorithm

• one-sense-per-discourse

View #1: the ID of the document that a word is in

• one-sense-per-allocation

View #2: local context of word in the document

• Yarowsky’s algorithm is a special case of co-training (Blum & Mitchell, 1998)

• Is this correct? No, according to (Abney, 2002).

Summary of co-training

• The original paper: (Blum and Mitchell, 1998)– Two “independent” views: split the features into two

sets.– Train a classifier on each view.– Each classifier labels data that can be used to train

the other classifier.

• Extension: – Relax the conditional independence assumptions– Instead of using two views, use two or more

classifiers trained on the whole feature set.

Summary of SSL

• Goal: use both labeled and unlabeled data.

• Many algorithms: EM, co-EM, self-training, co-training, …

• Each algorithm is based on some assumptions.

• SSL works well when the assumptions are satisfied.

Additional slides

Rule independence

• H1 (H2) consists of rules that are functions of X1 (X2, resp) only.

• EM: the data is generated according to some simple known parametric model. – Ex: the positive examples are generated according

to an n-dimensional Gaussian D+ centered around the point

co-training ling 572 fei xia 02/21/06. overview proposed by blum and mitchell (1998) important work:...

Documents

classification slide

cotraining algorithm

advisor slide

cotraining algorithms

nave bayes slide

conditional independence

cotraining ling

cotraining algorithm