course review #2 and project parts 3-6 ling 572 fei xia 02/14/06

Course Review #2 and Project Parts 3-6

LING 572Fei Xia

02/14/06

Outline

• Supervised learning– Learning algorithms– Resampling: bootstrap– System combination

• Semi-supervised learning• Unsupervised learning

Supervised Learning

Machine learning problems• Input x: a sentence, a set of attributes, …• Input domain X: the set of all possible input • Output y: a class, a real number, a tag, a tag

sequence, a parse tree, a cluster• Output domain Y: the set of all possible output

• Training data t: a set of (x, y) pairs in supervised learning, y is known.

Machine learning problems (cont)

• Predictor (f): a function from X to Y.• A learner: a function from T to F

– T: the set of all possible training data– F: the set of all possible predictors.

• Types of ML problems: – Y is a finite set: classification– Y is R: regression– Y is of other types: parsing, clustering, …

The standard setting for binary classification problems

• Input x: – There is a finite set of attributes: a1, …, an

– x is a vector: x=(x1, …, xn)

• Output y: – Binary-class: Y has only two members– Multi-class: Y has k members

Converting to the standard setting

• Multi-class binary: (Boosting) – Train one classifier: (x,y) (x, 1), 0), ….,

((x,y), 1), …, ((x,k), 0)– Train k classifiers, one for each class: for class j, (x,y) (x, (y=j))

• Y is not a pre-defined finite set– Ex: POS tagging, parsing– Convert Y to a sequence of decisions.

Converting to the standard setting (cont)

• x is not a vector (x1, …, xn)– Define a set of input attributes: a1, …, an

– Convert x to a vector– Ex: use boosting for POS tagging

Classification algorithms

• DT, DL, TBL, Boosting, MaxEnt• Comparison:

– Representation– Training: iterative approach

• Feature selection• Weight setting• Data processing

– Decoding

Representation• DT: a tree

– Each internal node is a test on an input attribute• DL: an ordered list of rules (fi, vi)

– Each fi is a test on one or more attributes.

• TBL: an ordered list of trans (fi, vi vi’).– Each fi is a test on one or more attributes.

• Boosting: a list of weighted weak classifiers.– Often a classifier tests one or more attributes.

• MaxEnt: a list of weighted features– A feature is a binary function: f(x, y)=0 or 1

Training: “feature” selection• DT: the test with max entropy reduction

– Test: attr == val

• DL: the decision rule with max entropy reduction– Rule: if (attr1=val1 && … && attr_i=val_i) then y=c.

• TBL: the transformation with max error reduction– Transformation: if (attr1=val1 && … && attr_i=val_i) then y=c1y=c2– Transformation: if (attr1=val1 && … && attr_i=val_i && y=c1) then y=c2

• Boosting: the classifier chosen by the weak learner. – Classifier: if (attr1=val1 && … && attr_i=val_i) then y=c1 else y=not (c1)

• MaxEnt: features with max increase of the log-likelihood of the training data– Feature: if (attr1=val1 && … && attr_i=val_i && y=c) then 1 else 0

Training: weight setting

• Boosting: weights that minimize the upper bound of training error.

• MaxEnt: weights that maximize the

entropy

Training: data processing

• DT: split data• DL: split data (optional)• TBL: apply transformations to reset cur_y

– Original data: (x, y) – Used data: ((x, cur_y), y)

• Boosting: re-weight the examples (x,y)• MaxEnt: none

Decoding for static problems: a single decision

• DT: find the unique path from the root to a leaf node in the decision tree

• DL: find the 1st rule that fires• TBL: find the sequence of rules that fire• Boosting: sum up the weighted decisions by

multiple classifiers

• MaxEnt: find the y that maximizes p(y | x).

)|(maxarg)( xypxfy

))(()( xfsignxf jj

j

Decoding for dynamic problems: a decision sequence

• TBL: it can handle dynamic problems directly.

• Beam search: – Decode from left-to-right.– A feature should not refer to future decisions.– Keep top-N at each position Easy to implement for MaxEnt Need to add weights (e.g, probabilities, costs,

confidence scores) to DT, DL, TBL, and boosting

Comparison of learnersDT DL TBL Boosting MaxEnt

Probabilistic SDT SDL TBL-DT confidence Y

Parametric N N N N Y

representation tree Ordered list of rules

Ordered list of transformations

List of weighted classifiers

List of weighted features

Each iteration attribute rule transformation

classifier & weight

feature & weight

Data processing

Splitdata

Split data*

Change cur_y

Reweight (x,y)

None.

decoding path 1st rule Sequence of rules

Calc f(x) Calc f(x)

Evaluation of learners• Accuracy: F-measure, error rate, ….

• Cost: – The types and amount of resources: tools and training data– The cost of errors

• Complexity: – Computational complexity of the algorithm (training time, decoding time)– Complexity of the model: # of parameters

• Stability

• Bias

Stability of a learner L

• Given two samples t1 and t2 from the same distribution DX*Y, let f1=L(t1) and f2=L(t2). If L is stable, f1 and f2 should agree most of the time.

))()((),(

)),(()(

2121

21

xfxfPffagreement

ffagreementELstability

X

YX

D

D

Bias

• Utgoff (1986):– Strong/weak bias: one that focuses the

learner on a relatively small (large, resp.) number of hypotheses.

– Correct/incorrect bias: one that allows (does allow, respectively) the learner to select the target

Bias (cont)• Rendell (1986): based on the learner’s behavior

– Exclusive bias: the learner does not consider any of the candidates in a class

– Preferential bias: the learner prefers one class of concepts over another class.

• Others: based on the learner’s design– Representational bias: certain concepts cannot be

considered because they cannot be expressed.– Procedural bias:

• Ex: pruning in C4.5 is a procedural bias that results in a preference for smaller DTs.

Resampling

Bagging

f1

f2

fB

ML

ML

ML

f

Sample with replacement

System combination

System combination

This can be seen as a special kind of ML problem.So we can use any learner

f1

f2

fB

f

Methods

ML problem:• Input attribute vector (f1(x), …., fn(x))• The goal: f(x)

Strategies:• Switching: for x, f(x) is equal to some fi(x)• Hybridization: create a new value.

Project Part 3

Tasks

• Understand the algorithm• Run the tagger on four sets of training data.

1K 5K 10K 40K

Accuracy

Training time# of features

The MaxEnt core• What is the format of the training data?• What is the format of the test data?• How does GIS work?• How does L-BFGS work?• What is Gaussian prior smoothing? And how is it

calculated?• How are events and features represented

internally?• During the decoding stage, how does the code

find the top-N classes for a new instance?

The MaxEnt tagger: features• Where are feature templates defined?

• List the feature templates used by the tagger.

• If you want to add a new feature template, what do you need to do? Which piece of code do you need to modify?

• Given the feature templates, how are (instantiated) features selected and filtered?

The MaxEnt tagger: trainer• What’s the format of the training sentences?

• How does the trainer convert a training sentence into a list of events?

• How does the trainer treat rare words? What additional features do rare words produce?

• How many files are created by the trainer in each experiment? How are they created? And what are their usages?

The MaxEnt tagger: decoder

• What’s the format of the test data?

• How are unknown words handled by the decoder?

• Which function does the beam search (Just provide the function name and file name)?

Project Part 4

Task 1: System combination

• Try three methods. • The methods can come from existing work (e.g,

(Henderson and Brill, 1999)), or be totally new. • At least one of them is trained:

– Create training data• Split S into (S1, S2)• Train each of the three POS taggers using S1• Tag instances in S2 (sys1, sys2, sys3, gold)

– Train the combiner with the training data

Task 1 (cont)1k 5K 10K 40K

Trigram a/b …TBL a/b ..MaxEnt …Comb1 …Comb2 …Comb3 …

a: tagging result with the whole training data.b: tagging result with part of the training data.

Task 2: bagging

• B=10: use 10 bags• Training data: 1K, 5K, and 10K. 40K is optional.• One combination method.

Task 2 (cont)1K 5K 10K 40K

(optional)Trigram a/b/c …

TBL …

MaxEnt …

Comb1 a/b/c …

a: no baggingb: one bagc: 10 bags

Task 3: boosting

• Software: boostexter• Main tasks:

– Handling unknown words– Format conversion: pay attention to special

characters: e.g., “,” in “2,300”– Feature templates– Choosing the number of rounds: N– Train and decode

Task 3 (cont)1K 5K 10K 40K

(optional)Iteration

num1a/b …

Iteration num2

…

…. …Iteration Num5

…

a: true tags for neighboring wordsb: most frequent tags for neighboring words

Task 4: semi-supervised learning

• Select one or more taggers• Choose the SSL methods: self-training,

co-training, or something else.• Decide on strategies for adding data.• Show the results with or without unlabeled

data.

Task 4 (cont)1K labeled data 5K labeled data

No unlabeled data a/b …

15K unlabeled data …

25K unlabeled data

35K unlabeled data

a: tagging accuracyb: the number of sentences added to the labeled data.

Project Part 5-6

Part 5: Presentation

• Presentation: 10 minutes + Q&A• Email me the slides by 6am on 3/9 and

bring a copy to class.• Focus:

– Tagging results: tables, figures– How TBL and MaxEnt work?– Project Part 4

Part 6: Final report

• Email me the file by 6am on 3/14.

• It should include the major results and observations from Project Part 1-5.

• Thoughts about ML algorithms

• Thoughts about the course, project, etc.

Due date• 6am on 3/7/06: Part 3-4

– ESubmit the following:• code for part 4• reports for Parts 3 and 4.

– Bring a hardcopy of the report to class.

• 6am on 3/9/06: Part 5– Email me your presentation slides.– Bring a hardcopy of your slides to class (4 slides per page).

• 6am on 3/14/06: Part 6– Email me the final report.

course review #2 and project parts 3-6 ling 572 fei xia 02/14/06

Documents