course review #2 and project parts 3-6 ling 572 fei xia 02/14/06
DESCRIPTION
Supervised LearningTRANSCRIPT
Course Review #2 and Project Parts 3-6
LING 572Fei Xia
02/14/06
Outline
• Supervised learning– Learning algorithms– Resampling: bootstrap– System combination
• Semi-supervised learning• Unsupervised learning
Supervised Learning
Machine learning problems• Input x: a sentence, a set of attributes, …• Input domain X: the set of all possible input • Output y: a class, a real number, a tag, a tag
sequence, a parse tree, a cluster• Output domain Y: the set of all possible output
• Training data t: a set of (x, y) pairs in supervised learning, y is known.
Machine learning problems (cont)
• Predictor (f): a function from X to Y.• A learner: a function from T to F
– T: the set of all possible training data– F: the set of all possible predictors.
• Types of ML problems: – Y is a finite set: classification– Y is R: regression– Y is of other types: parsing, clustering, …
The standard setting for binary classification problems
• Input x: – There is a finite set of attributes: a1, …, an
– x is a vector: x=(x1, …, xn)
• Output y: – Binary-class: Y has only two members– Multi-class: Y has k members
Converting to the standard setting
• Multi-class binary: (Boosting) – Train one classifier: (x,y) (x, 1), 0), ….,
((x,y), 1), …, ((x,k), 0)– Train k classifiers, one for each class: for class j, (x,y) (x, (y=j))
• Y is not a pre-defined finite set– Ex: POS tagging, parsing– Convert Y to a sequence of decisions.
Converting to the standard setting (cont)
• x is not a vector (x1, …, xn)– Define a set of input attributes: a1, …, an
– Convert x to a vector– Ex: use boosting for POS tagging
Classification algorithms
• DT, DL, TBL, Boosting, MaxEnt• Comparison:
– Representation– Training: iterative approach
• Feature selection• Weight setting• Data processing
– Decoding
Representation• DT: a tree
– Each internal node is a test on an input attribute• DL: an ordered list of rules (fi, vi)
– Each fi is a test on one or more attributes.
• TBL: an ordered list of trans (fi, vi vi’).– Each fi is a test on one or more attributes.
• Boosting: a list of weighted weak classifiers.– Often a classifier tests one or more attributes.
• MaxEnt: a list of weighted features– A feature is a binary function: f(x, y)=0 or 1
Training: “feature” selection• DT: the test with max entropy reduction
– Test: attr == val
• DL: the decision rule with max entropy reduction– Rule: if (attr1=val1 && … && attr_i=val_i) then y=c.
• TBL: the transformation with max error reduction– Transformation: if (attr1=val1 && … && attr_i=val_i) then y=c1y=c2– Transformation: if (attr1=val1 && … && attr_i=val_i && y=c1) then y=c2
• Boosting: the classifier chosen by the weak learner. – Classifier: if (attr1=val1 && … && attr_i=val_i) then y=c1 else y=not (c1)
• MaxEnt: features with max increase of the log-likelihood of the training data– Feature: if (attr1=val1 && … && attr_i=val_i && y=c) then 1 else 0
Training: weight setting
• Boosting: weights that minimize the upper bound of training error.
• MaxEnt: weights that maximize the
entropy
Training: data processing
• DT: split data• DL: split data (optional)• TBL: apply transformations to reset cur_y
– Original data: (x, y) – Used data: ((x, cur_y), y)
• Boosting: re-weight the examples (x,y)• MaxEnt: none
Decoding for static problems: a single decision
• DT: find the unique path from the root to a leaf node in the decision tree
• DL: find the 1st rule that fires• TBL: find the sequence of rules that fire• Boosting: sum up the weighted decisions by
multiple classifiers
• MaxEnt: find the y that maximizes p(y | x).
)|(maxarg)( xypxfy
))(()( xfsignxf jj
j
Decoding for dynamic problems: a decision sequence
• TBL: it can handle dynamic problems directly.
• Beam search: – Decode from left-to-right.– A feature should not refer to future decisions.– Keep top-N at each position Easy to implement for MaxEnt Need to add weights (e.g, probabilities, costs,
confidence scores) to DT, DL, TBL, and boosting
Comparison of learnersDT DL TBL Boosting MaxEnt
Probabilistic SDT SDL TBL-DT confidence Y
Parametric N N N N Y
representation tree Ordered list of rules
Ordered list of transformations
List of weighted classifiers
List of weighted features
Each iteration attribute rule transformation
classifier & weight
feature & weight
Data processing
Splitdata
Split data*
Change cur_y
Reweight (x,y)
None.
decoding path 1st rule Sequence of rules
Calc f(x) Calc f(x)
Evaluation of learners• Accuracy: F-measure, error rate, ….
• Cost: – The types and amount of resources: tools and training data– The cost of errors
• Complexity: – Computational complexity of the algorithm (training time, decoding time)– Complexity of the model: # of parameters
• Stability
• Bias
Stability of a learner L
• Given two samples t1 and t2 from the same distribution DX*Y, let f1=L(t1) and f2=L(t2). If L is stable, f1 and f2 should agree most of the time.
))()((),(
)),(()(
2121
21
xfxfPffagreement
ffagreementELstability
X
YX
D
D
Bias
• Utgoff (1986):– Strong/weak bias: one that focuses the
learner on a relatively small (large, resp.) number of hypotheses.
– Correct/incorrect bias: one that allows (does allow, respectively) the learner to select the target
Bias (cont)• Rendell (1986): based on the learner’s behavior
– Exclusive bias: the learner does not consider any of the candidates in a class
– Preferential bias: the learner prefers one class of concepts over another class.
• Others: based on the learner’s design– Representational bias: certain concepts cannot be
considered because they cannot be expressed.– Procedural bias:
• Ex: pruning in C4.5 is a procedural bias that results in a preference for smaller DTs.
Resampling
Bagging
f1
f2
fB
ML
ML
ML
f
Sample with replacement
System combination
System combination
This can be seen as a special kind of ML problem.So we can use any learner
f1
f2
fB
f
Methods
ML problem:• Input attribute vector (f1(x), …., fn(x))• The goal: f(x)
Strategies:• Switching: for x, f(x) is equal to some fi(x)• Hybridization: create a new value.
Project Part 3
Tasks
• Understand the algorithm• Run the tagger on four sets of training data.
1K 5K 10K 40K
Accuracy
Training time# of features
The MaxEnt core• What is the format of the training data?• What is the format of the test data?• How does GIS work?• How does L-BFGS work?• What is Gaussian prior smoothing? And how is it
calculated?• How are events and features represented
internally?• During the decoding stage, how does the code
find the top-N classes for a new instance?
The MaxEnt tagger: features• Where are feature templates defined?
• List the feature templates used by the tagger.
• If you want to add a new feature template, what do you need to do? Which piece of code do you need to modify?
• Given the feature templates, how are (instantiated) features selected and filtered?
The MaxEnt tagger: trainer• What’s the format of the training sentences?
• How does the trainer convert a training sentence into a list of events?
• How does the trainer treat rare words? What additional features do rare words produce?
• How many files are created by the trainer in each experiment? How are they created? And what are their usages?
The MaxEnt tagger: decoder
• What’s the format of the test data?
• How are unknown words handled by the decoder?
• Which function does the beam search (Just provide the function name and file name)?
Project Part 4
Task 1: System combination
• Try three methods. • The methods can come from existing work (e.g,
(Henderson and Brill, 1999)), or be totally new. • At least one of them is trained:
– Create training data• Split S into (S1, S2)• Train each of the three POS taggers using S1• Tag instances in S2 (sys1, sys2, sys3, gold)
– Train the combiner with the training data
Task 1 (cont)1k 5K 10K 40K
Trigram a/b …TBL a/b ..MaxEnt …Comb1 …Comb2 …Comb3 …
a: tagging result with the whole training data.b: tagging result with part of the training data.
Task 2: bagging
• B=10: use 10 bags• Training data: 1K, 5K, and 10K. 40K is optional.• One combination method.
Task 2 (cont)1K 5K 10K 40K
(optional)Trigram a/b/c …
TBL …
MaxEnt …
Comb1 a/b/c …
a: no baggingb: one bagc: 10 bags
Task 3: boosting
• Software: boostexter• Main tasks:
– Handling unknown words– Format conversion: pay attention to special
characters: e.g., “,” in “2,300”– Feature templates– Choosing the number of rounds: N– Train and decode
Task 3 (cont)1K 5K 10K 40K
(optional)Iteration
num1a/b …
Iteration num2
…
…. …Iteration Num5
…
a: true tags for neighboring wordsb: most frequent tags for neighboring words
Task 4: semi-supervised learning
• Select one or more taggers• Choose the SSL methods: self-training,
co-training, or something else.• Decide on strategies for adding data.• Show the results with or without unlabeled
data.
Task 4 (cont)1K labeled data 5K labeled data
No unlabeled data a/b …
15K unlabeled data …
25K unlabeled data
35K unlabeled data
a: tagging accuracyb: the number of sentences added to the labeled data.
Project Part 5-6
Part 5: Presentation
• Presentation: 10 minutes + Q&A• Email me the slides by 6am on 3/9 and
bring a copy to class.• Focus:
– Tagging results: tables, figures– How TBL and MaxEnt work?– Project Part 4
Part 6: Final report
• Email me the file by 6am on 3/14.
• It should include the major results and observations from Project Part 1-5.
• Thoughts about ML algorithms
• Thoughts about the course, project, etc.
Due date• 6am on 3/7/06: Part 3-4
– ESubmit the following:• code for part 4• reports for Parts 3 and 4.
– Bring a hardcopy of the report to class.
• 6am on 3/9/06: Part 5– Email me your presentation slides.– Bring a hardcopy of your slides to class (4 slides per page).
• 6am on 3/14/06: Part 6– Email me the final report.