the classification problem (recap from ling570) ling 572 fei xia, dan jinguji week 1: 1/10/08 1

The classification problem(Recap from LING570)

LING 572

Fei Xia, Dan Jinguji

Week 1: 1/10/08

1

Outline

• Probability theory

• The classification task

=> Both were covered in LING570, and are therefore part of prerequisites.

2

Probability theory

3

Three types of probability

• Joint prob: P(x,y)= prob of x and y happening together

• Conditional prob: P(x|y) = prob of x given a specific value of y

• Marginal prob: P(x) = prob of x for all possible values of y

4

Common tricks (I):Marginal prob joint prob

B

BAPAP ),()(

nAA

nAAPAP,...,

11

2

),...,()(

5

Common tricks (II):Chain rule

)|(*)()|(*)(),( BAPBPABPAPBAP

),...|(),...,( 111

1 ii

in AAAPAAP

6

Common tricks (IV):Independence assumption

)|(

),...|(),...,(

11

111

1

ii

i

ii

in

AAP

AAAPAAP

8

A and B are conditionally independent given C: P(A|B,C) = P(A|C) P(A,B|C) = P(A|C) P(B|C)

Classification problem

9

Definition of classification problem

• Task: – C= {c1, c2, .., cm} is a finite set of pre-defined classes

(a.k.a., labels, categories).– Given an input x, decide on its category y.

• Multi-label vs. single-label problem– Single-label: for each x, only one class is assigned to it.– Multi-label: a x could have multiple labels.

• Multi-class vs. binary classification problem– Binary: |C| = 2.– Multi-class: |C| > 2

10

Conversion to single-label binary problem

• Multi-label single-label– If labels are unrelated, we can convert a multi-label

problem into |C| binary problems: e.g., does x have label c1? Does it have label c2? … Does it have label cm?

• Multi-class binary problem– We can convert multi-class problem to several binary

problems. We will discuss this in Week #6.

=> We will focus on single-label binary classification problem.

11

Examples of classification tasks

• Text classification

• Document filtering

• Language/Author/Speaker id

• WSD

• PP attachment

• Automatic essay grading

• …

12

Sequence labeling tasks

• Tokenization / Word segmentation• POS tagging• NE detection• NP chunking• Parsing• Reference resolution• …

We can use classification algorithms + beam search

13

Steps for solving a classification problem

• Split data into training/test/validation• Data preparation

• Training• Decoding

• Postprocessing• Evaluation

14

The three main steps

• Data preparation: represent the data as feature vectors

• Training: A trainer takes the training data as input, and outputs a classifier.

• Decoding: A decoder takes a classifier and test data as input, and output classification results.

15

Data

• An instance: (x, y)

• Labeled data: y is known

• Unlabeled data: y is unknown

• Training/test data: a set of instances.

16

Data preparation: creating attribute-value table

f1 f2 … fK Target

d1 yes 1 no -1000 c2

d2

d3

…

dn

17

Attribute-value table

• Each row corresponds to an instance.• Each column corresponds to a feature.

• A feature type (a.k.a. a feature template): w-1

• A feature: w-1=book• Binary feature vs. non-binary feature

18

The training stage

• Three types of learning – Supervised learning: the training data is labeled.– Unsupervised learning: the training data is unlabeled.– Semi-supervised learning: the training data consists

of both.

• We will focus on supervised learning in LING572

19

The decoding stage• A classifier is a function f: f(x) = {(ci, scorei)}.

Given the test data, a classifier “fills out” a decision matrix.

d1 d2 d3 ….

c1 0.1 0.4 0 …

c20.9 0.1 0 …

c3

…20

Important tasks (for you)in LING 572

• Understand various learning algorithms.

• Apply the algorithms to different tasks:– Convert the data into attribute-value table

• Define feature types• Feature selection• Convert an instance into a feature vector

– Choose an appropriate learning algorithm.

21

• Important concepts in a classification task – Instance: a (x, y) pair, y may be unknown– Labeled data, unlabeled data– Training data, test data

– Feature, feature type/template– Feature vector– Attribute-value table

– Trainer, classifier– Training stage, test stage

22

Summary

the classification problem (recap from ling570) ling 572 fei xia, dan jinguji week 1: 1/10/08 1

Documents

label c

training data

multilabel problem

unlabeled data

c binary problems

multiclass binary problem

prob of x

multiclass problem