exercise ir

1
Exercise 13.5 Consider the following frequencies for the class coffee for four terms in the first 100,000 documents of Reuters-RCV1: Term N 00 N 01 N 10 N 11 brazil 98.012 102 1835 51 counci l 96.322 133 3525 20 produc ers 98.524 119 1118 34 roaste d 99.824 143 23 10 Select two of these four terms based on: i. χ2, ii. mutual information, iii. frequency. Exercise 13.10 Your task is to classify words as English or not English. Words are generated by a source with the following distribution: Event Word English ? Probabi lity 1 ozb no 4/9 2 uzu no 4/9 3 zoo yes 1/18 4 bun yes 1/18 i. Compute the parameters (priors and conditionals) of a multinomial NB classifier that uses the letters b, n, o, u, and z as features. Assume a training set that reflects the probability distribution of the source perfectly. Make the same independence assumptions that are usually made for a multinomial classifier that uses terms as features for text classification. Compute parameters using smoothing, in which computed- zero probabilities are smoothed into probability 0.01, and computed- nonzero probabilities are untouched. (This simplistic smoothing may cause P(A) + P(A) > 1. Solutions are not required to correct this.) ii. How does the classifier classify the word zoo? iii. Classify the word zoo using a multinomial classifier as in part (i), but do not make the assumption of positional independence. That is, estimate separate parameters for each position in a word. You only need to compute the parameters you need for classifying zoo.

Upload: eka-handayani

Post on 28-Dec-2015

160 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Exercise IR

Exercise 13.5 Consider the following frequencies for the class coffee for four terms in the first 100,000 documents of Reuters-RCV1:Term N00 N01 N10 N11

brazil 98.012 102 1835 51council 96.322 133 3525 20producers

98.524 119 1118 34

roasted 99.824 143 23 10

Select two of these four terms based on:i. χ2, ii. mutual information, iii. frequency.

Exercise 13.10 Your task is to classify words as English or not English. Words are generated by a source with the following distribution:Event Word English ? Probabili

ty1 ozb no 4/92 uzu no 4/93 zoo yes 1/184 bun yes 1/18

i. Compute the parameters (priors and conditionals) of a multinomial NB classifier that uses the letters b, n, o, u, and z as features. Assume a training set that reflects the probability distribution of the source perfectly. Make the same independence assumptions that are usually made for a multinomial classifier that uses terms as features for text classification. Compute parameters using smoothing, in which computed-zero probabilities are smoothed into probability 0.01, and computed-nonzero probabilities are untouched. (This simplistic smoothing may cause P(A) + P(A) > 1. Solutions are not required to correct this.)

ii. How does the classifier classify the word zoo? iii. Classify the word zoo using a multinomial classifier as in part (i), but do not make the

assumption of positional independence. That is, estimate separate parameters for each position in a word. You only need to compute the parameters you need for classifying zoo.