and one output variable - library.iitkgp.ac.in

3
Indian Institute of Technology Kharagpur End-Spring Semester 2017-18 Date of Examination: April 24, 2018 Session: AN Duration: 3 hours Full Marks: 80 Subject No.: CS600S0 SUbject: Machine Learning Department: Computer Science and Engineering Specific charts, graph paper, log book etc., required: NA Instructions: Attempt all questions. All parts of the same question must be answered together. For True/False and MCQ questions, no credit will be given if the explanation is incorrect even if the answer is correct. No clarifications can be provided during the exam. Make reasonable assumptions if necessary, and state any assumptions made. All workings must be shown. You can use calculators. 1. State whether the following statements are true or false. Explain briefly with reasons. [2 x 7 = 14] (i) Classifiers having lower bias have a higher variance. (ii) Complete linkage clustering is computationally cheaper compared to single linkage (iii) Complexity of learning from a hypothesis class decreases as the VC dimension of the hypothesis class increases. (iv) Given two classifiers A and B, if A has a lower VC-dimension than B, then A almost certainly will perform better on a test set. (v) If you are given m data points, and you use half the points for training and the other half for testing, the difference between training error and test error decreases as m increases. (vi) The support vectors are expected to remain the same in general, as we move from a linear kernel to higher order polynomial kernels. (vii) The VC dimension of a Perceptron is smaller than the VC dimension of a simple linear SVM. 2. Answer the following Multiple Choice Questions. Each question may have any number of correct answers, including zero. List all choices that you believe to be correct, with brief explanations. There are 2 marks for each correct answer, -1 for each incorrect answer. (i) Let there be K hypothesis sets HI, H 2 , ... , H K , where each Hk has a finite, positive VC di- mension dvc(Hk). Let H = nf=IH k be the intersection of the hypothesis sets. Which among the following statements are correct about the VC dimension of H? (The VC dimension of an empty set or a singleton set is taken as zero.) (a) 0 ~ dvc(H) ~ ~~=I dvc(Hk) (b) min{ dvc(Hk) }f=1 < dvc(H) < max{ dvc(Hk)}f=l (c) 0 < dvc(H) < min{dvc(Hk)}f=1 (d) min{dvc(Hk)}f=1 < dvc(H) < ~~=I dvc(Hk) (ii) As the number of training examples goes to infinity, a ML model trained on that data will have: (a) Lower variance (b) Higher variance (c) Same variance (d) Lower bias (e) Higher bias (f) Same bias ---- ------------ -----

Upload: others

Post on 11-Jan-2022

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: and one output variable - library.iitkgp.ac.in

Indian Institute of Technology KharagpurEnd-Spring Semester 2017-18

Date of Examination: April 24, 2018 Session: AN Duration: 3 hours Full Marks: 80Subject No.: CS600S0 SUbject: Machine LearningDepartment: Computer Science and EngineeringSpecific charts, graph paper, log book etc., required: NAInstructions:Attempt all questions. All parts of the same question must be answered together.For True/False and MCQ questions, no credit will be given if the explanation is incorrect even if the answeris correct.No clarifications can be provided during the exam. Make reasonable assumptions if necessary, and stateany assumptions made.All workings must be shown. You can use calculators.

1. State whether the following statements are true or false. Explain briefly with reasons. [2 x 7 = 14]

(i) Classifiers having lower bias have a higher variance.

(ii) Complete linkage clustering is computationally cheaper compared to single linkage

(iii) Complexity of learning from a hypothesis class decreases as the VC dimension of the hypothesisclass increases.

(iv) Given two classifiers A and B, if A has a lower VC-dimension than B, then A almost certainlywill perform better on a test set.

(v) If you are given m data points, and you use half the points for training and the other half fortesting, the difference between training error and test error decreases as m increases.

(vi) The support vectors are expected to remain the same in general, as we move from a linear kernelto higher order polynomial kernels.

(vii) The VC dimension of a Perceptron is smaller than the VC dimension of a simple linear SVM.

2. Answer the following Multiple Choice Questions. Each question may have any number of correctanswers, including zero. List all choices that you believe to be correct, with brief explanations. Thereare 2 marks for each correct answer, -1 for each incorrect answer.

(i) Let there be K hypothesis sets HI, H2, ... , HK, where each Hk has a finite, positive VC di-mension dvc(Hk). Let H = nf=IHk be the intersection of the hypothesis sets. Which amongthe following statements are correct about the VC dimension of H? (The VC dimension of anempty set or a singleton set is taken as zero.)(a) 0 ~ dvc(H) ~ ~~=I dvc(Hk)(b) min{ dvc(Hk) }f=1 < dvc(H) < max{ dvc(Hk)}f=l(c) 0 < dvc(H) < min{dvc(Hk)}f=1(d) min{dvc(Hk)}f=1 < dvc(H) < ~~=I dvc(Hk)

(ii) As the number of training examples goes to infinity, a ML model trained on that data will have:(a) Lower variance (b) Higher variance (c) Same variance(d) Lower bias (e) Higher bias (f) Same bias

---- ------------ -----

Page 2: and one output variable - library.iitkgp.ac.in

(iii) Suppose we have one input variable x and one output variable y. We use two models (l) iJ (x, a) =sign(x + a), and (2) 12(x, b) = sign(bx + 1), to classify the input into two classes (accordingto whether the sign is positive or negative). Which of the following statements are correct aboutthe YC dimension of it and 12?(a) dvc(iJ) = 1, dvc(12) = 1(b) dvc(iJ) = 1, dvc(12) = 2(c) dvc(iJ) = 2, dvc(12) = 1(d) dvc(iJ) = 2, dvc(12) = 2

(iv) For polynomial regression, which one of these structural assumptions is the one that most affectsthe trade-off between underfitting and overfitting:(a) The method used to learn the weights(b) The assumed variance of the noise(c) The polynomial degree(d) The use of a constant-term unit input

3. (a) Consider a hypothetical SYM model that has the following values of Lagrange multipliers O:k andsupport vectors. Suppose that the linear kernel is used. Compute the output y of this SYM modelwhen the input feature vector is (0.3,0.8,0.6).

0: Support vector class value1 (1,-1,1) +1

0.5 (0,2,-1) -1I (-1,0,2) -1

(b) You have trained a SYM on a training dataset, and you suspect that the SYM has overfit the data.Can you estimate whether overfitting has occurred from the support vectors?

(c) Mention at least two ways to avoid overfitting of a SYM on a training dataset. [5 + 2 + 3 = 10]

4. (a) Consider the following two-variable function. Draw a perceptron and label it with suitable weightsso that it correctly classifies all inputs for this function.Xl X2 output0 0 00 I 1I 0 0I 1 1

(b) Consider the XOR function with the two inputs Xl, X2, and output y. Implement the XOR functionwith a neural network, such that the output y ~ ~ when Xl f= X2 and y < ~ when Xl = X2. Theneural network should be a fully connected network, where every neuron has three inputs (includinga bias input of Xo = 1) and one output, and every neuron implements the sigmoid activation functiongiven below. (Hint: a three-neuron network is sufficient.)

(c) Consider the following neural network. Assume that each neuron implements the tanh activation function.Note that derivative of tanh(x) is 1 - tanh2(x). The network is trained with the instance (Xl, X2, y)where Xl, X2 are the inputs and y is the output according to the training set. Assume that back-propagation algorithm is used to minimize the squared error. Let 01,02,03,04 be the outputs ofthe neurons nj, n2, n3, n4 respectively. Let (h, 82, 83, 84 be the values backpropagated by the neu-rons nl, n2, n3, n4 respectively. Compute expressions for 81,82,83,84 in terms of y, the outputs01,02,03,04, and the weights as shown in the figure. [4 + 6 + 5 = 15]

2

Page 3: and one output variable - library.iitkgp.ac.in

5. Consider the dataset given below, for determining whether a person A finds a particular type of foodappealing, based on three features - the temperature, taste, and size of the food.Appealing Temperature Taste Size

no hot salty smallno cold sweet largeno cold sweet largeyes cold sour smallyes hot sour smallno hot salty largeyes hot sour largeyes cold sweet smallyes cold sweet smallno hot salty large

..(a) What IS the initial entropy of Appealing'!(b) Assume that the attribute Taste is chosen as the root of the decision tree. What is the informationgain associated with this attribute?(c) Draw the full decision tree learned for this data (without any pruning). [2 + 3 + 5 = 10]

6. Consider the hypothesis set H that is the set of all circles in the 2-dimensional plane, such that, fora particular hypothesis, the points inside the chosen circle are classified as I, and points outside thecircle are classified as O. What is the VC dimension of this hypothesis set H? Justify with reasons.Note that no credit will be given without proper justification. [Hint: If you want to show that VCdimension of His k, you have to show (i) some set of k points can be shattered by H, and (ii) no setof k + 1points can be shattered by H.] [10]

7. (a) Define (i) Disparate Treatment, (ii) Disparate Impact, and (iii) Disparate Mistreatment with respectto fairness. How are they different from each other?(b) A recidivism prediction tool predicts whether a defendant is going to re-commit a crime within twoyears. A defendant is said to 'survive' if he/she does not re-commit a crime, and is said to 'recidivate'if he/she re-commits a crime. The following contingency table shows the predictions made by a tool,and the corresponding ground truths. Is the tool biased? Justify numerically. [6 + 5 = 11]

Black defendants II White defendants

Low HighSurvived 990Recidivated 532

8051369

Low HighSurvivedRecidivated

1139461

349505

3