applied machine learning assignment 1 -...

11
Applied Machine Learning Assignment 1 Professor: Aude Billard Assistants: Basilio Noris, Nicolas Sommer contacts: aude.billard@epfl.ch basilio.noris@epfl.ch n.sommer@epfl.ch Winter Semester 2012 1 Goals Assignment 1 covers 4 machine learning techniques that perform linear transformations on the data for dimensionality reduction, signal separation, and clustering. This assignment will run through 3 practical sessions dedicated respectively to: Principal Component Analysis Independent Component Analysis (this part is optional and counts as 10% bonus) Clustering (K-Means, Soft-K Means and GMM) This assignment will be graded through a report which you must hand in no later than by November 9th at 18h00. Please send it to your assistants by email. The report should be named ”AML-2011 Name1-Name2.pdf” where nameX corresponds to the family names of the group participants. Delays will be penalized: 1 point will be subtracted for each day of delay. The first day late counts starting one hour after the deadline. This report counts for 25% of the total grade of the course. Practicals are conducted in teams of three. Unless told otherwise, we assume that the work has been shared equally by the members of the team and hence all members will be given the same grade. More information on the assignment and on the way the report should be written are given below. Structure of the practicals Sections 4, 5 and 6 describe the work to be conducted during the practical sessions held on October 5, 12 and 26, respectivelty, through step-by-step instructions. You must answer each of the questions specified in these sections. Each question correspond to a specific percentage of the overall grade. This percentage is indicated next to each question. Note that the two main sections, namely Sections 4 (PCA) and 5 (Clustering), correspond to two 3-hour sessions, whereas section 6 (ICA) will only be given a one-hour session (this part of the practical on ICA counts as a bonus for the report). 1

Upload: trinhdiep

Post on 31-Jan-2018

222 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Applied Machine Learning Assignment 1 - EPFLlasa.epfl.ch/teaching/lectures/ML_Msc/practicals/2012-2013/AML-TP1.pdf · Applied Machine Learning Assignment 1 Professor: Aude Billard

Applied Machine Learning

Assignment 1

Professor: Aude BillardAssistants: Basilio Noris, Nicolas Sommer

contacts:[email protected]@[email protected]

Winter Semester 2012

1 Goals

Assignment 1 covers 4 machine learning techniques that perform linear transformations on thedata for dimensionality reduction, signal separation, and clustering. This assignment will runthrough 3 practical sessions dedicated respectively to:

• Principal Component Analysis

• Independent Component Analysis (this part is optional and counts as 10% bonus)

• Clustering (K-Means, Soft-K Means and GMM)

This assignment will be graded through a report which you must hand in no later than byNovember 9th at 18h00. Please send it to your assistants by email. The report should be named”AML-2011 Name1-Name2.pdf” where nameX corresponds to the family names of the groupparticipants. Delays will be penalized: 1 point will be subtracted for each day of delay. Thefirst day late counts starting one hour after the deadline. This report counts for 25% of thetotal grade of the course. Practicals are conducted in teams of three. Unless told otherwise,we assume that the work has been shared equally by the members of the team and hence allmembers will be given the same grade. More information on the assignment and on the waythe report should be written are given below.

Structure of the practicals

Sections 4, 5 and 6 describe the work to be conducted during the practical sessions held onOctober 5, 12 and 26, respectivelty, through step-by-step instructions. You must answer eachof the questions specified in these sections. Each question correspond to a specific percentageof the overall grade. This percentage is indicated next to each question.

Note that the two main sections, namely Sections 4 (PCA) and 5 (Clustering), correspondto two 3-hour sessions, whereas section 6 (ICA) will only be given a one-hour session (this partof the practical on ICA counts as a bonus for the report).

1

Page 2: Applied Machine Learning Assignment 1 - EPFLlasa.epfl.ch/teaching/lectures/ML_Msc/practicals/2012-2013/AML-TP1.pdf · Applied Machine Learning Assignment 1 Professor: Aude Billard

2 Advice before starting the practical sessions

The practical sessions revolve around the evaluation of machine learning algorithms on realdata. The goal of the practicals is to get you acquainted with various stages of preprocessingone can perform on datasets of different sizes and dimensionality.

Your task will be to determine how the different parameters of each method affect the results.Most of your effort should be spent to understand why modifying such and such parameter affectthe result in such and such manner. You should ask yourself: ”Is this what I was expecting?”;”is this surprising?”; ”can I relate the mathematical formulations of these methods to what youare seeing on the screen?”. Take the opportunity to discuss the effects and results you obtainwithin your team. In case of doubt, do not hesitate to ask the assistants for help during thepractical hours dedicated for this.

3 Datasets

Throughout the practicals, you will be working on synthetic and real data. Synthetic data canbe created by using the drawing tools in the MLDemos interface. They are very useful to helpto visualize how changes in data affect the results of a learning algorithm. However, syntheticdata seldom help to grasp some of the issues arising when using realistic and hence noisy data.While you are advised to play with synthetic data during the practical sessions, the reportshould focus solely on results obtained when using real datasets. In these practicals, we askthat you use two such realistic dataset.

• The first dataset must be generated by your team (and must hence differ from thosegenerated by other teams) and must be composed of real images. You can create adataset composed of images using the Face/Object interface of MLDemos, see instructionsbelow.

• The second dataset will be assigned to your team by your assistants and is drawn fromthe Standard Benchmark datasets of the UCI Machine Learning Database1.

3.1 Face / Object datasets

Face and object images present an interesting source of data as they live in a high-dimensionalspace (images often have several thousand dimensions). Groups should be using original imagesthat they have either generated using the camera on their computers (we provide webcamsduring the practicals), or that they have gathered through other media (personal photos, etc).To collect these images, follow the steps provided here:

1. Launch MLDemos and select Plugins > Input / Output > PCA Faces from the menu.

2. An interface should pop up (See Figure 1). If you have a camera attached to your com-puter, it should open up on the left-hand side of the interface and allow you to select aregion of the image that can be captured multiple times (e.g. different faces or differentface expressions). Alternatively, an image can be loaded and sub-regions of that image beselected as samples.

3. Use the button marked with >> to add the selected regions to the dataset.

1http://archive.ics.uci.edu/ml/datasets.html

2

Page 3: Applied Machine Learning Assignment 1 - EPFLlasa.epfl.ch/teaching/lectures/ML_Msc/practicals/2012-2013/AML-TP1.pdf · Applied Machine Learning Assignment 1 Professor: Aude Billard

4. Once you have selected enough number of samples (all samples will be gathered in theright-hand side of the interface), you can assign labels to each sample by left/right-clickingon them. You can left/right-Shift-click a sample to change the class label of all samplesbelow. Ctrl+clicking on a sample will remove it from the dataset. Save the dataset onceyou’re satisfied with the results. The dataset is saved as an image which you can open andedit with any imaging software (and which you could for example include in your report).

When creating the image dataset, keep in mind that you will have to split each dataset intodifferent parts later on in the practicals, therefore make sure to have enough samples for eachobject or face. The minimal size for a dataset should be 50-60 samples (i.e. 25-30 samples perclass), but you will realize that a bigger dataset can help your understanding of what’s goingon. The system should be able to process up to a couple of thousands of samples.

Figure 1: PCAFaces plugin GUI for creating image-based datasets and projecting the resultsinto MLDemos.

3.2 UCI Machine Learning Database

The UCI Machine Learning Database (http://archive.ics.uci.edu/ml/datasets.html) gathers alarge number of datasets for different problems such as classification, regression and clustering.The description of each dataset entails its use (e.g. classification), its size (number of samplesfor each class) and its dimensionality (number of dimensions for each sample). To ensure thateach group produces original work, each group will work with a separate dataset fromthe UCI database. The list of dataset to each team is provided in Table 1 below.

To download the dataset, follow the links on the website and select Data Folder. There,download the file *.data containing the actual data. In the same folder, the *.names file describesthe contents of the dataset, you are advised to read it to better understand the data your areusing.

To import the *.data file in MLDemos, drag and drop the file in MLdemos or do File>Import> Data(csv,txt) from the menus. Select the class column to indicate which column ofthe data contains the class of the datapoints. This column will then be used to color eachdatapoint depending on its class (otherwise it will be used as one extra dimension for yourdata).

3

Page 4: Applied Machine Learning Assignment 1 - EPFLlasa.epfl.ch/teaching/lectures/ML_Msc/practicals/2012-2013/AML-TP1.pdf · Applied Machine Learning Assignment 1 Professor: Aude Billard

Group # Dataset

1 Letter recognition

2 Pima indian diabetes

3 Statlog (Image segmentation)

4 Waveform v1

5 Zoo

6 Echocardiogram

7 Wine

8 Hepatitis

9 Glass

10 Dermatology

11 Vertebral Column (3C)

12 Ionosphere

13 Congressional voting records

14 Libras movement

15 Credit approval

16 SPECT (change *.train to *.data)

17 SPECTF (change *.train to *.data)

18 Post-operative patient

Table 1: Dataset assigned to each group

4 Part I: Principal Component Analysis (50%)

To help you determine the relative importance of each part of the work, we provide indicativepercentage. These correspond to the percentage of points of the total grade for the report.

4.1 Goals

For this first practical you will focus on choosing a suitable projection of the data throughPrincipal Component Analysis. Such a projection aims at improving the separability of thedata and at reducing the dimensionality of the dataset.

Thus, the first step of this practical (collecting data) will require you to select or create adataset. Be aware that these sets will be used throughout the other practicals of assignment 1and must therefore be chosen very carefully, see also the recommendation given in Section 1.3of the Lecture Notes.

The second step will require you to investigate different PCA projections (analyzing withPCA) and to determine which projection is best suited for the purpose at hand. You will thenbe asked to discuss the influence those particular choices may have in improving or degradingperformance of the classification or clustering process.

Next, we give step by step instructions to complete the practical.

4.2 Getting started

I) Download MLdemos The practicals will use primarily the MLDemos software. Thesoftware (downloadable at http://mldemos.epfl.ch ) provides a graphical interface forvisualizing the data and algorithms you will use throughout this year.

It is advised to decompress the MLdemos zip file in the desktop folder if you are using anEPFL computer to avoid folder/files path issues.

4

Page 5: Applied Machine Learning Assignment 1 - EPFLlasa.epfl.ch/teaching/lectures/ML_Msc/practicals/2012-2013/AML-TP1.pdf · Applied Machine Learning Assignment 1 Professor: Aude Billard

II) Collect data You must collect one dataset of images and one dataset from the UCIdatabase:

1) Images Follow the instructions in section 3.1 on how to create and import your dataset.

2) UCI database Follow the instructions given in Section 3.2 to import the data from theUCI database. If you have other data you would like to use in the practicals,instead of the UCI dataset assigned to your group (i.e. data from your ownexperiments or a different source than UCI), please ask your assistants if itis suitable for the practicals and for help to import your data in MLdemos.

III) Load and Project your data

1) Images When you load your image dataset in the PCAfaces plugin, PCAFaces au-tomatically performs PCA and projects the samples on the current two eigenvectorsselected (lower-right corner of the PCAFaces interface). You can display both the graphof the reconstruction error with increasing amounts of components and the eigenvec-tors in two separate windows by clicking on the Eigenvectors button of the PCAfaceswindow.

2) UCI database When you load your *.data file (Drag and drop the file in MLdemos),the data is displayed in its original space. To project it with PCA, click on the Algo-rithms button (Figure 2) and go to the Projections tab. Select Principal ComponentAnalysis and click on Project. Your data is now projected onto its eigenvectors. Notethat you can project your data back to its original space by clicking the Revert button.

Figure 2: How to project your data: open the Algorithms window (1), select the Projectionstab (2) and choose PCA (3). Click on Project (4).

The graph of the reconstruction error (Figure 3) with increasing amounts of componentsand the table with each components’s variance and percentage of variance can be usefulto get an idea of how much information is stored in each component. The Eigen buttonwill display the eigenvectors in a new window.

5

Page 6: Applied Machine Learning Assignment 1 - EPFLlasa.epfl.ch/teaching/lectures/ML_Msc/practicals/2012-2013/AML-TP1.pdf · Applied Machine Learning Assignment 1 Professor: Aude Billard

Figure 3: Reconstruction error and component variance (1). Eigen button (2) to display theeigenvectors in a new window

IV) Browse your projected data

1) Images In the PCAFaces window, you select the eigenvectors to project your data onin the bottom right of the window.

Figure 4: Selecting the eigenvectors in the PCAFaces window will determine onto which twoeigenvectors the data is projected in the main window.

2) UCI database By default, the data is projected on the first two dimensions (eitherin the original space or in the projected space). You can change this by selecting thedimensions you want to display your data on: select the x and y axis dimensions onthe bottom left of the main window (Figure 5, (1)).

Figure 5: Choose the dimensions your data are displayed on (1) and the way they are displayed(2).

There are other ways of plotting your data to display more than two dimensions at

6

Page 7: Applied Machine Learning Assignment 1 - EPFLlasa.epfl.ch/teaching/lectures/ML_Msc/practicals/2012-2013/AML-TP1.pdf · Applied Machine Learning Assignment 1 Professor: Aude Billard

the same time, choose from the list (Figure 5, (2)) between Scatterplots (Plot everycombination of 2x2 dimensions next to each other. Expect some slowdown if youhave a big dataset), Parrallel coordinates (Each datapoint is a line passing througheach dimension.), BubblePlots(display a third dimension by varying the size of eachdatapoint).

Figure 6: Different data displays are available. From left to right: Standard, Scatterplots,Parallel coordinates and Bubble plots methods.

4.3 Questions: Studying Principal Projections

Assume here that PCA would be a preprocessing step, which you may perform before doingclassification (PCA does not do classification; classification would then be performed usinganother algorithm which we will see later on in class). Answer the following questions:

1. Can you find one or more projections of the data, that would make the classes separable? Ifthis is the case, can you decipher which feature of the data was extracted by the projectionand whether these features correspond to your expectations. If you did not manage tofind a suitable pair or group of projections to separate the data, discuss why this is thecase. (20%)

2. What happens if you do not use all samples to train PCA? (You can do this by right/left+ clicking on the samples in the dataset window: non-colored samples will not be usedfor computing PCA). Repeat this process 3 times by selecting different subgroups ofimages, and discuss how the choice of training set affects the computation of PCA andthe separability of the data. (15%)

3. Choose representatives figures in support of your responses. (15%)

7

Page 8: Applied Machine Learning Assignment 1 - EPFLlasa.epfl.ch/teaching/lectures/ML_Msc/practicals/2012-2013/AML-TP1.pdf · Applied Machine Learning Assignment 1 Professor: Aude Billard

5 Part II: Clustering (50%)

5.1 Goals

In this practical you will study three unsupervised methods for clustering data. You will com-pare K-Means, Soft K-Means and Gaussian Mixtures Models in terms of clustering quality,cluster/region boundaries and performances.

You will study K-Means first, and look at the way boundaries are created between theclusters. You will see the sensitivity to initialization of the system and the way clusteringrelates to the actual classes in your dataset. You will play with different metrics and study howthey affect the cluster boundaries.

You will then focus on Soft K-Means and Gaussian Mixture Models and study these methodsmuch in the same way as simple K-Means.

All these methods are related to one another in mathematical terms, you will explain whyand how.

5.2 Getting started

I) Load your data Using the PCAFaces plugin or by Drag and Drop.

II) Cluster your data After loading your data, open the Algorithms window and select theClustering tab. Choose the method and parameters (e.g. metric power, stiffness) and hitthe Cluster button.

Figure 7: Select the Clustering tab (1). Choose the method (2) and parameters (3) (e.g. metricpower, stiffness) and hit the Cluster button (4).

III) Optimize the number of clusters Besides choosing manually the number of clusters,you can let MLdemos find an optimal number according to several possible criterions:AIC, BIC, F-measure and RSS (Figure 8).

Beware: AIC and BIC are based on the algorithm’s model’s parameters and the likely-hood of your data to fit your algorithm’s model (internal evaluation), therefore they donot require datapoint class labels, whereas F-measure uses only the datapoint’s labels toevaluate the algorithm (external evaluation).

5.3 Questions

1. K-MeansFind the proper amount of clusters that model your data. This depends on the dataset you

8

Page 9: Applied Machine Learning Assignment 1 - EPFLlasa.epfl.ch/teaching/lectures/ML_Msc/practicals/2012-2013/AML-TP1.pdf · Applied Machine Learning Assignment 1 Professor: Aude Billard

Figure 8: Select the optimization criterion (1), the range of clusters to search and hit optimizeclusters (2). The results are displayed on a graph, (3) and numerically (4) for the optimalnumber found.

are using, but can also be changed to study how K-Means responds to different amount ofclusters. Can you identify a good ’measure’ of how good a clustering is? Try changing thetype of metric (Euclidean, Manhattan, Infinite and Polynomial), study the effect of thesechoices in terms of the positions of the cluster means and the shape of region boundaries.Also study the stability of the metrics: is there a metric that makes the system unstable?(Does the system converge to the same solution at each run?) (15%)

2. Soft K-Means StiffnessPlay with the stiffness parameter of the Soft K-Means boundaries, describe how the clus-tering is affected by this. Is the system still stable? (10%)

3. Gaussian Mixture ModelsStudy the way GMM clusters the points. Does this method always converge to the samesolution? More than the other methods? From a theoretical standpoint, GMM, SoftK-Means and K-Means are linked together. Explain why and where. (10%)

4. GMM with Bayesian Inference CriterionThe Bayesian Inference Criterion (BIC) provides an estimation of the quality of a cluster-ing. You can apply it to GMM to estimate the optimal number of clusters. (Select BIC inthe Optimize by menu in the right-hand side of the Algorithms window, and hit OptimizeClusters, see Figure 8) Do this for your different datasets. Do you always obtain optimalresults? Find and discuss an example in which the BIC fails to provide a good solution.Note that clustering is often used as a preparatory step for more complex problems (e.g.classification, regression); discuss the potential issues that arise when using BIC (from acomputational standpoint). (15%)

9

Page 10: Applied Machine Learning Assignment 1 - EPFLlasa.epfl.ch/teaching/lectures/ML_Msc/practicals/2012-2013/AML-TP1.pdf · Applied Machine Learning Assignment 1 Professor: Aude Billard

6 Bonus part: Independent Component Analysis (10% bonus)

6.1 Goals

In this practical you will study Independent Component Analysis. You will study how thealgorithm manages the separation of two mixed images in an iterative manner, and you willalso study how it decorrelates data (this part is not included in the grading).

6.2 What to do

The practical will be split into two parts:

• Image separation (facultative, bonus points for the report)

• Data decorrelation (facultative, not graded)

6.3 Getting started

I) First part: Matlab (Bonus) Launch Matlab and open the ica image mix.m file. HitF5 to execute the script. The script loads two images, mixes them together by means ofa random mixture, and then tries to separate the signals again using ICA. You shouldreplace the input images by changing the name of the images loaded at the beginningof the file, as well as changing the random seed for the generation of the mixed signals.Alternatively, you can input the mixing matrix manually.

II) Second part: MLdemos (not graded) Load your dataset and select the Projectionstab in the Algorithms window. Select Independant Component Analysis as projectionmethod and hit the Project button (Leave Jade for ICA method). The projected data willbe drawn in the main window while you can read the mixing coefficients in the Algorithmswindow.

As for PCA, you can come back to the original space by using the Revert button.

6.4 Questions

1. Find a pair of photos which you like and run the algorithm within matlab with differentmixing matrices. Discuss the various solutions you obtain. Discuss whether you observesome of the drawbacks of ICA (inversion of the image, reordering) presented in class.Illustrate major findings with at most two pairs of pictures. (5%)

2. Discuss the following: What would happen if half of the image was identical in bothpictures and the other half of the images was different ? What would happen if we wereto mix all the pixels inside both images, and mix the two scrambled images: will thealgorithm still be able to converge to the original scrambled images ? (5%)

7 Report

Write a report of maximum 10 pages in PDF format (pages beyond the tenth one will beignored). The best way to write the report is to fill it in as you go during the practical session.Just jotting down some quick notes while you experiment will save you hours once you work onthe report itself.

10

Page 11: Applied Machine Learning Assignment 1 - EPFLlasa.epfl.ch/teaching/lectures/ML_Msc/practicals/2012-2013/AML-TP1.pdf · Applied Machine Learning Assignment 1 Professor: Aude Billard

7.1 Format

In this first report, we expect solely a qualitative assessment of the performance and behaviorof the system:

A qualitative evaluation should contain images (e.g. screenshots) which exemplify the con-cepts you want to explain (e.g. an image of a good projection and an image of a badone). Make sure to plot only a subset of all the plots you may have visualized during thepractical. Choose the ones that are the most representative. Make sure that there is noredundancy in the information conveyed by the graphs and thus that each graph presentsa different concept.

Each graph/image should be accompanied by a caption that explains the content of theimage. Bad captions are captions that contain solely the figure number! An example ofgood caption would typically read as follows: Figure 2: The left plot shows the e1 and e2projection of 10 images of human faces, typical of those shown in Figure 1. In the maintext, refer to all figures using their figure numbers.

11