higher school of economics - the government of the russian … machine learning.pdf · training...
TRANSCRIPT
National Research University - Higher School of Economics
Course Title “Applied Machine Learning”
Master’s Program 38.04.05 “Big Data Systems”
The Government of the Russian Federation
The Federal State Autonomous Institution of Higher Education
"National Research University - Higher School of Economics"
Faculty of Business Informatics
Department of Innovation and Business in Information Technology
Course Title “Applied Machine Learning”
Master’s Program38.04.05 “Big Data Systems”
Author:
Dr. Sci., Prof. Andrey Dmitriev, [email protected]
Moscow, 2014
This document may not be reproduced or redistributed by other Departments of the University without
permission of the Authors.
National Research University - Higher School of Economics
Course Title “Applied Machine Learning”
Master’s Program 38.04.05 “Big Data Systems”
Field of Application and Regulations
The course "Applied Machine Learning" syllabus lays down minimum requirements for student’s
knowledge and skills; it also provides description of both contents and forms of training and assessment in
use. The course isoffered to students of the Master’s Program "Big Data Systems" (area code 080500.68) in
the Faculty of Business Informatics of the National Research University "Higher School of Economics".
The course is a part of the curriculum pool of elective courses (1st year, M.2.B.1. Optional courses, M.2
Courses required by the M’s program of the 2014-2015 academic year’s curriculum), and it is a two-
module course (3rd
module and 4th
module). The duration of the course amounts to 48 class periods (both
lecture and practices) divided into 24 lecture hours and 24 practice hours. Besides, 96 academic hours are
set aside to students for self-studying activity.
The syllabus is prepared for teachers responsible for the course (or closely related disciplines),
teaching assistants, students enrolled on the course "Applied Machine Learning" as well as experts and
statutory bodies carrying out assigned or regular accreditations in accordance with
educational standards of the National Research University – Higher School of Economics,
curriculum ("Business Informatics", area code 38.04.05), Big Data Systems specialization, 1st year,
2014-2015 academic year.
1 Course Objectives
The main objective of the Course is to present, examine and discuss with students fundamentals and prin-
ciples of machine learning. This course is focused on understanding the role of machine learning for big
data analysis.
Generally, the objective of the course can be thought as a combination of the following constituents:
familiarity with peculiarities of supervised learning, parametric and multivariate methods, dimen-
sionality reduction, clustering, nonparametric methods, decision trees, linear discrimination, kernel
machines, Bayesian estimation as applied areas related to big data analysis,
understanding of the main notions of machine learning theory,
the framework of machine learning as the most significant areas of big data analysis,
understanding of the role of machine learning in big data analysis,
obtaining skills in utilizing machine learning in big data analysis.
2 Students' Competencies to be Developed by the Course
While mastering the course material, the student will
know main notions of the supervised learning, parametric and multivariate methods, dimensionality
reduction, clustering, nonparametric methods, decision trees, linear discrimination, kernel ma-
chines, Bayesian estimation,
acquire skills of big data analysis,
gain experience in big data analysis with use main notions of the supervised learning, parametric
and multivariate methods, dimensionality reduction, clustering, nonparametric methods, decision
trees, linear discrimination, kernel machines, Bayesian estimation.
In short, the course contributes to the development of the following professional competencies:
National Research University - Higher School of Economics
Course Title “Applied Machine Learning”
Master’s Program 38.04.05 “Big Data Systems”
Ccompetencies
FSES/
HSE
code
Descriptors – main mastering
features (indicators of result
achievement)
Training forms and
methods contributing
to the formation and
development of
competence
Ability to offer concepts,
models, invent and test
methods and tools for pro-
fessional work
SC-2 Demonstrates Lecture, practice, home tasks
Ability to apply the methods
of system analysis and mod-
eling to assess, design and
strategy development of en-
terprise architecture
PC-13 Owns and uses Lecture, practice, home tasks
Ability to develop and im-
plement economic and math-
ematical models to justify
the project solu-tions in the
field of information and
computer technology
PC-14 Owns and uses Lecture, practice, home tasks
Ability to organize self and
collective research work in
the enterprise and manage it
PC-16 Demonstrates Lecture, practice, home tasks
3 The Course within the Program’s Framework
The course "Applied Machine Learning" syllabus lays down minimum requirements for student’s
knowledge and skills; it also provides description of both contents and forms of training and assessment in
use. The course is offered to students of the Master’s Program "Big Data Systems" (area code 080500.68)
in the Faculty of Business Informatics of the National Research University "Higher School of Economics".
The course is a part of the curriculum pool of required courses (1st year, M.2.B.1. Optional courses, M.2
Courses required by the M’s program of the 2014-2015 academic year’s curriculum), and it is a two-
module course (3rd
module and 4th
module). The duration of the course amounts to 48 class periods (both
lecture and practices) divided into 24 lecture hours and 24 practice hours. Besides, 96 academic hours are
set aside to students for self-studying activity.
Academic control forms include
2 home tasks are done by students individually, herewith each student has to prepare electronic
(PDF format solely) report; all reports have to be submitted in LMS; all reports are checked and
graded by the instructor on ten-point scale by the end of the 3rd
module and the 4th
module,
pass-final examination, which implies written test and computer-based problem solving.
The Course is to be based on the acquisition of the following courses:
Calculus
Linear Algebra
Probability Theory and Mathematical Statistics
Data Analysis
Economic and Mathematical Modeling
Discrete Mathematics
The Course requires the following students' competencies and knowledge:
main definitions, theorems and properties from Calculus, Linear Algebra, Probability Theory and
Mathematical Statistics, Data Analysis, Economic and Mathematical Modeling and Discrete Math-
ematics,
ability to communicate both orally and in written form in English language,
National Research University - Higher School of Economics
Course Title “Applied Machine Learning”
Master’s Program 38.04.05 “Big Data Systems”
ability to search for, process and analyze information from a variety of sources.
Main provisions of the course should be used to further the study of the following courses:
Risk analysis based on big data
Predictive Modeling
Marketing analytics based on big data
4 Thematic Course Contents
№ Title of the topic / lecture Hours
(total
number)
Class hours Independ-
ent work Lec-
tures
Semi-
nars Practice
3rd
Module
1 Supervised Learning 12 2 2 8
2 Bayesian Decision Theory 12 2 2 8
3 Parametric and Multivariate Methods 12 2 2 8
4 Dimensionality Reduction 12 2 2 8
5 Clustering 12 2 2 8
3rd
Module TOTAL 10 10 40
4th
Module
6 Nonparametric Methods 12 2 2 8
7 Decision Trees 12 2 2 8
8 Linear Discrimination 12 2 2 8
9 Multilayer Perceptrons 12 2 2 8
10 Kernel Machines 12 2 2 8
11 Bayesian Estimation 12 2 2 8
12 Design and Analysis of Machine Learning
Experiments
12 2 2 8
4th
Module TOTAL 14 14 56
TOTAL 24 24 96
5 Forms and Types of Testing
Type of
control
Form of con-
trol
1 year Department Parameters
1 2 3 4
Current
(week)
Home task 1 week 29 Innovation
and Busi-
ness in In-
formation
Technology
problems solving, written
report (paper)
Home task 2 week 40 problems solving, written
report (paper)
Resultant Pass-fail ex-
am
week 41 written test (paper) and
computer-based problem
solving
Evaluation Criteria
Current and resultant grades are made up of the following components:
2 tasks
are done by students individually, herewith each student has to prepare electronic (PDF format solely) re-
port. All reports have to be submitted in LMS. All reports are checked and graded by the instructor on ten-
point scale by the end of the 1stmodule. All home tasks (HT) are assessed on the ten-point scale summary.
pass-final examination
implies written test (WT) and computer-based problem solving (CS).
Finally, the total course grade on ten-point scale is obtained as
National Research University - Higher School of Economics
Course Title “Applied Machine Learning”
Master’s Program 38.04.05 “Big Data Systems”
O(Total) = 0,6 * O(HT) + 0,1 * O(WT) + 0,3 * O(CS).
A grade of4 or higher means successful completion of the course ("pass"), while grade of 3 or lower
mean sun successful result ("fail"). Conversion of the concluding rounded grade O(Total) to five-point
scale grade.
6 Detailed Course Contents
Lecture 1. Supervised Learning
Examples of Machine Learning Applications. Learning Associations: Classification, Regression, Unsuper-
vised Learning, Reinforcement Learning. Learning a Class from Examples. Vapnik-Chervonenkis (VC).
Dimension. Probably Approximately Correct (PAC) Learning. Noise. Learning Multiple Classes. Regres-
sion. Model Selection and Generalization. Dimensions of a Supervised Machine Learning Algorithm.
Practice 1. Probably Approximately Correct (PAC) Learning. Noise. Learning
Multiple Classes. Regression. Model Selection and Generalization. Dimensions of a
Supervised Machine Learning Algorithm.
Materials required
1. Alpaydin E. Introduction to Machine Learning, 2nd
Edition, MIT Press Cambridge, 2010
Recommended readings
1. Angluin, D. 1988. “Queries and Concept Learning.” Machine Learning 2: 319–342.
2. Blumer, A., A. Ehrenfeucht, D. Haussler, and M. K. Warmuth. 1989. “Learnability and the Vap-
nik-Chervonenkis Dimension.” Journal of the ACM 36: 929–965.
3. Dietterich, T. G. 2003. “Machine Learning.” In Nature Encyclopedia of Cognitive Science. Lon-
don: Macmillan.
4. Hirsh, H. 1990. Incremental Version Space Merging: A General Framework for Concept Learn-
ing. Boston: Kluwer.
Lecture 2. Bayesian Decision Theory
Introduction. Classification. Losses and Risks. Discriminant Functions. Utility Theory. Association Rules.
Practice 2. Classification. Losses and Risks. Discriminant Functions. Utility
Theory. Association Rules.
Materials required
1. Alpaydin E. Introduction to Machine Learning, 2nd
Edition, MIT Press Cabridge, 2010
Recommended readings
1. Agrawal, R., H. Mannila, R. Srikant, H. Toivonen, and A. Verkamo. 1996. “Fast Discovery of
Association Rules.” In Advances in Knowledge Discovery and Data Mining, ed. U. M. Fayyad, G. Pi-
atetsky-Shapiro, P. Smyth, and R. Uthurusamy, 307–328. Cambridge, MA: MIT Press.
2. Duda, R. O., P. E. Hart, and D. G. Stork. 2001. Pattern Classification, 2nd ed. New York: Wiley.
Li, J. 2006. “On Optimal Rule Discovery.” IEEE Transactions on Knowledge and Data Discovery 18:
460–471.
3. Newman, J. R., ed. 1988. The World of Mathematics. Redmond, WA: Tempus. Omiecinski, E. R.
2003. “Alternative Interest Measures for Mining Associations in Databases.” IEEE Transactions on
Knowledge and Data Discovery 15: 57–69.
4. Russell, S., and P. Norvig. 1995. Artificial Intelligence: A Modern Approach. New York: Pren-
tice Hall. Shafer, G., and J. Pearl, eds. 1990. Readings in Uncertain Reasoning. SanMateo,
National Research University - Higher School of Economics
Course Title “Applied Machine Learning”
Master’s Program 38.04.05 “Big Data Systems”
CA: Morgan Kaufmann.
5. Zhang, C., and S. Zhang. 2002. Association Rule Mining: Models and Algorithms. New York:
Springer.
Lecture 3. Parametric and Multivariate Methods
Maximum Likelihood Estimation: Bernoulli Density, Multinomial Density, Gaussian (Normal) Density.
Evaluating an Estimator: Bias and Variance. The Bayes’ Estimator. Parametric Classification. Regression.
Tuning Model Complexity: Bias/Variance Dilemma. Model Selection Procedures.
Multivariate Data. Parameter Estimation. Estimation of Missing Values. Multivariate Normal Distribution.
Multivariate Classification. Tuning Complexity. Discrete Features. Multivariate Regression.
Practice 3. Maximum Likelihood Estimation. Multivariate Classification. Tun-
ing Complexity. Discrete Features. Multivariate Regression.
Materials required
1. Alpaydin E. Introduction to Machine Learning, 2nd
Edition, MIT Press Cambridge, 2010
Recommended readings
1. Duda, R. O., P. E. Hart, and D. G. Stork. 2001. Pattern Classification, 2nd ed. New York: Wiley.
2. Friedman, J. H. 1989. “Regularized Discriminant Analysis.” Journal of American Statistical As-
sociation 84: 165–175.
3. Harville, D. A. 1997. Matrix Algebra from a Statistician’s Perspective. New York: Springer.
4. Manning, C. D., and H. Schutze. 1999. Foundations of Statistical Natural Language Processing.
Cambridge, MA: MIT Press.
5. McLachlan, G. J. 1992. Discriminant Analysis and Statistical Pattern Recognition. New York:
Wiley.
6. Rencher, A. C. 1995. Methods of Multivariate Analysis. New York: Wiley.
7. Strang, G. 1988. Linear Algebra and its Applications, 3rd ed. New York: Harcourt Brace Jo-
vanovich.
Lecture 4. Dimensionality Reduction
Subset Selection. Principal Components Analysis. Factor Analysis. Multidimensional Scaling. Linear Dis-
criminant Analysis. Isomap. Locally Linear Embedding.
Practice 4. Principal Components Analysis. Factor Analysis. Multidimensional
Scaling. Linear Discriminant Analysis.
Materials required
1. Alpaydin E. Introduction to Machine Learning, 2nd
Edition, MIT Press Cabridge, 2010.
Recommended readings
1. Balasubramanian, M., E. L. Schwartz, J. B. Tenenbaum, V. de Silva, and J. C. Langford. 2002.
“The Isomap Algorithm and Topological Stability.” Science 295: 7.
2. Chatfield, C., and A. J. Collins. 1980. Introduction to Multivariate Analysis. London: Chapman
and Hall.
3. Cox, T. F., and M. A. A. Cox. 1994. Multidimensional Scaling. London: Chapman and Hall.
4. Devijer, P. A., and J. Kittler. 1982. Pattern Recognition: A Statistical Approach. New York:
Prentice-Hall.
5. Flury, B. 1988. Common Principal Components and Related Multivariate Models. New York:
Wiley.
National Research University - Higher School of Economics
Course Title “Applied Machine Learning”
Master’s Program 38.04.05 “Big Data Systems”
6. Fukunaga, K., and P. M. Narendra. 1977. “A Branch and Bound Algorithm for Feature Subset
Selection.” IEEE Transactions on Computers C-26: 917–922.
Lecture 5. Clustering
Mixture Densities. k-Means Clustering. Expectation-Maximization Algorithm. Mixtures of Latent Variable
Models. Supervised Learning after Clustering. Hierarchical Clustering. Choosing the Number of Clusters.
Practice 5. Mixture Densities. k-Means Clustering. Expectation-Maximization
Algorithm. Mixtures of Latent Variable Models.
Materials required
1. Alpaydin E. Introduction to Machine Learning, 2nd
Edition, MIT Press Cabridge, 2010.
Recommended readings
1. Alpaydın, E. 1998. “Soft Vector Quantization and the EM Algorithm.”Neural Networks 11: 467–
477.
2. Barrow, H. B. 1989. “Unsupervised Learning.” Neural Computation 1: 295–311.
3. Bezdek, J. C., and N. R. Pal. 1995. “Two Soft Relatives of Learning Vector Quantization.” Neu-
ral Networks 8: 729–743.
4. Bishop, C. M. 1999. “Latent Variable Models.” In Learning in Graphical Models, ed. M. I. Jor-
dan, 371–403. Cambridge, MA: MIT Press.
5. Dempster, A. P., N. M. Laird, and D. B. Rubin. 1977. “Maximum Likelihood from Incomplete
Data via the EM Algorithm.” Journal of Royal Statistical Society B 39: 1–38.
6. Gersho, A., and R. M. Gray. 1992. Vector Quantization and Signal Compression. Boston:
Kluwer.
7. Ghahramani, Z., and G. E. Hinton. 1997. The EM Algorithm for Mixtures of Factor Analyzers.
Technical Report CRG TR-96-1, Department of Computer Science, University of Toronto
Lecture 6. Nonparametric Methods
Nonparametric Density Estimation. Histogram Estimator: Kernel Estimator, k-Nearest Neighbor Estimator.
Generalization to Multivariate Data. Nonparametric Classification. Condensed Nearest Neighbor. Nonpar-
ametric Regression: Smoothing Models, Running Mean Smoother, Kernel Smoother, Running Line
Smoother. How to Choose the Smoothing Parameter.
Practice 6. Nonparametric Density Estimation. Nonparametric Regression.
Materials required
1. Alpaydin E. Introduction to Machine Learning, 2nd
Edition, MIT Press Cabridge, 2010.
Recommended readings
1. Aha, D. W., ed. 1997. Special Issue on Lazy Learning, Artificial Intelligence Review 11(1–5): 7–
423.
2. Aha, D. W., D. Kibler, and M. K. Albert. 1991. “Instance-Based Learning Algorithm.” Machine
Learning 6: 37–66.
3. Atkeson, C. G., A. W. Moore, and S. Schaal. 1997. “Locally Weighted Learning.” Artificial In-
telligence Review 11: 11–73.
4. Cover, T. M., and P. E. Hart. 1967. “Nearest Neighbor Pattern Classification.” IEEE Transac-
tions on Information Theory 13: 21–27.
5. Dasarathy, B. V. 1991. Nearest Neighbor Norms: NN Pattern Classification Techniques. Los
Alamitos, CA: IEEE Computer Society Press.
National Research University - Higher School of Economics
Course Title “Applied Machine Learning”
Master’s Program 38.04.05 “Big Data Systems”
6. Duda, R. O., P. E. Hart, and D. G. Stork. 2001. Pattern Classification, 2nd ed. New York: Wiley.
Geman, S., E. Bienenstock, and R. Doursat. 1992. “Neural Networks and the Bias/Variance Dilemma.”
Neural Computation 4: 1–58.
Lecture 7. Linear Discrimination
Generalizing the Linear Model. Geometry of the Linear Discriminant: Two Classes, Multiple Classes.
Pairwise Separation. Parametric Discrimination Revisited. Gradient Descent. Logistic Discrimination: Two
Classes, Multiple Classes. Discrimination by Regression.
Practice 7. Generalizing the Linear Model. Geometry of the Linear Discrimi-
nant. Logistic Discrimination.
Materials required
1. Alpaydin E. Introduction to Machine Learning, 2nd
Edition, MIT Press Cabridge, 2010.
Recommended readings
1. Aizerman, M. A., E. M. Braverman, and L. I. Rozonoer. 1964. “Theoretical Foundations
of the Potential Function Method in Pattern Recognition Learning.” Automation and Remote Control 25:
821–837.
2. Anderson, J. A. 1982. “Logistic Discrimination.” In Handbook of Statistics, Vol. 2, Classifica-
tion, Pattern Recognition and Reduction of Dimensionality, ed. P. R. Krishnaiah and L. N. Kanal, 169–191.
Amsterdam: North Holland.
3. Bridle, J. S. 1990. “Probabilistic Interpretation of Feedforward Classification Network Outputs
with Relationships to Statistical Pattern Recognition.” In Neurocomputing: Algorithms, Architectures and
Applications, ed. F. Fogelman-Soulie and J. Herault, 227–236. Berlin: Springer.
4. Duda, R. O., P. E. Hart, and D. G. Stork. 2001. Pattern Classification, 2nd ed. New York: Wiley.
McCullagh, P., and J. A. Nelder. 1989. Generalized Linear Models. London: Chapman and Hall.
Lecture 8. Multilayer Perceptrons
Introduction: Understanding the Brain, Neural Networks as a Paradigm for Parallel Processing. The Per-
ceptron. Training a Perceptron. Learning Boolean Functions. Multilayer Perceptrons. MLP as a Universal
Approximator. Backpropagation Algorithm: Nonlinear Regression, Two-Class Discrimination, Multiclass
Discrimination, Multiple Hidden Layers. Training Procedures: Improving Convergence, Overtraining,
Structuring the Network, Hints. Tuning the Network Size. Bayesian View of Learning. Dimensionality Re-
duction. Learning Time. Time Delay Neural Networks. Recurrent Networks.
Practice 8. Backpropagation Algorithm. Training Procedures.
Materials required
1. Alpaydin E. Introduction to Machine Learning, 2nd
Edition, MIT Press Cabridge, 2010.
Recommended readings
1. Abu-Mostafa, Y. 1995. “Hints.” Neural Computation 7: 639–671.
2. Aran, O., O. T. Yıldız, and E. Alpaydın. 2009. “An Incremental Framework Based on Cross-
Validation for Estimating the Architecture of a Multilayer Perceptron.” International Journal of Pattern
Recognition and Artificial Intelligence 23: 159–190.
3. Ash, T. 1989. “Dynamic Node Creation in Backpropagation Networks.” Connection Science 1:
365–375.
4. Battiti, R. 1992. “First- and Second-Order Methods for Learning: Between Steepest Descent and
Newton’s Method.” Neural Computation 4: 141–166.
National Research University - Higher School of Economics
Course Title “Applied Machine Learning”
Master’s Program 38.04.05 “Big Data Systems”
5. Bishop, C. M. 1995. Neural Networks for Pattern Recognition. Oxford: Oxford University Press.
Bourlard, H., and Y. Kamp. 1988. “Auto-Association by Multilayer Perceptrons and Singular Value De-
composition.” Biological Cybernetics 59: 291–294.
Lecture 9. Multilayer Perceptrons
Introduction: Understanding the Brain, Neural Networks as a Paradigm for Parallel Processing. The Per-
ceptron. Training a Perceptron. Learning Boolean Functions. Multilayer Perceptrons. MLP as a Universal
Approximator. Backpropagation Algorithm: Nonlinear Regression, Two-Class Discrimination, Multiclass
Discrimination, Multiple Hidden Layers. Training Procedures: Improving Convergence, Overtraining,
Structuring the Network, Hints. Tuning the Network Size. Bayesian View of Learning. Dimensionality Re-
duction. Learning Time. Time Delay Neural Networks. Recurrent Networks.
Practice 9. Backpropagation Algorithm. Training Procedures.
Materials required
1. Alpaydin E. Introduction to Machine Learning, 2nd
Edition, MIT Press Cabridge, 2010.
Recommended readings
1. Abu-Mostafa, Y. 1995. “Hints.” Neural Computation 7: 639–671.
2. Aran, O., O. T. Yıldız, and E. Alpaydın. 2009. “An Incremental Framework Based on Cross-
Validation for Estimating the Architecture of a Multilayer Perceptron.” International Journal of Pattern
Recognition and Artificial Intelligence 23: 159–190.
3. Ash, T. 1989. “Dynamic Node Creation in Backpropagation Networks.” Connection Science 1:
365–375.
4. Battiti, R. 1992. “First- and Second-Order Methods for Learning: Between Steepest Descent and
Newton’s Method.” Neural Computation 4: 141–166.
5. Bishop, C. M. 1995. Neural Networks for Pattern Recognition. Oxford: Oxford University Press.
Bourlard, H., and Y. Kamp. 1988. “Auto-Association by Multilayer Perceptrons and Singular Value De-
composition.” Biological Cybernetics 59: 291–294.
Lecture 10. Kernel Machines
Optimal Separating Hyperplane. The Nonseparable Case: Soft Margin Hyperplane. ν-SVM. Kernel Trick.
Vectorial Kernels. Defining Kernels. Multiple Kernel Learning. Multiclass Kernel Machines. Kernel Ma-
chines for Regression. One-Class Kernel Machines. Kernel Dimensionality Reduction.
Practice 10. The Nonseparable Case: Soft Margin Hyperplane. ν-SVM. Mul-
ticlass Kernel Machines. Kernel Machines for Regression.
Materials required
1. Alpaydin E. Introduction to Machine Learning, 2nd
Edition, MIT Press Cabridge, 2010.
Recommended readings
1. Allwein, E. L., R. E. Schapire, and Y. Singer. 2000. “Reducing Multiclass to Binary: A Unifying
Approach for Margin Classifiers.” Journal of Machine Learning Research 1: 113–141.
2. Burges, C. J. C. 1998. “A Tutorial on Support Vector Machines for Pattern Recognition.” Data
Mining and Knowledge Discovery 2: 121–167.
3. Chang, C.-C., and C.-J. Lin. 2008. LIBSVM: A Library for Support Vector Machines.
http://www.csie.ntu.edu.tw/cjlin/libsvm/.
4. Cherkassky, V., and F. Mulier. 1998. Learning from Data: Concepts, Theory, and Methods. New
York: Wiley.
National Research University - Higher School of Economics
Course Title “Applied Machine Learning”
Master’s Program 38.04.05 “Big Data Systems”
5. Cortes, C., and V. Vapnik. 1995. “Support Vector Networks.” Machine Learning 20: 273–297.
6. Dietterich, T. G., and G. Bakiri. 1995. “Solving Multiclass Learning Problems via Error-
Correcting Output Codes.” Journal of Artificial Intelligence Research 2: 263–286.
7. Gonen, M., and E. Alpaydın. 2008. “Localized Multiple Kernel Learning.” In 25th International
Conference on Machine Learning, ed. A. McCallum and S. Roweis, 352–359. Madison, WI: Omnipress.
Lecture 11. Bayesian Estimation
Estimating the Parameter of a Distribution: Discrete Variables, Continuous Variables. Bayesian Estimation
of the Parameters of a Function: Regression, The Use of Basis/Kernel Functions, Bayesian Classification.
Gaussian Processes.
Practice 11. Estimating the Parameter of a Distribution. Bayesian Estimation of
the Parameters of a Function. Gaussian Processes.
Materials required
1. Alpaydin E. Introduction to Machine Learning, 2nd
Edition, MIT Press Cabridge, 2010.
Recommended readings
1. Bishop, C. M. 2006. Pattern Recognition and Machine Learning. New York: Springer.
2. Figueiredo, M. A. T. 2003. “Adaptive Sparseness for Supervised Learning.” IEEE Transactions
on Pattern Analysis and Machine Intelligence 25: 1150–1159.
3. Gelman, A. 2008. “Objections to Bayesian statistics.” Bayesian Statistics 3: 445–450.
4. MacKay, D. J. C. 1998. “Introduction to Gaussian Processes.” In Neural Networks and Machine
Learning, ed. C. M. Bishop, 133–166. Berlin: Springer.
5. MacKay, D. J. C. 2003. Information Theory, Inference, and Learning Algorithms. Cambridge,
UK: Cambridge University Press.
6. Rasmussen, C. E. , and C. K. I. Williams. 2006. Gaussian Processes for Machine Learning.
Cambridge, MA: MIT Press.
7. Tibshirani, R. 1996. “Regression Shrinkage and Selection via the Lasso.” Journal of the Royal
Statistical Society B 58: 267–288.
Lecture 12. Design and Analysis of Machine Learning Experiments
Factors, Response, and Strategy of Experimentation. Response Surface Design. Randomization, Replica-
tion, and Blocking. Guidelines for Machine Learning Experiments. Cross-Validation and Resampling
Methods: K-Fold Cross-Validation, 5×2 Cross-Validation, Bootstrapping. Measuring Classifier Perfor-
mance. Interval Estimation. Hypothesis Testing. Assessing a Classification Algorithm’s Performance: Bi-
nomial Test, Approximate Normal Test, t Test. Comparing Two Classification Algorithms: McNemar’s
Test, K-Fold Cross-Validated Paired t Test, 5 × 2 cv Paired t Test, 5 × 2 cv Paired F Test. Comparing Mul-
tiple Algorithms: Analysis of Variance. Comparison over Multiple Datasets: Comparing Two Algorithms,
Multiple Algorithms.
Practice 12. Cross-Validation and Resampling Methods. Assessing a Classifica-
tion Algorithm’s Performance.
Materials required
1. Alpaydin E. Introduction to Machine Learning, 2nd
Edition, MIT Press Cabridge, 2010.
Recommended readings
1. Alpaydın, E. 1999. “Combined 5 × 2 cv F Test for Comparing Supervised Classification Learn-
ing Algorithms.” Neural Computation 11: 1885–1892.
National Research University - Higher School of Economics
Course Title “Applied Machine Learning”
Master’s Program 38.04.05 “Big Data Systems”
2. Bouckaert, R. R. 2003. “Choosing between Two Learning Algorithms based on Calibrated
Tests.” In Twentieth International Conference on Machine Learning, ed. T. Fawcett and N. Mishra, 51–58.
Menlo Park, CA: AAAI Press.
3. Demsar, J. 2006. “Statistical Comparison of Classifiers over Multiple Data Sets.” Journal of Ma-
chine Learning Research 7: 1–30.
4. Dietterich, T. G. 1998. “Approximate Statistical Tests for Comparing Supervised Classification
Learning Algorithms.” Neural Computation 10: 1895–1923.
5. Fawcett, T. 2006. “An Introduction to ROC Analysis.” Pattern Recognition Letters 27: 861–874.
6. Montgomery, D. C. 2005. Design and Analysis of Experiments. 6th ed., New York: Wiley.
7. Ross, S. M. 1987. Introduction to Probability and Statistics for Engineers and Scientists. New
York: Wiley.
7 Educational Technology
During classes various types of active methods are used: analysis of practical problems, group
work, computer simulations in computational software program Mathematica 10.0, distance learning with
use LMS.
8 Methods and Materials for Current Control and Attestation
8.1 Example of Problems for Home Tasks
Problem 1. Imagine you have two possibilities: You can fax a document, that is, send the image, or you
can use an optical character reader (OCR) and send the text file. Discuss the advantage and disadvantages
of the two approaches in a comparative manner. When would one be preferable over the other?
Problem 2. Somebody tosses a fair coin and if the result is heads, you get nothing; otherwise you get $5.
How much would you pay to play this game? What if the win is $500 instead of $5?
Problem 3. Show that as we move an item from the consequent to the antecedent, confidence can never
increase: confidence(ABC → D) ≥ confidence(AB → CD).
Problem 4. Write the code that generates a normal sample with given μ and σ, and the code that calculates
m and s from the sample. Do the same using the Bayes’ estimator assuming a prior distribution for μ.
Problem 5. In Isomap, instead of using Euclidean distance, we can also use Mahalanobis distance between
neighboring points. What are the advantages and disadvantages of this approach, if any?
Problem 6. In image compression, k-means can be used as follows: The image is divided into nonoverlap-
ping c×c windows and these c2-dimensional vectors make up the sample. For a given k, which is generally
a power of two, we do k-means clustering. The reference vectors and the indices for each window is sent
over the communication line. At the receiving end, the image is then reconstructed by reading from the ta-
ble of reference vectors using the indices. Write the computer program that does this for different values of
k and c. For each case, calculate the reconstruction error and the compression rate.
Problem 7. In the running smoother, we can fit a constant, a line, or a higher-degree polynomial at a test
point. How can we choose between them?
Problem 8. What is the implication of the use of a single η for all xj in gradient descent?
National Research University - Higher School of Economics
Course Title “Applied Machine Learning”
Master’s Program 38.04.05 “Big Data Systems”
Problem 9. Consider a MLP architecture with one hidden layer where there are also direct weights from
the inputs directly to the output units. Explain when such a structure would be helpful and how it can be
trained.
Problem 10. Incremental learning of the structure of a MLP can be viewed as a state space search. What
are the operators? What is the goodness function? What type of search strategies are appropriate? Define
these in such a way that dynamic node creation and cascade-correlation are special instantiations.
8.2 Questions for Pass-Final Examination
Theoretical Questions
1. Examples of Machine Learning Applications.
2. Learning Associations: Classification, Regression, Unsupervised Learning, Reinforcement Learn-
ing. Learning a Class from Examples. Vapnik-Chervonenkis (VC). Dimension. Probably Approxi-
mately Correct (PAC) Learning. Noise.
3. Learning Multiple Classes. Regression. Model Selection and Generalization. Dimensions of a Su-
pervised Machine Learning Algorithm.
4. Introduction. Classification. Losses and Risks. Discriminant Functions. Utility Theory. Association
Rules.
5. Maximum Likelihood Estimation: Bernoulli Density, Multinomial Density, Gaussian (Normal)
Density.
6. Evaluating an Estimator: Bias and Variance. The Bayes’ Estimator. Parametric Classification. Re-
gression.
7. Tuning Model Complexity: Bias/Variance Dilemma.
8. Model Selection Procedures.
9. Multivariate Data. Parameter Estimation. Estimation of Missing Values. Multivariate Normal Dis-
tribution. Multivariate Classification. Tuning Complexity. Discrete Features. Multivariate Regres-
sion.
10. Subset Selection. Principal Components Analysis.
11. Factor Analysis.
12. Multidimensional Scaling.
13. Linear Discriminant Analysis. Isomap. Locally Linear Embedding.
14. Mixture Densities. k-Means Clustering. Expectation-Maximization Algorithm.
15. Mixtures of Latent Variable Models.
16. Supervised Learning after Clustering. Hierarchical Clustering. Choosing the Number of Clusters.
17. Nonparametric Density Estimation. Histogram Estimator: Kernel Estimator, k-Nearest Neighbor
Estimator.
18. Generalization to Multivariate Data. Nonparametric Classification. Condensed Nearest Neighbor.
19. Nonparametric Regression: Smoothing Models, Running Mean Smoother, Kernel Smoother, Run-
ning Line Smoother. How to Choose the Smoothing Parameter.
20. Generalizing the Linear Model. Geometry of the Linear Discriminant: Two Classes, Multiple Clas-
ses.
21. Pairwise Separation. Parametric Discrimination Revisited. Gradient Descent. Logistic Discrimina-
tion: Two Classes, Multiple Classes. Discrimination by Regression.
National Research University - Higher School of Economics
Course Title “Applied Machine Learning”
Master’s Program 38.04.05 “Big Data Systems”
22. Understanding the Brain, Neural Networks as a Paradigm for Parallel Processing.
23. The Perceptron. Training a Perceptron. Learning Boolean Functions. Multilayer Perceptrons. MLP
as a Universal Approximator.
24. Backpropagation Algorithm: Nonlinear Regression, Two-Class Discrimination, Multiclass Discrim-
ination, Multiple Hidden Layers.
25. Training Procedures: Improving Convergence, Overtraining, Structuring the Network, Hints.
26. Tuning the Network Size. Bayesian View of Learning. Dimensionality Reduction.
27. Learning Time. Time Delay Neural Networks. Recurrent Networks.
28. Optimal Separating Hyperplane.
29. The Nonseparable Case: Soft Margin Hyperplane, ν-SVM.
30. Kernel Trick. Vectorial Kernels. Defining Kernels.
31. Multiple Kernel Learning. Multiclass Kernel Machines.
32. Kernel Machines for Regression. One-Class Kernel Machines.
33. Kernel Dimensionality Reduction.
34. Estimating the Parameter of a Distribution: Discrete Variables, Continuous Variables.
35. Bayesian Estimation of the Parameters of a Function: Regression, The Use of Basis/Kernel Func-
tions, Bayesian Classification. Gaussian Processes.
36. Factors, Response, and Strategy of Experimentation. Response Surface Design. Randomization,
Replication, and Blocking.
37. Guidelines for Machine Learning Experiments.
38. Cross-Validation and Resampling Methods: K-Fold Cross-Validation, 5×2 Cross-Validation, Boot-
strapping.
39. Measuring Classifier Performance. Interval Estimation. Hypothesis Testing. Assessing a Classifica-
tion Algorithm’s Performance: Binomial Test, Approximate Normal Test, t Test.
40. Comparing Two Classification Algorithms: McNemar’s Test, K-Fold Cross-Validated Paired t Test,
5 × 2 cv Paired t Test, 5 × 2 cv Paired F Test.
41. Comparing Multiple Algorithms: Analysis of Variance. Comparison over Multiple Datasets: Com-
paring Two Algorithms, Multiple Algorithms.
Examples of Problems
Problem 1. In a two-class problem, let us say we have the loss matrix where λ11 = λ22 = 0, λ21 = 1 and λ12
= α. Determine the threshold of decision as a function of α.
Problem 2. The K-fold cross-validated t test only tests for the equality of error rates. If the test rejects, we
do not know which classification algorithm has the lower error rate. How can we test whether the first clas-
sification algorithm does not have higher error rate than the second one? Hint: We have to test H0 : μ ≤ 0
vs. H1 : μ > 0.
Problem 3. If we have two variants of algorithm A and three variants of algorithm B, how can we compare
the overall accuracies of A and B taking all their variants into account?
9 Teaching Methods and Information Provision
9.1 Core Textbook
Alpaydin E. Introduction to Machine Learning, 2nd
Edition, MIT Press Cambridge, 2010.
National Research University - Higher School of Economics
Course Title “Applied Machine Learning”
Master’s Program 38.04.05 “Big Data Systems”
9.2 Required Reading
Han, J., and M. Kamber. 2006. Data Mining: Concepts and Techniques, 2nd ed. San Francisco: Morgan
Kaufmann.
Leahey, T. H., and R. J. Harris. 1997. Learning and Cognition, 4th ed. New York: Prentice Hall.
Witten, I. H., and E. Frank. 2005. Data Mining: Practical Machine Learning Toolsand Techniques, 2nd ed.
San Francisco: Morgan Kaufmann.
9.3 Supplementary Reading
Dietterich, T. G. 2003. “Machine Learning.” In Nature Encyclopedia of Cognitive Science. London: Mac-
millan.
Hirsh, H. 1990. Incremental Version Space Merging: A General Framework for Concept Learning. Bos-
ton: Kluwer.
Kearns, M. J., and U. V. Vazirani. 1994. An Introduction to Computational Learning Theory. Cambridge,
MA: MIT Press.
Mitchell, T. 1997. Machine Learning. New York: McGraw-Hill.
Valiant, L. 1984. “A Theory of the Learnable.” Communications of the ACM 27: 1134–1142.
Vapnik, V. N. 1995. The Nature of Statistical Learning Theory. New York: Springer.
Winston, P. H. 1975. “Learning Structural Descriptions from Examples.” In The Psychology of Computer
Vision, ed. P. H. Winston, 157–209. New York: McGraw-Hill.
9.4 Handbooks
Handbook of Statistics, Vol. 2, Classification, Pattern Recognition and Reduction of Dimensionality,
ed. P. R. Krishnaiah and L. N. Kanal, 1982, Amsterdam: North Holland.
9.5 Software
Mathematica v. 10.0
9.6 Distance Learning
MIT Open Course (Machine Learning)
HSE Learning Management System
10 Technical Provision
Computer, projector (for lectures or practice), computer class