uppsala university - lecture 1 introduction · 2020-01-21 · lectureoutline 1. introduction 2....

Statistical Machine LearningLecture 1 – Introduction

Johan WågbergDivision of Systems and ControlDepartment of Information TechnologyUppsala University

[email protected]/katalog/johwa152

[email protected] Introduction

mailto:[email protected]

www.it.uu.se/katalog/johwa152


What is the course about?

1 / 30 [email protected] Introduction


Machine learning

"Machine learning is about learning, reasoningand acting based on data."

“It is one of today’s most rapidly growing technical fields, lying at theintersection of computer science and statistics, and at the core of

artificial intelligence and data science.”

Ghahramani, Z. Probabilistic machine learning and artificial intelligence. Nature 521:452-459, 2015.

Jordan, M. I. and Mitchell, T. M. Machine Learning: Trends, perspectives and prospects. Science, 349(6245):255-260, 2015.



This course

What is this course about? Supervised machine learning

In one sentence:

Methods for automatically learning (training, estimating, . . . )a model for the relationship between• the input x, and the• the output y

from observed training data

T def= {(y1,x1), (y2,x2), . . . , (yn,xn)}.

Seems dull. . . ?! Can this be useful?



ex) Cancer diagnosis

Systems for detectingcell divisions (mitosis) inhistology images can beused to improve (orautomate) cancerdiagnosis.

3

Fig. 1. Top left: one image (4 MPixels) corresponding to one of the 50 high power fields repre-sented in the dataset. Our detected mitosis are circled green (true positives) and red (false posi-tives); cyan denotes mitosis not detected by our approach. Top right: details of three areas (full-size results on the whole dataset in supplementary material). Note the challenging appearance ofmitotic nuclei and other very similar non-mitotic structures. Bottom: overview of our detectionapproach.

After several pairs of convolutional and MP layers, one fully connected layer fur-ther mixes the outputs into a feature vector. The output layer is a simple fully connectedlayer with one neuron per class (two for this problem), activated by a softmax function,thus ensuring that each neuron’s output activation can be interpreted as the probabilityof a particular input belonging to that class.

Training a detector Using ground truth data, we label each pixel of each trainingimage as either mitosis (when closer than d pixels to the centroid of a mitosis) or non-mitosis (elsewhere). Then, we build a training set in which each instance maps a squarewindow of RGB values sampled from the original image to the class of the centralpixel. If a window lies partly outside of the image boundary, the missing pixels aresynthesized by mirroring.

The mitosis detection problem is rotationally invariant. Therefore, additional train-ing instances are generated by transforming windows within the training set by applyingarbitrary rotations and/or mirroring. This is especially important considering that thereare extremely few mitosis examples in the training set.

Processing a testing image To process an unseen image I , we apply the DNN toall windows whose central pixel is within the image boundaries. Pixels outside of theimage boundaries are again synthesized by mirroring. This yields a probability map

• Learn a model withinput x = RBG histology image (pixel values)output y = number and locations (in the image) of mitosisdetections

• Training data: Histology images labeled by experts.• Uses Deep Learning to model f (Lectures 8–9)

D. C. Cireşan, A. Giusti, L. M. Gambardella and J. Schmidhuber. Mitosis Detection in Breast Cancer Histology Images withDeep Neural Networks. In Medical Image Computing and Computer Assisted Intervention, 411-418, 2013.



ex) Safe aging

SPHERE is a large UK-based research project for helping elderlypeople to live safely at home.

Machine learningproblem: Non-intrusiveactivity recognition.

• Learn a model withinput x = accelerometer, IR & positioning sensor dataoutput y = “sitting”, “walking”, “ascending stairs”, etc.

• Boosting (Lecture 7) among the most successful methods.N. Twomey, T. Diethe, M. Kull, H. Song, M. Camplani, S. Hannuna, X. Fafoutis, N. Zhu, P. Woznowski, P. Flach, and I. Craddock.The SPHERE Challenge: Activity Recognition with Multimodal Sensor Data. arXiv:1603.00797, 2016.



ex) Colorization (I/II)

4 Zhang, Isola, Efros

Fig. 2. Our network architecture. Each conv layer refers to a block of 2 or 3 repeatedconv and ReLU layers, followed by a BatchNorm [30] layer. The net has no pool layers.All changes in resolution are achieved through spatial downsampling or upsamplingbetween conv blocks.

[29]. In Section 3.1, we provide quantitative comparisons to Larsson et al., andencourage interested readers to investigate both concurrent papers.

2 Approach

We train a CNN to map from a grayscale input to a distribution over quantizedcolor value outputs using the architecture shown in Figure 2. Architectural de-tails are described in the supplementary materials on our project webpage1, andthe model is publicly available. In the following, we focus on the design of theobjective function, and our technique for inferring point estimates of color fromthe predicted color distribution.

2.1 Objective Function

Given an input lightness channel X ∈ RH×W×1, our objective is to learn a

mapping Y = F(X) to the two associated color channels Y ∈ RH×W×2, whereH,W are image dimensions.

(We denote predictions with a · symbol and ground truth without.) We per-form this task in CIE Lab color space. Because distances in this space modelperceptual distance, a natural objective function, as used in [1,2], is the Eu-clidean loss L2(·, ·) between predicted and ground truth colors:

L2(Y,Y) =1

2

∑

h,w

‖Yh,w − Yh,w‖22 (1)

However, this loss is not robust to the inherent ambiguity and multimodalnature of the colorization problem. If an object can take on a set of distinctab values, the optimal solution to the Euclidean loss will be the mean of theset. In color prediction, this averaging effect favors grayish, desaturated results.Additionally, if the set of plausible colorizations is non-convex, the solution willin fact be out of the set, giving implausible results.

1 http://richzhang.github.io/colorization/

Task: Colorize gray-scale photos.• Learn a model with

input x = gray-scale pixel values.output y = color pixel values (Lab).

• Typical task for deep learning (Lectures 8–9).

R. Zhang, P. Isola, and A. A. Efros. Colorful Image Colorization. ECCV, 2016.



ex) Colorization (II/II)

14 Zhang, Isola, Efros

Fig. 8. Applying our method to legacy black and white photos. Left to right: photoby David Fleay of a Thylacine, now extinct, 1936; photo by Ansel Adams of Yosemite;amateur family photo from 1956; Migrant Mother by Dorothea Lange, 1936.

3.3 Legacy Black and White Photos

Since our model was trained using “fake” grayscale images generated by strip-ping ab channels from color photos, we also ran our method on real legacy blackand white photographs, as shown in Figure 8 (additional results can be viewedon our project webpage). One can see that our model is still able to producegood colorizations, even though the low-level image statistics of the legacy pho-tographs are quite different from those of the modern-day photos on which itwas trained.

4 Conclusion

While image colorization is a boutique computer graphics task, it is also an in-stance of a difficult pixel prediction problem in computer vision. Here we haveshown that colorization with a deep CNN and a well-chosen objective functioncan come closer to producing results indistinguishable from real color photos.Our method not only provides a useful graphics output, but can also be viewedas a pretext task for representation learning. Although only trained to color,our network learns a representation that is surprisingly useful for object clas-sification, detection, and segmentation, performing strongly compared to otherself-supervised pre-training methods.

Acknowledgements

This research was supported, in part, by ONR MURI N000141010934, NSF SMA-1514512, an Intel research grant, and a hardware donation by NVIDIA Corp. We thankmembers of the Berkeley Vision Lab and Aditya Deshpande for helpful discussions,Philipp Krahenbuhl and Jeff Donahue for help with self-supervision experiments, andGustav Larsson for providing images for comparison to [23].

Model applied to legacy grayscale photos.



ex) Predicting conflicts

• Predicting the risk of violentconflicts across Africa• Learn a model with

input x = conflict history,protests, population, economicindicators, geography, . . .output y = risk of violentconflict

• Random forests (Lecture 6) used

https://www.pcr.uu.se/research/views


https://www.pcr.uu.se/research/views


ex) Autonomous driving

• Recognizing objects in a street view• Learn a model with

input x = the RGB values of the imageoutput y = the location of pedestrians in the image

• Deep learning (Lectures 8–9)



ex) Higgs Machine Learning Challenge

HiggsML: Crowd-sourcing initiative by CERN (hosted at Kaggle)

• Separate H → ττ from background noise.• Learn a model with

- input x = 30-dimensional vector of“features” recorded during the experiment.

- output y = “signal” or “background”• Bagging (Lecture 6), Boosting (Lecture 7) and

Neural Networks (Lectures 8–9) among thewinning methods.

C. Adam-Bourdarios, G. Cowan, C. Germain, I. Guyon, B. Kégl and D. Rousseau. The Higgs boson machine learningchallenge. NIPS 2014 Workshop on High-energy Physics and Machine Learning, 2014.



Statistical machine learning

Why the word “statistical” in the course title?

• Probability theory is used to define the models.• Statistical tools are used to learn the models from training data.

Allows us to reason about the uncertainties in the data,models, predictions, etc.!



Numerical and categorical variables

Both input variables (x) and output variables (y)can be either numerical or categorical.

• Numerical (quantitative) variables represent numbers (realnumbers, integer values, . . . ).• Categorical (qualitative) variables take on values in one ofK

distinct classes, e.g. “true or false”, “disease type A, B or C”.

We will mostly use integer coding (like {1, 2, 3, 4} or {0, 1}), but thecoding (or labeling) of qualitative variables is arbitrary and unordered.



Regression vs. classification

We will distinguish between two types of problems:regression and classification

Regression is when the output y is numerical, e.g.• Climate models (y = “increase in global temperature”)• Economic models (y = “change in GDP”)

Classification is when the output y is categorical, e.g.• Spam filters (y ∈ {spam, good email})• Diagnosis systems (y ∈ {ALL, AML, CLL, CML, no leukemia})• Fingerprint verification (y ∈ {match, no match})



Aim of the course

What will we learn in this course?• Use various methods for solving regression and classification

problems, ranging from fundamental (linear regression) tostate-of-the-art (deep learning, boosting, . . . )• Identify and apply suitable methods to a given problem• Evaluate the performance of a method and rationally choose

between different competing models and methods• Work with real data, reason about data representations and how

data is used in realistic machine learning applications



Course information



Lecture outline

1. Introduction2. Linear regression, regularization- Introduction to Python & scikit-learn3. Classification, logistic regression4. Classification, LDA, QDA, k-NN5. Bias-variance trade-off, cross validation6. Tree-based methods, bagging7. Boosting8. Deep learning I9. Deep learning II10. Summary and guest lecture by SEB

”Warm-up videos” for each lecture linked from the home page.Watch before coming to the lecture (the night before, or so).



Course elements

• 10 lectures + 1 introductory lecture to Python• 10 problem solving sessions• 1 mini project (3-4 students, written report)• 1 computer lab (4h, no report)• Complete course information (including lecture slides) is available

from the course home page:

www.it.uu.se/edu/course/homepage/sml


www.it.uu.se/edu/course/homepage/sml


Problem solving sessions

10 problem solving sessions:• Solve problems, discuss and ask questions! (”räknestuga”)• 5 pen-and-paper sessions• 5 computer-bases sessions (using Python)• Feel free to use your own laptop – Python is freely available• Exercises available via homepage

Each session 2 times, 2 in parallel. The computer sessions arescheduled 3 in normal class room + 1 computer room.

A great opportunity to discuss and ask questions!



Examination

Mini project:• Solved in groups of 3 or 4 students (sign up open now)• Written report (deadline: see home page)• Peer-review: read and review another group’s report (anonymously)• Material most relevant for the mini project presented at lectures 3–7, but you

can start working on the solution after lecture 3• Graded U/G/VG. Grade VG is for projects of notable quality.

Lab:• 4 h computer lab, solved in groups of 2 students, graded U/G• 5 sessions available – sign up for one of these• Solve the preparatory exercises before the lab session!

Written exam:• Written pen-and-paper exam, graded as U, 3, 4, or 5.• You can bring one (1) page with your own notes• Old exams are on the course home page



Examination

Grading:

• To pass the course you need to pass the mini-project, lab and exam.

• If you pass mini-project and lab, your exam grade will also be your course grade.

• Grade VG on the mini-project will increase your course grade by one step, butnot from U to 3.

Exam gradeU 3 4 5

U U U U UGrade on mini-project G U 3 4 5

VG U 4 5 5Course grade



Course literature

Supervised Machine LearningAndreas Lindholm, Niklas Wahlström,Fredrik Lindsten, Thomas B. Schön

Book draft (∼ 140 pages), will be published in 2021, available as pdfvia smlbook.org. Possible to give feedback at book Github page.

Additional reading suggestions: extensive list on the home page.


http://smlbook.org/

https://github.com/uu-sml/sml-book-page/issues

http://www.it.uu.se/edu/course/homepage/sml/literature


Teachers

Teachers involved in the course (in approximate order of appearance):

JohanWågbergRoom:2323

ThomasSchönRoom:2209

NiklasWahlströmRoom:2319

CarlAnderssonRoom:2353

DavidWidmannRoom:2303

CarmenLee

All room numbers are at ITC Polacksbacken.You can reach us by email: <firstname.lastname>@it.uu.se.



Statistical Machine Learning



Probability theory refresher

• If z is a random variable with PDF p(z), then the expected valueor mean of z is given by

E[z] =∫zp(z)dz.

More generally, E[g(z)] =∫g(z)p(z)dz.

• Let µ = E[z]. The variance of z is defined as

Var[z] = E[(z − µ)2] = E[z2]− µ2.

• If z1 and z2 are independent random variables, thenp(z1, z2) = p(z1)p(z2) and E[z1z2] = E[z1]E[z2].



Gaussian (Normal) distribution

Probability density function (PDF) for the scalar Gaussian distribution

N (z |µ, σ2) = 1√2πσ2

e−(z−µ)2

2σ2

• µ is the mean (expected valueof the distribution)• σ is the standard deviation• σ2 is the variance

0 5 100

0.2

0.4

0.6

0.8

1

zPD

F

µ = −1, σ2 = 0.5

µ = 7, σ2 = 0.2

µ = 4, σ2 = 4

z ∼ N (µ, σ2) means that z is a Gaussian random variable with meanµ and variance σ2. ∼ reads “distributed according to”.



Workflow for machine learning

1. Formulate the problem as a machine learning problem2. Collect training data3. Pre-process the data4. Choose which model to use5. Learn/train/estimate/fit/. . . the model from training data6. Feed x? into the model to make a prediction y?7. Evaluate the prediction, and thereby the usefulness of the model



Training a model

When training a model on training data

T def= {(yi,xi)}ni=1,

the model is somehow adopted to the training data.The most basic learning is fitting a straight line to data:

Sugar consumption50 g/day|

150 g/day|

Happinessindex

3 |7 |



Training a model

Sugar consumption50 g/day|

150 g/day|

Happinessindex

3 |7 |

This is linear regression (next lecture). A good entrance into theworld of supervised machine learning.

We will often understand training from a statistical maximumlikelihood perspective – adapting the model such that the data is aslikely as possible to have been observed.



Randomness of the learned model

We use training data T = {(xi, yi)}ni=1 to learn a model

If the training data T = {(xi, yi)}ni=1 used to learn the model is ran-dom, then so is the learned model!

We will use statistical tools and probability theory to reason about theproperties of the learned model: Bias-variance trade off in lecture 5.



A few concepts to summarize lecture 1

Machine Learning: Deals with learning, reasoning and acting based on data.

Regression: Learning problem where the output is quantitative.

Classification: Learning problem where the output is qualitative.

Training data: The dataset that is used to learn a model from data. (The modelshould not be evaluated on this dataset.)

Maximum likelihood: Learning objective based on probability theory.



uppsala university - lecture 1 introduction · 2020-01-21 · lectureoutline 1. introduction 2....

Documents