data11002 introduction to machine learning · i you should have signed up at oodi. i we use moodle...

35
DATA11002 Introduction to Machine Learning Lecturer: Kai Puolam¨ aki TAs: Moritz Lange and Henri Suominen Department of Computer Science University of Helsinki (based in part on material by Antti Ukkonen, Patrik Hoyer, Jyrki Kivinen, and Teemu Roos) 31 October – 13 December 2019 1,

Upload: others

Post on 11-Aug-2021

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: DATA11002 Introduction to Machine Learning · I You should have signed up at Oodi. I We use Moodle at least to send announcements (e.g., lecture cancelled due to illness) I You will

DATA11002

Introduction to Machine Learning

Lecturer: Kai PuolamakiTAs: Moritz Lange and Henri Suominen

Department of Computer ScienceUniversity of Helsinki

(based in part on material by Antti Ukkonen, Patrik Hoyer,Jyrki Kivinen, and Teemu Roos)

31 October – 13 December 2019

1 ,

Page 2: DATA11002 Introduction to Machine Learning · I You should have signed up at Oodi. I We use Moodle at least to send announcements (e.g., lecture cancelled due to illness) I You will

Introduction

I Practical details of the courseI Lectures

I Exercises

I Exam

I Grading

I Course outline

I What is machine learning? Motivation & examplesI Definition

I Relation to other fields

I Examples

2 ,

Page 3: DATA11002 Introduction to Machine Learning · I You should have signed up at Oodi. I We use Moodle at least to send announcements (e.g., lecture cancelled due to illness) I You will

Practical details (1)

I Lectures:I 31 October (today) – 13 December

I Wednesdays at 12–14 and Fridays at 10–12 (Exactum B123)

I Lecturer: Kai Puolamaki(Exactum A342, [email protected])

I Language: English

I Based on the course textbook (next slide)

3 ,

Page 4: DATA11002 Introduction to Machine Learning · I You should have signed up at Oodi. I We use Moodle at least to send announcements (e.g., lecture cancelled due to illness) I You will

Practical details (2)

I Textbook:I authors: Gareth James, Daniela Witten,

Trevor Hastie and Robert Tibshirani

I title: An Introduction to StatisticalLearning – with Applications in R

I publisher: Springer (2013, first edition)

I web page:http://www.statlearning.com/

I we’ll cover the whole book except splinesand generalized additive models (GAMs) –and include some additional Bayesian stuff

4 ,

Page 5: DATA11002 Introduction to Machine Learning · I You should have signed up at Oodi. I We use Moodle at least to send announcements (e.g., lecture cancelled due to illness) I You will

Practical details (3)

I Lecture materialI this set of slides (by

Hoyer/Kivinen/Roos/Ukkonen/Puolamaki) is intended for useas part of the actual lectures, together with the blackboard etc.

I we will cover some topics in more detail than the textbook(and some less)

I in particular some additional detail is needed for homeworkproblems

I both the selected parts of the textbook as well as additionalmaterial indicated on the course homepage are requiredmaterial for the exam

5 ,

Page 6: DATA11002 Introduction to Machine Learning · I You should have signed up at Oodi. I We use Moodle at least to send announcements (e.g., lecture cancelled due to illness) I You will

Practical details (4a)

I Exercise Set 0: Prerequisite KnowledgeI Submit your answer via Moodle as a pdf file by 5 November

I To be completed individually

I Not discussed at the exercise sessions

I Exercise Set 1–6:I Two kinds:

I mathematical exercises (pencil-and-paper)I computer exercises (support given in R but Python is a good

choice too)

I Problem set handed out every Friday, focusing on topics fromthat week’s lectures

I Solutions returned at the exercise sessions

I NB: You get points only by attending your own exercise group(not group 99)

6 ,

Page 7: DATA11002 Introduction to Machine Learning · I You should have signed up at Oodi. I We use Moodle at least to send announcements (e.g., lecture cancelled due to illness) I You will

Practical details (4b)

I Exercise Set 1–6:I Solutions can be returned by email to [email protected]

only under the following circumstances:I reason for the email submission must be valid and verifiableI “valid reason” is such that would entitle you be absent from

work (e.g., illness)I non-valid reasons include: travel, being busy at work, other

studiesI attach your answer to the email as a pdf fileI in the body of email, clearly indicate your (i) name and (ii)

student number, (iii) the numbers of problems you havesolved, and (iv) the reason for not attending the problemsession

I we must receive your email before the problem sessionI notice that need to complete 80% of the problems to get full

exercises points (i.e., you can get a full points even if you missone exercise session)

I Language of exercise sessions: English

I Exercise points make up 40% of your total grade, must get atleast half the points to be eligible for the course exam.

7 ,

Page 8: DATA11002 Introduction to Machine Learning · I You should have signed up at Oodi. I We use Moodle at least to send announcements (e.g., lecture cancelled due to illness) I You will

Practical details (5)

I Course exam (these can sometimes change with short notice!):I 17 December 2019 at 9:00 (NB: not 9:15) (preliminary date!)

I Makes 60% of your course grade

I Must get a minimum of half the points of the exam to passthe course

I Pencil-and-paper problems, similar style as in exercises (also‘essay’ or ‘explain’ problems)

I You may answer exam problems also in Finnish or Swedish.

I You may bring a hand-written cheat sheet (one A4)!

I (Note: To be eligible to take a ‘separate exam’ you need to first complete

a project work. These will be available on the course web page a bit later.

However since you are here at the lecture, this probably does not concern

you.)

8 ,

Page 9: DATA11002 Introduction to Machine Learning · I You should have signed up at Oodi. I We use Moodle at least to send announcements (e.g., lecture cancelled due to illness) I You will

Practical details (6)

I Prerequisites:I Mathematics: Basics of probability theory and statistics, linear

algebra (i.e., vectors and matrices) and real analysis (i.e.,derivatives, etc.)

I Computer science: Good programming skills (but no previousfamiliarity with R necessary)

I Please do the prerequisite knowledge test (Exercise Set 0) andreturn it via Moodle by Tuesday 5 November 23:59

I You need to pass the Exercise Set 0 to pass the course and tobe eligible for the course exam!

9 ,

Page 10: DATA11002 Introduction to Machine Learning · I You should have signed up at Oodi. I We use Moodle at least to send announcements (e.g., lecture cancelled due to illness) I You will

Related courses

I Various advanced Data Science courses:I Advanced Course in Machine Learning (period IV)

I Statistical Data Science (period II)

I Computational Statistics I-II (periods I-II)

I Probabilistic Graphical Models (period III)

I Introduction to Data Science (period I)

I Deep Learning (period II)

I Many seminars also have a strong machine learning flavour!

I Lots of courses at Aalto as well!

10 ,

Page 11: DATA11002 Introduction to Machine Learning · I You should have signed up at Oodi. I We use Moodle at least to send announcements (e.g., lecture cancelled due to illness) I You will

Practical details (8)

I Course material:I Webpage (public information about the course):

https://courses.helsinki.fi/fi/data11002/129803083

I You should have signed up at Oodi.

I We use Moodle at least to send announcements (e.g., lecturecancelled due to illness)I You will be added automatically to the Moodle area when you

sign up at Oodi. If not and if you have @helsinki.fi email, youcan ask us ([email protected]) to add you there manually.

I Help?I Ask the assistants/lecturer at exercises/lectures

I Contact assistants/lecturer

I Use the course mailbox [email protected] for allcourse-related emails!

11 ,

Page 12: DATA11002 Introduction to Machine Learning · I You should have signed up at Oodi. I We use Moodle at least to send announcements (e.g., lecture cancelled due to illness) I You will

Course outline

I Introduction

I Ingredients of machine learningI task, models, data

I evaluation and model selection

I Supervised learningI classification

I regression

I Unsupervised learningI clustering

I dimension reduction

12 ,

Page 13: DATA11002 Introduction to Machine Learning · I You should have signed up at Oodi. I We use Moodle at least to send announcements (e.g., lecture cancelled due to illness) I You will

What is machine learning?

I Definition:

machine = computer, computer program (in this course)

learning = improving performance on a given task, basedon experience / examples

I In other words

I instead of the programmer writing explicit rules for how tosolve a given problem, the programmer instructs the computerhow to learn from examples

I in many cases the computer program can even become betterat the task than the programmer is!

13 ,

Page 14: DATA11002 Introduction to Machine Learning · I You should have signed up at Oodi. I We use Moodle at least to send announcements (e.g., lecture cancelled due to illness) I You will

Example 1: tic-tac-toe

I How to program the computer to play tic-tac-toe?

I Option A: The programmer writes explicit rules, e.g., ‘if theopponent has two in a row, and the third position is free,place your mark there’, etc (lots of work, difficult, not at allscalable!)

I Option B: Go through the game tree, choose optimally (fornon-trivial games, must be combined with some heuristics torestrict tree size)

I Option C: Let the computer try out various strategies byplaying against itself and others, and noting which strategieslead to winning and which to losing (=‘machine learning’)

14 ,

Page 15: DATA11002 Introduction to Machine Learning · I You should have signed up at Oodi. I We use Moodle at least to send announcements (e.g., lecture cancelled due to illness) I You will

I Arthur Samuel (50’s and 60’s):

I Computer program that learns to play checkers

I Program plays against itself thousands of times, learns whichpositions are good and which are bad (i.e., which lead towinning and which to losing)

I The computer program eventually becomes much better thanthe programmer.

15 ,

Page 16: DATA11002 Introduction to Machine Learning · I You should have signed up at Oodi. I We use Moodle at least to send announcements (e.g., lecture cancelled due to illness) I You will

Example 2: spam filter

I Programmer writes rules: “If it contains ‘viagra’ then it isspam.” (difficult, not user-adaptive)

I The user marks which mails are spam, which are legit, and thecomputer learns itself what words are predictive

Example 2: spam filter

� X is the set of all possible emails (strings)

� Y is the set { spam, non-spam }

From: [email protected]

Subject: viagra

cheap meds...spam

From: [email protected]

Subject: important information

here’s how to ace the test...non-spam

......

From: [email protected]

Subject: you need to see this

how to win $1,000,000...?

6 , 16 ,

Page 17: DATA11002 Introduction to Machine Learning · I You should have signed up at Oodi. I We use Moodle at least to send announcements (e.g., lecture cancelled due to illness) I You will

Problem setup

I One definition of machine learning: A computer programimproves its performance on a given task with experience(i.e., examples, data).

I So we need to separate

I Task: What is the problem that the program is solving?

I Performance measure: How is the performance of the program(when solving the given task) evaluated?

I Experience: What is the data (examples, features) that theprogram is using to improve its performance?

17 ,

Page 18: DATA11002 Introduction to Machine Learning · I You should have signed up at Oodi. I We use Moodle at least to send announcements (e.g., lecture cancelled due to illness) I You will

Related scientific disciplines (1)

I Artificial Intelligence (AI)I Machine learning can be seen as ‘one approach’ towards

implementing ‘intelligent’ machines (or at least machines thatbehave in a seemingly intelligent way).

I Artificial neural networks, computational neuroscienceI Inspired by and trying to mimic the function of biological

brains, in order to make computers that learn from experience.Modern machine learning really grew out of the neuralnetworks boom in the 1980’s and early 1990’s.

I Pattern recognitionI Recognizing objects and identifying people in controlled or

uncontrolled settings, from images, audio, etc. Such taskstypically require machine learning techniques.

18 ,

Page 19: DATA11002 Introduction to Machine Learning · I You should have signed up at Oodi. I We use Moodle at least to send announcements (e.g., lecture cancelled due to illness) I You will

Deep learning?

I Is a “family” of machine learning methods.

I Has become incredibly popular in the past few years.

I Yields very good results, e.g., in computer vision and speechrecognition tasks.

I Heavily based on classical work on artificial neural networkmethods. (Fundamentally not really that novel.)

I Basic principles of ML covered in this course also apply todeep learning!

19 ,

Page 20: DATA11002 Introduction to Machine Learning · I You should have signed up at Oodi. I We use Moodle at least to send announcements (e.g., lecture cancelled due to illness) I You will

Availability of data

I These days it is very easy toI collect data (sensors are cheap, much information digital)I store data (hard drives are big and cheap)I transmit data (essentially free on the internet).

I The result? Everybody is collecting large quantities of data.I Businesses: shops (market-basket data), search engines (web

pages and user queries), financial sector (asset prices,transaction metadata, etc), manufacturing (sensors of allkinds), social networking sites (user activity, links), anybodywith a web server (hits, user activity)

I Science: genomes sequenced, gene expression data,experiments in high-energy physics, images of remote galaxies,global ecosystem monitoring data, drug research anddevelopment, public health data

I But how to benefit from it? Analysis is becoming key!

20 ,

Page 21: DATA11002 Introduction to Machine Learning · I You should have signed up at Oodi. I We use Moodle at least to send announcements (e.g., lecture cancelled due to illness) I You will

Big Data

I one definition: data of a very large size, typically to the extentthat its manipulation and management present significantlogistical challenges (Oxford English Dictionary)

I 3V: volume, velocity, and variety (Doug Laney, 2001)

I a database may be able to handle a lot of data, but you can’treally implement a machine learning algorithm as an SQLquery

I on this course we do not consider technical issues relating toextremely large data sets (check out courses Introduction toBig Data Management and Big Data Frameworks)

I basic principles of machine learning still apply, but manyalgorithms may be difficult to implement efficiently

21 ,

Page 22: DATA11002 Introduction to Machine Learning · I You should have signed up at Oodi. I We use Moodle at least to send announcements (e.g., lecture cancelled due to illness) I You will

Related scientific disciplines (2)

I Data miningI Trying to identify interesting and useful associations and

patterns in huge datasets

I Focus on scalable algorithms

I Example: shopping basket analysis (frequent itemsets)

I StatisticsI historically, introductory courses on statistics tend to focus on

hypothesis testing and some other basic problemsI however there’s a lot more to statistics than hypothesis testingI there is a lot of interaction between research in machine

learning, data mining and statistics

22 ,

Page 23: DATA11002 Introduction to Machine Learning · I You should have signed up at Oodi. I We use Moodle at least to send announcements (e.g., lecture cancelled due to illness) I You will

Example 3

I Prediction of search queries

I The programmer provides a standard dictionary (words andexpressions change!)

I Previous search queries are used as examples!

23 ,

Page 24: DATA11002 Introduction to Machine Learning · I You should have signed up at Oodi. I We use Moodle at least to send announcements (e.g., lecture cancelled due to illness) I You will

Example 4

I Ranking search results:I Various criteria for

ranking results

I What do users click onafter a given search?Search engines canlearn what users arelooking for bycollecting queries andthe resulting clicks.

24 ,

Page 25: DATA11002 Introduction to Machine Learning · I You should have signed up at Oodi. I We use Moodle at least to send announcements (e.g., lecture cancelled due to illness) I You will

Example 5

I Detecting credit card fraudI Credit card companies typically end up

paying for fraud (stolen cards, stolen cardnumbers)

I It’s thus useful to screen transactionsautomatically!

I Important to be adaptive to the behaviorsof customers, i.e., learn from existing datahow users normally behave, and try todetect ‘unusual’ transactions (anomalydetection)

25 ,

Page 26: DATA11002 Introduction to Machine Learning · I You should have signed up at Oodi. I We use Moodle at least to send announcements (e.g., lecture cancelled due to illness) I You will

Example 6

I Self-driving cars:I Sensors (radars,

cameras) superior tohumans

I How to make thecomputer reactappropriately to thesensor data?

I Note: The sensors canbe broken and deliverincorrect/broken data.

I Adversarial attacks oncomputer visionsystems!

26 ,

Page 27: DATA11002 Introduction to Machine Learning · I You should have signed up at Oodi. I We use Moodle at least to send announcements (e.g., lecture cancelled due to illness) I You will

Example 7

I Character recognition:I Automatically sorting

mail (handwrittencharacters)

I Digitizing old booksand newspapers intoeasily searchableformat (printedcharacters)

27 ,

Page 28: DATA11002 Introduction to Machine Learning · I You should have signed up at Oodi. I We use Moodle at least to send announcements (e.g., lecture cancelled due to illness) I You will

Example 8

I Recommendation systems(‘collaborative filtering’):I Amazon: “Customers

who bought X alsobought Y ”...

I Netflix: “Based onyour movie ratings, youmight enjoy...”

I Spotify: “Discoverweekly” playlists

4 5 5 1 2

3 4 3

1 4 1 5 1

? 4 1 ?2 1 1 5

1 1 4 5

4 5 5

2 3 3

Fargo

Seven

Leon

Aliens

Avatar

Bill

Jack

Linda

John

Lucy

28 ,

Page 29: DATA11002 Introduction to Machine Learning · I You should have signed up at Oodi. I We use Moodle at least to send announcements (e.g., lecture cancelled due to illness) I You will

Example 9

I Machine translation:I Traditionally: statistical machine translation based on example

data

I Most recently: neural machine translation based on deepsequence-to-sequence models

29 ,

Page 30: DATA11002 Introduction to Machine Learning · I You should have signed up at Oodi. I We use Moodle at least to send announcements (e.g., lecture cancelled due to illness) I You will

Example 10

I Online store websiteoptimization:I What items to present,

what layout?

I What colors to use?

I Can significantly affectsales volume

I Experiment, andanalyze the results!(lots of decisions onhow exactly toexperiment and how toensure meaningfulresults)

30 ,

Page 31: DATA11002 Introduction to Machine Learning · I You should have signed up at Oodi. I We use Moodle at least to send announcements (e.g., lecture cancelled due to illness) I You will

Example 11

I Mining chat and discussion forumsI Breaking news

I Detecting outbreaks of infectious disease

I Tracking consumer sentiment about companies / products

31 ,

Page 32: DATA11002 Introduction to Machine Learning · I You should have signed up at Oodi. I We use Moodle at least to send announcements (e.g., lecture cancelled due to illness) I You will

Example 12

I Real-time sales andinventory managementI Picking up quickly on

new trends (what’s hotat the moment?)

I Deciding on what toproduce or order

32 ,

Page 33: DATA11002 Introduction to Machine Learning · I You should have signed up at Oodi. I We use Moodle at least to send announcements (e.g., lecture cancelled due to illness) I You will

Example 13

I Prediction of friends in Facebook, or prediction of who you’dlike to follow on Twitter.

33 ,

Page 34: DATA11002 Introduction to Machine Learning · I You should have signed up at Oodi. I We use Moodle at least to send announcements (e.g., lecture cancelled due to illness) I You will

What about privacy?

I Users are surprisingly willing to sacrifice privacy to obtainuseful services and benefits

I Regardless of what position you take on this issue, it isimportant to know what can and what cannot be done withvarious types information (i.e., what the dangers are)

I ‘Privacy-preserving data mining’I What type of statistics/data can be released without exposing

sensitive personal information? (e.g., government statistics)

I Developing data mining algorithms that limit exposure of userdata (e.g., ‘Collaborative filtering with privacy’, Canny 2002)

34 ,

Page 35: DATA11002 Introduction to Machine Learning · I You should have signed up at Oodi. I We use Moodle at least to send announcements (e.g., lecture cancelled due to illness) I You will

What’s next

I Do the Exercise Set 0: Prerequisite KnowledgeI submit your answer via Moodle on 5 November at latest

I Next lecture (1 Nov):I basic components of a machine learning scenario

I generalization

35 ,