optimization methods for machine learning (omml)palagi/didattica/sites/... · data mining and big...

40
Optimization Methods for Machine Learning (OMML) 1st lecture (1 slot) Prof. L. Palagi 25/09/2017 1

Upload: others

Post on 06-Jun-2020

11 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Optimization Methods for Machine Learning (OMML)palagi/didattica/sites/... · Data Mining and Big Data • The subject of data mining is the extraction of patterns and knowledge from

Optimization Methods for Machine Learning (OMML)

1st lecture (1 slot)

Prof. L. Palagi

25/09/2017 1

Page 2: Optimization Methods for Machine Learning (OMML)palagi/didattica/sites/... · Data Mining and Big Data • The subject of data mining is the extraction of patterns and knowledge from

Course at a glance 6 ects You can find all info on the web site

http://www.dis.uniroma1.it/~palagi

following the path (not yet)

didatticaaa-2017-18optimization-methods-machine-learning

Assistant Professor: Ing. Tommaso Colombo Lectures schedule on the website Join the Google Group “OMML_2017-18”

26/09/2017 2

Page 3: Optimization Methods for Machine Learning (OMML)palagi/didattica/sites/... · Data Mining and Big Data • The subject of data mining is the extraction of patterns and knowledge from

Master students in….Management Engineering (ingegneria gestionale)

Strong background in optimization Poor background in Data ManagementMedium background in Programming

Data Science Unknown background in optimization Strong background in Data ManagementMedium background in Programming

Others from engineering (e.g. Engineering in Artificial intelligence and Robotics)

Basic background in optimization Good background in Data Management Strong background in Programming

26/09/2017 3

Page 4: Optimization Methods for Machine Learning (OMML)palagi/didattica/sites/... · Data Mining and Big Data • The subject of data mining is the extraction of patterns and knowledge from

ExamsAttending students 3 Homeworks every two weeks (75%)

You must turn in all the homeworks in order to be admitted to the final term.

Midtern and Final Exam (10% & 15%) Grading

– Homework (75%, 3 assignments: 15, 25, 35 % respectively)

– Midterm (10 %)

– Final (15 %)

– Oral (on demand) ± 2

Not attending students Project, multiple choice exam and final oral exam

26/09/2017 4

Page 5: Optimization Methods for Machine Learning (OMML)palagi/didattica/sites/... · Data Mining and Big Data • The subject of data mining is the extraction of patterns and knowledge from

Syllabus at a glance

• Introduction to statistical learning theory (“learning from data”)

26/09/2017 5

Supervised Learning:

Deep Learning: FeedforwardNeural Networks (NN)

Kernel methods: Support Vector Machines (SVM)

Unsupervised Learning: Clustering

Use of standard software (R, LIBSVM, TensorFlow, Sklearn)

FOCUS: optimization models, algorithms

Page 6: Optimization Methods for Machine Learning (OMML)palagi/didattica/sites/... · Data Mining and Big Data • The subject of data mining is the extraction of patterns and knowledge from

What does «automatic learning» mean ?

26/09/2017 8

Arthur Samuel (1901-1990)“ programming of a digital computer to behave in a way which, if done by human beings or animals, would be described as involving the process of learning”in Some Studies In Machine Learning Using the Game of Checkers ,1959

Tom Mitchell (1997) http://www.cs.cmu.edu/~tom/

“Machine Learning is the study of computer algorithms that improve automatically through experience”in Machine Learning, Tom Mitchell, McGraw Hill, 1997.

Page 7: Optimization Methods for Machine Learning (OMML)palagi/didattica/sites/... · Data Mining and Big Data • The subject of data mining is the extraction of patterns and knowledge from

Human brain versus automatic learning

1. 10 billions neurons

2. 60 trillions sinapsi

3. Distribution processing

4. Nonlinear Process

5. Parallel Computation

1. ??

Page 8: Optimization Methods for Machine Learning (OMML)palagi/didattica/sites/... · Data Mining and Big Data • The subject of data mining is the extraction of patterns and knowledge from

To be more precise… (T. Mitchell)

26/09/2017 10

“We say that a machine learns with respect to a particular task T, performance metric P, and type of experience E, if the system reliability improves its performance P at task T, following experience E.”

http://www.cs.cmu.edu/~tom/pubs/MachineLearning.pdf

Page 9: Optimization Methods for Machine Learning (OMML)palagi/didattica/sites/... · Data Mining and Big Data • The subject of data mining is the extraction of patterns and knowledge from

An everyday example: SPAM detection

26/09/2017 11

Assume that your e-mail program controls which mail should be classified as «spam» or «not spam» and needsto learn how to improve the AntiSpam filter

T (task) classify mail as «spam» or «not spam» P (Performance misure) the number (or %) of

correctly classified mails E (Experience) your e-mail classification as

«spam» or «not spam»

Page 10: Optimization Methods for Machine Learning (OMML)palagi/didattica/sites/... · Data Mining and Big Data • The subject of data mining is the extraction of patterns and knowledge from

Learning from examples

• The measure are the «input variables» and we assume thatare available for all the objects under study. (this is not always true)

• The properties of the objects are known as «output variables» and usually they are known only on a subset of objects which represent the “examples”

• Estimate the dependence among input-output will be useful to predict the properties of all the possible objects (not only the examples)

26/09/2017 12

It is the process to find the analytic description of an unknown relationship among the «measure» of some «objects» and the properties of such «objects».

Page 11: Optimization Methods for Machine Learning (OMML)palagi/didattica/sites/... · Data Mining and Big Data • The subject of data mining is the extraction of patterns and knowledge from

An everyday exampleSPAM detection

26/09/2017 13

The measure (input/feature) can be Sender

Subject

Body

The property (output) classification as spam o not spam (1 o 0)

Such a problem where the properties (output) can assume only

a finite number of values (discrete) is call classificationproblem

Page 12: Optimization Methods for Machine Learning (OMML)palagi/didattica/sites/... · Data Mining and Big Data • The subject of data mining is the extraction of patterns and knowledge from

Classification

• Classification establishes the belonging of an element to a class.

In a classification problem, the output is categorizednamely there is only a finite number of values {Yes, No}, {High, Medium, Low}, etc.

As a first example, consider the problem of targeting if a consumer will likely to buy a new product or accept a new commercial offer.

26/09/2017 14

Page 13: Optimization Methods for Machine Learning (OMML)palagi/didattica/sites/... · Data Mining and Big Data • The subject of data mining is the extraction of patterns and knowledge from

Medical Diagnosis(from T. Mitchell)

26/09/2017 15

Predict if a pregnancy will end with a cesarean sectionor a natural childbirth

age

Cesarean-S

Natural

Cesarean-S Natural

Page 14: Optimization Methods for Machine Learning (OMML)palagi/didattica/sites/... · Data Mining and Big Data • The subject of data mining is the extraction of patterns and knowledge from

Medical Diagnosis (from T. Mitchell)

26/09/2017 16

age

Cesarean-S Natural

weight

Predict if a pregnancy will end with a cesarean

section or a natural childbirth

Page 15: Optimization Methods for Machine Learning (OMML)palagi/didattica/sites/... · Data Mining and Big Data • The subject of data mining is the extraction of patterns and knowledge from

Medical Diagnosis (from T. Mitchell)

26/09/2017 17

Page 16: Optimization Methods for Machine Learning (OMML)palagi/didattica/sites/... · Data Mining and Big Data • The subject of data mining is the extraction of patterns and knowledge from

Postural diseases detection*

18

▪ Formetric allows to digitallyreconstruct the spinal column of a patient

▪ It works by taking sequences of images, thus elaborating averagemeasurements

• 09/04/2017

After cleaning, 42 input features

▪ Trunk length (mm):

▪ Anteroposterior curve (degree)

▪ Kyphosis peak (mm)

▪ Inversion point (mm)

▪ Lordotic angle (degree)

▪ Lateral deviation (mm)

▪ ……..

Output

Healthy/scoliosis*Joint project DIAG (Data Mining aspects) and the Physical and Rehabilitation department (patient selection, postural evaluation and

rasterstereography) - Sapienza

Page 17: Optimization Methods for Machine Learning (OMML)palagi/didattica/sites/... · Data Mining and Big Data • The subject of data mining is the extraction of patterns and knowledge from

Handwritten digit recognition

26/09/2017 19

Input variables are the pictures of a given

character

Page 18: Optimization Methods for Machine Learning (OMML)palagi/didattica/sites/... · Data Mining and Big Data • The subject of data mining is the extraction of patterns and knowledge from

Handwritten digit recognition

• Each input element is a picture with pxp (28x28, 256x256) pixel and hence is represented by a real vector of dimensione p2(=784, 65536) which represents the grey level (0=white, 1=black) that can be represented with 8-bit

• The properties (output) is the “character” namely one of the elements of the finite set {0,1,2….,9}

• The examples (E) are handwritten digits

• The aim (T) is the recognition of handwritten digits from others

• The difficulty stays in the high variability of the shapes and the huge number of elements (228 x28 x8 ,2256 x256 x8)

26/09/2017 20

Page 19: Optimization Methods for Machine Learning (OMML)palagi/didattica/sites/... · Data Mining and Big Data • The subject of data mining is the extraction of patterns and knowledge from

Approssimation/regression

26/09/2017 21

In many learning problems the output takes a numericalvalue in the continuous field: approssimation/regressionproblem.

area

Price

Page 20: Optimization Methods for Machine Learning (OMML)palagi/didattica/sites/... · Data Mining and Big Data • The subject of data mining is the extraction of patterns and knowledge from

Approssimation/regression

26/09/2017 22

Training data are pairs of real valued vectors (x, t) and we are assuming that a linear or nonlinear model of dependency exists which is represented by the unknown function t=f(x)

Output t may assume an infinite number of values. Often they are refereed to as continuous variables even when they are not such in mathematical sense (e.g. people’s age)

We look for a function that approaches data “at best”

Input data may contain a given (low) level of noise. Noiseless problem are referred to as “approssimation ” pb; in the presence of noise we are tackling regression pb.

As an example, consider the learning problem of predicting the earning that a customer will lead in a giventime period.

Page 21: Optimization Methods for Machine Learning (OMML)palagi/didattica/sites/... · Data Mining and Big Data • The subject of data mining is the extraction of patterns and knowledge from

Organic Rankine Cycle system for waste heat recovery*

The highly nonlinear thermodynamic model of an ORC is determined by a FNN

*joint project Dept. of Computer, Control, and Management Engin. and Dept. of Mechanical and Aerospace Engin.

Input Features- Working fluid mass flow rate- Bottom pressure of the ORC cycle - Top pressure of the ORC cycle- Super-heating rate- Degree of regenerationOutputPower generated

Page 22: Optimization Methods for Machine Learning (OMML)palagi/didattica/sites/... · Data Mining and Big Data • The subject of data mining is the extraction of patterns and knowledge from

Machine Learning and Statistics

Statistical Inference (V. Vapnick)

Given a collection of empirical data originating from some functional dependency, infer this dependency

There are two main approaches

– parametric (particular) inference, which aims to create simple methods of inference to be use to solve specific real –life problems

– general inference which aims to create one (induction) method for any problem of statistical inference

26/09/2017 24

Page 23: Optimization Methods for Machine Learning (OMML)palagi/didattica/sites/... · Data Mining and Big Data • The subject of data mining is the extraction of patterns and knowledge from

Parametric InferenceBeginning 1930. Golden age ‘30-’60

– Assume to know the problem, e.g. • the physical law that generates the stochastic properties of data

• and the function to be found up to a finite number of parameters.

– the essence of the inference problem stays in estimating parameters and using data to verify reliabilty of it

– To find these parameters, using information about the statistical law and the target function one adopts the maximum likelihood method

26/09/2017 25

Page 24: Optimization Methods for Machine Learning (OMML)palagi/didattica/sites/... · Data Mining and Big Data • The subject of data mining is the extraction of patterns and knowledge from

Parametric Inference Inference models are quite simple and they were suitable for the

computational resources available in the sixties.

These models are based on three main principles

The Weierstrass Theorem: any continuous function on a finite interval can be approximated by a polynomials (i.e. a linear function in the parameters) to any degree of accuracy

The central limit theorem: (roughly) the distribution of the sum (or average) of a large number of independent, identically distributed variables will be approximately normal, regardless of the underlying distribution.

the maximum likelihood method is a good tools to estimate the parameters

Page 25: Optimization Methods for Machine Learning (OMML)palagi/didattica/sites/... · Data Mining and Big Data • The subject of data mining is the extraction of patterns and knowledge from

The end of parametric inference

- Curse of dimensionality (R. Bellman, ca 1960): increasing the number of factors to be taken into account requires exponentially increasing the amount of computational resources. For ex: if the function is not sufficiently smooth to obtain the given degree of accuracy one needs an exponential number of terms in the polynomial (and hence of variables)

- (Tukey ca 1960) statistical components of real-life problems cannot be described by classical distribution functions

- the maximum likelihood method may not be a good one even for very simple cases (James and Stein)

26/09/2017 27

Page 26: Optimization Methods for Machine Learning (OMML)palagi/didattica/sites/... · Data Mining and Big Data • The subject of data mining is the extraction of patterns and knowledge from

Beyond parametric inference

• General statistical inference: ones does not have a priori information about the statistical law underlying the problem or about the function to be approximated.

– Look for a method that infers an approximating function from examples (inductive method)

– data used to define the model itself

– non linear models in the parameters

data analysis/data mining

26/09/2017 28

Page 27: Optimization Methods for Machine Learning (OMML)palagi/didattica/sites/... · Data Mining and Big Data • The subject of data mining is the extraction of patterns and knowledge from

Data Mining and Big Data

• The subject of data mining is the extraction of patterns and knowledge from large amount of data using automatic or semi-automatic methods and the operative use of thisinformation.

• Exponential growth of tools and techinques to collect and store huge amount of data

26/09/2017 29

Report| McKinsey Global Institute

Big data: The next frontier for innovation, competition, and productivity - May 2011

“The amount of data in our world has been exploding, and analyzing large data sets—so-called big data—will become a key basis of competition, underpinning new waves of productivity growth, innovation, and consumer surplus, according to research by MGI and McKinsey's Business Technology Office……. Leading companies are using data collection and analysis to conduct controlled experiments to make better management decisions; …… sophisticated analytics can substantially improve decision-making…….”http://www.mckinsey.com/insights/business_technology/big_data_the_next_frontier_for_innovation

Page 28: Optimization Methods for Machine Learning (OMML)palagi/didattica/sites/... · Data Mining and Big Data • The subject of data mining is the extraction of patterns and knowledge from

Story

– In 1958 Rosenblatt (a fisiology) proposed a learning machine (namely a program) called Perceptron to solve a simple classification problem. The perceptron reproduced some neurofiologic learning model. The perceptron was able to generalize (it learns !).

– 1958-1992: Feedforward Neural Networks (shallow)

– (1992- ) back to the general statistical inference: other learning machines have been proposed which do not have any similarity with the biological neuron.

Does an inductive inference principle exist in common to all these machines ?

- (2010 - ) Deep Learning (FFN deep) and beyond

26/09/2017 30

Page 29: Optimization Methods for Machine Learning (OMML)palagi/didattica/sites/... · Data Mining and Big Data • The subject of data mining is the extraction of patterns and knowledge from

What is Data Mining ?The core of knowledge discovery in Databases (KDD)

The term KDD, denotes the full research knowledge process from data, namely the techniques to help decision manager in the process of extraction of knowledge in a clever and automatic way. The KDD process includes

• Formulation of the problems

• Data collection

• Data Cleaning and preprocessing

• Data mining

• Analysis of the results produced by the model

Non-trivial extraction of implicit, previously unknown and potentially useful information from data

Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns

Page 30: Optimization Methods for Machine Learning (OMML)palagi/didattica/sites/... · Data Mining and Big Data • The subject of data mining is the extraction of patterns and knowledge from

In a dynamic system a small perturbation of the initial condition may lead to a totally different finalstate.

Rule for a “safe use”Not everything is foreseeable or can be learned

Some process are «intrinsicly chaotic», e.g. social/economic phenomena which are characterized by the unpredictability and by personal choices

These are cases when the mathematical model produceschaos.

Page 31: Optimization Methods for Machine Learning (OMML)palagi/didattica/sites/... · Data Mining and Big Data • The subject of data mining is the extraction of patterns and knowledge from

In some cases developing refined

mathematical models and/or increasing the

tools’ reliability may lead to predict phenomena

which are not predictable nowadays;

in other cases, although deterministic, no

refined tools or model may produce a good

prediction

Page 32: Optimization Methods for Machine Learning (OMML)palagi/didattica/sites/... · Data Mining and Big Data • The subject of data mining is the extraction of patterns and knowledge from

What is (not) Data Mining?By Namwar Rizvi

- Ad Hoc Query: ad Hoc queries just examines the current data set and gives you result based on that. This means you can check what is the maximum price of a product but you can not predict what will be the maximum price of that product in near future.

- Event Notification: you can set different alerts based on some threshold values which will inform you as soon as that threshold will reach by actual transactional data but you can not predict when that threshold will reach.

- Multidimensional Analysis: you can find the value of an item based on different dimensions like Time, Area, Color but you can not predict what will be the value of the item when its color will be Blue and Area will be UK and Time will be First Quarter of the year

- Statistics: Statistics can tell you the history of price changes, moving averages, maximum values, minimum values etc. but it can not tell you how price will change if you start selling another product in the same season.

Page 33: Optimization Methods for Machine Learning (OMML)palagi/didattica/sites/... · Data Mining and Big Data • The subject of data mining is the extraction of patterns and knowledge from

Data Mining tasks...

• Classification [Predictive]

• Regression [Predictive]

• Clustering [Descriptive]

…addressed in the course

• Prediction

– Use data to predict unknown or future values of some variables.

• Description

– Find human-interpretable patterns that describe the data;

Define a learning model to be used in

Page 34: Optimization Methods for Machine Learning (OMML)palagi/didattica/sites/... · Data Mining and Big Data • The subject of data mining is the extraction of patterns and knowledge from

Learning model

The main focus of the course is on optimization tools for machine learning. In order to study mathematically, we need to formally define the learning problem.

Keep in mind:

1. A learning model should be rich enough to capture important aspects of the problem, but simple enough to be tackled mathematically.

2. As usual in mathematical modelling, simplifying assumptions are unavoidable.

3. A learning model should answer several questions:

• How is the data being generated?

• How is the data presented to the learner?

• What is the goal of learning in this model?

Page 35: Optimization Methods for Machine Learning (OMML)palagi/didattica/sites/... · Data Mining and Big Data • The subject of data mining is the extraction of patterns and knowledge from

What are Data ?• A collection of objects (examples) and their attributes

• An attribute is a property or characteristic of an object

– Examples: eye color of a person, temperature, etc.

– Attribute is also known as variable, field, characteristic, or feature

– Attributes are encoded as vectors in some vector space

• A collection of attributes describe an object (also called record, point, case, sample, entity, or instance)

Instances

Page 36: Optimization Methods for Machine Learning (OMML)palagi/didattica/sites/... · Data Mining and Big Data • The subject of data mining is the extraction of patterns and knowledge from

Learning paradigm

- Supervised learning: there is a “teacher”, namely one knows the right answer (label, output) on the training instances- The training set is made up of attributes in pairs

(feature – label) or (input-output)

- Unsurpervised learning: no “teacher”- Output values are not known in advance. One wants to

find similarity class and to assign instances to the correctclass. The training set is made up of features

26/09/2017 44

Page 37: Optimization Methods for Machine Learning (OMML)palagi/didattica/sites/... · Data Mining and Big Data • The subject of data mining is the extraction of patterns and knowledge from

Data set(features,label)

Supervised learning

Features Label

Attributes

Attributes=Features

Data set(features)

Unsupervised learning

Page 38: Optimization Methods for Machine Learning (OMML)palagi/didattica/sites/... · Data Mining and Big Data • The subject of data mining is the extraction of patterns and knowledge from

Learning process• In a learning process we have two main

phases

– learning using a set of available data

– use (prediction/description): capability of given the “right answer” on new instances (generalization).

26/09/2017 46

Page 39: Optimization Methods for Machine Learning (OMML)palagi/didattica/sites/... · Data Mining and Big Data • The subject of data mining is the extraction of patterns and knowledge from

Learning processData may have a twofold role

– Training set: data used for the learning phase

• incrementally (on-line learning): Data are obtained incrementally during the training process

• batch (off-line) learning: Data of the training set

are available in advance before entering the training process

– Test set: data used ins the 2nd phase for checking the accuracy

– Validation set: data used as testing in the learning phase

26/09/2017 47

Page 40: Optimization Methods for Machine Learning (OMML)palagi/didattica/sites/... · Data Mining and Big Data • The subject of data mining is the extraction of patterns and knowledge from

If data is plentiful, then one can use some of the available data as training set and a second set of independent data, called validation set, to check the predictive performance.

K-fold cross validation

In order to build good models, we wish to use as much of the available data as possible for training. However, if the validation set is small, it will give a relatively noisy estimate of predictive performance. One solution is to use cross-validation.

The available data are partitioned into k groups. Then k − 1 of the groups are used as training set and the remaining group as validation. This procedure is then repeated for all k possible choices for the held-out group. The performance scores from the k runs are then averaged.

Example of 4-fold cross validation