optimization methods for machine learning (omml)palagi/didattica/sites/... · data mining and big...

Optimization Methods for Machine Learning (OMML)

1st lecture (1 slot)

Prof. L. Palagi

25/09/2017 1

Course at a glance 6 ects You can find all info on the web site

http://www.dis.uniroma1.it/~palagi

following the path (not yet)

didatticaaa-2017-18optimization-methods-machine-learning

Assistant Professor: Ing. Tommaso Colombo Lectures schedule on the website Join the Google Group “OMML_2017-18”

26/09/2017 2

http://www.dis.uniroma1.it/~palagi/didattica/aa-2016-17/optimization-methods-machine-learning



Master students in….Management Engineering (ingegneria gestionale)

Strong background in optimization Poor background in Data ManagementMedium background in Programming

Data Science Unknown background in optimization Strong background in Data ManagementMedium background in Programming

Others from engineering (e.g. Engineering in Artificial intelligence and Robotics)

Basic background in optimization Good background in Data Management Strong background in Programming

26/09/2017 3

ExamsAttending students 3 Homeworks every two weeks (75%)

You must turn in all the homeworks in order to be admitted to the final term.

Midtern and Final Exam (10% & 15%) Grading

– Homework (75%, 3 assignments: 15, 25, 35 % respectively)

– Midterm (10 %)

– Final (15 %)

– Oral (on demand) ± 2

Not attending students Project, multiple choice exam and final oral exam

26/09/2017 4

Syllabus at a glance

• Introduction to statistical learning theory (“learning from data”)

26/09/2017 5

Supervised Learning:

Deep Learning: FeedforwardNeural Networks (NN)

Kernel methods: Support Vector Machines (SVM)

Unsupervised Learning: Clustering

Use of standard software (R, LIBSVM, TensorFlow, Sklearn)

FOCUS: optimization models, algorithms

What does «automatic learning» mean ?

26/09/2017 8

Arthur Samuel (1901-1990)“ programming of a digital computer to behave in a way which, if done by human beings or animals, would be described as involving the process of learning”in Some Studies In Machine Learning Using the Game of Checkers ,1959

Tom Mitchell (1997) http://www.cs.cmu.edu/~tom/

“Machine Learning is the study of computer algorithms that improve automatically through experience”in Machine Learning, Tom Mitchell, McGraw Hill, 1997.

Human brain versus automatic learning

1. 10 billions neurons

2. 60 trillions sinapsi

3. Distribution processing

4. Nonlinear Process

5. Parallel Computation

1. ??

To be more precise… (T. Mitchell)

26/09/2017 10

“We say that a machine learns with respect to a particular task T, performance metric P, and type of experience E, if the system reliability improves its performance P at task T, following experience E.”

http://www.cs.cmu.edu/~tom/pubs/MachineLearning.pdf

An everyday example: SPAM detection

26/09/2017 11

Assume that your e-mail program controls which mail should be classified as «spam» or «not spam» and needsto learn how to improve the AntiSpam filter

T (task) classify mail as «spam» or «not spam» P (Performance misure) the number (or %) of

correctly classified mails E (Experience) your e-mail classification as

«spam» or «not spam»

Learning from examples

• The measure are the «input variables» and we assume thatare available for all the objects under study. (this is not always true)

• The properties of the objects are known as «output variables» and usually they are known only on a subset of objects which represent the “examples”

• Estimate the dependence among input-output will be useful to predict the properties of all the possible objects (not only the examples)

26/09/2017 12

It is the process to find the analytic description of an unknown relationship among the «measure» of some «objects» and the properties of such «objects».

An everyday exampleSPAM detection

26/09/2017 13

The measure (input/feature) can be Sender

Subject

Body

The property (output) classification as spam o not spam (1 o 0)

Such a problem where the properties (output) can assume only

a finite number of values (discrete) is call classificationproblem

Classification

• Classification establishes the belonging of an element to a class.

In a classification problem, the output is categorizednamely there is only a finite number of values {Yes, No}, {High, Medium, Low}, etc.

As a first example, consider the problem of targeting if a consumer will likely to buy a new product or accept a new commercial offer.

26/09/2017 14

Medical Diagnosis(from T. Mitchell)

26/09/2017 15

Predict if a pregnancy will end with a cesarean sectionor a natural childbirth

age

Cesarean-S

Natural

Cesarean-S Natural

Medical Diagnosis (from T. Mitchell)

26/09/2017 16

age

Cesarean-S Natural

weight

Predict if a pregnancy will end with a cesarean

section or a natural childbirth

Medical Diagnosis (from T. Mitchell)

26/09/2017 17

Postural diseases detection*

18

▪ Formetric allows to digitallyreconstruct the spinal column of a patient

▪ It works by taking sequences of images, thus elaborating averagemeasurements

• 09/04/2017

After cleaning, 42 input features

▪ Trunk length (mm):

▪ Anteroposterior curve (degree)

▪ Kyphosis peak (mm)

▪ Inversion point (mm)

▪ Lordotic angle (degree)

▪ Lateral deviation (mm)

▪ ……..

Output

Healthy/scoliosis*Joint project DIAG (Data Mining aspects) and the Physical and Rehabilitation department (patient selection, postural evaluation and

rasterstereography) - Sapienza

Handwritten digit recognition

26/09/2017 19

Input variables are the pictures of a given

character

Handwritten digit recognition

• Each input element is a picture with pxp (28x28, 256x256) pixel and hence is represented by a real vector of dimensione p2(=784, 65536) which represents the grey level (0=white, 1=black) that can be represented with 8-bit

• The properties (output) is the “character” namely one of the elements of the finite set {0,1,2….,9}

• The examples (E) are handwritten digits

• The aim (T) is the recognition of handwritten digits from others

• The difficulty stays in the high variability of the shapes and the huge number of elements (228 x28 x8 ,2256 x256 x8)

26/09/2017 20

Approssimation/regression

26/09/2017 21

In many learning problems the output takes a numericalvalue in the continuous field: approssimation/regressionproblem.

area

Price

Approssimation/regression

26/09/2017 22

Training data are pairs of real valued vectors (x, t) and we are assuming that a linear or nonlinear model of dependency exists which is represented by the unknown function t=f(x)

Output t may assume an infinite number of values. Often they are refereed to as continuous variables even when they are not such in mathematical sense (e.g. people’s age)

We look for a function that approaches data “at best”

Input data may contain a given (low) level of noise. Noiseless problem are referred to as “approssimation ” pb; in the presence of noise we are tackling regression pb.

As an example, consider the learning problem of predicting the earning that a customer will lead in a giventime period.

Organic Rankine Cycle system for waste heat recovery*

The highly nonlinear thermodynamic model of an ORC is determined by a FNN

*joint project Dept. of Computer, Control, and Management Engin. and Dept. of Mechanical and Aerospace Engin.

Input Features- Working fluid mass flow rate- Bottom pressure of the ORC cycle - Top pressure of the ORC cycle- Super-heating rate- Degree of regenerationOutputPower generated

Machine Learning and Statistics

Statistical Inference (V. Vapnick)

Given a collection of empirical data originating from some functional dependency, infer this dependency

There are two main approaches

– parametric (particular) inference, which aims to create simple methods of inference to be use to solve specific real –life problems

– general inference which aims to create one (induction) method for any problem of statistical inference

26/09/2017 24

Parametric InferenceBeginning 1930. Golden age ‘30-’60

– Assume to know the problem, e.g. • the physical law that generates the stochastic properties of data

• and the function to be found up to a finite number of parameters.

– the essence of the inference problem stays in estimating parameters and using data to verify reliabilty of it

– To find these parameters, using information about the statistical law and the target function one adopts the maximum likelihood method

26/09/2017 25

Parametric Inference Inference models are quite simple and they were suitable for the

computational resources available in the sixties.

These models are based on three main principles

The Weierstrass Theorem: any continuous function on a finite interval can be approximated by a polynomials (i.e. a linear function in the parameters) to any degree of accuracy

The central limit theorem: (roughly) the distribution of the sum (or average) of a large number of independent, identically distributed variables will be approximately normal, regardless of the underlying distribution.

the maximum likelihood method is a good tools to estimate the parameters

The end of parametric inference

- Curse of dimensionality (R. Bellman, ca 1960): increasing the number of factors to be taken into account requires exponentially increasing the amount of computational resources. For ex: if the function is not sufficiently smooth to obtain the given degree of accuracy one needs an exponential number of terms in the polynomial (and hence of variables)

- (Tukey ca 1960) statistical components of real-life problems cannot be described by classical distribution functions

- the maximum likelihood method may not be a good one even for very simple cases (James and Stein)

26/09/2017 27

Beyond parametric inference

• General statistical inference: ones does not have a priori information about the statistical law underlying the problem or about the function to be approximated.

– Look for a method that infers an approximating function from examples (inductive method)

– data used to define the model itself

– non linear models in the parameters

data analysis/data mining

26/09/2017 28

Data Mining and Big Data

• The subject of data mining is the extraction of patterns and knowledge from large amount of data using automatic or semi-automatic methods and the operative use of thisinformation.

• Exponential growth of tools and techinques to collect and store huge amount of data

26/09/2017 29

Report| McKinsey Global Institute

Big data: The next frontier for innovation, competition, and productivity - May 2011

“The amount of data in our world has been exploding, and analyzing large data sets—so-called big data—will become a key basis of competition, underpinning new waves of productivity growth, innovation, and consumer surplus, according to research by MGI and McKinsey's Business Technology Office……. Leading companies are using data collection and analysis to conduct controlled experiments to make better management decisions; …… sophisticated analytics can substantially improve decision-making…….”http://www.mckinsey.com/insights/business_technology/big_data_the_next_frontier_for_innovation

http://www.mckinsey.com/insights/business_technology/big_data_the_next_frontier_for_innovation

Story

– In 1958 Rosenblatt (a fisiology) proposed a learning machine (namely a program) called Perceptron to solve a simple classification problem. The perceptron reproduced some neurofiologic learning model. The perceptron was able to generalize (it learns !).

– 1958-1992: Feedforward Neural Networks (shallow)

– (1992- ) back to the general statistical inference: other learning machines have been proposed which do not have any similarity with the biological neuron.

Does an inductive inference principle exist in common to all these machines ?

- (2010 - ) Deep Learning (FFN deep) and beyond

26/09/2017 30

What is Data Mining ?The core of knowledge discovery in Databases (KDD)

The term KDD, denotes the full research knowledge process from data, namely the techniques to help decision manager in the process of extraction of knowledge in a clever and automatic way. The KDD process includes

• Formulation of the problems

• Data collection

• Data Cleaning and preprocessing

• Data mining

• Analysis of the results produced by the model

Non-trivial extraction of implicit, previously unknown and potentially useful information from data

Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns

In a dynamic system a small perturbation of the initial condition may lead to a totally different finalstate.

Rule for a “safe use”Not everything is foreseeable or can be learned

Some process are «intrinsicly chaotic», e.g. social/economic phenomena which are characterized by the unpredictability and by personal choices

These are cases when the mathematical model produceschaos.

In some cases developing refined

mathematical models and/or increasing the

tools’ reliability may lead to predict phenomena

which are not predictable nowadays;

in other cases, although deterministic, no

refined tools or model may produce a good

prediction

What is (not) Data Mining?By Namwar Rizvi

- Ad Hoc Query: ad Hoc queries just examines the current data set and gives you result based on that. This means you can check what is the maximum price of a product but you can not predict what will be the maximum price of that product in near future.

- Event Notification: you can set different alerts based on some threshold values which will inform you as soon as that threshold will reach by actual transactional data but you can not predict when that threshold will reach.

- Multidimensional Analysis: you can find the value of an item based on different dimensions like Time, Area, Color but you can not predict what will be the value of the item when its color will be Blue and Area will be UK and Time will be First Quarter of the year

- Statistics: Statistics can tell you the history of price changes, moving averages, maximum values, minimum values etc. but it can not tell you how price will change if you start selling another product in the same season.

Data Mining tasks...

• Classification [Predictive]

• Regression [Predictive]

• Clustering [Descriptive]

…addressed in the course

• Prediction

– Use data to predict unknown or future values of some variables.

• Description

– Find human-interpretable patterns that describe the data;

Define a learning model to be used in

Learning model

The main focus of the course is on optimization tools for machine learning. In order to study mathematically, we need to formally define the learning problem.

Keep in mind:

1. A learning model should be rich enough to capture important aspects of the problem, but simple enough to be tackled mathematically.

2. As usual in mathematical modelling, simplifying assumptions are unavoidable.

3. A learning model should answer several questions:

• How is the data being generated?

• How is the data presented to the learner?

• What is the goal of learning in this model?

What are Data ?• A collection of objects (examples) and their attributes

• An attribute is a property or characteristic of an object

– Examples: eye color of a person, temperature, etc.

– Attribute is also known as variable, field, characteristic, or feature

– Attributes are encoded as vectors in some vector space

• A collection of attributes describe an object (also called record, point, case, sample, entity, or instance)

Instances

Learning paradigm

- Supervised learning: there is a “teacher”, namely one knows the right answer (label, output) on the training instances- The training set is made up of attributes in pairs

(feature – label) or (input-output)

- Unsurpervised learning: no “teacher”- Output values are not known in advance. One wants to

find similarity class and to assign instances to the correctclass. The training set is made up of features

26/09/2017 44

Data set(features,label)

Supervised learning

Features Label

Attributes

Attributes=Features

Data set(features)

Unsupervised learning

Learning process• In a learning process we have two main

phases

– learning using a set of available data

– use (prediction/description): capability of given the “right answer” on new instances (generalization).

26/09/2017 46

Learning processData may have a twofold role

– Training set: data used for the learning phase

• incrementally (on-line learning): Data are obtained incrementally during the training process

• batch (off-line) learning: Data of the training set

are available in advance before entering the training process

– Test set: data used ins the 2nd phase for checking the accuracy

– Validation set: data used as testing in the learning phase

26/09/2017 47

If data is plentiful, then one can use some of the available data as training set and a second set of independent data, called validation set, to check the predictive performance.

K-fold cross validation

In order to build good models, we wish to use as much of the available data as possible for training. However, if the validation set is small, it will give a relatively noisy estimate of predictive performance. One solution is to use cross-validation.

The available data are partitioned into k groups. Then k − 1 of the groups are used as training set and the remaining group as validation. This procedure is then repeated for all k possible choices for the held-out group. The performance scores from the k runs are then averaged.

Example of 4-fold cross validation

optimization methods for machine learning (omml)palagi/didattica/sites/... · data mining and big...

Documents