optimization methods for machine learning (omml)palagi/didattica/sites/... · data mining and big...
TRANSCRIPT
Optimization Methods for Machine Learning (OMML)
1st lecture (1 slot)
Prof. L. Palagi
25/09/2017 1
Course at a glance 6 ects You can find all info on the web site
http://www.dis.uniroma1.it/~palagi
following the path (not yet)
didatticaaa-2017-18optimization-methods-machine-learning
Assistant Professor: Ing. Tommaso Colombo Lectures schedule on the website Join the Google Group “OMML_2017-18”
26/09/2017 2
Master students in….Management Engineering (ingegneria gestionale)
Strong background in optimization Poor background in Data ManagementMedium background in Programming
Data Science Unknown background in optimization Strong background in Data ManagementMedium background in Programming
Others from engineering (e.g. Engineering in Artificial intelligence and Robotics)
Basic background in optimization Good background in Data Management Strong background in Programming
26/09/2017 3
ExamsAttending students 3 Homeworks every two weeks (75%)
You must turn in all the homeworks in order to be admitted to the final term.
Midtern and Final Exam (10% & 15%) Grading
– Homework (75%, 3 assignments: 15, 25, 35 % respectively)
– Midterm (10 %)
– Final (15 %)
– Oral (on demand) ± 2
Not attending students Project, multiple choice exam and final oral exam
26/09/2017 4
Syllabus at a glance
• Introduction to statistical learning theory (“learning from data”)
26/09/2017 5
Supervised Learning:
Deep Learning: FeedforwardNeural Networks (NN)
Kernel methods: Support Vector Machines (SVM)
Unsupervised Learning: Clustering
Use of standard software (R, LIBSVM, TensorFlow, Sklearn)
FOCUS: optimization models, algorithms
What does «automatic learning» mean ?
26/09/2017 8
Arthur Samuel (1901-1990)“ programming of a digital computer to behave in a way which, if done by human beings or animals, would be described as involving the process of learning”in Some Studies In Machine Learning Using the Game of Checkers ,1959
Tom Mitchell (1997) http://www.cs.cmu.edu/~tom/
“Machine Learning is the study of computer algorithms that improve automatically through experience”in Machine Learning, Tom Mitchell, McGraw Hill, 1997.
Human brain versus automatic learning
1. 10 billions neurons
2. 60 trillions sinapsi
3. Distribution processing
4. Nonlinear Process
5. Parallel Computation
1. ??
To be more precise… (T. Mitchell)
26/09/2017 10
“We say that a machine learns with respect to a particular task T, performance metric P, and type of experience E, if the system reliability improves its performance P at task T, following experience E.”
http://www.cs.cmu.edu/~tom/pubs/MachineLearning.pdf
An everyday example: SPAM detection
26/09/2017 11
Assume that your e-mail program controls which mail should be classified as «spam» or «not spam» and needsto learn how to improve the AntiSpam filter
T (task) classify mail as «spam» or «not spam» P (Performance misure) the number (or %) of
correctly classified mails E (Experience) your e-mail classification as
«spam» or «not spam»
Learning from examples
• The measure are the «input variables» and we assume thatare available for all the objects under study. (this is not always true)
• The properties of the objects are known as «output variables» and usually they are known only on a subset of objects which represent the “examples”
• Estimate the dependence among input-output will be useful to predict the properties of all the possible objects (not only the examples)
26/09/2017 12
It is the process to find the analytic description of an unknown relationship among the «measure» of some «objects» and the properties of such «objects».
An everyday exampleSPAM detection
26/09/2017 13
The measure (input/feature) can be Sender
Subject
Body
The property (output) classification as spam o not spam (1 o 0)
Such a problem where the properties (output) can assume only
a finite number of values (discrete) is call classificationproblem
Classification
• Classification establishes the belonging of an element to a class.
In a classification problem, the output is categorizednamely there is only a finite number of values {Yes, No}, {High, Medium, Low}, etc.
As a first example, consider the problem of targeting if a consumer will likely to buy a new product or accept a new commercial offer.
26/09/2017 14
Medical Diagnosis(from T. Mitchell)
26/09/2017 15
Predict if a pregnancy will end with a cesarean sectionor a natural childbirth
age
Cesarean-S
Natural
Cesarean-S Natural
Medical Diagnosis (from T. Mitchell)
26/09/2017 16
age
Cesarean-S Natural
weight
Predict if a pregnancy will end with a cesarean
section or a natural childbirth
Medical Diagnosis (from T. Mitchell)
26/09/2017 17
Postural diseases detection*
18
▪ Formetric allows to digitallyreconstruct the spinal column of a patient
▪ It works by taking sequences of images, thus elaborating averagemeasurements
• 09/04/2017
After cleaning, 42 input features
▪ Trunk length (mm):
▪ Anteroposterior curve (degree)
▪ Kyphosis peak (mm)
▪ Inversion point (mm)
▪ Lordotic angle (degree)
▪ Lateral deviation (mm)
▪ ……..
Output
Healthy/scoliosis*Joint project DIAG (Data Mining aspects) and the Physical and Rehabilitation department (patient selection, postural evaluation and
rasterstereography) - Sapienza
Handwritten digit recognition
26/09/2017 19
Input variables are the pictures of a given
character
Handwritten digit recognition
• Each input element is a picture with pxp (28x28, 256x256) pixel and hence is represented by a real vector of dimensione p2(=784, 65536) which represents the grey level (0=white, 1=black) that can be represented with 8-bit
• The properties (output) is the “character” namely one of the elements of the finite set {0,1,2….,9}
• The examples (E) are handwritten digits
• The aim (T) is the recognition of handwritten digits from others
• The difficulty stays in the high variability of the shapes and the huge number of elements (228 x28 x8 ,2256 x256 x8)
26/09/2017 20
Approssimation/regression
26/09/2017 21
In many learning problems the output takes a numericalvalue in the continuous field: approssimation/regressionproblem.
area
Price
Approssimation/regression
26/09/2017 22
Training data are pairs of real valued vectors (x, t) and we are assuming that a linear or nonlinear model of dependency exists which is represented by the unknown function t=f(x)
Output t may assume an infinite number of values. Often they are refereed to as continuous variables even when they are not such in mathematical sense (e.g. people’s age)
We look for a function that approaches data “at best”
Input data may contain a given (low) level of noise. Noiseless problem are referred to as “approssimation ” pb; in the presence of noise we are tackling regression pb.
As an example, consider the learning problem of predicting the earning that a customer will lead in a giventime period.
Organic Rankine Cycle system for waste heat recovery*
The highly nonlinear thermodynamic model of an ORC is determined by a FNN
*joint project Dept. of Computer, Control, and Management Engin. and Dept. of Mechanical and Aerospace Engin.
Input Features- Working fluid mass flow rate- Bottom pressure of the ORC cycle - Top pressure of the ORC cycle- Super-heating rate- Degree of regenerationOutputPower generated
Machine Learning and Statistics
Statistical Inference (V. Vapnick)
Given a collection of empirical data originating from some functional dependency, infer this dependency
There are two main approaches
– parametric (particular) inference, which aims to create simple methods of inference to be use to solve specific real –life problems
– general inference which aims to create one (induction) method for any problem of statistical inference
26/09/2017 24
Parametric InferenceBeginning 1930. Golden age ‘30-’60
– Assume to know the problem, e.g. • the physical law that generates the stochastic properties of data
• and the function to be found up to a finite number of parameters.
– the essence of the inference problem stays in estimating parameters and using data to verify reliabilty of it
– To find these parameters, using information about the statistical law and the target function one adopts the maximum likelihood method
26/09/2017 25
Parametric Inference Inference models are quite simple and they were suitable for the
computational resources available in the sixties.
These models are based on three main principles
The Weierstrass Theorem: any continuous function on a finite interval can be approximated by a polynomials (i.e. a linear function in the parameters) to any degree of accuracy
The central limit theorem: (roughly) the distribution of the sum (or average) of a large number of independent, identically distributed variables will be approximately normal, regardless of the underlying distribution.
the maximum likelihood method is a good tools to estimate the parameters
The end of parametric inference
- Curse of dimensionality (R. Bellman, ca 1960): increasing the number of factors to be taken into account requires exponentially increasing the amount of computational resources. For ex: if the function is not sufficiently smooth to obtain the given degree of accuracy one needs an exponential number of terms in the polynomial (and hence of variables)
- (Tukey ca 1960) statistical components of real-life problems cannot be described by classical distribution functions
- the maximum likelihood method may not be a good one even for very simple cases (James and Stein)
26/09/2017 27
Beyond parametric inference
• General statistical inference: ones does not have a priori information about the statistical law underlying the problem or about the function to be approximated.
– Look for a method that infers an approximating function from examples (inductive method)
– data used to define the model itself
– non linear models in the parameters
data analysis/data mining
26/09/2017 28
Data Mining and Big Data
• The subject of data mining is the extraction of patterns and knowledge from large amount of data using automatic or semi-automatic methods and the operative use of thisinformation.
• Exponential growth of tools and techinques to collect and store huge amount of data
26/09/2017 29
Report| McKinsey Global Institute
Big data: The next frontier for innovation, competition, and productivity - May 2011
“The amount of data in our world has been exploding, and analyzing large data sets—so-called big data—will become a key basis of competition, underpinning new waves of productivity growth, innovation, and consumer surplus, according to research by MGI and McKinsey's Business Technology Office……. Leading companies are using data collection and analysis to conduct controlled experiments to make better management decisions; …… sophisticated analytics can substantially improve decision-making…….”http://www.mckinsey.com/insights/business_technology/big_data_the_next_frontier_for_innovation
Story
– In 1958 Rosenblatt (a fisiology) proposed a learning machine (namely a program) called Perceptron to solve a simple classification problem. The perceptron reproduced some neurofiologic learning model. The perceptron was able to generalize (it learns !).
– 1958-1992: Feedforward Neural Networks (shallow)
– (1992- ) back to the general statistical inference: other learning machines have been proposed which do not have any similarity with the biological neuron.
Does an inductive inference principle exist in common to all these machines ?
- (2010 - ) Deep Learning (FFN deep) and beyond
26/09/2017 30
What is Data Mining ?The core of knowledge discovery in Databases (KDD)
The term KDD, denotes the full research knowledge process from data, namely the techniques to help decision manager in the process of extraction of knowledge in a clever and automatic way. The KDD process includes
• Formulation of the problems
• Data collection
• Data Cleaning and preprocessing
• Data mining
• Analysis of the results produced by the model
Non-trivial extraction of implicit, previously unknown and potentially useful information from data
Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns
In a dynamic system a small perturbation of the initial condition may lead to a totally different finalstate.
Rule for a “safe use”Not everything is foreseeable or can be learned
Some process are «intrinsicly chaotic», e.g. social/economic phenomena which are characterized by the unpredictability and by personal choices
These are cases when the mathematical model produceschaos.
In some cases developing refined
mathematical models and/or increasing the
tools’ reliability may lead to predict phenomena
which are not predictable nowadays;
in other cases, although deterministic, no
refined tools or model may produce a good
prediction
What is (not) Data Mining?By Namwar Rizvi
- Ad Hoc Query: ad Hoc queries just examines the current data set and gives you result based on that. This means you can check what is the maximum price of a product but you can not predict what will be the maximum price of that product in near future.
- Event Notification: you can set different alerts based on some threshold values which will inform you as soon as that threshold will reach by actual transactional data but you can not predict when that threshold will reach.
- Multidimensional Analysis: you can find the value of an item based on different dimensions like Time, Area, Color but you can not predict what will be the value of the item when its color will be Blue and Area will be UK and Time will be First Quarter of the year
- Statistics: Statistics can tell you the history of price changes, moving averages, maximum values, minimum values etc. but it can not tell you how price will change if you start selling another product in the same season.
Data Mining tasks...
• Classification [Predictive]
• Regression [Predictive]
• Clustering [Descriptive]
…addressed in the course
• Prediction
– Use data to predict unknown or future values of some variables.
• Description
– Find human-interpretable patterns that describe the data;
Define a learning model to be used in
Learning model
The main focus of the course is on optimization tools for machine learning. In order to study mathematically, we need to formally define the learning problem.
Keep in mind:
1. A learning model should be rich enough to capture important aspects of the problem, but simple enough to be tackled mathematically.
2. As usual in mathematical modelling, simplifying assumptions are unavoidable.
3. A learning model should answer several questions:
• How is the data being generated?
• How is the data presented to the learner?
• What is the goal of learning in this model?
What are Data ?• A collection of objects (examples) and their attributes
• An attribute is a property or characteristic of an object
– Examples: eye color of a person, temperature, etc.
– Attribute is also known as variable, field, characteristic, or feature
– Attributes are encoded as vectors in some vector space
• A collection of attributes describe an object (also called record, point, case, sample, entity, or instance)
Instances
Learning paradigm
- Supervised learning: there is a “teacher”, namely one knows the right answer (label, output) on the training instances- The training set is made up of attributes in pairs
(feature – label) or (input-output)
- Unsurpervised learning: no “teacher”- Output values are not known in advance. One wants to
find similarity class and to assign instances to the correctclass. The training set is made up of features
26/09/2017 44
Data set(features,label)
Supervised learning
Features Label
Attributes
Attributes=Features
Data set(features)
Unsupervised learning
Learning process• In a learning process we have two main
phases
– learning using a set of available data
– use (prediction/description): capability of given the “right answer” on new instances (generalization).
26/09/2017 46
Learning processData may have a twofold role
– Training set: data used for the learning phase
• incrementally (on-line learning): Data are obtained incrementally during the training process
• batch (off-line) learning: Data of the training set
are available in advance before entering the training process
– Test set: data used ins the 2nd phase for checking the accuracy
– Validation set: data used as testing in the learning phase
26/09/2017 47
If data is plentiful, then one can use some of the available data as training set and a second set of independent data, called validation set, to check the predictive performance.
K-fold cross validation
In order to build good models, we wish to use as much of the available data as possible for training. However, if the validation set is small, it will give a relatively noisy estimate of predictive performance. One solution is to use cross-validation.
The available data are partitioned into k groups. Then k − 1 of the groups are used as training set and the remaining group as validation. This procedure is then repeated for all k possible choices for the held-out group. The performance scores from the k runs are then averaged.
Example of 4-fold cross validation