pattern recognition – basic concepts

Post on 23-Feb-2016

56 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Pattern recognition – basic concepts. Sample. input attribute, attribute, feature , input variable, independent variable ( atribut, rys, p říznak, vstupní proměnná, nezávisl e proměnná ) class, output variable, dependendent variable ( třída, výstupní proměnná, závislá proměnná ) - PowerPoint PPT Presentation

TRANSCRIPT

Pattern recognition – basic concepts

Input OutputAttributes Attribute

Inst.Sepal

LengthSepal Width

Petal Length

Petal Width Species

1 5.1 3.5 1.4 0.2 setosa2 4.9 3 1.4 0.2 setosa3 4.7 3.2 1.3 0.2 setosa4 4.6 3.1 1.5 0.2 setosa5 5 3.6 1.4 0.2 setosa

Sample

• input attribute, attribute, feature, input variable, independent variable (atribut, rys, příznak, vstupní proměnná, nezávisle proměnná)

• class, output variable, dependendent variable (třída, výstupní proměnná, závislá proměnná)

• sample (vzorek)

Handwritten digits

• each digit 28 x 28 pixels– so each digit can be represented by a vector x

comprising 784 real numbers• goal:– build a machine that will take x as input and will produce

the identity of the digit 0 … 9 as the output– non-trivial problem due to the wide variability of

handwriting– could be tackled by using rules for distinguishing the

digits based on the shapes of the strokes– in practice such an approach leads to a proliferation of

rules and of exceptions to the rules and so on, and invariably gives poor results

• better way – adopt machine learning algorithm (i.e. use some adaptive model)

model

input

internal parameters influencing the behavior of the model must be adjusted

• tune the parameters of the model using the training set– training set is a data set of N digits {x1, …, xN}

• the categories of the digits in the training set are known in advance (inspected manually), they form a target vector t for each digit (one target vector for each digit)

0 0 0 1 0 0 0 0 0 0

0 1 2 3 4 5 6 7 8 9

The result of running the machine learning algorithm can be expressed as a function y(x) which takes a new digit image x as input and that generates an output vector y, encoded in the same way as the target vectors.

y(x)

x784x6x5x4x3x2x1 . . .

vector y

• The precise form of the function y(x) is determined during the training phase (learning phase). – Model adapts its parameters (i.e. learns) on the basis

of the training data {x1, …, xN}.

• Trained model can then determine the identity of new, previously unseen, digit images which are said to comprise a test set.

• The ability to categorize correctly new examples that differ from those used for training is known as generalization.

For most practical applications, the original input variables are typically preprocessed to transform them into some new space of variables where, it is hoped, the pattern recognition problem will be easier to solve.

y(x)

x784x6x5x4x3x2x1 . . .

vector y

Preprocessing

• Translate and scale the images of the digits so that each digit is contained within a box of a fixed size.

• This greatly reduces the variability within each digit class, because the location and scale of all the digits are now the same.

• This pre-processing stage is sometimes also called feature extraction.

• Test data must be pre-processed using the same steps as the training data.

Feature selection and feature extraction

x784x6x5x4x3x2x1 . . .

x456x103x5x1

x784x6x5x4x3x2x1 . . .

x*666x*

309x*152x*

18

x*784x*

6x*5x*

4x*3x*

2x*1 . . .

selection extraction

Dimensionality reduction

• We want to reduce number of dimensions because:– Efficiency• measurement costs• storage costs• computation costs

– Problem may be solved more easily in the new space

– Improved classification performance– Ease of interpretation

Curse of dimensionality

Bishop, Pattern Recognition and Machine Learning

Supervised learning

• training data comprises examples of the input vectors along with their corresponding target vectors (e.g. digit recognition)– classification – aim: assign an input vector to one

of a finite number of discrete categories – regression (data/curve fitting) - desired output

consists of one or more continuous variables

Unsupervised learning

• training data consists of a set of input vectors x without any corresponding target value

• goals:– discover groups of similar examples within the

data – clustering– project the data from a high-dimensional space

down to two or three dimensions for the purpose of visualization

Polynomial curve fitting

• regression problem, supervised• we observe a real-valued input variable x and

we wish to use this observation to predict the value of a real-valued target variable t

• artificial example - sin(2πxn) + random noise• training set: N observations of x written as x =

(x1, … , xN)T + corresponding observations of the values of t: t = (t1, … , tN)T

sin(2πxn) + random noise

N = 10x1 -> t1

x2 -> t2

etc.

training data set {x, t}

adapted from Bishop, Pattern Recognition and Machine Learning

• goal: exploit the training set in order to make prediction of the target variable t’ for new value x’ of the input variable

• this is generally difficult, as– we have to generalize from the finite data set– data are corrupted by the noise, so for the given x’ there is uncertainty in

the value of t’

x’

t’

adapted from Bishop, Pattern Recognition and Machine Learning

• decision: which method to use?• From the plethora of possibilities (you do not

know about yet) I chose a simple one – data will be fitted using this polynomial function

• polynomial coefficients w0, …, wM form a vector w– they represent parameters of the model that must

be set in the training phase– the polynomial model is still linear regression!!!

𝑦 (𝑥 ,𝑤 )=𝑤0+𝑤1𝑥+𝑤2𝑥❑2 +…+𝑤𝑀 𝑥

𝑀=∑𝑗=0

𝑀

𝑤 𝑗𝑥𝑗

• The values of coefficients will be determined by minimizing the error function

• It measures the misfit between the function y(x, w) and the training set data points

• one simple choice: the sum of squared errors - SSE between the predictions y(xn, w) for each data point xn and the correspoding values tn

𝑆𝑆𝐸=12∑𝑛=1

𝑁

(𝑦 (𝑥𝑛 ,𝒘 )− 𝑡𝑛)2

fitted function

displacement of thedata point tn from thefunction y(xn, w)

Bishop, Pattern Recognition and Machine Learning

𝑆𝑆𝐸=12∑𝑛=1

𝑁

(𝑦 (𝑥𝑛 ,𝒘 )− 𝑡𝑛)2

• solving curve fitting problem means choosing the value of w for which E(w) is as small as possible … w* → y(x, w*)– as small as possible means to find a minimum of

E(w) (i.e. its derivatives)

top related