1 simple interval calculation (sic-method) theory and applications. rodionova oxana [email protected]...

27
1 Simple Interval Calculation (SIC- method) theory and applications. Rodionova Oxana [email protected] Semenov Institute of Chemical Physics RAS & Russian Chemometric Society Moscow

Upload: betty-claire-stewart

Post on 18-Dec-2015

235 views

Category:

Documents


0 download

TRANSCRIPT

1

Simple Interval Calculation (SIC-method)

theory and applications.

Rodionova Oxana [email protected]

Semenov Institute of Chemical Physics RAS & Russian Chemometric Society

Moscow

2

Plan

1. Introduction

2. Main Features of SIC-method

3. Treatment of Parameter

4. SIC-object status classification

5. Conclusions

3

First Question.

Why do we think about some other methods?

Classical statistical methods

Chemometric approach & projection methods

SIC-method

4

Second Question.

Why do we call our method in such a way?

Simple interval calculation (SIC-method)

1. simple idea lies in the background

2. well-known mathematical methods are used for its implementation.

gives the result of the prediction directly in an interval form

5

Main Assumption of SIC-method

,Prob 00

All errors are limited.

+

Normal (–) distribution

Finite (–) distributions Value is

the Maximum Error Deviation (MED)

6

The Region of Possible Values (RPV)

Let (xi,yi) , i=1,…,n – be a calibration sample ( an object)

i - yi i + (1)

yi - xtia yi + (2)

All vectors a, which agree with (2) form a strip S(xi,yi) Rp

- is known

ExactExact (errorless) model

InexactInexact (real) model y=Xa+

y = +X = , X is n p matrix n – samples; p - variables

RPV

7

The Simplest Example of RPV

-1.2

-0.6

0

0.6

1.2

-1.2 -0.6 0 0.6 1.2

X1

X2

y1 1 0 a1

y2 = 0 1 a2

0.9

0.95

1

1.05

1.1

0.9 1 1.1

a1

a2

1

0.9

0.95

1

1.05

1.1

0.9 1 1.1

a1

a2

2 3

0.9

0.95

1

1.05

1.1

0.9 1 1.1

a1

a2

4

0.9

0.95

1

1.05

1.1

0.9 1 1.1

a1

a2

0.9

0.95

1

1.05

1.1

0.9 1 1.1

a1

a2

0.9

0.95

1

1.05

1.1

0.9 1 1.1

a1

a2

5

8

The RPV A Properties

9

SIC Prediction

0

1

2

3

4

5

1 2 3 4

Test Samples

V-prediction interval

U-test interval

10

Example of SIC – prediction

C11C10C9

C8

C7

C6

C5

C4

C3

C2 C1

T4

T3

T2

T1

-10

-5

0

5

10

-40 -20 0 20 40

PC1

PC2

C5 C6 C7 C8 C9

RPV

C2-C4-

C6-

C10-

C11-

C2+

C11+C3+

C4+

A

BC

D

E

PCR

-0.26

-0.22

-0.18

-0.14

0.015 0.025 0.035 0.045

a1

a2

C11C10C9

C8

C7

C6

C5

C4

C3

C2 C1

T4

T3

T2

T1

-10

-5

0

5

10

-40 -20 0 20 40

PC1

PC2

C5 C6 C7 C8 C9

RPV

C2-C4-

C6-

C10-

C11-

C2+

C11+C3+

C4+

A

BC

D

E

PCR

-0.26

-0.22

-0.18

-0.14

0.015 0.025 0.035 0.045

a1

a2

C5 C6 C7 C8 C9

RPV

C2-C4-

C6-

C10-

C11-

C2+

C11+C3+

C4+

A

BC

D

E

PCR

-0.26

-0.22

-0.18

-0.14

0.015 0.025 0.035 0.045

a1

a2

C5 C6 C7 C8 C9

RPV

C2-C4-

C6-

C10-

C11-

C2+

C11+C3+

C4+

A

BC

D

E

PCR

-0.26

-0.22

-0.18

-0.14

0.015 0.025 0.035 0.045

a1

a2

C5 C6 C7 C8 C9

RPV

C2-C4-

C6-

C10-

C11-

C2+

C11+C3+

C4+

A

BC

D

E

PCR

-0.26

-0.22

-0.18

-0.14

0.015 0.025 0.035 0.045

a1

a2

36.69

6.63

11

Treatment of Parameter

known a priori

unknown parameter of error distribution

parameter of the method and it is

unknown

12

Unknown . How to Find It?

There exists a minimum bsuch that A(b) . This minimum value may be taken as an estimator for parameter

Value b is used instead of

The RPV A depends on b and A(b) is extended monotonically with increasing of b

13

1. number of objects in calibration set ( N )b at N

- the Unknown Parameter of the Error Distribution.

The accuracy of estimate depends on

- -

2. form of error distribution

14

Statistical Simulation

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

-1.2 -0.8 -0.4 0 0.4 0.8 1.2

0.3

1

2

Number of objects in

calibration set N

Number of repeated series m= 500 at each (N, k )

N 10 20 50 75 100 250

k 0.3, 0.5, 1,

1.5, 2,

2.5, 3

15

bsic Calculation

bsic=breg*C(N,s)

k s0.3 0.5738861 0.53956

1.5 0.4950982 0.4398133 0.328859

N=100 -fixed, k=0.3,…,3

3500 points

initialy = 0.5471x + 0.7263

0.4

0.6

0.8

1

1.2

1.4

1.6

0.2 0.3 0.4 0.5 0.6 0.7

s

breg

corrected

0.4

0.6

0.8

1

1.2

1.4

1.6

0.2 0.3 0.4 0.5 0.6 0.7

s

bconf

16

Octane Rating Example

25

26

JK

L

M

0

0.1

0.2

0.3

0.4

0.5

0.6

1100 1150 1200 1250 1300 1350 1400 1450 1500 1550

Wavelength

Short Training Set (1-24) Long Traing Set (1-26)

Short Test Set (A-I) Long Test Set (A-M)

X-predictors are NIR-measurements (absorbance spectra) over 226 wavelengths,

Y –response is reference measurements of octane number.

Training set =26 samples

Test set =13 samples

Spectral dada

Geometrical shape of RPV for Number of PCs=3, short training set

17

Octane Rating Example

86

87

88

89

90

91

92

93

A B C D E F G H I J K L MTest Samples

Oc

tan

e N

um

be

r (s

am

ple

s A

-I)

60

70

80

90

100

110

120

Oc

tan

e N

um

be

r (s

am

ple

s J

-M)

PCR & SIC prediction for PCs=3

Points ( ) are test values with error bars, points ( ) are PCR estimates, bars ( ) are SIC intervals, curves ( ) are borders of PCR confidence intervals. Short test set

Test set with outliers

s=0.475 C=1.12

18

Quality of Calibration

RMSECRMSEC

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

-1.2 -0.8 -0.4 0 0.4 0.8 1.2

0.3

1

2

bsic

~1.7*RMSEC bsic ~ 1.9*RMSEC

bsic ~ 2.3*RMSEC

bbsicsic~1/s*RMSEC~1/s*RMSEC

19

Quality of Prediction

C5 C6 C7 C8 C9

RPV

C2-C4-

C6-

C10-

C11-

C2+

C11+C3+

C4+

A

BC

D

E

PCR

-0.26

-0.22

-0.18

-0.14

0.015 0.025 0.035 0.045

a1

a2

New object (x,y)

?

20

SIC Object Status Theory

21

SIC– leverage / SIC–residual

u

u

2

2d y

v

v

22

SIC Object Status Map(x,y) - SIC-Residual h(x) - SIC-Leverage

-2.3

-1.8

-1.3

-0.8

-0.3

0.2

0.7

1.2

1.7

2.2

0 0.5 1 1.5

h(x)

(x,y)

-2.3

-1.8

-1.3

-0.8

-0.3

0.2

0.7

1.2

1.7

2.2

0 0.5 1 1.5

h(x)

(x,y)

-2.3

-1.8

-1.3

-0.8

-0.3

0.2

0.7

1.2

1.7

2.2

0 0.5 1 1.5

h(x)

(x,y)

-2.3

-1.8

-1.3

-0.8

-0.3

0.2

0.7

1.2

1.7

2.2

0 0.5 1 1.5

h(x)

(x,y)

23

Octane Rating Example

AC D GB

IF

H

E

-2.3

-1.8

-1.3

-0.8

-0.3

0.2

0.7

1.2

1.7

2.2

0 0.5 1 1.5

h(x)

(x,y)

AC D GB

IF

H

E

-2.3

-1.8

-1.3

-0.8

-0.3

0.2

0.7

1.2

1.7

2.2

0 1.5

h(x)

(x,y)

AC D GB

IF

H

E

-2.3

-1.8

-1.3

-0.8

-0.3

0.2

0.7

1.2

1.7

2.2

0 1.5

h(x)

(x,y)

86

87

88

89

90

91

92

93

A B C D E F G H I J K L MTest Samples

Oc

tan

e N

um

be

r (s

am

ple

s A

-I)

60

70

80

90

100

110

120

Oc

tan

e N

um

be

r (s

am

ple

s J

-M)

AC D GB

IF

H

E

-2.3

-1.8

-1.3

-0.8

-0.3

0.2

0.7

1.2

1.7

2.2

0 1.5

(x,y)

E

HF

I

BGDC

A

-2.3

-1.8

-1.3

-0.8

-0.3

0.2

0.7

1.2

1.7

2.2

0 1.5

(x,y)

LJ

MK

-2.3

-1.8

-1.3

-0.8

-0.3

0.2

0.7

1.2

1.7

2.2

12 14 16 18 20

h(x)

bsic=0.66 3 PCs

24 calibration samples 10 boundary samples

24

Wheat Quality Monitoring

X-predictors are NIR-measurements (log-value of absorbance spectra) at 20 wavelengths,

Y –response is reference measurements of protein contents.

Training set =165 (3*55) wheat samples

Standard error in reference method = 0.09

PLS-model with 7 PC

Sample 35 is outlier

0.3

0.4

0.5

0.6

0.7

1440 1640 1840 2040 2240

25

Wheat Quality Monitoring

bmin=0.147

bsic=0.241-2.3

-1.8

-1.3

-0.8

-0.3

0.2

0.7

1.2

1.7

2.2

0 0.5 1 1.5

h(x)

(x,y)

x y=x+1 x y=-x-10 1 0 -1

-2.8

-1.8

-0.8

0.2

1.2

2.2

0 0.5 1 1.5 2

h(x)

(x,y)

Sample No 35

18 boundary samples

26

Main rules

is know a priori

Check up that A()

YES

Calculate bmin

and bsic

NO

Error of Modeling

Calculate prediction intervals for test samples

A sample is inside the model – reliable

prediction

A sample is absolute outsider- it differs from

calibration samples.

New sample- absolute outsider or not.

27

The Main Features of the SIC-method

SIC - METHODSIC - METHOD

• gives the result of prediction directly in the interval form.

• calculates the prediction interval irrespective of sample position regarding the model.

• summarizes and processes all errors involved in bi-linear modelling all together and estimates the Maximum Error Deviation for the model

• provides wide possibilities for sample classification and outlier detection