multiple linear regression · 2013. 7. 13. · multiple linear regression model i model: the...

Multiple linear regressionmodel description and application

Markus Kunze

Institut für MeteorologieFreie Universität Berlin

Ausgewählte Probleme der mittleren Atmosphäre,WS 2009/10

Outline

1 TheoryMultiple linear regression modelResidualsUncertainties

2 Model descriptionThe linear regression modelBasis functionsNamelists

Multiple linear regression model I

• Model: the response yt may be related to k + 1 regressorvariables.

yt = β0 + β1xt1 + β2xt2 + · · ·+ βkxtk + εt

yt = β0 +k∑

j=1

βjxtj + εt , t = 1,2, . . . ,n (1)

• Task: find the unknown regression coefficientsβj , j = 0,1, . . . , k . The number of observations n must begreater than the number of unknown regressor variables:n > k + 1.

• The model is a linear model because it is a linear functionof the regression coefficients.

Multiple linear regression model II• Matrix notation of the multiple linear regression model:

y = Xβ + ε (2)

• with the following matrices (n: number of observations,p = k + 1: number of regressor variables (basisfunctions)):

y =

26664y1y2...

yn

37775 , n×1 Matrix; X =26664

1 x11 x12 . . . x1k1 x21 x22 . . . x2k...

......

...1 xn1 xn2 . . . xnk

37775 , n×p Matrix

β =

26664β0β1...βk

37775 , p × 1 Matrix; ε =26664ε1ε2...εn

37775 , n × 1 Matrix

The least-squares function

• Like for simple linear regression, one looks for theparameters that minimises the least squares functionS(β0, β1, . . . , βk ).

•

S(β0, β1, . . . , βk ) =n∑

t=1

ε2t

=n∑

t=1

yt − β0 − k∑j=1

βjxtj

2 (3)

The normal equations

• k + 1 partial derivations of the least-squares function leadto k + 1 normal equations:

•

∂S∂β0

= −2n∑

i=1

(yt − β̂0 −k∑

j=1

β̂jxtj) = 0

∂S∂βj

= −2n∑

t=1

(yt − β̂0 −k∑

j=1

β̂jxtj)xtj = 0, j = 1,2, . . . , k

The normal equations II• The least squares equation in matrix notation:

S(β0, β1, . . . , βk ) =n∑

t=1

ε2t

S(β) = εTε= (y − Xβ)T (y − Xβ)= yT y − βT X T y − yT Xβ + βT X T Xβ= yT y − 2βT X T y + βT X T Xβ (4)

• The least-squares normal equations in matrix notation:

∂S∂β

= −2X T y + 2X T X β̂ = 0 (5)

X T X β̂ = X T y (6)

The solutions for the normal equations

• The least squares estimator of β:

β̂ = (X T X )−1 X T y (7)

• The solution can be found, when the covariance Matrix(X T X )−1 exists.

• That is the case, if the regressor variables xk are linearindependent. That means no column of the Matrix X canbe created as a linear combination of the other columns.

Outline



Autocorrelated errors I

• Fundamental assumptions in linear regression:

E(εt) = 0, εt have zero mean.Var(εt) = σ2, εt have constant variance.

E(εtεt−1) = 0, εt are uncorrelated.

• The assumption of uncorrelated or independent errors fortime series data is often not appropriate:

E(εtεt−1) 6= 0. (8)

• Sources of autocorrelation:• The regression model is not complete. One or more

important regressors are not included.

Autocorrelated errors II:effects of autocorrelated errors

• If all assumptions are valid, the estimated regressioncoefficients are unbiased, efficient, and consistent.

• If not, the regression coefficients are no longer minimumvariance estimates: estimates are inefficient.

• The residual mean square may seriously underestimateσ2.

• The standard errors of the regression coefficients may betoo small.

• The confidence intervals are shorter than they really shouldbe.

• Misleading test statistics: indicating significance forinsignificant results.




• The residual mean square may seriously underestimateσ2.• The standard errors of the regression coefficients may be

too small.

• The confidence intervals are shorter than they really shouldbe.






too small.• The confidence intervals are shorter than they really should

be.






too small.• The confidence intervals are shorter than they really should

be.• Misleading test statistics: indicating significance for

insignificant results.

Autoregressive model

• If there are autocorrelations in the residuals, the linearregression model has to be transformed.

• An autoregressive model is applied to estimate the degreeof autocorrelation.

• Second-order autoregressive model for the residuals εt attime t :

εt = ρ1 εt−1 + ρ2 εt−2 + at , (9)

where at is a random variable, and ρ1 and ρ2 are theautocorrelation parameter.ρ1 + ρ2 < 1ρ2 − ρ1 < 1−1 < ρ2 < 1

• ρ1 and ρ2 are used to transform the model according toTiao et al. (1990).

Transformation of the model(see Tiao et al. 1990, Appendix A)

• Transformation of the response variable:

y ′t = yt − ρ1yt−1 − ρ2yt−2 (10)

• The transformed model (for a simple example):

y ′t = β0 + β1xt1 + β2xt2 + εt −ρ1(β0 + β1x(t−1)1 + β2x(t−1)2 + εt−1)−ρ2(β0 + β1x(t−2)1 + β2x(t−2)2 + εt−2)

= β0(1− ρ1 − ρ2) + β1(xt1 − ρ1x(t−1)1 − ρ2x(t−2)1) +β2(xt2 − ρ1x(t−1)2 − ρ2x(t−2)2) + εt − ρ1εt−1 − ρ2εt−2

y ′t = β′0 + β

′1x′t1 + β

′2x′t2 + at (11)

Transformation of the model II• The uncertainties:

For the first run of the least square regression (lsqr) theuncertainties are set to 1: σ = 1. They are used prior to the lsqrto normalise the Matrix X and the response variable y .

Xtj =xtjσt

Yt =ytσt

• Update the uncertainties (Box and Jenkins, 1970):

σt =

√(1− ρ21 + ρ2

)σ2ε

[(1− ρ2)2 − ρ21](12)

where σε, the standard deviation of the residuals, is derived foreach individual month.

• After the transformation the lsqr is run again.

Outline



The uncertainties of the regressioncoefficients

• The uncertainties of the regression parameters are givenwith the diagonal elements of the covariance matrix(X T X )−1.

(X T X )−1 = C =

C11 . . . . . . . . .. . . C22 . . . . . ....

......

.... . . . . . . . . Ckk

• The values for

√Cjj are stored an extra output file.

The t test statistics

• With a Student’s t test the following Hypotheses H0 and H1are tested:

H0 : βj = 0H1 : βj 6= 0

If H0 is rejected, basis function xj has a significantinfluence on y .

• The t test statistics are calculated from the uncertaintiesand the estimated regression coefficients:

t0j =β̂j√Cjj

Outline



The linear regression model

• A variable at time t , yt can be modelled in the followingway:

y(t) = βoffs ∗ offset(N=m) +βtr ∗ trend(t)(N=m) +βqbo ∗QBO(t)(N=m) +βqbo_or ∗QBO_orthog(t)(N=m) +βsfl ∗ solar(t)(N=0) +βens ∗ ENSO(t)(N=n) +βvol ∗ Volcano(t)(N=m) +· · ·+ε(t), t = 1,n (13)

• The model can easily be expanded with more basisfunctions.

The seasonal variability

• The basis functions can be expanded into N = m pairs ofsine and cosine functions to account for the seasonalvariability:

βj xtj = βj0 xtj +m∑

k=1

[ βj(2k−1) sin(2πkt/365.25)

+ βj(2k) cos(2πkt/365.25) ] xtj

• For example the trend with m = 1:

βtr trtj = βtr0 trtj + [ βtr1 sin (2πt/365.25)+ βtr2 cos(2πt/365.25) ] trtj

Outline



Treatment of the basis functions• If a basis function has a mean and a trend > 0:

• the offset basis function is not accounting for all of theoffset,

• the trend basis function is not accounting for all of the trend.

• The mean or the mean and the trend can be removed:

xt = xt − x , TREATMENT=’RemoveMean’,xt = xt − b t , TREATMENT=’RemoveTrend’xt = xt − b t − a, TREATMENT=’RemoveTrendAndMean’

• Remove the seasonal cycle:

xt = xt − x t ,ltm, TREATMENT=’Deseason’

• Whether or not the basis function should be treated,depends on the question to answer with the linearregression model (Bodeker et al., 1998, JGR).

Treatment of the basis functions• If a basis function has a mean and a trend > 0:

• the offset basis function is not accounting for all of theoffset,

• the trend basis function is not accounting for all of the trend.• The mean or the mean and the trend can be removed:

xt = xt − x , TREATMENT=’RemoveMean’,xt = xt − b t , TREATMENT=’RemoveTrend’xt = xt − b t − a, TREATMENT=’RemoveTrendAndMean’

• Remove the seasonal cycle:

xt = xt − x t ,ltm, TREATMENT=’Deseason’

• Whether or not the basis function should be treated,depends on the question to answer with the linearregression model (Bodeker et al., 1998, JGR).

Fill matrix X with treated basisfunctions

• Create the design matrix X from the treated basisfunctions (regressors).

• Each column represents a time series t = 1,2, . . . ,n..• The number of columns depends on the number of basis

functions and the number of fourier pair expansions usedfor them.

X =

off1 off21 off31 qbo11 . . . tr31off1 off22 off32 qbo12 . . . tr32

......

......

...off1 off2n off3n qbo1n . . . tr3n

Main tasks

• Fill the design Matrix X with treated basis functions.

• Perform the first run of the least square regressionsubroutine.

• Analyse the residuals and transform the times series:• Run the second order autoregressive model on the

residuals.• Transform the model according to the autocorrelation

coefficients ρ1 and ρ2.• Update the standard deviation σt .

• Perform the second run of the least square regressionsubroutine.

Main tasks

• Fill the design Matrix X with treated basis functions.• Perform the first run of the least square regression

subroutine.

• Analyse the residuals and transform the times series:• Run the second order autoregressive model on the

residuals.• Transform the model according to the autocorrelation

coefficients ρ1 and ρ2.• Update the standard deviation σt .


Main tasks

• Fill the design Matrix X with treated basis functions.• Perform the first run of the least square regression

subroutine.• Analyse the residuals and transform the times series:

• Run the second order autoregressive model on theresiduals.

• Transform the model according to the autocorrelationcoefficients ρ1 and ρ2.

• Update the standard deviation σt .


Orthogonal version of the QBO• To account for the different phases of the QBO in lower and

higher levels of the tropical stratosphere two QBO basisfunctions are used.

• An orthogonal version is created from the time series of theQBO. QBO_orthog is orthogonal to QBO if the dot productvanishes.

qbo · qbo_orthog =n∑

t=1

qbot qbo_orthogt = 0

Volcanic basis functions• rapid initial perturbation followed by an exponential relaxation

(Bodeker et al., 1998):

yt = exp(Onset−DecimalDate)(

1− exp6(Onset−DecimalDate))

Outline



Main namelist to control model behaviour.

&MLREG_INPOFILE = ’mlreg-output’, ! -> output name.START_DATE = ’19800101’, ! -> start date of data processingEND_DATE = ’20051231’, ! -> end date of data processingTLAB = ’ALL’, ! -> use complete time seriesLAUTO = T, ! -> account for autocorrelation

! of the residualsICOEFF = 1, ! -> write out 1. regression coeff.LOFFSET = T, ! -> use offsetLTREND = T, ! -> use trendLQBO = T, ! -> use QBOLQBO_ORTHOG = T, ! -> use orthogonal QBOLENSO = T, ! -> use ENSOLPINATUBO = T, ! -> use Pinatubo volcanoLELCHICHON = T, ! -> use El Chichon volcanoLAGUNG = F, ! -> use Agung volcanoLNAO = F, ! -> use NAOLLAG = F, ! -> use time lagged inputLNAM = F, ! -> use Northern annular modeLSAM = F, ! -> use Southern annular modeLEXTRA1 = F, ! -> \LEXTRA2 = F, ! -> use extra definedLEXTRA3 = F ! -> /

/

namelist to describe the input data time series.

&DATA_INPLAREA = T, ! -> create and area weighted mean over

! latitudes, specified by LAT_S, LAT_ELZM = F, ! -> create zonal mean prior to regressionLINP_OUT = T, ! -> write out the used input dataLFIT_OUT = T, ! -> write out fitted data for each single

! basis functionMISSING = , ! -> value for missing dataTFILTER = F, ’RMEAN’, 5, 0.,0.,0., ! -> characterise time filterLON_S = 25., ! -> start longitude; default: all longitudesLON_E = -25., ! -> end longitudeLAT_S = 25., ! -> start latitude; default: all latitudesLAT_E = -25., ! -> end latitudeSDATE = ’’, ! -> start date of input data fileEDATE = ’’, ! -> end date of input data fileLEVEL = 10., ! -> requested pressure level(s);

! default: all pressure levelsCODE = ’’, ! -> code of the variable in the netCDF fileIFILE = ’’, ! -> netCDF input data fileLDESEASON = F ! -> deseasonalise the input time series

! prior to regression/

Filter characterisation:

The namelist entry TFILTER characterises the time filtering of the basis function.TFILTER is a derived TYPE and has the following elements:

TYPE filter! lfilter = .TRUE. - Apply a time filter.!LOGICAL :: lfilter!! fkind = ’rmean’ - simple running mean with equal weights.! fkind = ’rmean_gauss’ - running mean with gaussian weights.! fkind = ’butterworth’ - second order butterworth filter.!CHARACTER(len=16) :: fkind!INTEGER :: nrmean ! number of time steps for running meanREAL(sp) :: variance ! variance used for gaussian weights!! The cutoff frequencies of the butterworth filter are! calculated from the time periods lowt and hight:! lowf = 2*Pi/lowt - lower cutoff frequency! uppf = 2*Pi/uppt - upper cutoff frequency.!REAL(sp) :: lowtREAL(sp) :: uppt

END TYPE filter

namelist to describe the offset and trend basis function.

&OFFSET_INPNFOUR = 0, ! -> number of fourier pair expansionsNSPH = 0, ! -> number of spherical harmonic expansionsTREATMENT = ’None’ ! -> always none for offset basis function

/&TREND_INP

NFOUR = 0,NSPH = 0,TREATMENT = ’RemoveMean’

/

Possible treatments for other basis functions:

TREATMENT = ’RemoveMean’TREATMENT = ’RemoveTrend’TREATMENT = ’RemoveTrendAndMean’TREATMENT = ’Deseason’TREATMENT = ’DeseasonAndRemoveTrend’TREATMENT = ’DeseasonAndRemoveMean’TREATMENT = ’DeseasonAndRemoveTrendAndMean’

namelist to describe the volcanic basis function.

&PINATUBO_INPNFOUR = 0, ! -> number of fourier pair expansionsNSPH = 0, ! -> number of spherical harmonic expansionsTSHIFT = 0., ! -> time shift; default: t = -9.99E30TREATMENT = ’None’

/

Without specifying the parameter TSHIFT, the best optimal shift will be estimated within

the program. This can slow down the whole process.

namelist to describe the QBO basis function.

The standard QBO is given by monthly mean radiosonde derived winds (Naujokat1986). The input is given as an unformatted binary file. All information about thestructure of the file must be given in the namelist.

&QBO_INPNFOUR = 2,NSPH = 0,TREATMENT = ’RemoveTrendAndMean’,TFILTER = F, ’RMEAN’, 5, 0.,0.,0.,SDATE = ’19530101’, ! -> start date of the data fileEDATE = ’20081231’, ! -> end date of the data fileMISSING = -99999., ! -> value for missing dataLEVEL = 100., 70., 50., 40.,

30., 20., 15., 10.,WORK = 50., ! -> pressure level WORK for QBO definitionIFILE = ’qbo-195301-200812.dat’ ! -> unformatted binary file

/

namelist to describe the QBO basis function: netCDF input

It is possible to use a netCDF file as QBO input file. This is important when model datashould be analysed.

&QBO_INPNFOUR = 2,NSPH = 0,TREATMENT = ’RemoveTrendandMean’,TFILTER = F, ’RMEAN’, 5, 0.,0.,0.,MISSING = -9.999E30, ! -> value for missing dataLAT_S = 5., ! -> calculate area weighted mean fromLAT_E = -5., ! LAT_S to LAT_E, to define the QBOWORK = 50., ! -> use pressure level WORK for

! QBO definitionCODE = ’ua’, ! -> name of the variable within the

! netCDF fileIFILE = "" ! -> netCDF input file

/

namelist to describe the SFLUX, ENSO, NAO basis functions.

The namelist structure is the same for the SFLUX, ENSO, and NAO basis functions. Itis possible to input binary or ASCII data.Example for ENSO:

&ENSO_INPNFOUR = 2,NSPH = 0,TREATMENT = ’RemoveTrendandMean’,TFILTER = F, ’RMEAN’, 5, 0.,0.,0., ! -> characterise time filterSDATE = ’18710101’,EDATE = ’20081201’,MISSING = -99.9,DTYPE = ’BIN’, ! -> unformatted binary inputIFILE = ’data/ENSO/nino34/nino34_index.dat’

/

&ENSO_INPNFOUR = 2,NSPH = 0,TREATMENT = ’RemoveTrendandMean’,TFILTER = F, ’RMEAN’, 5, 0.,0.,0., ! -> characterise time filterMISSING = -99.9,DTYPE = ’TAB’, ! -> ASCII inputIFILE = ’data/ENSO/nino34/nino34_index_tab.dat’

/

namelist to describe the EXTRA basis functions.

There is an additional entry LAB to label the basis function. Again, it is possible to inputbinary or ASCII data.Example for QBO according to Randel and Wu:

&EXTRA_INPLAB = ’qbo’, ! -> label the basis functionNFOUR = 0,NSPH = 0,TREATMENT = ’DeseasonAndRemoveMean’,TFILTER = F, ’RMEAN’, 5, 0.,0.,0.,MISSING = -99999.,DTYPE = ’TAB’,IFILE = ’/home/kunze/data/mlreg/data/QBO/qbo1_tab.dat’

/

&EXTRA_INPLAB = ’qbo_orth’, ! -> label the basis functionNFOUR = 0,NSPH = 0,TREATMENT = ’DeseasonAndRemoveMean’,TFILTER = F, ’RMEAN’, 5, 0.,0.,0.,MISSING = -99999.,DTYPE = ’TAB’,IFILE = ’/home/kunze/data/mlreg/data/QBO/qbo2_tab.dat’

/

Format of the ASCII input file.

The format of the standard ASCII input file consists of two columns:• column 1: decimal date.• column 2: floating point data value.

For example:

data/SFLUX/solar_flux_monthly_tab.dat

1947.0416 -99.991947.125 202.71947.2084 235.71947.2916 264.11947.375 261.21947.4584 226.61947.5416 215.21947.625 231.21947.7084 199.71947.7916 209.01947.875 179.81947.9584 176.400011948.0416 155.71948.125 134.31948.2084 135.51948.2916 208.11948.375 226.5....

Command line arguments.

All namelists are stored in one single namelist file. The name of the namelist file isgiven on the command line:

Syntax: mlreg -nl

-nl execute with the namelist file -wnl create a default namelist file

Program output.

All results are stored in netCDF data files. A GrADS control file is provided for eachnetCDF file.

ta_EMAC_1_196001_200012_coeff.ctl ! -> regression coefficientsta_EMAC_1_196001_200012_coeff.ncta_EMAC_1_196001_200012_coeff_ts.ctl ! -> time resolved coeff.ta_EMAC_1_196001_200012_coeff_ts.ncta_EMAC_1_196001_200012_inp.ctl ! -> input data and data fitta_EMAC_1_196001_200012_inp.ncta_EMAC_1_196001_200012_prob.ctl ! -> probabilities of t testta_EMAC_1_196001_200012_prob.ncta_EMAC_1_196001_200012_prob_ts.ctlta_EMAC_1_196001_200012_prob_ts.ncta_EMAC_1_196001_200012_regr.ctl ! -> basis functionsta_EMAC_1_196001_200012_regr.ncta_EMAC_1_196001_200012_tval.ctl ! -> t test statisticsta_EMAC_1_196001_200012_tval.ncta_EMAC_1_196001_200012_tval_ts.ctlta_EMAC_1_196001_200012_tval_ts.ncta_EMAC_1_196001_200012_unc.ctl ! -> uncertaintiesta_EMAC_1_196001_200012_unc.ncta_EMAC_1_196001_200012_unc_ts.ctlta_EMAC_1_196001_200012_unc_ts.nc

Expand the results in time

• The regression coefficients can be expanded in time to getthe seasonal variations of the specific influence:

ytj = βj0 +m∑

k=1

[ βj(2k−1) sin(2πkt/365.25)

+ βj(2k) cos(2πkt/365.25) ]

Program output.

File: ta_EMAC_1_196001_200012_coeff.ctl

dset ^ta_EMAC_1_196001_200012_coeff.ncdtype netcdf.....vars 49off1=>off1 31 t,z,y,x [offset]off2=>off2 31 t,z,y,x [offset]*sin( 2*Pi*Decimal Year )off3=>off3 31 t,z,y,x [offset]*cos( 2*Pi*Decimal Year )off4=>off4 31 t,z,y,x [offset]*sin( 4*Pi*Decimal Year )off5=>off5 31 t,z,y,x [offset]*cos( 4*Pi*Decimal Year )off6=>off6 31 t,z,y,x [offset]*sin( 6*Pi*Decimal Year )off7=>off7 31 t,z,y,x [offset]*cos( 6*Pi*Decimal Year )tr1=>tr1 31 t,z,y,x [trend]tr2=>tr2 31 t,z,y,x [trend]*sin( 2*Pi*Decimal Year )tr3=>tr3 31 t,z,y,x [trend]*cos( 2*Pi*Decimal Year )tr4=>tr4 31 t,z,y,x [trend]*sin( 4*Pi*Decimal Year )tr5=>tr5 31 t,z,y,x [trend]*cos( 4*Pi*Decimal Year )tr6=>tr6 31 t,z,y,x [trend]*sin( 6*Pi*Decimal Year )tr7=>tr7 31 t,z,y,x [trend]*cos( 6*Pi*Decimal Year )......

Program output.

File: ta_EMAC_1_196001_200012_inp.nc

data=>data 31 t,z,y,x input data time seriesdfit1=>dfit1 31 t,z,y,x complete fit of the regression modeldfit2=>dfit2 31 t,z,y,x second complete fit of the regression modeldfit3=>dfit3 31 t,z,y,x third complete fit of the transformed dataoff1=>off1 31 t,z,y,x [offset]off2=>off2 31 t,z,y,x [offset]*sin( 2*Pi*Decimal Year )off3=>off3 31 t,z,y,x [offset]*cos( 2*Pi*Decimal Year )......res1=>res1 31 t,z,y,x residualsres2=>res2 31 t,z,y,x residuals after accounting for autocorrelation

ReferencesD. C. Montgomery, E. A. Peck, G. G. ViningIntroduction to linear regression analysis.John Wiley & sons, 2001.

G. E. Bodeker, I. S. Boyd, and W. A. MatthewsTrends and variability in vertical ozone and temperatureprofiles measured by ozonesondes at Lauder, NewZealand: 1986–1996.J. of Geophys. Res., 103, D22, 28661-28681, 1998.

G. E. P. Box, and G. M. JenkinsTime Series Analysis Forecasting and Control.Holden-Day, Merrifield, Va., 1970.

G. C. Tiao et al.Effects of autocorrelation and temporal sampling schemeson estimates of trend and spatial correlation.J. Geophys. Res., 95, 20507–20517, 1990.

TheoryMultiple linear regression modelResidualsUncertainties

Model descriptionThe linear regression modelBasis functionsNamelists

AnhangAnhang

multiple linear regression · 2013. 7. 13. · multiple linear regression model i model: the...

Documents