exercises 1: execute step-by-step the r code at

22
Exercises 1: Execute step-by-step the R code at http://folk.uio.no/trondr/nve_course_3/hoelen1.R Take a look at the yearly discharge means at Hølen. This code should give answers to the following: a) Take a look at the histogram together with the normal distribution having the same mean(expectancy) and standard deviation. Do the data look approximately normally distributed? b) Make a QQ plot to further investigate whether the data is approximately normally distributed. c) Do a and b over again, but this time using the lognormal distribution. (Use the mean and standard deviations of the logarithmic values to adapt the lognormal distribution to the data). d) Repeat a-c for daily discharges also (found at http://folk.uio.no/trondr/nve_course_3/TrendDog nHoelen.txt ). What conclusions change and why?

Upload: fayola

Post on 24-Feb-2016

45 views

Category:

Documents


0 download

DESCRIPTION

Exercises 1: Execute step-by-step the R code at http://folk.uio.no/trondr/nve_course_3/hoelen1.R Take a look at the yearly discharge means at Hølen. This code should give answers to the following: - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Exercises 1: Execute step-by-step the R code at

Exercises 1: Execute step-by-step the R code at http://folk.uio.no/trondr/nve_course_3/hoelen1.R

Take a look at the yearly discharge means at Hølen. This code should give answers to the following:

a) Take a look at the histogram together with the normal distribution having the same mean(expectancy) and standard deviation. Do the data look approximately normally distributed?

b) Make a QQ plot to further investigate whether the data is approximately normally distributed.

c) Do a and b over again, but this time using the lognormal distribution. (Use the mean and standard deviations of the logarithmic values to adapt the lognormal distribution to the data).

d) Repeat a-c for daily discharges also (found at http://folk.uio.no/trondr/nve_course_3/TrendDognHoelen.txt). What conclusions change and why?

Page 2: Exercises 1: Execute step-by-step the R code at

Exercise 2: Is the expectancy of the yearly mean discharges at Hølen equal to 10m3/s?

http://folk.uio.no/trondr/nve_course_3/hoelen2.Ra) Estimate the expectancy.b) Check the assumption expectancy=10m3/s using a t-test (takes the

variance uncertainty into account). Use preferably a significance level of 5% (confidence level of 95%).

c) Show the data together with the confidence interval. Should we worry that most of the data is outside this interval? Is it 95% probability that the real expectancy lies within this specific interval?

d) Would we be justified in using the techniques in a-c for daily data also?

e) Do the same analysis using the lognormal distribution also. Perform a bootstrap analysis in order to make a 95% confidence interval. What does this say about the assumption expectancy=10m3/s?

Page 3: Exercises 1: Execute step-by-step the R code at

Extra exercise 1Will illustrate the law of large numbers and the central limit theoremCode: http://folk.uio.no/trondr/nve_course_3/lawoflargenumbers.Ra) Draw n=10000 samples from the Poisson distribution with rate (and

thus expectancy) =0.3. Calculate the mean and see how much or little this differs from the expectancy. Repeat a couple of times. Why are the results as they are?

b) Look at the histogram and compare with the probability distribution. c) Look at the distribution of means of n=10000 samples from the

Poisson distribution. Sample N=1000 such means. Look at the histogram and compare with the normal distribution which the central limit theorem says the distribution of the mean should start to resample. Check with a QQ plot.

d) What happens with the mean when only a few samples, n=10, are used. Increase N to 100000. Is there a noticeable difference between the distribution of the means and the normal distribution?

)/,( nN

Page 4: Exercises 1: Execute step-by-step the R code at

Extra exercise 2: Outcome of dice throwsOn a cubic die, there is a 1/6th probability for each outcome from 1 to 6. When you have two dice, let’s assume the outcomes on the two dice are independent.

a) What is the probability for a specific outcome on the two dice, for instance the probability of getting 5 on the first die and 2 on the second die?

b) What is the probability of getting sum=2 one the two dice? Repeat for sum=3, sum=4, sum=5, sum=6 , sum=7, sum=8.

c) What is the probability for sum<=4? d) What is the probability for equal outcome on the two dice?e) What is the probability of getting equal outcomes and sum<=4? f) What is the probability for either getting sum<=4 or two equal dice?

You can use the answers from c, d and e or you can count the possibilities.

g) Both from the rule for conditional probability and from the list of outcomes where sum<=4, what is the probability of equal dice given that sum<=4

h) Calculate the probability from sum<=4 given two equal dice, both from a list of possible outcomes and from Bayes formula.

Page 5: Exercises 1: Execute step-by-step the R code at

Extra exercise 3:At Blindern, there is a 33.9% chance of rain any given day, given that it rained yesterday. There is a 12.9% chance of rain given that it didn’t rain the previous day. PS: Assume stationarity, that is that all probabilities do not depend of which day it is, when the circumstances are the same.

a) What is the probability that it rains any given day? (I.e. what is the marginal probability for rain?) Tip: The law of total probability gives Pr(rain today)= Pr(rain today|rain yesterday)Pr(rain yesterday)+ P(rain today| no rain yesterday)Pr(no rain yesterday).

b) Why is the chance of rain yesterday given that it rained today also 33.9% (Tip: Bayes formula)

Page 6: Exercises 1: Execute step-by-step the R code at

Extra exercise 4 – conditional probabilities

The Hobbit Shire council has decided to expand the hobbit settlements westwards. Unfortunately, these lands are infested with dragons!

Of the 10kmx10km areas studied so far, 70% were dragon-infested.

A standard protocol for examining and area was made. A standardized test area of lesser size is examined, where a field hobbit biologist goes through the area in detail. The Shire Biology Institute has determined that there is a 50% chance of detecting a dragon in an area with this procedure, if there is a dragon in that area.

If there is no dragon in the area, there is of course no probability of detecting one either!

?

Here be dragonsNo dragons

?

Dragon Hobbit

Page 7: Exercises 1: Execute step-by-step the R code at

What is the (marginal) probability of detecting a dragon, when you don’t know whether the area is infected or not?

(Hint: The law of total probability)

Show using Bayes formula that the probability of there being a dragon in the area if you actually detected one, is 100%.

Find the probability of there being a dragon in the area given that you did not detect one. Could you have expected the probability to decrease even without knowing the detection rate?

Dragon in the areaDragon detected

Dragon detected and in the area

Dragon in the areaDragon detected

Dragon in the area

or

No dragon

No dragonNo

dragon

Extra exercise 4 contd.

Page 8: Exercises 1: Execute step-by-step the R code at

Exercise 3: Analysis of repeat period for a given threshold value in continuous time.

Will look at the danger for the discharge to go above a fixed threshold. The station Gryta has discharge>1.5m3/s 27(=y) times in the course of t=44 years. Assume such events to be independent in time, i.e. a Poisson process. Instead of rate, we will use the expected repeat period, T=1/ (where is the rate). When one has data over a time period, t, the probability for y events will then be (Poisson distribution):

The likelihood is maximal at . (For those with analytical optimization experience, show this). a)What is the ML-estimate for T in this particular case?b)From asymptotic theory, one has that:

Construct a 95% confidence interval using and that the normal distribution has 95% of it’s probability mass within +/-1.96 standard deviations from the expectancy (Insert the ML estimate for T).

Tty

eyTtTPTL /

!)/()| t in time eventsy ()(

ytT ˆ

matrix.n informatio sFisher' is )()( where))I(,N(~ˆ2

21-

lEI

3/)( TtTI

Page 9: Exercises 1: Execute step-by-step the R code at

Exercise 4: Independence, Markov chains and precipitation at Blindern

Want to test whether thresholded precipitation (rain or no rain) is time dependent or not. The alternative model is a Markov chain. The rain state (0 or 1) is called xi, where i denotes the day index. Since we need to condition on the previous state, we set aside the first day, x0. The number of days except this, is denoted n.Zero hypothesis: The rain status (0 or 1) is independent from day to day (Bernoulli process). We can thus specify the model with one parameter, p=probability for rain a given day. The probability for a series of data is then: where k=number of days with rain. This will thus be the likelihood-function of this model. The ML estimate of p is then

Alternative hypothesis: The rain status (0 or 1) is a Markov chain. There is a transition probability from no rain to rain pNR and a transition probability for rain given rain the previous day also, pRR. (The two other probabilities are given as the negation of the two that have been specified). The probability of a given dataset is then:

where kR and nR is the number of rainy days and total days when it rained the previous day and kN and nN are similarly defined for cases where it didn’t rain the previous day. The ML estimate for the two parameters then become:

knkn ppxxP )1(),,( 1

R

N

pRR

pNR1-pRR

1-pNR

nkp /ˆ

NNNRRR knNR

kNR

knRR

kRRn ppppxxP )1()1(),,( 1

NNNRRRRR nkpnkp /ˆ ,/ˆ

Page 10: Exercises 1: Execute step-by-step the R code at

Exercise 4: Independence, Markov chains and precipitation at Blindern (contd).

Check if the rain status at Blindern is time dependent or not. Code: http://folk.uio.no/trondr/nve_course_3/blindern.R. Data: http://folk.uio.no/trondr/nve_course_3/blindern_terkslet.txt.a)For the Blindern data, what is the estimate for p (rain probability in the zero hypothesis)

and pNR and pRR (alternative hypothesis). From the last, what seems to be the nature of the rain status one day and the next?

b)From extra exercise 2a, one has that the marginal probability of rain under the alternative hypothesis p’=pNR/(1-pRR+pNR). Compare the estimate of this with the estimate of p from the zero hypothesis.

c)The likelihood ratio test check a zero hypothesis against an alternative (more advanced) hypothesis by using that under the zero hypothesis, where la and l0 is the logarithm of the maximal likelihood for alternative and zero hypothesis respectively and is the so-called chi-square distribution (the distribution of the square of something that is standard normal) and k is the difference in the number of parameters. Calculate the p value and check if one can reject time independence with 95% confidence. i.e. check whether the rain status can be independent from day to day.

d)Assume that the ML estimators for pNR and pRR are approximately normally distributed (asymptotic theory) and and similarly for pNR (and p). What is the 95% confidence interval for these two parameters?

e)Run a parametric bootstrap to look at the uncertainty (95% confidence interval) in p (zero hypothesis) and p’ (Markov chain). Compare the 95% confidence intervals for p and p’ from bootstrap and also the methodology in d as applied to p.

R

N

pRR

pNR1-pRR

1-pNR

20 ~)(2 ka ll

2

RRRRRRR nppp /)1()ˆvar(

Page 11: Exercises 1: Execute step-by-step the R code at

Exercise 5: Medical example translated to the language of Bayesian statistics. (PS: probabilities instead of probability densities and sums instead of integrals). Reminder: 0.1% of the population has a certain sickness. There is a test that always tests positive when the subject has the sickness and only tests positive 1% of the time when the subject is healthy.

a) Identify what is data outcomes and what is the thing we want to do inference on (the parameter, if you like).

b) What is the prior distribution? (Tip: What is it we want to do inference on?) c) What is the likelihood? (Tip: How is it the data is produced conditioned on what we want

to do inference on)d) Describe and calculate (using the law of total probability) the prior prediction distribution

(marginal data distribution).e) What is the posterior distribution, given a positive or a negative test? f) A little risk analysis: The cost (C) for society (patient time as well as medical staff and

equipment) of a medical operation is 10000kr. The benefit (B) is 100000kr if the subject has the sickness and 0kr if not. Assume you have tested positive. Find the expected benefit (=-risk): E(B-C|positive test). If you test positive, will the operation pay off in total?

g) Assume the first test to be positive. What is the posterior prediction distribution for a new test, now?

h) A new test is performed and again gives a positive outcome. What is now the posterior probability that you have the sickness? Will an operation now pay off in expectancy?

Page 12: Exercises 1: Execute step-by-step the R code at

Exercise 6: Expectation value for mean discharge at Hølen – Bayesian analysishttp://folk.uio.no/trondr/nve_course_3/hoelen3.RAssume that the data is normally distributed, Then the overall mean will also be normally distributed(this will approximately be the case even if the data itself is not normally distributed).

We have a vague normal prior for the data expectancy, : where 0==10. Assume we know that the standard deviation of the individual values is =2.83.It can be shown that the posterior distribution is:

)/,(~),|( 22 nNxf

),(~)( 20 Nf

))(),(())(),((/

/,/

/~)|( 22

22

22

20

2

DDNxxNnn

nnxNxf

a) What are the expectancy and standard deviation of the posterior distribution of in this case? Can you from this give an estimate of the data expectancy, ? If so, is that estimate far from the estimate you got from exercise 2?

b) Make a 95% credibility band for the discharge expectancy. (Tip: 95% of the probability of a normal distribution is less than 1.96 standard deviations from the expectancy). Was this much different from the 95% confidence interval in exercise 2? Can you from this conclude concerning the assumption =10m3/s?

),(~),|( Nxf i

Page 13: Exercises 1: Execute step-by-step the R code at

Exercise 6 – contd: Expectation value for mean discharge at Hølen – Bayesian analysis

The prior predictive (marginal) distribution is: c) We will now test the assumption =10. Compare the prior predictive

distribution under our model (free-ranging ), with the prior predictive distribution when =10 (see the expression for the likelihood). What does this suggest?

d) Calculate the posterior model probability for model 0 (=10) and model 1 (free-ranging ).

Use

and assume the prior model probability for each model to be 50%. What is the conclusion?

e) Make a plot of the prior predictive distribution given different data outcomes and compare that to model 0. What data outcomes (if any) would be evidence for model 0, i.e. for =10?

)'Pr()'|()Pr()|()|Pr(MMDfMMDfDM

)/,(~)( 220 nNxf

Page 14: Exercises 1: Execute step-by-step the R code at

Exercise 7: Bayesian analysis of repeat period for a given threshold value in continuous time.

Will look at the danger for the discharge to go above a fixed threshold. The station Gryta has discharge>1.5m3/s y=27 times in the course of t=44 years. Assume such events to be independent in time, i.e. a Poisson process. Instead of rate, we will use the expected repeat period, T=1/ (where is the rate). When one has data over a time period, t, the probability for y events will then be (Poisson distribution):

Assume an inverse gamma distribution (which is mathematically convenient) for the parameter, T:

The prior predictive (marginal) data distribution is then:

(this is the so-called negative binomial distribution).

Tty

eyTtTtyP /

!)/()| in timeevent (

TeTTf /1

)()(

ttppp

yy

tyP y where)1(1

) in events (

Page 15: Exercises 1: Execute step-by-step the R code at

Exercise 7 (contd.): R code is found at http://folk.uio.no/trondr/nve_course_3/gryta_extreme.R

The station Gryta has had discharge>1.5m3/s y=27 times in the course of t=44 years.a)Plot the prior distribution for T and the prior predictive data distribution while using ==1 in the prior.b)Use Bayes formula to express the posterior distribution for T. Plot this for Gryta when ==1. Compare to the

prior distribution. Try also ==0.5 and even ==0 (non-informative) . Were there large differences in the posterior probability distribution? Compare these results with the frequentist result: TML=t/y=1.63 years.

c)Find the 95% credibility interval by calculating the 2.5% and 97.5% quantile of the posterior distribution, when the prior is specified by ==1. (R-tip: The 2.5% quantile of the inverse gamma distribution is one over the 97.5% quantile of the gamma distribution and vice versa.) Compare with the 95% confidence interval in the frequentist analysis.

d)Can you find the posterior prediction distribution for new floods at Gryta the next 100 years? If so, plot that. Compare to the Poisson distribution with the ML estimate inserted. Why is that distribution sharper than the Bayesian posterior prediction distribution?

e)Do a simple MCMC run to fetch samples from the posterior distribution for T. Find how many samples are need before the process stabilizes (burn-in) and approximately how many samples are need before you get independent sample (spacing).

f)Fetch 1000 approximately independent samples after burn-in. Compare this result with the theoretical posterior distribution (histogram and QQ plot).

g)Perform an new MCMC sampling, this time with the prior distribution f(T)=lognormal(=0,=2). (This cannot be solved analytically). Compare with the samples you got in f.

Tty

eyTtTtyP /

!)/()| in timeevent ( TeTTf /1

)()(

ttppp

yy

tyP y where)1(1

) in time events (

Page 16: Exercises 1: Execute step-by-step the R code at

Exercise 8: Perform a power-law regression on discharge as a function of

stage at station Gryta, that is do a linear regression of log(discharge) against log(stage). (Zero-plane, h0=0)

Code: http://folk.uio.no/trondr/nve_course_3/gryta.RData: http://folk.uio.no/trondr/nve_course_3/gryta.txt

a) Plot the data, both on the original scale and on log-scale.b) Run a linear regression on log(discharge) against log(stage).

Interpret the result. Is there a significant relationship between stage and discharge?

c) What is the formula for discharge vs stage on the original scale? Plot this.

d) Check if anything is wrong with the residuals (trend or non-normality).

e) Extra: Perform a linear regression on the original scale and make a plot of this also. (PS: R code not included).

Page 17: Exercises 1: Execute step-by-step the R code at

Exercise 9: The dataset http://folk.uio.no/trondr/nve_course_3/Env-Eco.csv contains species counts in various lakes together with some environmental variables, namely height, substrate diameter, water velocity, dissolved oxygen and water temperature. We want to see if the species richness is connected to any of the environmental variables according to this dataset. A species count can range from 0 but does not have an upper limit. Thus we will use Poisson regression (glm(family=poisson()) in R). The model looks like this:

where yi is the response (number of species) for site number i and xj,i is covariate number j, site number i. Will log-transform the covariates diameter, velocity and dissolved oxygen, since these are strictly positive of nature.Code: http://folk.uio.no/trondr/nve_course_3/env_eco.R

a) Fetch the data and take a glance at it.b) Run a Poisson fit with only a constant term (no covariates). Compare the result with the

mean species number. c) Try each covariate in term. Are any statistically significant (according to the score test

reported for each individual covariate)? If so, which is most significant (has the lowest p-value)? We have now started a stepwise-up approach to model building.

d) Use the best covariate and then use that as a basis to see if there are any more significant variables (significance level 5%). If so add the most significant and see if any more can be added. If not, stop there. What does the resulting model say?

e) Now start with a large model containing all possible covariates. If any covariates are not significant, remove the most insignificant and start over again. What model do you end up with now? If there are any differences between the results now and for the stepwise-up approach, why do you think that happened?

ikkiiii xxxxpoissony ,,110))(log( where))((~

Page 18: Exercises 1: Execute step-by-step the R code at

Exercise 10: Perform a power-law regression on discharge as a function of stage at

station Gryta, that is do a linear regression of log(discharge) against log(stage-zero plane), now with unknown zero-plane. I.e. do regression on the model

Code: http://folk.uio.no/trondr/nve_course_3/gryta2.RData: http://folk.uio.no/trondr/nve_course_3/gryta.txta) Perform a linear regression of log-discharge on log(stage-h0)

for a set of candidate values for h0. Look at the likelihood as a function of these candidate values. What is the best estimate for h0? What is the estimate for a, b, C and ?

b) A test called the likelihood-ratio test says that the zero-hypothesis is rejected with 95% confidence when 2(lfull-l0)>3.84 (PS: for one parameter). Test if h0=0.

a0

0

eC ),,0(log~ *)( you will ifor

)N(0,~ )log()log(

NEEhhCQ

hhbaQq

iiii

iiiii

Page 19: Exercises 1: Execute step-by-step the R code at

Extra exercise 5: Extreme value analysis at Bulken (about 120 years of data).Code: http://folk.uio.no/trondr/nve_course_3/bulken_extreme.RData: : http://folk.uio.no/trondr/nve_course_3/bulken_max.txtWill use the Gumbel distribution as model for the yearly extreme values (floods):

a) Make an extreme value plot, by ordering the yearly maximal discharges and plot them against expected repeat time interval where n=number of years and i is an index running from 1 to n

b) Fit the Gumbel distribution to the data using the two first l-moments (see the code), 1 and 2. (Extra: Compare with what you get from DAGUT). The parameters relate to the l-moments as = 2/log(2), = 1-0.57721. Calculate 1 and 2 and as

c) Make an extreme value plot (extreme value against repetition time) given the l-moment estimates together with the data.

d) Do numerical ML estimation of the parameters.e) Make an extreme value plot given the ML estimates.f) Perform a Bayesian analysis with flat prior. Sample 1000 MCMC samples (burn-in=1000,

spacing=10). Compare estimated parameters with previous results. Look at the uncertainty of the parameters .Also look at the MCMC samples themselves. Has the chain of samples stabilized (is the burn-in long enough) and are the dependencies in the samples (spacing).

g) Use also the posterior prediction distribution to make an extreme value plot (with the parameter uncertainty taken into account). Compare with previous results.

44.012.0

inti

/)(/)(1),|(xexexf

)(1

21

1 ))()1(()1(

1ˆ ,1ˆj

n

j

n

jj xjnj

nnxx

n

Sorted data

Page 20: Exercises 1: Execute step-by-step the R code at

Extra exercise 6:Will de-trend daily data from Hølen using linear regression. Then a relatively

naive ARMA time series analysis will be performed.Code: http://folk.uio.no/trondr/nve_course_3/hoelen_arima.RData: http://folk.uio.no/trondr/nve_course_3/TrendDognHoelen.txta) Plot the data.b) De-trend the data using linear regression on time+trigonometric

functions over time (to catch the seasonality). c) Take a look at the residuals (i.e. the de-trended data). Is there auto-

correlations there? Can we trust model selection using linear regression here?

d) Take a look at the partial auto-correlation also. Would an AR- or MA-model be expected to perform best?

e) Fit an AR(1) model (PS: pacf suggests AR(2) is better). Check whether the estimated parameter is approximately the same as the first term in the autocorrelation?

f) Make analytic plots of the AR(1) residuals. What does these suggest?g) Also use an ARMA(1,1) model. Take another look at the residuals. What

do they seem to say now?

Page 21: Exercises 1: Execute step-by-step the R code at

Extra exercise 7:Use the Kalman filter to interpolate over holes at the station Farstad, year 1993. Will use the simplest

type of continuous time Markov chain, namely the Wiener process (random walk). The data has varying time resolution and will be log transformed before use.

The system (updating) equation is:The observation equation is: In total, only two parameters, and .Code: http://folk.uio.no/trondr/nve_course_3/farstad.R Data: http://folk.uio.no/trondr/nve_course_3/farstad_1993.txta) Plot the data. The time series has an unmarked hole between May and June which needs to be

marked. In addition, we will make artificial holes: 1993-02-01 00:00:00 - 1993-02-03 00:00:00

1993-03-01 12:00:00 - 1993-03-15 12:00:00 1993-08-01 00:00:00 - 1993-08-01 18:00:00 1993-09-01 00:00:00 - 1993-09-02 00:00:00 1993-10-01 00:00:00 - 1993-11-01 00:00:00 Plot the data in the holes together with the rest of the data.b) Take a glance at the Kalman filter to see if you recognize what the various pieces are doing.c) Run an ML optimization and check the results. What does this say about the observational noise

parameter, ?d) Perform a Kalman smoothing Kalman using the ML estimated parameters. Look at the end result.e) Check how the interpolation has functioned in the holes. What kind of interpolation is this (on the

log-scale)? What do you in addition get? f) Criticize the model and see if you understand how the fit and uncertainty is as it is. If you feel

creative, try to come up with a better model.

),(~| 111 iiiii ttxNxx ),(~| iii xNxy

Page 22: Exercises 1: Execute step-by-step the R code at

Extra exercise 8:Use the Kalman filter to interpolate over holes at the stations Etna and Hølervatn, 2010-2011. The data

has equi-distant time resolution (days) and will be log-transformed before the analysis. Use a 2-dimensional AR(1) process (thus stationary), correlated noise (so that completition is possible), equal auto-correlation, variance and seasonal trends but individual expectancy.

The system equation then becomes:

The observational equation is In total eight parameters: a,e,h,,,S1,C1 and .Code: http://folk.uio.no/trondr/nve_course_3/kompletering.R Data: http://folk.uio.no/trondr/nve_course_3/etna_hoelervatn.csva) Plot the data. Make artificial holes (specified in the R code). Plot the data in the holes together with

the remaining data.b) Perform an ML optimization (using the included Kalman filter to calculate the likelihood) and check

the resulting parameter estimates. What does the auto-correlation means here? What does the cross-correlation imply?

c) Perform a Kalman smoothing with the ML estimated parameters. Take a look at the full dataset.d) Check how the completion worked in and around the holes. What happens when both stations are

missing data at the same period? A couple of completions worked less well than the others. Why? e) Test if the cross-correlation is zero (thus no reason to use one station to complete the other), using

the likelihood ratio test.

22

22

11

11

11

and )365/2cos()365/2sin()365/2cos()365/2sin(

)( where

)),()1((~|

tCtStCtS

t

taxaNxx

h

e

iii

),(~|/,/,/, hei

xNxy heihei