statistical models for defective count data - stat - home · statistical models for defective count...

43
Statistical Models for Defective Count Data Gerhard Neubauer a , Gordana - Duraˇ s a , and Herwig Friedl b a Statistical Applications, Joanneum Research, Graz, Austria b Institute of Statistics, University of Technology, Graz, Austria Statistik Tage 2011, Graz, September 7 th -9 th

Upload: hanhu

Post on 13-May-2019

220 views

Category:

Documents


0 download

TRANSCRIPT

Statistical Models for Defective Count Data

Gerhard Neubauera, Gordana -Durasa, and Herwig Friedlb

aStatistical Applications, Joanneum Research, Graz, AustriabInstitute of Statistics, University of Technology, Graz, Austria

Statistik Tage 2011, Graz, September 7th-9th

Introduction

Statistik Tage Graz, September 7-9, 2011 -1-

Defective Count Data

Defective: too much and/or too few is counted, recorded,

reported ...

Most prominent example: Criminology

• Crimes associated with shame (sexual offences, domestic

violence)

• Theft of low-value goods

Likely to be not reported to the police, official numbers are too

low

→ Under-Reporting

Statistik Tage Graz, September 7-9, 2011 -2-

Defective Count Data

Less considered:

Insurance companies suspect that theft may be reported, but

only to make an insurance claim

Most popular with bicycles, skiing equipment

Likely to be reported to the police, official numbers are too high

→ Over-Reporting

More precisely:

→ Under-Reporting fraud

→ Over-Reporting theft

Statistik Tage Graz, September 7-9, 2011 -3-

Criminology: Bicycle Theft Data

Weekly counts of bicycle theft in an Austrian region

0 20 40 60 80 100

020

4060

8010

0

Week

Statistik Tage Graz, September 7-9, 2011 -4-

Bicycle Theft ?

Statistik Tage Graz, September 7-9, 2011 -5-

Defective Count Data

Health data

Registers for infectious diseases (HIV), chronic diseases

(diabetes)

Cause of death registers (heart attack)

Any miss-classification will cause both errors

→ Under-Reporting in the right category

→ Over-Reporting in the wrong category

Statistik Tage Graz, September 7-9, 2011 -6-

Health: Heart Attack Data

Monthly counts of heart attack discharges from Styrian hospitals

0 20 40 60 80

050

100

150

200

250

Month

Cou

nt

Statistik Tage Graz, September 7-9, 2011 -7-

Defective Count Data

• Criminology

• Health data

• Insurance data: traffic accidents with minor damage

• Production: number of defective goods in production

• Warranty: number of goods sent back for warranty claim

• ...

Any sample of count data may be defective due to recording

failures

Statistik Tage Graz, September 7-9, 2011 -8-

Under-Reporting

How to estimate the total number of cases?

Statistik Tage Graz, September 7-9, 2011 -9-

Bernoulli Sampling

One observation λ observations

Case reported Case reported

Yes No Yes No

R 1− R Y D

R ∼ Bernoulli(π) Y =

λ∑

i=1

Ri ∼ Binomial(λ, π)

Y . . . (random) number of reported cases

D . . . (random) number of unreported cases

λ . . . total number of cases

π . . . reporting probability

Statistik Tage Graz, September 7-9, 2011 -10-

Binomial Model

Case reported

Yes No

Y D

Y =λ∑

i=1

Ri ∼ Binomial(λ, π)

E(Y ) = µ = λπ

var(Y ) = σ2 = λπ(1− π)

Both λ and π have to be estimated

Statistik Tage Graz, September 7-9, 2011 -11-

Binomial Model

Case reported

Yes No

Y D

Y =

λ∑

i=1

Ri ∼ Binomial(λ, π)

E(Y ) = µ = λπ

var(Y ) = σ2

= λπ(1− π)

Both λ and π have to be estimated

No longer a member of 1-parameter exponential family

Limitation to data with s2 < y

Statistik Tage Graz, September 7-9, 2011 -12-

Is iid Assumption Appropriate?

0 20 40 60 80 100

020

4060

8010

0

Week

0 20 40 60 80

050

100

150

200

250

Month

Cou

nt

Trend/Seasonality ⇒ Regression approach

Statistik Tage Graz, September 7-9, 2011 -13-

Regression Approach

Consider Ytind∼ Binomial(λt, π), t = 1, . . . , T ,

where

λt = exp(x′tβ) , β ∈ Rd ,

π =exp(α)

1 + exp(α), α ∈ R ,

to ensure λt > 0 and 0 < π < 1

Likelihood contribution of yt

L(α, β|yt) =(

λt

yt

)πyt(1− π)λt−yt

Statistik Tage Graz, September 7-9, 2011 -14-

Maximum Likelihood Estimation

Sample Log-Likelihood

`(α, β|y) =T∑

t=1

log L(α, β|yt)

=T∑

t=1

log[(

λt

yt

)πyt(1− π)λt−yt

]

Score and observed Information are analytically evaluated

Newton-Raphson procedure till convergence

Statistik Tage Graz, September 7-9, 2011 -15-

Maximum Likelihood Estimation

Sample Log-Likelihood

`(α, β|y) =T∑

t=1

log L(α, β|yt)

=T∑

t=1

log[(

λt

yt

)πyt(1− π)λt−yt

]

Score and observed Information are analytically evaluated

Newton-Raphson procedure till convergence

Still problems with overdispersion!

Statistik Tage Graz, September 7-9, 2011 -16-

Randomized ParameterModels

Statistik Tage Graz, September 7-9, 2011 -17-

Beta-Binomial Model

Randomize reporting probability π

Yt|Pt ∼ Binomial(λt, Pt)

Pt ∼ Beta(γ, δ)

Yt ∼ Beta-Binomial(λt, γ, δ)

We use the parameterization

θ = γ + δ and π =γ

γ + δ= E(Pt)

Hence E(Yt) = µt = λtπ and var(Yt) = µt(1− π)φ,

where φ = λ+γ+δ1+γ+δ ≥ 1

Statistik Tage Graz, September 7-9, 2011 -18-

Poisson Models

Randomize total number of cases λ

Yt|Lt ∼ Binomial(Lt, π)

Lt ∼ Poisson(λt)

Yt ∼ Poisson(λtπ)

Model not identified

Winkelmann (2000): Pogit Model

Yt ∼ Poisson(λtπt)

λt = exp(x′tβ) and πt =exp(z′tα)

1 + exp(z′tα)with two (disjoint) sets of regressors xt and zt

Statistik Tage Graz, September 7-9, 2011 -19-

Negative Binomial Model

Randomize total number of cases λ

Yt|Lt ∼ Binomial(Lt, π)

Lt|Kt ∼ Poisson(Ktλt)

Kt ∼ Gamma(ωt, ωt)

Yt ∼ Negative Binomial(ωt, 1− π)

where ωt = λt(1− π) is the expected number of unreported cases

E(Yt) = µt = λtπ var(Yt) = µt +µ2

t

ωt

Statistik Tage Graz, September 7-9, 2011 -20-

Beta-Poisson Model

Randomize π and λ

Yt|Lt, Pt ∼ Binomial(Lt, Pt)

Lt ∼ Poisson(λt)

Pt ∼ Beta(γ, δ)

Yt ∼ Beta-Poisson(λt, γ, δ)

We use the parameterization

θ = γ + δ and π =γ

γ + δ= E(Pt)

Hence E(Yt) = µt = λtπ and var(Yt) = µtφt,

where φt = 1 + λt(1−π)1+γ+δ ≥ 1

Statistik Tage Graz, September 7-9, 2011 -21-

Estimation and Inference

Statistik Tage Graz, September 7-9, 2011 -22-

Estimation

We use Maximum Likelihood (ML), where the solution of score equations∂

∂αlog

∏t

L (ΩM |yt,xt) = 0 and∂

∂βlog

∏t

L (ΩM |yt,xt) = 0

ΩM the parameter vector of given model

is found by the Newton-Raphson algorithm

For the Beta-mixtures we also use the Hybrid algorithm

Cycle between

1. ML for α, β given θ, and

2. MoM for θ given α, β

Choice of starting values sometimes crucial

Statistik Tage Graz, September 7-9, 2011 -23-

Inference

Normality of parameter estimates

Simulation studies: for reasonable settings α and β approximately normal

Problems when

π → 0, i.e. Poisson limit

π → 1, perfect reporting system

Inference within distribution - usual model selection methods

e.g. t-values, LRT,...

Statistik Tage Graz, September 7-9, 2011 -24-

Inference

Inference between distributions - Non-nested testing (Allcroft andGlasbey, 2003)

Comparison of various models M = (Mk), k = 1, . . . , K

1. find θk for data y = (yt) by optimizing GoF criterion c = (ck(y))

2. simulate samples y(s)(θk), s = 1, . . . , S, (e.g. S = 100) and obtain all Smatrices C(s) of dimension K ×K

3. obtain mean C of S matrices and compare c to the k-th column ck bymultivariate normality assumption

If Dk ∼ χ2K, Dk = (c− ck)′V −1

k (c− ck) ⇒ kth model correct

Statistik Tage Graz, September 7-9, 2011 -25-

Real Data Applications

Statistik Tage Graz, September 7-9, 2011 -26-

Bicycle Theft

Distribution log L Pearson BIC D p(D) π

Negative-Binomial -728.91 206.49 1521.63 1.92 0.59 0.61

Beta-Binomial -732.87 204.79 1529.57 9.59 0.02 0.32

Beta-Poisson -735.39 197.97 1534.60 9.85 0.02 0.63

Model selected: Negative Binomial

π = 0.61, p(D) = 0.59

Statistik Tage Graz, September 7-9, 2011 -27-

Bicycle Theft Data

0 20 40 60 80 100

020

4060

8010

0

Week

Statistik Tage Graz, September 7-9, 2011 -28-

Bicycle Theft Model

0 20 40 60 80 100

020

4060

8010

0

Week

Estimated mean (solid), estimated total number of thefts with confidence interval (dashed)

Statistik Tage Graz, September 7-9, 2011 -29-

Heart Attack Data

Distribution log L Pearson BIC D p(D) π

Negative Binomial -419.23 95.56 870.41 1.75 0.63 0.55

Beta-Binomial -417.64 95.60 871.79 77.61 <e-03 0.62

Beta-Poisson -419.82 88.00 876.16 36.17 <e-03 0.82

Model selected: Negative Binomial

π = 0.55, p(D) = 0.63

Statistik Tage Graz, September 7-9, 2011 -30-

Heart Attack Data

0 20 40 60 80

010

020

030

040

0

Month

Cou

nt

Estimated mean (solid), estimated total number of heart attacks with confidence interval

(dashed)

Statistik Tage Graz, September 7-9, 2011 -31-

Heart Attack Counts + Cause of Death Counts

0 20 40 60 80

010

020

030

040

0

Month

Cou

nt

Estimated mean (black solid), estimated total number of heart attacks with confidence

interval (red dashed)

Statistik Tage Graz, September 7-9, 2011 -32-

Over-Reporting

How to estimate the correct number of cases?

Statistik Tage Graz, September 7-9, 2011 -33-

Cameron & Trivedi, 1998

Model dependence between the two Bernoulli variables

E = Event and R = Recording

Bivariate Binomial Random Variable

R = 0 R = 1

E = 0 T0 O N0

E = 1 U T1 N1

N − Y Y N

T0: true not reportedT1: true reportedO: over-reportedU : under-reportedY : observed count

Statistik Tage Graz, September 7-9, 2011 -34-

Independent Sampling

In Criminology T0 does not exist. Therefore we consider independentsampling.

R = 0 R = 1

E = 0 ××× R0

E = 1 U R1 N

andY = R0 + R1

Specify under-reporting model for R1 and a count model for R0

Model Y by convolution of both

Statistik Tage Graz, September 7-9, 2011 -35-

A Family of Poisson Convolutions

Assume that R1 is generated by under-reporting, i.e.

R1|N, P ∼ Binomial(N,P ),

andR0 ∼ Poisson(α)

Then

p(Y = y = r0 + r1) =r0∑

j=0

p0(j)p1(y − j) =r1∑

j=0

p0(y − j)p1(j)

with

E(Y ) = µ0 + µ1 = α + λπ

var(Y ) = σ20 + σ2

1 = α + var(R1)

Statistik Tage Graz, September 7-9, 2011 -36-

Negative-Binomial-Poisson Convolution

For R1 ∼ Negative Binomial(ω, π)

Y ∼ Delaporte(λ, π, α), λ = ω/(1− π)

with

p(Y = y = r1 + r0) =(1− π)ωαy exp(−α)

Γ(ω)S

where

S =r1∑

j=0

Γ(j + ω)Γ(j + 1)Γ(y − j + 1)

α

)j

and

E(Y ) = µ0 + µ1 = α + λπ

var(Y ) = σ20 + σ2

1 = α + λπ(1− π)−1

Statistik Tage Graz, September 7-9, 2011 -37-

Application to bicycle theft data

0 20 40 60 80 100

020

4060

80

Data and Mean in grey

Week

Statistik Tage Graz, September 7-9, 2011 -38-

Application to bicycle theft data

0 20 40 60 80 100

020

4060

80

Data and Mean in grey, E(R_0) in red, E(R_1) in green, E(U) in blue

Week

Statistik Tage Graz, September 7-9, 2011 -39-

Application to bicycle theft data

Averaged Results

R = 0 R = 1

E = 0 ××× E(R0) = 14.35

E = 1 E(U) = 5.08 E(R1) = 21.18 E(N) = 26.26

andE(Y ) = E(R0 + R1) = 35.53

Reporting probability in the under-reporting model: π = 0.77Fraud rate: 14.35/35.53 = 0.40

Statistik Tage Graz, September 7-9, 2011 -40-

Conclusion

• highly relevant methodology for wide-spread applications

• models based on Bernoulli sampling

• conditional binomial models suitable for cases when var(Y ) > µ

• estimation relies on ML and Hybrid ML

• non-nested technique for model selection

Limitations

• π → 0, i.e. Poisson limit

• π → 1, perfect reporting system

Statistik Tage Graz, September 7-9, 2011 -41-

References

Allcroft, D.J. and Glasby, Ch.A. (2003). A simulation-based method formodel evaluation, Statistical Modelling, 3, 1-14.

Cameron, A.C. and Trivedi, P.K. (1998). Regression Analysis of CountData. Cambridge: Cambridge University Press.

Neubauer, G., Djuras, G. and Friedl, H. (2011). Models for Underreporting:A Bernoulli Sampling Approach for Reported Counts. Austrian Journal ofStatistics, 40, 85-92.

Winkelmann, R. (2000). Econometric Analysis of Count Data. Berlin:Springer.

Statistik Tage Graz, September 7-9, 2011 -42-