semicontinuous regression models with skew distributions

12
Semicontinuous regression models with Skew distributions Anna Gottard, Elena Stanghellini and Rosa Capobianco Abstract In applied studies, researchers are often confronted with semicontinu- ous response models. These are models with a semicontinuous response variable, i.e. a continuous variable that has a lower bound, that we here consider to be zero, and such that a sizable fraction of the observations takes value on this boundary. Semicontinuous response models are common in pharmacovigilance, pharmacoepi- demiological and pharmacoeconomic studies, where it can be sometimes useful to evaluate and monitor the doses of a certain drug substance consumed by the general population. The preponderant number of observations taking value zero corresponds to the part of the population which is not actually consuming the medicine, either because they do not need it or because, even if they need it, they are not taking it in a given interval of time. Another interesting field of application concerns goods or drugs consumption to be studied for economic or social purposes. We here explore the use of several asymmetric distributions to address the fact that the continuous part of the data distribution shows skewness in most cases. As an illustration, the proposal is applied to model alcohol expenditure in Italian households. Anna Gottard Dipartimento di Statistica, Universit` a di Firenze, Viale Morgagni 59, 50121 Firenze, Italy e-mail: [email protected] Elena Stanghellini Dipartimento di Economia, Finanza e Statistica, Universit` a di Perugia, Via Pascoli, 1, 06100 Peru- gia, Italy e-mail: [email protected] Rosa Capobianco Dipartimento di Studi dei Processi Formativi, Culturali e Interculturali nella Societ` a Contem- poranea, Universit` a degli Studi Roma Tre, Via Manin 53, 00185 Roma, Italy e-mail: rcapo- [email protected] 1

Upload: independent

Post on 13-May-2023

1 views

Category:

Documents


0 download

TRANSCRIPT

Semicontinuous regression models with Skewdistributions

Anna Gottard, Elena Stanghellini and Rosa Capobianco

Abstract In applied studies, researchers are often confronted with semicontinu-ous response models. These are models with a semicontinuousresponse variable,i.e. a continuous variable that has a lower bound, that we here consider to be zero,and such that a sizable fraction of the observations takes value on this boundary.Semicontinuous response models are common in pharmacovigilance, pharmacoepi-demiological and pharmacoeconomic studies, where it can besometimes useful toevaluate and monitor the doses of a certain drug substance consumed by the generalpopulation. The preponderant number of observations taking value zero correspondsto the part of the population which is not actually consumingthe medicine, eitherbecause they do not need it or because, even if they need it, they are not taking it ina given interval of time. Another interesting field of application concerns goods ordrugs consumption to be studied for economic or social purposes. We here explorethe use of several asymmetric distributions to address the fact that the continuouspart of the data distribution shows skewness in most cases. As an illustration, theproposal is applied to model alcohol expenditure in Italianhouseholds.

Anna GottardDipartimento di Statistica, Universita di Firenze, VialeMorgagni 59, 50121 Firenze, Italy e-mail:[email protected]

Elena StanghelliniDipartimento di Economia, Finanza e Statistica, Universita di Perugia, Via Pascoli, 1, 06100 Peru-gia, Italy e-mail: [email protected]

Rosa CapobiancoDipartimento di Studi dei Processi Formativi, Culturali e Interculturali nella Societa Contem-poranea, Universita degli Studi Roma Tre, Via Manin 53, 00185 Roma, Italy e-mail: [email protected]

1

2 Gottard, Stanghellini & Capobianco

1 Introduction

In many situations it can be of interest to model a response variable having an un-expected, with respect to a given distribution, number of zero values. For countresponses, a situation of excess of zeroes with respect to a standard distributionalassumption, such as Poisson or Negative Binomial distribution, is usually handledin the literature by adopting a so-called zero-inflated model (Lambert (1992)) or ahurdle model (Mullahy (1986)). An ordinary Poisson or Negative Binomial regres-sion model would be inadequate to study these kinds of data. See Min and Agresti(2002) for a survey on these classes of models.

Continuous positive value data with a preponderance of zeroobservations canoccur in a variety of applications. These kinds of variablesare sometimes calledsemicontinuous (Olsen and Schafer (2001)), to allow a probability mass on a spe-cific value. Semicontinuous data are common in many fields. For example, in phar-macovigilance, pharmacoepidemiological and pharmacoeconomic studies, it can besometimes useful to evaluate and monitor the doses of a certain drug consumed inthe general population, outside of a clinical trial. Also, in many therapeutic fields, asnonsteroidal anti-inflammatory drugs, oral contraceptives, antihypertensive drugs,antidepressants, and so on, the actual drug consumption is an important item ofthe total public expenditure. The preponderant presence ofzeroes observed in theseframeworks corresponds to the part of the population which is not actually consum-ing the medicine, either because they do not need it or because they are not takingit in the interval time of monitoring, even if needed. Similarly, in meteorologicalstudies, the daily quantity of rainfall or snowfall distribution can present a clumpat zero, due to not rainy/snowy days. However, due to some measurement errors,small quantity of rain in rainy days may not be detected. A further, interesting fieldof application is measuring goods, food or drug consumptionor expenditure. In thepresent paper, the case of alcohol expenditure in Italian households is considered.The distribution of alcohol consumption presents a clump atzero, due to both tee-totalers and occasional drinkers having no alcohol during the monitoring period.

The first model for semicontinuous data is the so–calledTobit model, due toTobin (1958). The Tobit model assumes an underlying normally distributed latentvariable determining when the outcome of interest is positive or zero. Since the as-sumption of normality of the underlying latent variable is too restrictive in manyapplications, several extension of the Tobit model have been proposed in the litera-ture, especially in the direction of distribution-free methods. See Powell (1994) fora summary. Also, as noted in several applied studies, the continuous part of the datadistribution may present a rather substantial skewness. Toaddress this issue, Hut-ton and Stanghellini (2011) assume a Skew-Normal model (Azzalini and Capitanio(1999)) for the latent variable that governs the outcome of interest, for the case ofclumping due to right censoring in bounded health scores.

The two–part model (Cragg (1971); Duanet al. (1983)) extends the Tobit modelto the case when the probability of zero and positive values depends on differentsets of parameters and explanatory variables. The likelihood function of this classof models factorizes into two parts, that can be maximized separately. According to

Semicontinuous regression models 3

Min and Agresti (2002), the two-part model is sometimes preferable as it addressesthe data in their original form, it is simple to fit and to interpret. It can be seen as thecontinuous counterpart of the hurdle model.

A different generalization of the Tobit model, used in microeconometric stud-ies, is calledDouble-Hurdle model (see, for example, Cragg (1971); Blundell andMeghir (1987); Jones (1989)). This kind of model has been used to study individualbehaviour on commodity consumptions as two subsequent decisions. Therefore, twoseparate hurdles have to be cleared in order to obtain a positive level of consump-tion. The Double-Hurdle model allows a dependence between the two components.

The two-part model has been extended in several ways (see forexample, Chai andBailey (2008); Moulton and Halsey (1996); Suet al. (2009)). As for the Tobit model,the continuous part of the data can be skewed, as shown in several studies. This issueis generally addressed by taking the log of the response variable. However, this canlead to a normalization of only lightly skewed situations. In particular, Chai andBailey (2008) propose a probit/log-skew-normal distribution to analyze continuousdata with a discrete component at zero.

In the present paper, we are considering a two-part model combining severalmodels for both the discrete component and the continuous component with skewdistributions. This kind of model is able to account for several forms of asymmetry.In particular, the Skew-Normal and the Skewt distributions for the continuous partcan accommodate asymmetry in a very flexible way. The first model reduces to thenormal model when a particular parameter, denoted byα, equals zero. For this case,the score functions have been derived and it is shown that, whenα = 0, the scorefunctions are linearly dependent so that the information matrix is singular.

The paper is organized as follows. The proposed two-part model is presented inSection 2. Some properties of the proposed class of models are given in Section 3.The analysis of alcohol consumption data is in Section 4. Section 5 concludes withsome remarks.

2 Semicontinuous response models using skew distributions

Let Y1 andY2 be two independent latent random variables. AssumeY1 follows aBernoulli distribution, with

P(Y1 = 1) = π1 and P(Y1 = 0) = π0 = 1−π1.

Let Y ∗2 be a further latent random variable with support onℜ. DefineY2 to have

a left censored at zero distribution with support on[0,+∞). We suppose that theprobability distribution ofY2 is linked to that ofY ∗

2 as

P(Y2 = 0) = P(Y ∗2 ≤ 0) while P(Y2) = P(Y ∗

2 ) when Y ∗2 > 0.

Finally, defineY as the observable random variable with the clump at zero and acontinuous positive part. We assume:

4 Gottard, Stanghellini & Capobianco

Y =

{

0 if Y1 = 0Y2 if Y1 = 1 with Y2 ≥ 0

(1)

Therefore,

P(Y = 0) = P(Y1 = 0)+P(Y1 = 1,Y2 = 0) = P(Y1 = 0)+P(Y1 = 1)P(Y ∗2 ≤ 0)

asY1 andY2 are assumed independent andP(Y2 = 0) = P(Y ∗2 ≤ 0). It follows that

the density function of the random variableY , say fY , can be written as

fY (y) = [π0+π1P(Y ∗2 ≤ 0)] I(y = 0) + π1 fY2(y) I(y > 0). (2)

Note that two different mechanisms may give rise toY = 0. One is driven by thelatent variableY1, while the other is driven by the truncation of the latent variableY2. In the pharmacoeconomic studies theY1 variable describes the need or not totake a given drug, whileY2 drives the actual consumption for those in need to takethe drug. In an actual study,Y ∗

2 may be zero, especially if the monitoring time isshort.

Moreover, notice that, ifπ0 = 0, the model is a left-censored-at-zero model.Wheneverπ0 > 0, the proposed model assumes an extra source of zero values due tothe discrete componentY1. Furthermore, ifP(Y ∗

2 ≤ 0) = 0, and consequentlyY2 hasa strictly positive support, then the resulting model is a two-part model. Otherwise,the model can be viewed as a zero-inflated truncated continuous model. See foralternative proposals Aristei and Pieroni (2008) and Couturier and Victoria-Feser(2010). For Gaussian specification of the continuous part ofthe model, our proposalcoincides with the Double-Hurdle model with independent components.

As already mentioned, we may want the distribution of the continuous part ofthe data to allow for skewness. Several possible choices areavailable. Here we shallconsider two possible choices: (a) the Skew-Normal distribution and (b) the Skewtdistribution.

3 The specification of the model

As illustrated by Lemma 1 of Azzalini (2005), a skew distribution may be generatedby perturbation of a symmetric distribution, as follows:

g(z) = 2 f0(z)G0(wz) (3)

where f0 is a one-dimensional density function, being symmetric about 0, andG0

is a one-dimensional distribution function such thatG0 exists and has a densitysymmetric about 0, whilew(·) is an odd function.

Under choice (a), whenf0 = φ(z) andG0 = Φ(αz), whereφ(·) andΦ(·) arerespectively the density and the distribution function of astandard univariate normal

Semicontinuous regression models 5

distribution, (3) is the density function of a Skew-Normal random variable, withskewness parameterα, Z ∼ SN(α).

In the model in (2) under choice (a), we shall imposeY ∗2 = µ +σZ. In this case,

Y ∗2 ∼ SN(µ ,σ2,α). Notice that forα = 0 the model reduces toY ∗

2 ∼ N(µ ,σ2). Theeffect of covariates on the parameterµ can also be investigated. A possible choiceis to assume for thei-th generic observation:

µi = β ′2X2i (4)

whereX2i denotes a set of explanatory variables forY ∗2 andβ2 is a vector of regres-

sion coefficients.Choice (a) parallels what done by Chai and Bailey (2008) in which a log transfor-

mation is assumed instead, so that the continuous part has strictly positive support.As shown in the Appendix, a test forα = 0, which corresponds to testing for theabsence of skewness in the data, may be problematic. This is also true for rightcensored data (see Capobiancoet al. (2010)).

Under choice (b) of a continuous component with Skewt distribution, we letf0 = t(z;ν) and

G0 = T

[

ασ−1(y− µ)(

ν +1Q+ν

)1/2

; ν +1

]

whereQ = [(y− µ)/σ ]2, t andT are, respectively, the density and the distributionfunction of a Studentt distribution withν andν +1 degrees of freedom. In a similarmanner as in (4), we shall impose thatY ∗

2 = µ + σZ to account for the effect ofcovariates onµ .

Focusing now on assumptions onY1, we may consider the effect of covariatesalso onπ1, by the insertion of a suitable link function. Classical choices of linkfunction include logit or Probit model.

To exemplify, under choice (a) of a Skew-Normal distribution forY ∗2 , and a logit

link for Y1, the contribution to the likelihood function of a generic observationi is

LL.SNi =

[

11+expβ ′

1x1i+

expβ ′1x1i

1+expβ ′1x1i

FSN(

0;β ′2x2i,σ2,α

)

]I[yi=0]

·[

fSN(

yi;β ′2x2i,σ2,α

)

]I[yi=1]

where fSN(·) andFSN(·) are the density and the distribution function of the Skew-Normal variable. Assuming a probit link, under choice (a), the contribution to thelikelihood function of a generic observationi is instead

LP.SNi =

[

(

1−Φ(β ′1x1i)

)

+Φ(β ′1x1i)FSN

(

0;β ′2x2i,σ2,α

)

]I[yi=0]

·[

fSN(

yi;β ′2x2i,σ2,α

)

]I[yi=1]

As an example, we report in Figure 1 the histogram of two data sets randomlygenerated by model (2) under choice (a) and (b) with the probit link. The data are

6 Gottard, Stanghellini & Capobianco

Continuous component: Skew−Normal

Y

Fre

quen

cy

0 5 10 15 20 25 30

010

0020

0030

0040

0050

0060

0070

00

Continuous component: Skew t

Y

Fre

quen

cy

0 5 10 15 20 25 30

010

0020

0030

0040

0050

0060

0070

00

Fig. 1 Simulated data for variableY , under choice (a) and (b) with a Probit link

generated assuming for both the distributionsπ1 = 0.6 and the location, scale andskewness parameters equal to 6, 3 and 0.5 respectively. For the Skewt distributionthe number of degrees of freedom was settled at 3. It can be seen that the Skewtchoice yields to a heavier right tail, giving the possibility of adequately model thepresence of kurtosis in the data. Notice that, even if the parameterπ1 is the same, theamount of zero values in the Skewt case is lightly higher, due to truncation of theleft tail, heavier than that of the Skew-Normal distribution. However, the distributionunder choice (b) tends to that under choice (a) as the number of degrees of freedomtends to infinity.

4 Analysis of the alcohol consumption data

As an illustration, we analyze here a set of data on alcohol consumption comingfrom the 2002 Italian Household Budget Survey (IHBS) conducted by Istat. Thesample consists of 27499 italian households. Expenditure on a wide range on non-durable goods and services over 1 week period is recorded, and subsequently ex-pressed on a monthly basis. Expenditure variables are deflated via regional priceindices produced by Istat. Data refer to the households rather than individuals, there-fore a per-equivalent adult measure is obtained by adjusting for family size using the

Semicontinuous regression models 7

Histogram of Y

Y

Fre

quen

cy

0 1 2 3 4 5

050

010

0015

0020

00

Fig. 2 Observed values for variableY , monthly expenditure for alcohol

modified OECD equivalence scale, which assigns weight 1 to the first adult, 0.5 toeach subsequent adult and 0.3 to each child under 14 years of age.

Here the interest is in the determinants of monthly expenditure for alcohol. Ex-planatory variables to take into account for socio-economic factors of the householdare recorded. Also the amount of expenditure in tobacco is recorded, a variable thatemerges as significant in our analysis. In order to amplify the differences betweenthe continuous and the discrete part of the distribution, the response variable hasbeen transformed usingY = log(1+X). This is usually done in studies on semicon-tinuous response models (see e.g. Chai and Bailey, 2008). The histogram of the sotransformed response variable, in Figure 2, shows that there is a moderate evidenceof skewness in the positive data.

The probit link has been chosen for the discrete component. For the continuouspart, we adopted the Skew-Normal distribution (choice (a) in the previous sections).The model has been selected starting from a large model. Included covariates are:age (age), with a linear and quadratic effect, percentage of adult male members ofthe household (perc males), income proxied by per-equivalent adult householdtotal expenditure scaled by 100 (income), per-equivalent tobacco consumption cat-egorized in tertiles (ftab), a binary variable taking value 1 if there is at least onechild aged 0-14 years (Child014) and list of binary variables taking value 1 ifthe household’s head: is male (MaleH), has a higher education (HighEdu), has awhite collar job position (Whitecollar), owns the house (OwnerOcc), is singlewithout children (Single).

The first model included all variables both in the Probit linkand in the continuouspart of the model (with the exception that the variable income has been categorized

8 Gottard, Stanghellini & Capobianco

in four levels (fincome), using quartiles as cut-points, for the continuous com-ponent to avoid instability of estimates and allow for nonlinear effects). Using abackward selection procedure, we proceeded by removing variables with non sig-nificant effect. The estimates of the final model are in Table 1. The likelihood ratiotest between the extended and the reduced model was 11.13 with 10 degrees offreedom.

Table 1 Parameter estimates for the selected model

Variables Estimate s.e. Confidence intervals (95%)

Model for the binary componentIntercept -1.3686 0.1118 -1.5878 -1.1494eta 0.0168 0.0021 0.0128 0.0209eta2 -0.0005 0.0002 -0.0008 -0.0002perc admales 0.7036 0.1017 0.5042 0.9029income 0.0104 0.0019 0.0066 0.0141MaleHH 0.3176 0.0658 0.1887 0.4466whitecollar -0.1323 0.0409 -0.2124 -0.0523ftab(0.1,20] 0.4700 0.0530 0.3661 0.5739ftab(20,160] 0.3610 0.0516 0.2599 0.4622

Model for the continuous componentIntercept 1.8089 0.0582 1.6949 1.9229perc admales 0.2618 0.0762 0.1125 0.4112fincome(6.2,9.08] 0.3198 0.0500 0.2217 0.4179fincome(9.08,13.3] 0.4741 0.0499 0.3762 0.5720fincome(13.3,167] 0.6797 0.0509 0.5799 0.7794single 0.3400 0.0539 0.2343 0.4458child014 -0.1197 0.0360 -0.1903 -0.0491ftab(20,160] 0.1466 0.0410 0.0663 0.2270s.d. 0.8120 0.0122 0.7881 0.8358skewness 0.1766 0.0587 0.0614 0.2917

Variables have an effect in the expected direction. Tobaccoconsumption plays arole both in the decision model (Probit equation) and in the level model (Skew-Normal equation). If household’s head has a white collar jobthe probability ofdrinking alcohol (binary component) reduces, while it increases if the household’shead is male. These binary covariates, however, do not affect the continuous com-ponent. The main negative effect on the level of alcohol expenditure is the presencein the family of children aged less than fourteen. Finally, income positively affectsalcohol expenditure in both components.

Notice that the derivation in the Appendix implies that asymptotic distribution ofthe likelihood ratio test to compare normal and skew Normal models is non standard.However, the skewness parameter is significant, even if small. Parameter estimatesand Hessian matrix have been obtained by a numerical algorithm in theoptimfunction implemented in R (R Development Core Team, 2011) Attempts to use theSkewt distribution have been made, however the algorithms for themaximization of

Semicontinuous regression models 9

0 2 4 6 8 10 12

02

46

810

12

Percentiles of chi−square distribution

Mah

alan

obis

dis

tanc

es

QQ−plot for normal distribution

residuals

0 2 4 6 8 10 12

02

46

810

12

Percentiles of chi−square distribution

Mah

alan

obis

dis

tanc

es

QQ−plot for skew−normal distribution

residuals

Fig. 3 Healy’s plot of the residuals of the non-zero values correctly identified against (left panel)the normal distribution and (right panel) the Skew-Normal distribution

the log likelihood did not converge. As far as it concerns thealgorithm for likelihoodmaximization in the Skew-Normal case, the more stable results have been obtainedusing simulated annealing. This algorithm is a global optimization method. It isquite inefficient compared to the classical Newton-Raphson-like methods but it isable to distinguish among several local optima. This characteristic makes it quiteconvenient in case of difficult log-likelihood functions.

To evaluate the model fitting, it is interesting to first compare fitted and observedvalues in terms of the binary componentY1. For this purpose, for each uniti weset y1(i) = 0 whenπ1(i) < π0(i). (The notation, used here and in the Appendix,πk(i),k = 0,1 instead ofπk is due to the presence of explanatory variables in thebinary component.) The percentage of zero and non-zero values correctly identifiedis 59.38 (58.36 for the normal model). In particular, the model underestimates thenumber of zero values in the data (as only 35.1% of the zeroes are correctly iden-tified). This can be due mainly to the absence among the measured covariates ofvariables describing personal and psychological factors that can influence the deci-sion to consume alcohol. Limiting our investigation to the 79.1% of non-zero valuescorrectly identified, Figure 3 reports the Healy’s plot (Healy, 1968) of the residualsagainst either the normal (left panel) or the Skew-Normal (right panel) distribution.The Figure exhibits a moderate improvement of the second model.

10 Gottard, Stanghellini & Capobianco

5 Conclusions

In this paper, the possibility of adopting skew distributions for modeling semicontin-uous data is analyzed. In particular, the Skew-Normal and the Skewt distributionsare contemplated. The resulting classes of models are useful whenever a non nega-tive continuous variable shows an anomalous clump in the zero value and asymme-try and/or kurtosis on positive values. This situation is quite common in many fieldsof application and can be easily carried out visually through a careful analysis ofdata histogram. For this particular type of data, the proposed models avoid the useof Box-Cox transformation that, when the clump at zero is toohigh, does not leadto acceptable results.

The semicontinuous regression models presented here can also be viewed as non-standard mixture models, being a mixture of two components,one binary and onewith positive support. Moreover, the clump at zero is supposed to have two possi-ble sources: one is due to a binary latent component, and the other to a continuouslatent component, whereby this takes negative values. Thistwofold source of zerovalues is appealing, for instance, in the study of alcohol expenditure from the ItalianHousehold Budget Survey, as presented in the paper. In fact,it allows to accountfor people having by chance no alcohol expenditure in the short period of observa-tion. For this data set we adopted a probit/Skew-Normal model, which showed to fitbetter than the probit/normal model.

The results obtained show some evidence that the use of alcohol is related to thatof tobacco, both in terms of drinking/not drinking (binary component) and in termsof levels (continuous component). Moreover, a white collarjob position seems toreduce the probability of drinking alcohol (binary component), but does not affectthe continuous component. The main negative effect on the level of alcohol expen-diture seems to be the presence in the family of children agedless than fourteen.Finally, income positively affects alcohol expenditure inboth components. In termsof model fitting, the proposed model seems to underestimate the number of zerovalues, probably due to some missing explanatory variableson the binary part.

Estimates are here obtained by maximum likelihood, via simulated annealing.As these models are in fact nonstandard mixture models, estimates could be alterna-tively obtained using the EM algorithm. The R functions for the likelihood functionof the proposal are available upon request from the authors.

References

1. Aristei, D., & Pieroni, L.: A double-hurdle approach to modelling tobacco consumption inItaly. Applied Economics,40, 2463–2476 (2008)

2. Azzalini, A., & Capitanio, A.: Statistical applicationsof the multivariate skew normal dis-tribution. Journal of the Royal Statistical Society: Series B (Statistical Methodology),61,579–602 (1999)

3. Blundell, R., & Meghir, C.: Bivariate Alternatives to theTobit Model. Journal of Economet-rics,34, 179–200 (1987)

Semicontinuous regression models 11

4. Capobianco, R, Hutton, J., & Stanghellini, E.: Modellingcensored data with the skew-normaldistribution. In: 25th International Workshop on Statistical Modelling (IWSM2010) Proceed-ings. Glasgow, UK, 119–122 (2010)

5. Chai, H., & Bailey, K.: Use of log-skew-normal distribution in analysis of continuous datawith a discrete component at zero. Statistics in medicine,27, 3643–3655 (2008)

6. Couturier, D., & Victoria-Feser, M.: Zero-inflated truncated generalized Pareto distributionfor the analysis of radio audience data. The Annals of Applied Statistics,4, 1824–1846(2010)

7. Cragg, J.: Some statistical models for limited dependentvariables with application to thedemand for durable goods. Econometrica: Journal of the Econometric Society,39, 829–844(1971)

8. Duan, N., Manning, W., Morris, C., & Newhouse, J.: A comparison of alternative models forthe demand for medical care. Journal of Business & Economic Statistics,1, 115–126 (1983)

9. Healy MJR. Multivariate normal plotting. Applied Statistics17, 157–161 (1968)10. Hutton, J., & Stanghellini, E.: Modelling bounded health scores with censored skew-normal

distributions. Statistics in medicine,30, 368–376 (2011)11. Jones, A.: A double-hurdle model of cigarette consumption. Journal of Applied Economet-

rics,4, 23–39 (1989)12. Lambert, D.: Zero-inflated poisson regression, with an application to defects in manufactur-

ing. Technometrics,34, 1–14 (1992)13. Min, Y., & Agresti, A.: Modeling nonnegative data with clumping at zero: a survey. Journal

of the Iranian Statistical Society,1, 7–33 (2002)14. Moulton, L., Halsey, N.: A mixed gamma model for regression analyses of quantitative assay

data. Vaccine,14, 1154–1158 (1996)15. Mullahy, J.: Specification and testing of some modified count data models. Journal of Econo-

metrics,33, 341–365 (1986)16. Olsen, M., & Schafer, J.: A two-part random-effects model for semicontinuous longitudinal

data. Journal of the American Statistical Association,96, 730–745 (2001)17. Powell, J.: Estimation of Semiparametric Models. In: Robert F. Engle and Daniel L. Mc-

Fadden, eds. Handbook of econometrics, vol. IV, pp. 2443–521. Amsterdam: North Holland(1994)

18. R Development Core Team: R: A Language and Environment for Statistical Computing. Vi-enna, Austria: R Foundation for Statistical Computing. (2011)

19. Su, L., Tom, B., & Farewell, V.: Bias in 2-part mixed models for longitudinal semicontinuousdata. Biostatistics,10 , 374 (2009)

20. Tobin, J.: Estimation of relationships for limited dependent variables. Econometrica: Journalof the Econometric Society,26, 24–36 (1958)

Appendix

Under the assumption thatY ∗2 is Skew-Normal, the likelihood function for uniti is:

Li = [π0(i)+π1(i)P(Y ∗2 (i)≤ 0)]I[yi=0] · [π1(i) fSN(yi;µi,σ ,α)]I[yi>0]

= [π0(i)+π1(i)FSN(0;µi,σ ,α)]I[yi=0] · [π1(i) fSN(yi;µi,σ ,α)]I[yi>0]

=[

π0(i)+π1(i)(

1−2Φ2(0,−µiσ ;−ρ)

)]I[yi=0] ·

·[

π1(i)2σ φ( yi−µi

σ )Φ(α

σ (yi − µi))]I[yi>0]

12 Gottard, Stanghellini & Capobianco

HereΦ2(·;−ρ) denotes the distribution function of a standard bivariate normal ran-dom variable with−ρ as correlation coefficient, whereρ = α√

1+α2. The log likeli-

hood for uniti is then

ℓi = I[yi = 0] · log[π0(i)+π1(i)FSN(0;µi,σ ,α)]+

+I[yi > 0] log[π1(i) fSN(yi;µi,σ ,α)] .

Let us denote:zi = yi−µi

σ

ui = − µiσ

w(αzi) =φ(αzi)Φ(αzi)

Ai = 1π0+π1FSN(0,µi,σ ,α)

Then, the score functions for uniti with respect to (some of) the parameters of theSkew-Normal distribution are:

S(µi) = I[yi = 0] ·Ai · ∂∂ µi

FSN(0,µi,σ ,α)+ I[yi > 0] · 1fSN(yi,µi,σ ,α) ·

∂∂ µi

fSN(yi,µi,σ ,α)

S(αi) = I[yi = 0] ·Ai · ∂∂αi

FSN(0,µi,σ ,α)+ I[yi > 0] · 1fSN(yi,µi,σ ,α) ·

∂∂αi

fSN(yi,µi,σ ,α)

where

∂∂ µi

FSN(0,µi,σ ,α) = φ(ui)Φ(αui)σΦ2(0,ui)

∂∂ µi

fSN(yi,µi,σ ,α) = ziσ − α

σ w(αzi)

∂∂α FSN(0,µi,σ ,α) = φ(

√1+α2ui)

(1+α2)√

2πΦ2(0,ui)∂

∂α fSN(yi,µi,σ ,α) = ziw(αzi)

We can verify that whenα = 0, then

S(α)|α=0 = σ√

S(µ)|α=0

In fact, whenα = 0, thenAi = 1/(π0+π1Φ(yi)) = Ai|α=0 and

S(µ)|α=0 = I[yi = 0] Ai|α=0 · φ(ui)σΦ(ui)

+ I[yi > 0] · ziφ(yi)

S(α)|α=0 = I[yi = 0] Ai|α=0 · φ(ui)σΦ(ui)

σ√

2π + I[yi > 0] · zi

φ(zi)σ√