nonparametric estimation of non-response distribution in the israeli social survey

19
Nonparametric estimation of non-response distribution in the Israeli Social Survey Yury Gubman Dmitri Romanov JSM 2009 Washington DC 4/8/2009

Upload: morag

Post on 10-Jan-2016

50 views

Category:

Documents


0 download

DESCRIPTION

Nonparametric estimation of non-response distribution in the Israeli Social Survey. Yury Gubman Dmitri Romanov. JSM 2009 Washington DC 4/8/2009. Outline. Missing data generating mechanisms Sharp bounds for conditional mean Tests for MCAR and MAR assumptions - PowerPoint PPT Presentation

TRANSCRIPT

Nonparametric estimation of non-response distribution in the Israeli Social Survey

Yury Gubman

Dmitri Romanov

JSM 2009 Washington DC 4/8/2009

2

Outline

1. Missing data generating mechanisms

2. Sharp bounds for conditional mean

3. Tests for MCAR and MAR assumptions

4. Empirical results - Israeli Social Survey 2006

5. Conclusions

3

Missing data generating mechanisms Missing Completely At Random (MCAR): Non-respondent's data is

ignorable:(1)

Missing At Random (MAR): Conditionally on some set of covariates, the non-respondent's data is ignorable. 1. Non-respondent's data provides no additional information

about conditional distribution of y: (2)

2. Given a set of survey design covariates X, the probability of y to be missing does not depend on y:

(3) MAR assumption can not be tested statistically using the survey

data only, because the non-respondents data is not available. We overcome this difficulty by conditioning on the full-known

administrative covariates of census type.

)1,|()0,|( zxyPzxyP

)(),|( iPYYiP mnrm

)X|i(P)X,y,y|i(P mnrm

4

Sharp bounds for conditional mean Let z=1 for interview respondents, and 0 otherwise. Let w=1 for item respondents, and 0 otherwise. Let be survey strata, y - survey variable and x - covariate. By the Law of Iterated Expectations:

(4)

By Bayes theorem: (5)

Using covariates from the administrative sources of census type allows us to assume:1. There is no item non-response on the covariates x.

2. P(x), the overall population distribution x, is known.

)x|z(P)z,x|y(E)x|z(P})z,x|s(P)z,s,x|y(E{

)x|z(P)z,x|y(E)x|z(P)z,x|y(E)x|y(E

Ss

00111

0011

)x(P/})s(P)s|z,x(P{)x(P/)z,x(P)x|z(PSs

111

Ss

5

Sharp bounds for conditional mean

In addition:

(6)

Combine (4), (5) and (6). The data reveals nothing at all about and

. The lower and upper bounds are obtained by minimization and

maximization, respectively, of the result expression with respect to all unobserved values.

Minimum and maximum exist because all survey variables are bounded.

)z,x|y(E 0

)]z,s,x|w(P)w,z,s,x|y(E

)z,s,x|w(P)w,z,s,x|y(E)z,s,x|y(E

1001

11111

)w,z,s,x|y(E 01

6

Sharp bounds for conditional mean

For full survey data: , where:

The width of the interval between the bounds reflects both item and survey non-response.

)x|z(P)x|z(P

)z,x|s(P)]z,s,x|w(P)z,s,x|w(P)w,z,s,x|y(E[UB

),x|z(P)z,x|s(P)z,s,x|w(P)w,z,s,x|y(ELB

Ssf

Ssf

01

1101111

111111

ff UBxyELB )|(

7

Sharp bounds for conditional mean

For item non-response analysis, the respondent's data should be treated. In this case, formula for sharp bounds may be simplified:

The width of the interval between the bounds reflects item non-response only.

Nothing was assumed about the true missing data generating mechanism

)x|s(P)]s,x|w(P)s,x|w(P)w,s,x|y(E[UB

)x|s(P)s,x|w(P)w,s,x|y(ELB

Ssr

Ssr

011

11

8

Testing MCAR and MAR Define:

The explicit expression for is given by:

and for by:

and are asymptotically normal, and do not depend on and on sample size.

Their standard deviations can be estimated using bootstrap. T-test for equal means, for two population with unknown and

different variances, is used for checking the null hypothesis in the following cases.

rrr

fff

LBUB)x|y(I

LBUB)x|y(I

fI

)x|z(P)x|z(P)z,x|s(P)z,s,x|w(P)x|y(ISs

f 01110

rI

Ss

r )x|s(P)s,x|w(P)x|y(I 0

fI rI )x|y(E

9

Testing MCAR

H0 for testing overall non-response is given by:

and for item non-response:

If the H0 is rejected, for some i, j, we will conclude that the probability to be non-respondent depends on x. In particular, MCAR assumption is violated.

]k,[j,i),xx|y(I)xx|y(I:H jfiff 10

]k,[j,i),xx|y(I)xx|y(I:H jrirr 10

10

Testing MAR (1)

Let be a variable from the administrative database, which is strongly correlated with key survey variable y, and/or with survey topic. In such case, may be treated as a survey variable, and it is known for all sampled units.

Under MAR, the respondent's data is sufficient to estimate conditional population distributions for all survey variables, and in particular for .

The null hypothesis is given by:

If null hypothesis is rejected, survey non-response distribution depends on survey topic and/or on survey variable, and this contradicts MAR.

*y

*y

*y

)z,x|y(E)z,x|y(E ** 10

11

Testing MAR (2) Use MAR definition: . Let X be a set of survey design covariates, which "controls" a bias

in survey variables (due to non-response). Let be categorical full-observed and orthogonal to X covariate.

Assuming MAR and conditional on X, the survey non-response rates should be independent of .

H0 for testing MAR assumption is given by:

for overall non-response, and

for item non-response. Rejection of H0 means that, conditionally on the set of survey

design covariates X, MAR assumption is violated. If is strongly correlated with some survey variable, rejection of H0 means that the non-response depends on the survey variable or/and survey topic.

*x

)|(),,|( XiPXyyiP mnrm

)xx,X|y(I)xx,X|y(I *j

*f

*i

*f

)xx,X|y(I)xx,X|y(I *j

*r

*i

*r

*x

*x

12

Israeli Social Survey 2006

The Israeli Social Survey (ISS) has been conducted annually since 2002 on a sample of persons aged 20 and older. The main purpose of the ISS is to provide up-to-date information on the welfare of Israelis and on their living conditions. The ISS is the first survey conducted by ICBS using the Population Register as a sampling frame.

The sample size in 2006 was 9,499 persons. 562 persons did not belong to the sample frame (deceased, were abroad for over a year), and the final sample included 8,937 persons. 1,648 did not respond the survey (18.4 percent of the final sample).

13

Israeli Social Survey 2006 Four key ISS variables were chosen:

1. Worked last week (4 categories) - no item non-response;2. Optimism – general (3 categories) - item non-response rate is 11.0

percent;3. Gross salary from all places of work (10 categories) - item non-

response rate is 5.3 percent;4. Degree of religiosity – Jews (5 categories) - item non-response

rate is 0.5 percent. We use three administrative covariates which are highly

correlated with survey topic and some important survey variables:1. Reported income from work (Tax Authority);2. Work status (Tax Authority);3. Degree of religiosity of Jews (derived from the educational

databases).

14

Empirical results – testing covariate's conditional distributions

)z,x|y(E)z,x|y(E:H ** 100

Respondent’s and non-respondent’s distributions significantly differ

Respondents Non-respondents p-value Respondents Non-respondents p-value Respondents Non-respondents p-value

Israeli-born Jews, males, age group 25-

34477 136 3.2470 3.2111 0.0055 1.5625 1.6876 <0.0001 1.4173 1.5846 <0.0001

Jews, immigrants who arrived by

1989,females, age group 55-64

235 36 3.3188 3.0877 <0.0001 2.1324 2.2832 <0.0001 1.2525 1.9048 <0.0001

Jews, immigrants who arrived in 1990 and later, males, age

group 45-54

95 27 3.2249 3.0000 0.2318 1.3337 2.8503 <0.0001 1.1802 2.0146 <0.0001

Arabs outside East Jerusalem, males, age group 35-44

115 13 3.2677 2.8763 0.0008 1.5631 1.7353 0.0001 --- --- ---

Arabs outside East Jerusalem, females,

age group 25-34133 22 2.5654 2.2700 0.0007 2.2244 2.1927 0.3195 --- --- ---

Religiosity - JewsStrata

Estimated means of administrative covariates, by strata

N resp N non-respIncome from work Labor Status

15

Empirical results: testing MAR for interview non-response

"Working last week", conditional on reported income

0.5400

0.5600

0.5800

0.6000

0.6200

0.6400

0.6600

1 2 3 4 5

"Working last week", conditional on work status

0.0000

0.1000

0.2000

0.3000

0.4000

0.5000

0.6000

0.7000

0.8000

0.9000

salaried employee self-employed unemployed or not partof the labor force

""Working last week", conditional on degree of religiosity - Jews

0.0000

0.2000

0.4000

0.6000

0.8000

1.0000

1.2000

Not religious, secular Religious Ultra-religious (Haredi)

)xx,X|y(I)xx,X|y(I:H *j

*f

*i

*f 0

H0 is rejected, p-value<0.01 for three covariates

16

Empirical results testing MAR for item non-response

)xx,X|y(I)xx,X|y(I:H *j

*r

*i

*r 0

"Gross salary from all places of work", conditional on reported

income

0.0000

0.5000

1.0000

1.5000

2.0000

2.5000

1 2 3 4 5

"Gross salary from all places of work", conditional on Labor

market status

0.0000

0.1000

0.2000

0.3000

0.4000

0.5000

0.6000

0.7000

0.8000

salaried employee self-employed unemployed or not partof the labor force

"Gross salary from all places of work", conditional on Degree of

religiosity - Jews

0.0000

0.1000

0.2000

0.3000

0.4000

0.5000

0.6000

0.7000

0.8000

0.9000

1.0000

Not religious, secular Religious Ultra-religious (Haredi)

H0 is rejected, p-value<0.01for three covariates

17

Conclusions

We propose nonparametric statistical tests for checking validity of MCAR and MAR assumptions, where the test statistics are based on the width of the interval between the estimated sharp bounds for conditional mean.

Significant departures from MAR assumption were found in the ISS 2006 data. Non-response propensity varies significantly between population groups assumed to be homogenous according to the survey design.

ISS survey design can be improved using available administrative covariates, such as income, labor market status, and degree of religiosity of Jews.

18

19

Yury GubmanSenior Coordinator

Israeli Central Bureau of [email protected]

Tel. 972 (2) 6593204Fax 972 (2) 6593203