evaluation of dun and bradstreet's dmi file as ......1147 evaluation of dun and...

28
1147 EVALUATION OF DUN AND BRADSTREET'S DMI FILE AS AN ESTABLISHMENT SURVEY SAMPLING FRAME Christopher L. Moriarity, National Center for Health Statistics David W. Chapman, Federal Deposit Insurance Corporation Christopher L. Moriarity, 6525 Belcrest Rd., Room 915, Hyattsville, MD 20782 USA [email protected] ABSTRACT The National Employer Health Insurance Survey (NEHIS) was conducted in 1994 to collect data on U.S. employer- sponsored health insurance. An important aspect of conducting the 1994 NEHIS was the choice of the private sector sampling frame. Because of confidentiality regulations, the business establishment lists maintained by the U.S. Bureau of the Census and the U.S. Bureau of Labor Statistics were not available. Therefore, the Duns Market Identifiers (DMI) File, a commercially available business establishment list from the Dun and Bradstreet Corporation, was used. This paper presents an evaluation of the DMI File as a sampling frame for establishment surveys. Topics discussed include issues related to frame coverage, and the accuracy of information contained in the file. Key Words: Undercoverage; Overcoverage; Prospecting Records 1. INTRODUCTION The National Employer Health Insurance Survey (NEHIS) was conducted in 1994 to collect data on U.S. employer-sponsored health insurance. The NEHIS was co-sponsored by three U.S. Federal Government agencies: the National Center for Health Statistics (NCHS), the Health Care Financing Administration, and the Agency for Health Care Policy and Research (now called the Agency for Healthcare Research and Quality). The survey contractor for the NEHIS was Westat, Inc. The NEHIS asked questions on (a) the names, types, and characteristics of health insurance plans offered, if any, (b) the number of employees enrolled, and (c) health insurance costs. Followups to the 1994 NEHIS have been conducted since 1997 as the Medical Expenditure Panel Survey - Insurance Component by the Agency for Healthcare Research and Quality. An important aspect of conducting the 1994 NEHIS was the choice of the sampling frame for the private sector portion of the survey. Because of confidentiality regulations, the business establishment lists maintained by the U.S. Bureau of the Census and the U.S. Bureau of Labor Statistics were not available. The Duns Market Identifiers (DMI) File, a commercially available business establishment list from the Dun and Bradstreet Corporation, was used as the NEHIS private sector sampling frame. (Note: The Dun and Bradstreet Corporation recently renamed the DMI File as "Prospecting Records." For ease of exposition, we refer to the file by the name it was known by when we worked with it.) This paper presents information learned from our work with the DMI File as the sampling frame for the 1994 NEHIS. This paper, along with results presented in Marker and Edwards (1997) and the other references we provide below, can serve as useful information to survey researchers considering the use of the DMI File as a sampling frame for a U.S. business establishment survey. Most of the work summarized in this paper was conducted between 1993 and 1995, using DMI File resources from late 1993 and early 1994. The 1994 NEHIS did not use the public sector portion of the DMI File, nor was any use made of the private sector total sales information. Hence, we are not able to provide any evaluation of the quality of ---------- Notes: David W. Chapman was a contract employee at NCHS at the time the NEHIS was conducted. The views expressed in this paper are attributable to the authors and do not necessarily reflect those of the National Center for Health Statistics, Centers for Disease Control and Prevention, or the Federal Deposit Insurance Corporation.

Upload: others

Post on 18-Sep-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: EVALUATION OF DUN AND BRADSTREET'S DMI FILE AS ......1147 EVALUATION OF DUN AND BRADSTREET'S DMI FILE AS AN ESTABLISHMENT SURVEY SAMPLING FRAME Christopher L. Moriarity, National Center

1147

EVALUATION OF DUN AND BRADSTREET'S DMI FILE AS AN ESTABLISHMENT SURVEYSAMPLING FRAME

Christopher L. Moriarity, National Center for Health StatisticsDavid W. Chapman, Federal Deposit Insurance Corporation

Christopher L. Moriarity, 6525 Belcrest Rd., Room 915, Hyattsville, MD 20782 [email protected]

ABSTRACT

The National Employer Health Insurance Survey (NEHIS) was conducted in 1994 to collect data on U.S. employer-sponsored health insurance. An important aspect of conducting the 1994 NEHIS was the choice of the private sectorsampling frame. Because of confidentiality regulations, the business establishment lists maintained by the U.S. Bureau ofthe Census and the U.S. Bureau of Labor Statistics were not available. Therefore, the Duns Market Identifiers (DMI) File,a commercially available business establishment list from the Dun and Bradstreet Corporation, was used.

This paper presents an evaluation of the DMI File as a sampling frame for establishment surveys. Topics discussedinclude issues related to frame coverage, and the accuracy of information contained in the file.

Key Words: Undercoverage; Overcoverage; Prospecting Records

1. INTRODUCTION

The National Employer Health Insurance Survey (NEHIS) was conducted in 1994 to collect data on U.S.employer-sponsored health insurance. The NEHIS was co-sponsored by three U.S. Federal Government agencies:the National Center for Health Statistics (NCHS), the Health Care Financing Administration, and the Agency forHealth Care Policy and Research (now called the Agency for Healthcare Research and Quality). The surveycontractor for the NEHIS was Westat, Inc. The NEHIS asked questions on (a) the names, types, and characteristicsof health insurance plans offered, if any, (b) the number of employees enrolled, and (c) health insurance costs.Followups to the 1994 NEHIS have been conducted since 1997 as the Medical Expenditure Panel Survey - InsuranceComponent by the Agency for Healthcare Research and Quality.

An important aspect of conducting the 1994 NEHIS was the choice of the sampling frame for the private sectorportion of the survey. Because of confidentiality regulations, the business establishment lists maintained by the U.S.Bureau of the Census and the U.S. Bureau of Labor Statistics were not available. The Duns Market Identifiers(DMI) File, a commercially available business establishment list from the Dun and Bradstreet Corporation, was usedas the NEHIS private sector sampling frame. (Note: The Dun and Bradstreet Corporation recently renamed the DMIFile as "Prospecting Records." For ease of exposition, we refer to the file by the name it was known by when weworked with it.)

This paper presents information learned from our work with the DMI File as the sampling frame for the 1994NEHIS. This paper, along with results presented in Marker and Edwards (1997) and the other references we providebelow, can serve as useful information to survey researchers considering the use of the DMI File as a sampling framefor a U.S. business establishment survey.

Most of the work summarized in this paper was conducted between 1993 and 1995, using DMI File resources fromlate 1993 and early 1994. The 1994 NEHIS did not use the public sector portion of the DMI File, nor was any usemade of the private sector total sales information. Hence, we are not able to provide any evaluation of the quality of

----------

Notes: David W. Chapman was a contract employee at NCHS at the time the NEHIS was conducted. The viewsexpressed in this paper are attributable to the authors and do not necessarily reflect those of the National Center forHealth Statistics, Centers for Disease Control and Prevention, or the Federal Deposit Insurance Corporation.

Page 2: EVALUATION OF DUN AND BRADSTREET'S DMI FILE AS ......1147 EVALUATION OF DUN AND BRADSTREET'S DMI FILE AS AN ESTABLISHMENT SURVEY SAMPLING FRAME Christopher L. Moriarity, National Center

1148

the public sector data, nor the sales information. However, the NEHIS made extensive use of the employee countinformation and corporate linkage information in the private sector portion of the DMI File, and we discuss what welearned in the process.

The organization of this paper is as follows: we begin with an overview of the DMI File, summarize results presentedin Marker and Edwards (1997), discuss issues related to file coverage and the accuracy of information in the DMIFile, and then conclude with some suggestions and recommendations for potential users of the DMI File and for theDun and Bradstreet Corporation.

2. OVERVIEW OF THE DMI FILE

The DMI File contains information about business establishments (individual business locations) in the private sectorand government entities in the public sector. Individual records in the private sector portion of the file represent abusiness establishment, and include basic information such as company name, address, telephone, names ofcorporate officers, etc. Also, information about total sales and the number of employees is provided.

The primary source of business establishments for the DMI File, which is updated monthly, is credit inquiries. Othersources of business establishment listings used by Dun and Bradstreet include Department of Motor Vehicle records,newspapers, commercial telephone directories ("yellow pages"), unemployment insurance records, and other publicrecords.

Each business establishment record is assigned a unique "DUNS Number" by Dun and Bradstreet. The DMI Fileuses DUNS Numbers to provide information linking individual business establishments to "parent" establishments,for establishments that are part of a multi-establishment firm. The presence of this corporate linkage information isof much interest to potential users of the DMI File who have an interest in knowing the structure of multi-establishment firms.

3. SUMMARY OF RESULTS IN THE MARKER AND EDWARDS (1997) PAPER

Marker and Edwards (1997) describe information that Westat, Inc., learned as a result of their NEHIS work with theprivate sector portion of the DMI File and from another U.S. business establishment survey for which Westat was thesurvey contractor for prior to the NEHIS. We summarize here some of the results presented in Marker and Edwards(1997):

� The number of employees in the business establishment was missing on about 13 percent of the records.� The DMI File contained a substantial proportion of records that do not correspond to currently-existing

business establishments.� The reported information for the number of employees often, but not always, was consistent with the

number of employees reported for sampled business establishments in the NEHIS.� The corporate linkage information on the DMI File generally appeared to be of good quality.� A reduced version of the DMI File, referred to as the DMI "Abstract File," can be purchased from Dun and

Bradstreet, and it was a very useful tool to use to design and select a sample from the DMI File. (The DMIAbstract File contains all the records that are in the complete DMI File, but only a portion of the data items;e.g., it includes number of employees and corporate linkage information, but does not include names oraddresses.)

Page 3: EVALUATION OF DUN AND BRADSTREET'S DMI FILE AS ......1147 EVALUATION OF DUN AND BRADSTREET'S DMI FILE AS AN ESTABLISHMENT SURVEY SAMPLING FRAME Christopher L. Moriarity, National Center

1149

4. UNDERCOVERAGE AND OVERCOVERAGE: SAMPLING FRAME ISSUES

It is impossible for Dun and Bradstreet, or anyone else, to have a list of U.S. business establishments that is completeand accurate at all times. Due to the continuous processes of "births" and "deaths" of businesses, as well as otherimportant events such as acquisitions, mergers, spin-offs, moves, etc., a hypothetical business establishment list thatwas complete and accurate at some moment in time would become incomplete and inaccurate almost immediately.Given that some amount of inaccuracy cannot be avoided in the DMI File, it is important for potential DMI Fileusers to have a good understanding of what kinds of inaccuracies they might expect to encounter. Our discussionfirst focusses on "undercoverage," i.e., erroneous omissions. We then discuss "overcoverage," i.e., erroneousinclusions.

A useful first step in a coverage evaluation is an aggregate comparison of DMI File information to other references,such as figures published by the U.S. Bureau of the Census and the U.S. Bureau of Labor Statistics. Marker andEdwards (1997) do such a comparison, as do McCauley (1991) and Howland (1988). However, as pointed out byCox (1997), consistency of totals does not necessarily imply complete coverage; undercoverage could be masked byovercoverage.

4.1 UNDERCOVERAGE IN THE DMI FILE: DELISTING

Dun and Bradstreet expends a lot of effort trying to update the DMI File. Hence, most potential DMI File userswould expect that no identified establishments would be purposely excluded from the DMI File (e.g., see Marker andEdwards 1997, p. 21). However, we found that some establishments are purposely excluded from the DMI File.

We discovered a problem with omitted establishments when we used the DMI Abstract File corporate linkageinformation to produce an estimate of the number of employees in the firm for business establishments sampled forthe NEHIS that are part of a multi-establishment firm (Moriarity and Siller, 1995). Theoretically, this informationcan be obtained for a given business establishment record by following the corporate linkage information in the DMIFile to the "ultimate headquarters record." However, we were not always able to locate the "ultimate headquartersrecord" for the NEHIS sample cases. We contacted Dun and Bradstreet about this matter, and we supplied specificexamples where we had been unable to locate the ultimate headquarters record. Dun and Bradstreet personnel thenexplained to us that we had identified some "delisted" records. When we requested an explanation of what "delisted"meant, we were informed that businesses can choose not to be listed in the DMI File.

While our research was able to detect that delisting had occurred, it was not possible for us to make an estimate ofthe extent of delisting.

Clearly, the delisting procedure raises serious concerns regarding the use of the DMI File as a business establishmentsurvey sampling frame, as delisted establishments have no chance of being sampled. We also are concerned thatdelisting was not mentioned in any of the DMI File documentation that was provided to us by Dun and Bradstreet.(The issue of less than optimal communication between Dun and Bradstreet and DMI File users on DMI File issueshas been documented previously; e.g., Verway 1980, 1981.)

4.2 UNDERCOVERAGE IN THE DMI FILE: INFORMATION FROM THE NEHISPOSTSTRATIFICATION PROCESS

As described in Wallace, et al. (1995), the respondent weights in the private sector portion of the NEHIS were ratioadjusted (poststratified) to align with independent estimates of the number of employees. This weight ratioadjustment method dampens sampling variability in estimates that are correlated with the number of employees, and,also, it provides a mechanism for adjusting for sampling frame undercoverage.

Page 4: EVALUATION OF DUN AND BRADSTREET'S DMI FILE AS ......1147 EVALUATION OF DUN AND BRADSTREET'S DMI FILE AS AN ESTABLISHMENT SURVEY SAMPLING FRAME Christopher L. Moriarity, National Center

1150

The weight ratio adjustments were applied using independent estimates of employment provided by the U.S. Bureauof Labor Statistics. The ratio adjustment cells were formed within state, cross-classified by two broad groups ofStandard Industrial Classification (SIC) Codes and by four business establishment employment sizes (1-9, 10-49, 50-249, 250 and above). Approximately 400 ratio adjustment cells were formed, with about 100 cells in eachemployment size category.

The median of the weight ratio adjustment factors was derived for each of the four employment size categories;Table 1 displays the median values. Interquartile ranges are included in Table 1 to provide a measure of dispersionfor the weight ratio adjustment factors within each of the employment size categories.

Table 1: Median Weight Adjustment Factors Used in the 1994 NEHIS, by Employment Size Class

Employment Size Median Ratio AdjustmentFactor

Interquartile Range

1-9 1.35 0.35

10-49 1.12 0.24

50-249 0.98 0.29

250 and over 0.88 0.36

The median weight ratio adjustment factor of 1.35 for business establishments with small employment (1-9employees) suggests that substantial undercoverage exists in the DMI File for establishments in this size category.

Aldrich, et al. (1989) describe an evaluation study which showed that the portion of the DMI File being evaluated inthe study had substantial undercoverage of new businesses. If most new businesses are assumed to be small, thiscould provide a partial explanation for the level of small employment business undercoverage suggested byinformation from the NEHIS poststratification process.

4.3 OVERCOVERAGE

The DMI File has a problem with overcoverage, as discussed by Marker and Edwards (1997), Phillips (1993), andreferences cited in Davis, et al. (1993). While some overcoverage cannot be avoided, it seems that more effort couldbe expended by Dun and Bradstreet to reduce the amount of overcoverage in the DMI File.

For a sample selected from the DMI File, overcoverage leads to an increased survey cost due to efforts to contactcases that are determined to be out-of-scope (e.g., no longer in business). Over 8% of the DMI File businessestablishment records selected for the NEHIS were classified as out-of-scope.

Also, overcoverage in the DMI File leads to the occurrence of cases with undetermined eligibility status; thisintroduces additional uncertainty in survey estimates. For example, in the NEHIS, approximately 12% of thesampled DMI File business establishment records wound up being classified as "dead-end" (undetermined eligibilitystatus) cases. The NEHIS used computer-assisted telephone interviewing for data collection, and "dead-end" casesresulted from the following scenarios:

� Some sample cases did not have a telephone number. For all cases that did not have a telephone number,directory assistance was contacted by Westat, Inc., to try to obtain a number. Some of these cases did nothave a listing.

Page 5: EVALUATION OF DUN AND BRADSTREET'S DMI FILE AS ......1147 EVALUATION OF DUN AND BRADSTREET'S DMI FILE AS AN ESTABLISHMENT SURVEY SAMPLING FRAME Christopher L. Moriarity, National Center

1151

� Some sample cases had an invalid telephone number; i.e., a recording would indicate the number wasnonworking, or questionable sounds such as a "fast busy" would occur. Westat, Inc., contacted directoryassistance to try to verify the number or obtain an updated number. If there was no listing, or the samenumber was obtained, the case was considered a "dead end." If a different number was obtained fromdirectory assistance, and if this number either was invalid or the respondent indicated that the number wasnot for the sample establishment, the case was considered a "dead end."

� Some sample cases had working telephone numbers, but no contact was made after several attempts.Directory assistance was contacted, but they did not have a different listing for the sample case.

Personnel from the U.S. government agencies sponsoring the NEHIS, and from Westat, Inc., conducted specialprocedures to attempt to get more information about 50 "dead-end" cases randomly selected from all "dead-end"cases in Maryland. The results of these efforts have been summarized previously by Moriarity (1995). Almost all ofthe cases appeared to be no longer in business; in fact, some of the cases had been out of business for 5, 10, or even15 years. In some instances, new unrelated businesses were occupying the same locations as the former businesses.This discovery suggests that for some addresses there may be multiple listings in the DMI File (with only one, thecurrent business, being valid).

5. ACCURACY OF INFORMATION IN THE DMI FILE

As mentioned in the previous section, the DMI File has some inaccurate telephone number information. This is anissue for anyone considering use of the DMI File for a telephone survey.

Similarly, there appear to be some problems with addresses, and this could mask the existence of duplicate records.For example, two cases that were part of the "dead-end" investigation were located on the same street, but one casehad a street name of "Westside Dr.," while the other case had a street name of "W. Side Dr." Addressstandardization software is available to resolve anomalies such as these, and to help locate cases with potentiallyincorrect addresses. Since address inaccuracies are an issue for researchers considering use of the DMI File for amailback survey, address standardization would be a recommended approach for Dun and Bradstreet to implement.

In addition, we discovered some inconsistencies between employment information for the business establishmentversus employment for the firm when we used the DMI Abstract File corporate linkage information to produce anestimate of the number of employees in the firm for business establishments sampled for the NEHIS that are part ofmulti-establishment firms (Moriarity and Siller, 1995). For example, we discovered instances where, within the setof records for a given firm, an individual establishment record had a larger number of employees than the firmemployee count on the associated "ultimate headquarters" record.

We also discovered some anomalies in the corporate linkage information that complicated our effort to produce anestimate of the number of employees in the firm for business establishments sampled for the NEHIS that are part of amulti-establishment firm. For example, we encountered cases where the DMI File variable "status" was equal to "0"and the variable "subsidiary" was equal to "3." This is a contradiction because Status = 0 indicates a single locationestablishment, but Subsidiary = 3 indicates the establishment is a subsidiary of another entity, hence not a singlelocation establishment.

6. CONCLUSIONS, RECOMMENDATIONS

Although improvements are warranted, we believe that the DMI File is the best U.S. business establishment list thatis available to the public. In particular, we believe that the corporate linkage information available with the DMIFile is more complete than that provided in any other file available to the public.

Page 6: EVALUATION OF DUN AND BRADSTREET'S DMI FILE AS ......1147 EVALUATION OF DUN AND BRADSTREET'S DMI FILE AS AN ESTABLISHMENT SURVEY SAMPLING FRAME Christopher L. Moriarity, National Center

1152

We have described some anomalies that existed in the DMI File at the time when it was used as the sampling framefor the 1994 NEHIS. There is evidence in the existing literature that Dun and Bradstreet has been making a steadyeffort to improve the quality of the DMI File, and we encourage them to continue those efforts. Several specificrecommendations are given below.

6.1 RECOMMENDATIONS FOR DMI FILE USERS

We agree with Marker and Edwards' (1997) recommendation to acquire the DMI Abstract File for survey designresearch and sample selection. The purchase price of the DMI Abstract File was substantial; however, acquisition ofthe file permitted Westat, Inc., and the NEHIS government sponsoring agencies to carry out extensive survey designresearch for the NEHIS.

Until the amount of overcoverage in the DMI File is reduced, we suggest that DMI File users request Dun andBradstreet to provide assistance to help resolve sample cases with unknown eligibility status, and to provide a refundfor sample cases that are found to be out of business. It may be possible to include such arrangements in the contractbetween Dun and Bradstreet and the DMI File user.

6.2 RECOMMENDATIONS FOR THE DUN AND BRADSTREET CORPORATION

We recommend that Dun and Bradstreet consider the following measures to improve the quality of the DMI File:

� Verify telephone numbers using independent sources, when possible; consider the deletion of records wherea valid telephone number cannot be obtained. (It seems unlikely that a viable business would lack atelephone number.)

� Standardize addresses, and do a match within the DMI File to identify and resolve all occurrences wheremultiple records have the same address.

� Perform consistency checks for employment information for establishments within multi-establishmentfirms; resolve inconsistencies.

Phillips (1993) outlines an elaborate system of editing and imputation to resolve missing and inconsistentinformation on the DMI File. To the extent that Dun and Bradstreet is not already applying these techniques, wesuggest that they consider doing so.

We also recommend that Dun and Bradstreet provide the best information they can to DMI File users about the file'sundercoverage, overcoverage, and quality of the individual data items (e.g., employment, SIC Code, NAICS Code,corporate linkage information) on the file. In particular, if the delisting procedure is going to be continued by Dunand Bradstreet, the DMI File documentation should state that delisting is done, the reasons why, and the extent towhich it is done. Also, it would be useful if information were provided about the amount of missing data forvariables such as telephone number, employment, corporate linkage, etc.

ACKNOWLEDGMENTS: The authors thank two persons who provided very helpful insights and information tous about the DMI File: Brian MacDonald of the U.S. Bureau of Labor Statistics, and Bruce Phillips of the U.S. SmallBusiness Administration. The authors also thank several employees of the Dun and Bradstreet Corporation whomade a sincere effort to answer the many questions we posed to them about the DMI File.

Page 7: EVALUATION OF DUN AND BRADSTREET'S DMI FILE AS ......1147 EVALUATION OF DUN AND BRADSTREET'S DMI FILE AS AN ESTABLISHMENT SURVEY SAMPLING FRAME Christopher L. Moriarity, National Center

1153

REFERENCES

Aldrich, H., A. Kalleberg, P. Marsden, and J. Cassell (1989), "In Pursuit of Evidence: Sampling Procedures forLocating New Businesses," Journal of Business Venturing, 4, pp. 367-386.Cox, B.G. (1997), "Quality of U.S. Business Establishment Frames: Discussion," Proceedings of the Section onSurvey Research Methods, American Statistical Association, pp. 31-33.Davis, S.J., J. Haltiwanger, and S. Schuh (1993), "Small Business and Job Creation: Dissecting the Myth andReassessing the Facts," Working Paper No. 4494, National Bureau of Economic Research, Inc.Howland, M. (1988), Plant Closings and Worker Displacement: The Regional Issues, Kalamazoo, Michigan: W.E.Upjohn Institute for Employment Research.Marker, D.A., and W.S. Edwards (1997), "Quality of the DMI File as a Business Sampling Frame," Proceedings ofthe Section on Survey Research Methods, American Statistical Association, pp. 21-30.McCauley, J. (1981), "A Critical Examination of the Dun and Bradstreet Data Files - A Rebuttal," Review of PublicData Use, 9, pp. 145-148.Moriarity, C. (1995), ""Dead-end" Cases," unpublished memorandum, Hyattsville, Maryland: National Center forHealth Statistics.Moriarity, C., and A. Siller (1995), "Technical Documentation of the Creation of "Ultimate" Firm Size," unpublishedmemorandum, Hyattsville, Maryland: National Center for Health Statistics.Phillips, B.D. (1993), "Perspectives on Small Business Sampling Frames," Proceedings of the First InternationalConference on Establishment Surveys, American Statistical Association, pp. 177-184.Verway, D.I. (1980), "A Critical Examination of the Dun and Bradstreet Data Files," Review of Public Data Use, 8,pp. 369-374.Verway, D.I. (1981), "Reply to James McCauley," Review of Public Data Use, 9, p. 149.Wallace, L., C. Bryant, D.W. Chapman, D.A. Marker, and C.L. Moriarity (1995), "Weighting and EstimationProcedures for the 1994 NEHIS," Proceedings of the Section on Survey Research Methods, American StatisticalAssociation, pp. 192-197.

Page 8: EVALUATION OF DUN AND BRADSTREET'S DMI FILE AS ......1147 EVALUATION OF DUN AND BRADSTREET'S DMI FILE AS AN ESTABLISHMENT SURVEY SAMPLING FRAME Christopher L. Moriarity, National Center

1153

TECHNIQUES FOR SAMPLING ESTABLISHMENTS IN MULTI-ESTABLISHMENT FIRMS

David W. Chapman, Federal Deposit Insurance CorporationJohn Sommers, Agency for Healthcare Research and Quality

David W. Chapman, FDIC, Room 2062, 550 17th St., N.W., Washington, DC [email protected]

ABSTRACT

The National Employer Health Insurance Survey (NEHIS) was conducted in 1994 to collect data on U.S. employer-sponsored health insurance, including the number, types, and characteristics of insurance plans offered; plan enrollments;plan costs; and benefits paid. Follow-ups to the NEHIS have been conducted annually since 1997 as the InsuranceComponent of the Medical Expenditure Panel Survey (MEPS-IC). Because of the emphasis on state-level estimates, theestablishment, rather than the entire firm, has been the basic sampling unit for these surveys. For the NEHIS there werelarge numbers of sample of establishments selected from some of the bigger U.S. firms, which created a significantreporting burden for them. We discuss alternate methods considered to limit the number of establishments selected fromlarge firms for the MEPS-IC.

Key Words: State estimation, Respondent burden, Enterprise sampling

1. INTRODUCTION

In 1994 the National Employer Health Insurance Survey (NEHIS), a national survey of business establishments(locations) and governments, was conducted by Westat, Inc., under contract to the National Center for HealthStatistics (NCHS). The survey was jointly sponsored by NCHS, the Agency for Healthcare Research and Quality(AHRQ), and the Health Care Financing Administration (HCFA). The survey, which served as a baseline survey,collected establishment and employee data on health insurance coverage of employees and on health plancharacteristics, for nearly 40,000 establishments in both the private and public sectors. A major objective of theNEHIS was to provide state-level estimates of establishment and employee characteristics. The types of surveyitems collected in the NEHIS included (1) whether an establishment offers health insurance to their employees, (2)the number and types of plans offered, if any, (3) the number of employees eligible for health insurance and thenumber enrolled in different plans, (4) the characteristics of the establishment's health plans, (5) the costs to theemployee and employer of health insurance (premiums), and (6) total benefits paid (claims).

Follow-on surveys to the NEHIS have been conducted annually, beginning in 1997, as part of the MedicalExpenditure Panel (MEPS) program, sponsored by AHRQ. This part of MEPS is referred to as MEPS-IC (insurancecomponent). The MEPS-IC is conducted by the U.S. Census Bureau under contract to AHRQ. The first MEPS-IC,carried out in 1997, is referred to as the 1996 MEPS-IC since the data collected relate to insurance characteristics for1996. The annual sample size for MEPS-IC surveys is about 27,000 establishments and governments.

This paper addresses a basic sample design issue addressed during the planning of the 1996 MEPS-IC: how best tocontrol the number of establishments that are selected from any firm (group of one or more establishments undercommon ownership). Sample selection for the public sector is not included in this discussion because sampling unitsfor the public sector are individual governments and selection of higher level organizational units is not an issue.

Firms were not used as sampling units in the selection of the NEHIS sample primarily because of the goal ofproducing state estimates. This goal dictated that establishments, rather than firms, would have to be the unit ofanalysis since these can be uniquely identified with a specific state, whereas many firms are located in two or morestates. Since an adequate establishment level sampling frame was available for selecting the NEHIS sample, theredid not appear to be any advantage to selecting firms as first stage sampling units. (For an evaluation of the NEHISsampling frame, see Moriarity and Chapman, 2000.) Therefore, the sample design for NEHIS was a single stagestratified random sample of establishments.

In the selection of the NEHIS sample, there was no control of the number of establishments that were selected fromany single firm. As a result, one firm had about 140 establishments selected for the sample, and several others hadabout 100 establishments selected. This created a large respondent burden for these firms because the healthinsurance data for the NEHIS often had to be collected from a human resources department at the firm level.

Page 9: EVALUATION OF DUN AND BRADSTREET'S DMI FILE AS ......1147 EVALUATION OF DUN AND BRADSTREET'S DMI FILE AS AN ESTABLISHMENT SURVEY SAMPLING FRAME Christopher L. Moriarity, National Center

1154

For the 1996 MEPS-IC, state level estimates were still a priority. (For the 1996 MEPS-IC and subsequent MEPS-ICsurveys, the sample was designed so that a minimum sample size would be achieved for 40 states.) Therefore, theestablishment continues to be the basic unit of analysis. However, to reduce the respondent burden for the largefirms, a method had to be used to control the number of establishments selected from any one firm.

In planning for the 1996 MEPS-IC, two methods were considered for controlling the number of establishmentsselected from a single firm for the MEPS-IC:

(1) A two-stage sampling approach: firms at the first stage and establishments within firms at the second stage.(2) A probability adjustment method: a single-stage sample that reduces the selection probabilities of

establishments in large firms.

A discussion of sample selection issues for small and large firms is given in Section 2. The two sampling methodslisted above are defined and discussed in Section 3. An analysis of the advantages and disadvantages of the twomethods, and the reasoning that lead to the selection of the probability adjustment method for use in the MEPS-IC,are given in Section 4. Some key results of the application of this method to the MEPS-IC are presented inSection 5. A brief summary of considerations for possible modifications of the sampling method for future MEPS-IC samples is given in Section 6. Much of the material given in Sections 2, 3, and 4 has been summarized from anearlier paper by Chapman, et al. (1996).

2. SAMPLE SELECTION FOR SMALL FIRMS VERSUS LARGE FIRMS

Establishments in the private sector can be classified as belonging to either single-establishment firms (SEFs) or tomulti-establishment firms (MEFs). A SEF is an establishment that is self-owned. A MEF is a collection of two ormore establishments that have common ownership. The highest level of ownership is referred to as the firm,enterprise, or the ultimate parent.

Of the approximately 6.3 million establishments in the nation in 1994, only about 1.3 million belonged to MEFs.The distribution of the size of MEFs in terms of number of establishments in 1994 is given in Table 1. Thederivations in this table are based on the linkage information provided in the sampling frame for the NEHIS in 1994:Dun and Bradstreet's Dun's Market Identifiers file (which has been renamed the Prospecting Records File). Thistable shows that in 1994 there were only 1,150 firms with 100 or more establishments nationwide.

Table 1. Number of establishments per MEF for the NEHIS Sampling Frame in 1994

Number of Establishmentsper MEF

Number of MEFs Number of Establishmentsin Category (rounded tonearest 1000)

Number of Employees inCategory (rounded to nearest1000)

2-4 221,931 518,000 10,718,000

5-9 23,152 144,000 5,544,000

10-19 7,276 95,000 4,684,000

20-49 3,382 101,000 5,981,000

50-99 1,112 77,000 5,068,000

100-199 630 87,000 6,586,000

200 or more 520 292,000 17,041,000

Total 258,003 1,314,000 55,622,000

There is no issue of controlling the number of selections of establishments within firms for sampling SEFs since, bydefinition, establishments and firms are identical for SEFs. (The SEFs consist of about 5.0 million of the

Page 10: EVALUATION OF DUN AND BRADSTREET'S DMI FILE AS ......1147 EVALUATION OF DUN AND BRADSTREET'S DMI FILE AS AN ESTABLISHMENT SURVEY SAMPLING FRAME Christopher L. Moriarity, National Center

1155

approximately 6.3 million establishments in the nation in 1994.) The control of the number of establishmentsselected is not necessary for small firms either, since the number of selections in any one firm is not likely to belarge for these firms. The need for control is in the selection of establishments from large firms.

Since our concern is with the number of establishments selected in large firms, we defined "large firm" in terms ofthe expected number of establishments selected for the sample from a firm, based on a single-stage selection methodlike that used in 1994 for the NEHIS. For a given firm, the expected number of "hits" (i.e., establishments selectedfor the sample) is the sum of the probabilities of selection of the establishments in the firm.

During the planning for the 1996 MEPS-IC, it was assumed that the criterion for defining a “large” firm would be anexpected number of sample hits of about 15 or 20. However, based on cost and response rate information collectedin MEPS-IC, the criterion chosen to define a “large” firm is an expected number of hits greater than 1. Thiscriterion typically identifies between 800 and 1,000 firms as “large.”

3. TWO METHODS FOR CONTROLLING THE NUMBER OF HITS PER FIRM

The two methods being considered for controlling the number of establishment hits for a single firm were discussedbriefly in the introduction. These methods are described in more detail in Subsections 3.1 and 3.2.

3.1 The Two-Stage Approach

The most obvious way to control the number of establishments selected per firm is to select firms as the first stage ina two-stage sampling procedure. For the second stage, a sample of establishments would be selected from all thosein the firm. This would certainly allow the number of hits per selected firm to be controlled. However, a number ofdesign choices would be required in developing an efficient two-stage selection approach.

As was pointed out in the previous section, two-stage sampling would only be applied to the large firms: those withexpected numbers of sample establishment hits greater than 1. For the SEFs and the establishments belonging tosmall firms, the sample would be selected as a single-stage sample, similar to the 1994 sample design for NEHIS..To identify the large firms, the establishment universe would be stratified as though there would only be asingle-stage sample for them too. Once this was done, the expected number of hits per firm would be computed bysumming the selection probabilities (i.e., stratum sampling rates) for all the establishments in each firm.

Once the large firms are defined, the establishment sample would be allocated between (1) the SEFs and the smallMEFs and (2) the large MEFs. This allocation would presumably be based on both the number of establishmentsand the number of employees in these two primary strata since both of these types of estimates are of interest inNEHIS. The next step would be to determine how many firms would be selected from the large MEF stratum andhow many establishments would be selected from each firm at the second stage.

The optimum solution to this allocation question could be approximated from the simplified formula provided byHansen, et al. (1953, p. 286) for the optimum cluster sample size for simple two-stage cluster sampling. Thesampling plan within the large MEF stratum would not fit the simple cluster sampling model used to derive theoptimum cluster by Hansen et al. (1953) because clusters (firms) would not likely be selected with equal probability.However, this calculation would give some indication of what the optimum cluster size should be. The parametersneeded to apply the approximate formula in Hansen e al. (1953) could be estimated from data collected in NEHIS.

It is likely that a probability proportional to size (PPS) type sample of firms would be selected at the first stage. Itwould be difficult to develop a two-stage procedure that would provide limited establishment-within-firm samplesizes and also be appropriate for meeting the state-level precision targets. One approach would be to start with asingle-stage stratified design and a sample allocation to these strata that would be sufficient to meet the stateprecision requirements. The stratum sample sizes would define target sampling rates for the strata which would beincorporated into the sampling and subsampling of large firms, based on the use of a “composite measure of size”for each firm, as described by Folsom et al. (1987). See Chapman et al. (1996) for a detailed description of how thismethod would be used to select the sample for MEPS-IC.

Page 11: EVALUATION OF DUN AND BRADSTREET'S DMI FILE AS ......1147 EVALUATION OF DUN AND BRADSTREET'S DMI FILE AS AN ESTABLISHMENT SURVEY SAMPLING FRAME Christopher L. Moriarity, National Center

1156

Folsom, et al. (1987) show that, with this method of selecting first-stage sampling units in proportion to theircomposite measures of size, the target stratum sampling rates will be achieved and the within-firm sample sizes willbe constant, except for any firms selected with certainty. For certainty selections, the number of sample hits in thefirm could not be controlled unless the target stratum sampling rates for the single-stage strata were reduced. This isa major drawback of the two-stage selection procedure.

3.2 The Probability Adjustment Method

The second approach controls the number of hits for a firm by reducing the probabilities of selection of theestablishments in large firms. The amount of reduction is derived to be such that the expected number ofestablishment sample hits for the large firms will be reduced to a more manageable level.

This is done by first developing the single-stage sample design without regard to the number of hits per firm, as wasdone for the 1994 design of the NEHIS. Once these strata are defined, the expected number of establishment samplehits is calculated for each firm. Adjustments to the selection probabilities are made in a specified way that willreduce the expected number of hits for large firms. The key to the procedure is to recognize that the selection of aSRS of nh out of Nh establishments is nearly equivalent to selecting a systematic PPS sample of nh out of Nhestablishments, where the selection interval is Nh/nh, and each establishment has a measure of size (MOS) of 1. TheMOS will be reduced to below 1 for establishments in large firms.

To achieve the appropriate MOS reductions, the MOS of each establishment in a large firm is set equal to r=R/A,where A is equal to the initial expected number of establishment hits for the firm, and R is equal to the targetreduced number. For example, if the initial expected number of hits for a firm were 60 and the target reducednumber was 15, the MOS of establishments in that firm would be set equal to 0.25, instead of 1. For theprobabilities to be reduced properly for a large firm, the non-reduced MOSs in a stratum have to be increased by anamount that will restore the sum of the MOSs in a stratum to the original sum, which is the number ofestablishments in the stratum, Nh.

As an example, assume that a stratum contains Nh=100 establishments. Suppose that 20 of these 100 establishmentshad their MOSs reduced because they belonged to large firms, and that the reduced stratum total of the MOSs was88 (i.e., 8 from the 20 reduced MOSs plus 80 from the non-reduced MOSs). To restore the sum of the MOSs in thestratum to 100, the 80 unaffected establishments would be assigned a MOS of 1.15 (i.e., 92/80).

Care has to be taken in applying this process to not increase the expected number of sample hits for a firm to somenumber above the “large firm” threshold as a consequence of the compensating MOS increases. Therefore, aniterative process might be needed where any firm that exceeds the large-firm threshold because of the compensatingMOS increases in the previous iteration would have the MOSs of its establishments reduced appropriately.Presumably, such an iterative process would converge within a few iterations.

Once all of the MOSs were adjusted appropriately, a sample of the original size, nh, would be selected from eachstratum as a systematic PPS sample. In a given stratum, the selection interval would still be Nh/nh.

4. COMPARISON OF THE TWO METHODS

The main advantage of the two-stage approach is that the number of establishments selected in each of the largefirms can be controlled exactly. With the probability adjustment method, only the expected number of hits for alarge firm is controlled. (However, this does not seem like it would be a major problem.)

A second advantage of the two-stage procedure is that it would be straightforward to compute firm-level estimatesfor large firms, since they are first-stage sampling units. With the probability adjustment method, making firm-levelestimates is more difficult because of the need to compute firm selection probabilities from an establishment sample.Also, the firm selection probabilities cannot easily be controlled in the probability adjustment method. Severalmethods that could be used to make firm-level estimates for MEPS-IC are given by Sommers (2000).

There are several disadvantages of the two-stage approach. First, it is complicated by the need to allocate thesample between the large-firm stratum and the balance of the population, and the need to determine the optimum

Page 12: EVALUATION OF DUN AND BRADSTREET'S DMI FILE AS ......1147 EVALUATION OF DUN AND BRADSTREET'S DMI FILE AS AN ESTABLISHMENT SURVEY SAMPLING FRAME Christopher L. Moriarity, National Center

1157

number of large firms to select at the first stage. The use of the composite measure of size would also addconsiderable complexity. Second, as was mentioned in Section 3.2, the method breaks down for any firms selectedwith certainty. For these, the number of sample hits could not be reduced unless the target sampling rates for theirestablishments were also reduced. As a result, either the sample sizes for the certainty cases would exceed thelarge-stratum threshold, or there would be sample size losses for the certainty cases that would have to becompensated for by increasing sample sizes for SEFs or other MEFs.

Perhaps the most significant disadvantage of the two-stage approach would be that, by design, some large firmswould be excluded from the sample. This would add considerably to the variance of the estimators. With theprobability adjustment method, nearly all the large firms would be retained even though the selection probabilities oftheir establishments are reduced. (Although these selection probability reductions, and the corresponding selectionprobability increases for establishments in SEFs and small MEFs, will also increase the variances of estimators, weexpect that these variance increases would not be nearly as high as those associated with the variance increases forthe two-stage approach.) This is a major advantage of the probability adjustment method.

Another advantage of the probability adjustment method is that it is relatively straightforward to apply since it is asingle-stage procedure. Calculating the needed reduction reductions in the MOSs for large firms, and thecorresponding inflation factors for the MOSs in each stratum, is straightforward. A disadvantage of the probabilityadjustment method, which was already indicated, is that the number of sample hits for a firm cannot be preciselycontrolled. However, controlling the expected number of hits should be adequate.

One other potential disadvantage of the probability adjustment method would be a shortage in a stratum ofestablishments for adjusting up their MOSs to compensate for the establishments whose measures had to be reducedbecause they were in large firms. There was some investigation of the potential for occurrence of this problemduring the planning for the 1996 MEPS-IC, using the data from the NEHIS sampling frame. The investigation,which is described in more detail by Chapman, et al. (1996), Section 4, suggested that this would not likely be aproblem. In that investigation, they found that, for a “large” firm criterion of 20 expected hits, the maximumadjustment in a stratum would be 1.78, with the second largest adjustment being 1.47.

Because of the comparative advantages of the probability adjustment method, it was the chosen approach forMEPS-IC. First, this method retains sample representation of all, or nearly all, of the large firms in the population.Second, the probability adjustment method is much easier to apply because it involves only one stage of sampling.It avoids many of the design decisions that would have to be made for the two-stage approach, and avoids theproblem of how to treat certainty firms. Finally, the investigation, using the NEHIS sampling frame, of the potentialstratum MOS weight adjustment factors that would be required for applying the probability adjustment methodsuggested that the probability adjustment method should work well for MEPS-IC.

5. APPLICATION OF THE PROBABILITY ADJUSTMENT METHOD TO THE MEPS-IC

The probability adjustment method has been used for all rounds of the MEPS-IC. A “tiered” method of reducingprobabilities of selection is used, to substantially reduce the number of multiple hits in large firms, as follows:

1. If the original expected number of hits, A, for a firm is above 50, the target expected number of hits is set at 15.To do this, the measure of size (MOS) for each establishment in the firm is set equal to 15/A.

2. If the original expected number of hits is between 20 and 50, it is reduced to 10. In this case, the MOS of eachestablishment in the firm is set equal to 10/A.

3. If the original expected number of hits is between 6 and 20, the expected number is halved (i.e., the MOS is setequal to 0.5 for each establishment in the firm).

4. If the original expected number of hits is over 3 but less than 6, the expected number is reduced to 3 (i.e., theMOS is set equal to 3/A).

5. If the original expected number of hits is between 1 and 3, it is reduced to 1 (i.e., the MOS is set equal to 1/A).

Page 13: EVALUATION OF DUN AND BRADSTREET'S DMI FILE AS ......1147 EVALUATION OF DUN AND BRADSTREET'S DMI FILE AS AN ESTABLISHMENT SURVEY SAMPLING FRAME Christopher L. Moriarity, National Center

1158

There are several reasons why the expected numbers of hits for large MEFs have been reduced considerably morethan was anticipated at the outset of the MEPS-IC. First, it was discovered during the 1996 MEPS-IC that theintraclass correlation between establishments within the same firm is very high. Second, it was recognized in thefirst year that response rates were very low for firms with the highest numbers of establishments and could only beimproved with special handling, which is very costly and is limited by the number of personnel available. (Ingeneral, costs per establishment for large firms increase with the number of establishments in the sample).

Although increasing the effort to reduce the number of hits per large firm was motivated more by practicalconsiderations than by precision or sample quality issues, there has been an increase in the response rate for largefirms. For example, the response rate for firms with over 1,000 employees went from 58% in 1996 to 71% in 1997.

The modifications of selection probabilities have not been large for MEPS-IC. Currently, approximately 15% of theframe units have had their probabilities of selection and their MOSs decreased. Of these, less than 0.04% had theirMOSs decreased to less than 0.5, and none were decreased below 0.3. The average decrease in the MOS was to0.75. To make up for these probability reductions, 14% of the frame units had their probabilities of selection andtheir MOSs increased. The maximum increase in the MOS was to 2.0, and the average increase was to 1.6. If therewere no intraclass correlation between establishments within firms, the increase in variance due to the modificationsof establishment selection probabilities would be less than 3%, based on a comparison of sums of squares ofsampling weights (reciprocals of selection probabilities) before and after the changes in selection probabilities.

Overall, the probability adjustment method appears to be quite effective. The average number of establishmentsselected in the large firms has been reduced substantially, therefore decreasing the reporting burden of these firms,which increases the survey response rate and decreases costs. The increase in variance associated with theadjustments in establishment selection probabilities is at most 3%. In fact, due to the high intraclass correlationbetween establishments in the same firm, this method may actually decrease the variance of most survey estimates.

6. Considerations for Modifications of the Sampling Method for Future MEPS-IC Samples.

As more data are collected and analyzed for MEPS-IC, allocations of the sample to strata may change based onrevised stratum variance estimates. Furthermore, AHRQ intends to reassess the rules currently used to reduce theexpected numbers of establishments within firms, based on improved estimates of data collection costs andintraclass correlations between establishments within the same firm, and on experience with response ratesassociated with increased numbers of establishments within firms.

7. REFERENCES

Chapman, D.W., C. L. Moriarity, and J. Sommers (1996), “Should Firms Be Used As Sampling Units for SelectingEstablishments for the 1997 National Employer Health Insurance Survey?” Proceedings of the Survey ResearchMethods Section, American Statistical Association, pp. 353-358.

Folsom, R. E., F. J. Potter, and S. K. Williams (1987), "Notes on a Composite Measure for Self-Weighting Samplesin Multiple Domains," Proceedings of the Survey Research Methods Section, American Statistical Association,pp. 792-796.

Hansen, M. H., W. N. Hurwitz, and W. G. Madow (1953), Sample Survey Methods and Theory, Vol. 1, New York:Wiley.

Moriarity, C. L. and D. W. Chapman (2000), “Evaluation of Dun and Bradstreet’s DMI File As an EstablishmentSurvey Sampling Frame,” Proceedings of the International Conference on Establishment Surveys II,forthcoming.

Sommers, J. (2000), “Methods to Produce Establishment and Firm Level Estimates for an Economic Survey,”Proceedings of the International Conference on Establishment Surveys II, forthcoming.

Page 14: EVALUATION OF DUN AND BRADSTREET'S DMI FILE AS ......1147 EVALUATION OF DUN AND BRADSTREET'S DMI FILE AS AN ESTABLISHMENT SURVEY SAMPLING FRAME Christopher L. Moriarity, National Center

1159

METHODS TO PRODUCE ESTABLISHMENT AND FIRM LEVEL ESTIMATES FOR AN ECONOMICSURVEY

John Sommers, Agency for Healthcare Research and Quality2101 E. Jefferson St., Suite 500, Rockville, MD, 20852

[email protected]

ABSTRACT

Business surveys use either establishments (a single physical location ) or firms (an organization of one or moreestablishments) as the sampling unit. The choice depends upon the frame, estimates required and the level at which data canbe reported. At times the level at which required data can be reported varies within the same survey. This necessitates makingboth establishment and firm level estimates from the same survey. The paper explores several methods to accomplish this interms of a single health insurance survey. Pros and cons of each method are given and a recommendation is made.

1. BACKGROUND

Economic surveys usually have as final sampling units, either establishments (i.e. specific business locations) or firms,(i.e. entities which own or control, one or more establishments). Choice of the sampling unit can depend upon manyfactors. Four of the key reasons for the choice of unit are:

• What level is data likely to be available – Data can be kept either for the firm or establishment or both. Someinformation, such as sales or numbers of employees may be available at the establishment level. Otherinformation, such as total assets, total profits or total federal taxes paid may only be reportable or logical toask for at the firm level. Thus, the choice of sampling unit is heavily influenced by whether one can actuallyobtain data for the sampling unit.

• What type of sampling frame is available – Most sampling frames are lists of establishments. Due to thecomplexity of grouping establishments or how the frame was developed, there may be no list of firms. Forinstance, the Bureau of Labor Statistics’(BLS) Business Establishment List (BEL)contains establishments.These are not tied together by firms. The Census Bureau’s frame does have firm designations. Because ofconfidentiality, currently Census cannot share this information with BLS. (Chapman, 1995)

• What types of estimates are required – If estimates are required for geographic areas below the national level,establishments, which are location specific, are required as a sampling unit. Many large firms cross geographicboundaries. The Medical Expenditure Panel Survey - Insurance Components (MEPS - IC) is an example ofa survey with this requirement. This survey must produce a variety of estimates of employment relatedinsurance results by State .

• What policies the data may support– If one is trying to obtain data to support government policy, it isimportant to know what level the policy is likely to be applied. For instance, if one only had information onestablishments and not firms, it would be difficult to apply certain items by establishment size. The effect ofa policy on a small independent business of less than 10 persons could be different than the effect on a smallestablishment owned by a very large multi unit firm.

Although care is taken to choose a sampling unit for which data can be collected, a survey can contain data elementsreportable at the firm or establishment level, but not both. In such cases one would require a means, such as two setsof weights, to produce estimates for both firms and establishments in the sample. This is similar to the case in householdsamples where there are household or family weights and person weights. For household surveys, dual weights areeasily produced due to the sample clustering. Households are selected and then persons within households. For aneconomic survey with a sample selected from an establishment frame which has the establishments linked by firm, thissame method of clustering by selecting firms and establishments within firms can be used.

Page 15: EVALUATION OF DUN AND BRADSTREET'S DMI FILE AS ......1147 EVALUATION OF DUN AND BRADSTREET'S DMI FILE AS AN ESTABLISHMENT SURVEY SAMPLING FRAME Christopher L. Moriarity, National Center

1160

Zarkin et. al. (1995) discuss the issue of data collected at multiple levels in the same survey. They note that this couldbe the result of the level the data is kept or could be required because policy is to be applied at the level and thus thelevel of collection could be influenced by the level of policy application. They recommend use of cluster sampling offirms as the solution to this problem.

However, the loss of sampling efficiency of geographic estimates and the relative importance of establishment versusfirm level estimates may make an alternate to the cluster sampling method a more desirable method. The MEPS - ICis an example of such a survey which currently does not use a cluster sampling technique but where firm level estimatesare needed.

This paper discusses several methods that can be used to produce establishment and firm level estimates from a surveywhich has establishments as the final stage of sampling. It gives an overview of the methods of implementation andbriefly discusses possible variance estimation techniques and the pros and cons for each method.

It should be noted that the methods are presented in terms of the sample for the MEPS-IC Survey and do not alwaystranslate to other surveys and sample designs. However, some of the methods do translate to other designs. Forinstance, some can be used with designs which use permanent random numbers. (Ohlsson, 1995).

2. DESIGN OF THE MEPS - IC

The MEPS - IC is an annual survey of business establishments which collects data about employer sponsored healthinsurance. Most data such as plan premiums, enrollments, etc. can be reported for the selected establishments. SinceState estimates are required, establishments were selected as the sampling unit. The sample design is a stratifiedsequential sample of establishments with establishments placed in strata by State, establishment size and firm size.There are also minimum samples selected in each State to allow for State level estimation. Currently, establishmentsfrom the same firm can be in different strata.

To select the sample, probabilities of selection are determined for each establishment and establishments are sorted bystrata with each establishment assigned to a subinterval in the interval [0,n], where n is the overall sample size. Thesubinterval for an establishment has a length equal to it probability of selection and starts at the sum of the probabilitiesof selection of the establishments which precede it in the sort order. The selection is made sequentially using a singleuniform random start selected in the interval [0,1]. Variances of establishment level estimates can be calculated usinga software package, such as, SUDAAN (Shah, et. al., 1995 ) or a replicate method can be used by creating replicatesor random groups using the sort order. (Wolter, 1985) For more information concerning the MEPS-IC sample designsee Sommers, 1998.

Since the first survey was conducted in 1997, we have found that some data items can only be collected at the firm level.Of specific interest are questions concerning retirees with employer sponsored health insurance. This information isonly available at a firm level. Even when one can identify the establishments where the retirees worked, the informationby establishment cannot be used to produce an estimate because there are numerous retirees from closed establishmentswhich had no chance of selection into the sample.

One requires firm level weights or other means to produce these estimates. Following is a discussion of some of thechoices of sample design and estimation methods that can used, how a firm level weight would be calculated, the meansthat might be used to calculate variances for firm level estimates and other related issues.

3. SAMPLING FIRMS

One method to easily solve the problem of firm level estimates is to use cluster sampling, with firms as the cluster andthen select establishments from the selected firms. This design was recommended by Zarkin, et. al., 1995. The designcould be accomplished with a slight modification of the current IC methods. One could simply re-sort the establishmentfile so that all establishments from the same firm are clustered together on the file. The probability of selection of aparticular firm would be the sum of the probabilities of selection of its establishments or 1 if that sum were greater than1. As part of the sort, stratification of firms could be used. To maintain some control of the establishment sample and

Page 16: EVALUATION OF DUN AND BRADSTREET'S DMI FILE AS ......1147 EVALUATION OF DUN AND BRADSTREET'S DMI FILE AS AN ESTABLISHMENT SURVEY SAMPLING FRAME Christopher L. Moriarity, National Center

1161

sample size within states, the establishments within a firm could be sorted by State and size. Using the same singlerandom start technique as is now used, a sample of firms and establishments could be selected.

This method, which is commonly used in household surveys of households and persons, allows simple calculation ofboth firm and establishment weights and variances can be produced by Taylor series methods with software, such asSUDAAN, or any of the standard replicate methods.

In spite of its apparent ease of use, this method has drawbacks. The survey still has as its main focus state levelestimates. With a cluster sample of firms, although the expected sample sizes within States and the strata within Stateswould equal the present allocation, the numbers of sample selected will vary, thus reducing the efficiency of the Stateestimates.

Another drawback of this method and all the other methods proposed in this paper is the need for a frame whichsupports the method. This means that one needs a frame which links establishments within each firm. All of themethods rely on these links for their implementation. Unlike household surveys where such links are established duringdata collection, these methods require knowledge of these links as part of the original selection process.

The method also requires a more complex database since data is kept at multiple levels.

4. THE CURRENT METHOD

In order to make estimates at the firm level, one can use the probability of selection of firms in order to calculate firmlevel weights, estimates and sampling errors. This can be done with the current IC sampling method.

To calculate the probability of selection of a firm, one needs only calculate the probability of selecting at least oneestablishment from the firm. To do this one must take advantage of the single random start used. If the interval for anestablishment is defined as [k+a, k+b], where k is an integer, 0<a<b<1, then the establishment will be selected if therandom start falls between a and b. To find if a firm is to be selected one simply must find the length covered by theprojections of the intervals of the firm’s establishments on the interval [0,1]. (Note some intervals can project into twodisjoint intervals on the unit interval. For instance, if the interval of an establishment was [1.75, 2.25] its projectionis the two intervals [0, .25] and [.75, 1].)

To calculate this length one must write programs to calculate all the projections and then starting with the lowest startingpoint of the establishments from the firm, continue to check for the interval which intersects the first interval and hasthe largest endpoint. This process is repeated to continue to expand the first interval until no further intersecting intervalcan be found. When this is done, one has coverage of the interval from the lowest starting point to the last highendpoint. Once this is done, one starts a new interval with projections which do not intersect the first built interval.This process continues until an interval is built which has 1 as its endpoint or has no other projections with a rightendpoint higher than the last built interval. The probability of selection is then the sum of the lengths of constructedintervals.

To calculate a variance of firm level estimates for this method one has to use a replicate method. There are no strataand yet it is not simple random sampling for firms. One suggestion is to place establishments into replicates andcalculate firm level probabilities of selection for each replicate and repeat the entire weighting process for firms for eachreplicate. Establishment level variances can be calculated using the current strategy.

Aside from the difficulty of implementation, this method has other drawbacks. It does not control the probability ofselection of individual firms nor the sample allocation of firms. The calculation of probabilities of firm selection andthe variances of firm level estimates is non trivial. However, the method preserves the establishment level estimatesand their efficiency. It is also the case that probabilities of selection of firms are correlated with the size of the firm.Thus for estimates of total employment and other values which correlate with size the firm estimates are likely to bereasonably good. Because of the overall sample size and because only National estimates of firm values are required,the sample is likely to yield adequate firm level estimates.

Page 17: EVALUATION OF DUN AND BRADSTREET'S DMI FILE AS ......1147 EVALUATION OF DUN AND BRADSTREET'S DMI FILE AS AN ESTABLISHMENT SURVEY SAMPLING FRAME Christopher L. Moriarity, National Center

1162

As with the cluster sample method, this method also would require a more complex database and a frame that linksestablishments by firm.

5. SELECTION OF ESTABLISHMENTS WITHIN FIRMS CLUSTERED BY STRATA

A third method of selection retains the establishment structure of the original sample and gives another method ofcalculating firm level probabilities of selection. In this method rather than using a single random start, which essentiallycombines the selection from all establishment strata into one process, one selects establishments within each strata withindependent random starts.

As part of the selection of establishments within each strata, establishments within each firm are grouped. This meansthat the probability of selecting an establishment from a firm within a stratum is simply the sum of probabilities ofselecting each individual establishment in the firm within the stratum or one whichever is less. It also means theprobability of not selecting any establishment from the firm within the stratum is 1 minus this value.This probability of not selecting firm f within stratum s will be called pn(f,s).

Since selection is independent across the strata, then the probability of not selecting any establishment from firm f fromany of the stratum, pn(f) is;

.pn f pn f ss

( ) ( , )= ∏

The probability of selecting at least one establishment from firm f from any stratum is 1 - pn(f), the desired probabilityof selection.

As with the current method variances would require use of replicates. However, the process of calculating probabilitiesof firm selection appears easier to implement. The method also preserves the current establishment sample. Becausewe can sort the establishments within strata in any manner as long as the establishments within firms are grouped, wemay be able to also consider firm characteristics in the sort. This could lead to a better firm estimate than using thecurrent establishment sampling methods.

As with the previous two methods, this method also requires a more complex database and a frame that linksestablishments within firms.

6. PRORATION

Another method that can be used to solve the problem of firm level data for numerical totals like the type of estimatesrequired in the IC, is proration. This method is simple to implement and is likely used by many surveys. If one collectsa numerical piece of information for a firm and desires to make estimates of National totals, one can simply allocate thefirm’s results to each of its establishments. One can show that if the allocation process used is such that the sum of theallocation of the firm’s value across all its establishments equals the total firm’s reported value, then if all firm valuesare allocated in this way, one theoretically has an unbiased estimate of the National total if one uses these proratedvalues with the establishment sample and the associated probabilities of selection.

Using the probability of the selection of the ith establishment in the fth firm as p(f, i), a(f, i) as its allocation and itsweight as wt(f, i) = 1/ p(f, i), then the expected value of the sum across the establishments within the fth firm E[f] is

E f a f i wt f i p f i a f i p f i p f i a f iiii

[ ] ( , ) ( , ) ( , ) ( , ) ( , ) ( , ) ( , )= = =− ��� 1

which equals the firm total given the condition placed upon the proration process. Since the expected value of the framehas the sum of all firms, the method produces an unbiased estimate.

Page 18: EVALUATION OF DUN AND BRADSTREET'S DMI FILE AS ......1147 EVALUATION OF DUN AND BRADSTREET'S DMI FILE AS AN ESTABLISHMENT SURVEY SAMPLING FRAME Christopher L. Moriarity, National Center

1163

Note that any proration which meets this condition, such as, prorating the value equally over all the establishments inthe firm, will produce an unbiased estimate. However, if one has, as is expected, probabilities of selection ofestablishments which correlate with total employment at the establishment, then the allocation should correlate withthese chances of selection. Thus, an allocation in proportion to the frame employment at each establishment or itsprobability of selection would be recommended.

While proration appears to be a form of imputation, it also resembles a multiplicity estimator. (Sirkin, 1972) One cansee this if one sums the values of the estimator over all the establishments within a single firm. In the case of estimationof retirees for the IC, the firm weight is the weighted sum of the establishment employments for the selectedestablishments within the firm, divided by the firm level employment. Considering the estimator in this fashion, seemsto give it more statistical credibility.

As opposed to the previous methods suggested this method need not rely on linked establishment information on theframe. One can at time of collection, acquire establishment and firm employments to be used for allocation purposes.However, it is still best to have frame linkage and employments if possible. This avoids problems caused by missingdata. Also , the use of the frame employment which allows the proportions to add to one is preferable to using areported establishment and firm employment. This is because reported employments made for firms are generallyestimates and thus the sum of the reported establishment employments does not equal the firm employment and thusthe proportions allocated to each establishment would not add to the total reported value. This could lead to bias inone’s estimates.

Since, the estimate has been converted to an establishment estimate, sampling variances would use the same methodused for other estimates made using the establishment sample.

It should be noted that this method only works for total values that can be prorated across the firm’s establishments.However, with some clever allocation methods across a firm’s establishments, many estimates can be made totals. Forinstance, if one wished an estimate of the total number of firms that offer health insurance, one could allocate a valueless than one to each of a firm’s establishments so the sum across the entire firm was one. Then, the expected sum foreach firm is zero or one when summed across its establishments and the expected total of the allocated variable acrossall establishments would equal the number of firms which offer health insurance.

In summary, use of frame information allocates the collected information in a way that produces the unbiased estimatebased upon collected information. No frame values are presented when results are produced.

Proration of this nature, however, does not allow one to produce proper estimates for geographic areas. For instance,for the IC, one should not estimate the number of retirees in Florida by summing the prorated firm values over all thesample establishments in Florida. This is because the expected value process does not produce unbiased estimatesbelow the National level. Currently, to produce such geographic estimates with firm level data, requires that one eithercollect data from the firm for each State or one makes National estimates and then distribute this total across Statesusing related data. Using the retiree example, one could distribute the National total according to the percent of peopleover 65 in each state.

7. CONCLUSIONS

To produce firm and establishment level estimates from the same economic survey requires a frame that linksestablishments together into firms. Each method given may have a use dependent upon the situations. However, itappears that the second and third methods are probably too much work. These methods find the probabilities ofselecting a firm are found for an establishment survey where the firm is not a sampling unit. The advantage of thesemethods is that they do allow for a sample optimized on establishment allocation.

The choice between the other two methods, cluster sampling of firms and proration of firm information to establishmentsshould be made by considering the types and quality of the estimates needed. For instance, if a large number ofimportant firm level estimates must be made along with some establishment level estimates, it is probably best to usea cluster sample. For the opposite extreme where the geographic estimates are most important and few firm level

Page 19: EVALUATION OF DUN AND BRADSTREET'S DMI FILE AS ......1147 EVALUATION OF DUN AND BRADSTREET'S DMI FILE AS AN ESTABLISHMENT SURVEY SAMPLING FRAME Christopher L. Moriarity, National Center

1164

estimates are required, then one should avoid cluster samples, produce the best establishment level sample with whichto make geographic estimates and make firm level estimates through proration. The latter is the current status of theMEPS-IC.

8. REFERENCES

Chapman, D. W. (1995), “Evaluation of Sampling Frames for the 1994 National Employer Health Insurance Survey,”Technical Report , Contract 200-92-0510, Task Order 41, Washington, DC: Klemm Analysis Group.

Ohlsson, E. (1995), “Coordination of Samples Using Permanent Random Numbers,” in Cox, B. G., D. B. Binder, B.N. Chinnappa, A. Christianson, M. G. Colledge and P. S. Kott (eds.) Business Survey Methods, New York, NY: Wileypp 153-170.

Sirken, M. G. (1972), “Stratified Sample Surveys with Multiplicity,” Journal of the American Statistical Association,67, pp 68-73.

Shah B. V., B. G. Barkwell and G. S. Bieler (1995), SUDAAN Users Manual: Software for Analysis of Correlated Data,Release 6.40, Research Triangle Park, NC: Research Triangle Institute.

Sommers, J. P. (1999), List Sample Design of the 1996 Medical Expenditure Panel Survey Insurance Component,MEPS Methodology Report No. 6, AHCPR Pub. No. 99-0037, Rockville MD: Agency for Health Care Policy andResearch.

Wolter, K. M. (1985), Introduction to Variance Estimation, New York, NY: Springer-Verlag.

Zarkin, G. A., S. A. Garfinkel, F. J. Potter and J. J. McNeill (1995), “Implications of the Sampling Unit for PolicyAnalysis,” Inquiry 32, pp 310-319.

Page 20: EVALUATION OF DUN AND BRADSTREET'S DMI FILE AS ......1147 EVALUATION OF DUN AND BRADSTREET'S DMI FILE AS AN ESTABLISHMENT SURVEY SAMPLING FRAME Christopher L. Moriarity, National Center

1165

IMPUTATION FOR ESTABLISHMENT SURVEYS: LESSONS LEARNED FROM AN EMPLOYERHEALTH INSURANCE SURVEY

Leslie Wallace and David A. Marker, WestatLeslie Wallace, Westat, 1650 Research Blvd., Rockville, MD 20850, [email protected]

ABSTRACT

This paper summarizes lessons learned from imputation for the National Employer Health Insurance Survey (NEHIS). TheNEHIS was a national survey of over 40,000 employers collecting information on the types of health insurance offered toemployees and the costs and coverage of such insurance. While the items presented stem from work done on the NEHIS,they are applicable to imputation in establishment surveys in general. Some of the issues discussed include the organizationof a large-scale imputation task, the desire to impute values that maintain the multivariate relationships inherent in the dataset, and problems associated with having responses at different levels (i.e., covering different portions of the establishmentor firm to which the establishment belongs). The resolution of many of these problems led to a construct-edit-impute cyclewhich was implemented multiple times until imputed and constructed values passed edits.

Key Words: Hot-deck imputation, Attenuation of key relationships, Large-scale complex surveys

1. INTRODUCTIONMuch of the recent research into imputation methodology has focused on developing optimal procedures for a singlevariable or set of variables, where the patterns of missingness and underlying distributions follow standarddistributions. In contrast, it is frequently necessary to impute for many variables from a single survey, with an evenlarger set of potential covariates and complex covariance structures among the variables to be imputed. Further, theimputations need to be completed in a relatively short time frame within a constrained budget. The analyst also isunlikely to be able to anticipate all of the important analyses for which the imputed data are to be used. This oftenprevents analysts from being able to produce optimal imputations for each variable. Instead, one tries to produce aset of imputed variables that minimize the attenuation of key relationships, hopefully reduces nonresponse bias, andsatisfies the time and budgetary constraints.

These complexities often require the selection of methods that use imputed values of one variable to impute others,possibly through iterative procedures. Both Bayesian and hot-deck procedures have been proposed for suchsituations. This paper describes how these issues were addressed when imputing for the National Employer HealthInsurance Survey (NEHIS). A number of strategies could have been chosen. It was necessary to develop imputationmodels for dozens of variables, many of which were highly correlated with each other, requiring carefulconsideration of the sequence of imputation. Many of the variables had to be imputed satisfying arithmeticconstraints. Item nonresponse rates varied from one percent to about 75 percent, following complex jointnonresponse patterns (Judkins, 1997). After briefly discussing the complications in the dataset, the authors explainthe strategies implemented to produce imputations that achieve many of the characteristics desired by optimalprocedures.

1.1. Reasons to Impute Data for Complex SurveysLarge complex datasets typically contain large numbers of variables measured on even larger numbers ofrespondents. Such datasets are the logical result of surveys that attempt to understand the relationships amongcharacteristics of the population of inference and multiple outcome measures.

The missing responses in the questionnaire items can be handled in one of two ways: they can be filled in by someform of imputation, or they can be left as missing with missing data codes assigned in the data files. The use ofimputation to assign values for item nonresponse in large-scale surveys has a number of advantages (see, forexample, Kalton, 1983). One is that carefully implemented imputations can reduce the risk of bias in many surveyestimates arising from missing data. Second, with data assigned by imputation, estimates can be calculated as if thedataset were complete. Thus, analyses are made easier to conduct and the results easier to present. By including theresponses from partially-complete cases, the power of statistical analyses for marginal means and totals is alsoincreased. Third, the results obtained from different analyses will be consistent with one another, a feature that neednot apply to the results obtained from an incomplete data set. (For example, estimates of total costs will only equal

Page 21: EVALUATION OF DUN AND BRADSTREET'S DMI FILE AS ......1147 EVALUATION OF DUN AND BRADSTREET'S DMI FILE AS AN ESTABLISHMENT SURVEY SAMPLING FRAME Christopher L. Moriarity, National Center

1166

the sum of the cost totals by establishment size, after size has been imputed for all establishments.) However, whenanalyzing an imputed data set, it needs to be recognized that the standard errors of the estimates are larger than thosethat would apply if there were no missing data.

The alternative to imputation is to leave the task of compensating for the missing data to the data analysts. Analystscan then develop methods of handling the missing data to satisfy their specific analytic models and objectives (Littleand Rubin, 1987). Such an approach may be preferable in some cases, but it is often impractical. Most analystsconfronted with this task are likely to rely on the options available in software packages for handling missing data.Moreover, given the wide range of analyses that are conducted with a survey data set, it is unrealistic to believe thatan efficient compensation procedure can be developed for each individual analysis, while it is possible to retain acore set of relationships when producing an imputed data set to be used by others. Finally, the data producers mayhave access to useful restricted information that can assist in their imputations.

1.2. Constraints Limiting Development of Optimal Imputation for Each VariableMuch of the research on imputation has concentrated on best methods for imputing for a single variable at a time. Inlarge complex datasets, the situation is much harder, because the resulting data must satisfy multiple logicalconsistencies that are often intertwined. These relationships can take the form of one variable being the sum or ratioof others, or that while not being the exact sum (or ratio), it should approximate that relationship. Also, indeveloping models to use in imputation, it is desirable to anticipate the main analyses that are planned for theimputed data and to try to avoid attenuating the variance among the variables whose relationships are beinginvestigated. By their very nature, large complex datasets are analyzed by many users over many years. It isimpossible to anticipate all of the significant analyses that will be conducted by the analysts. It is only possible towork with those who designed the original study, to try and anticipate which relationships are most important toaccurately preserve during the imputation process.

With the number of variables collected in the dozens or even hundreds, it is often impractical to try and develop bestmodels for each variable. Rather, limits may be placed on the time and resources devoted to development of eachmodel, with the goal of filling in for as much missing data as possible, given finite resources. If the survey is not arepeated survey, there may be a lack of historical data available on the relationships among the variables. In thissituation, model development has to be based on the observed responses or on a-priori beliefs. Not only is this morecomplicated, since the database used to develop the model contains the very biases that are hoped to be reduced, butthe time between data collection and database production is often limited by sponsor’s eagerness to begin analyses.

1.3. Achievable Goals of ImputationThe goal of any imputation should be to provide a database containing complete cases allowing for easy, consistent,analyses. The resulting database should minimize bias from nonresponse in univariate analyses and attenuation ofkey multivariate relationships. Subgroup analyses should be consistent with marginal distributions. The greater theresources (statistical skill, knowledge of potential uses, budget, time, etc.) available, the better one will be able toachieve these goals. From the experience of imputing for the National Employer Health Insurance Survey (NEHIS),it is hoped that one will better understand how to address this with finite resources, while still trying to come asclose as possible to these ideals.

2. NATIONAL EMPLOYER HEALTH INSURANCE SURVEY (NEHIS)2.1. Magnitude and Complexity of ProblemThe 1994 National Employer Health Insurance Survey (NEHIS) was sponsored by three United States healthagencies, the Health Care Finance Administration (HCFA), Agency for Health Care Policy Research (AHCPR), andthe National Center for Health Statistics (NCHS). NEHIS collected information on the health insurance plansoffered by 40,000 private-sector establishments and governments (collectively referred to as establishments), and50,000 health insurance plans offered by those private and public-sector respondents. More than 100 variables werecollected for each establishment and each health plan. Fifty of these variables were selected for imputation. Since itwas necessary to model these variables separately for public and private sectors, and for fully-insured andself-funded health plans, this required almost 150 separate imputation models (Yansaneh, et al., 1998).Each model had to evaluate dozens of potential covariates. Ideally, covariates would be found that were highlycorrelated with the variable to be imputed and were present when the imputation variable was missing. Furthercomplicating the effort was the fact that the data set was to be used by at least three government agencies and manyadditional unknown analysts. The anticipated uses ranged from modeling national accounts to estimating levels of

Page 22: EVALUATION OF DUN AND BRADSTREET'S DMI FILE AS ......1147 EVALUATION OF DUN AND BRADSTREET'S DMI FILE AS AN ESTABLISHMENT SURVEY SAMPLING FRAME Christopher L. Moriarity, National Center

1167

health insurance coverage to understanding the types of coverage included by different health plans. Thus, it wasimpossible to gain agreement from sponsors on the few vital relationships that must not become attenuated throughimputation. Best efforts would be needed to maintain many different relationships.

The item response rates for the 150 variables to be modeled varied from 99 percent to 25 percent; but in almost allcases the response rates were above 70 percent. The percentage of imputation variables by response rate category issummarized as follows, given in the form response rate (percent of variables): 95-100% (37%), 90-95% (21%),85-90% (13%), 80-85% (8%), 75-80% (9%), 70-75% (9%), and below 70% (3%). Even though the government didnot plan to publish estimates for the few variables with low response rates, it planned to use them in a variety ofmodeling efforts since no other source exists for this information. Imputed data based on low response rates werethought to be preferable to using the unimputed data for modeling purposes because the imputation models werelikely to significantly reduce some sources of bias.

As with many complex surveys, variables were measured at different levels. For NEHIS, data were collected at thefirm (corporation) level, establishment level, the health insurance plan level, and the plan within establishment level.It is important to retain this structure in the imputed data. If a variable is collected once for all establishments in afirm, then it should only be imputed once for the set of establishments within a firm that did not respond.

Further complicating the imputation, the data were subject to numerous logical consistency requirements. Theserequirements range from situations where employer and employee contributions must add up to the total premium,too much more complex arrangements involving combinations of single and family-coverage enrollments andcontributions and total plan costs. This required frequent cycling between imputation-edit-reimputation to achievean imputed dataset that matched the cleanliness of the reported data.

2.2. Approach Chosen2.2.1. Ordering and Grouping VariablesMany of these consistency requirements could only be evaluated when the last of the variables involved wasimputed. Thus, it was very important to determine an order of imputation that would maximize the availablecovariates at each step, and simultaneously allow for checking logical edits as soon as they could possibly bechecked. To accomplish all of these tasks, the variables were broken up into chunks of related variables (e.g.,enrollments in health plans). The chunks were put together into groups so that all chunks in a group could beimputed simultaneously, since a variable in one chunk would not be needed to check the imputation of a variable inanother chunk in the same group. The groups were then ordered in a logical sequence to provide for the maximumavailable covariates at each step. The basic groups were to first impute enrollments, then component costs, thenfinally total costs per enrollee. Figure 1 provides a simplified overview of this approach. The following factorsshould be taken into consideration in deciding on the order of imputation for a large complex data set when onewants to avoid large numbers of cycles:

1. If one variable is used in the construction of a second variable, then the first variable should be imputed beforethe second.

2. The imputations should follow the logical sequence (if any) suggested by the patterns of missingness of theimputation variables, that is, the joint frequencies that identify sets of imputation variables that are missingtogether. For instance, if the first variable happens to be a strong covariate of the second, and is present in mostcases where the second variable is missing, then the first variable should be imputed first.

3. Within groups, variables using deterministic imputation should be imputed before variables requiring stochasticimputation.

Page 23: EVALUATION OF DUN AND BRADSTREET'S DMI FILE AS ......1147 EVALUATION OF DUN AND BRADSTREET'S DMI FILE AS AN ESTABLISHMENT SURVEY SAMPLING FRAME Christopher L. Moriarity, National Center

1168

Impute Group AEnrollments

Chunks 1 and 2

2Pass

Group AEdit Checks?

Impute Group BComponent CostsChunks 3, 4, and 5

4Pass

Group BEdit Checks?

Impute Group CTotal Costs Per Enrollee

Chunk 6

6Pass

Group CEdit Checks?

Figure 1. Overview of NEHIS Imputation

No

No

No

Done

Yes

Yes

Yes

1

3

5

4. Within groups, decisions about the order ofimputation need only be made for imputationvariables that are very highly correlated with oneanother. The order of imputation is not crucialfor variables that are not highly correlated.

5. If the best covariates are the same for a set ofimputation variables within a group, and thosevariables are frequently all missing for the samecases, then those imputation variables should beimputed as a block.

The chunks into which the NEHIS imputationvariables were partitioned covered the following dataareas: fully-insured premiums; premium equivalents(self-funded plans); plan enrollments; plan costs;deductibles and co-payments; additional plan-levelvariables; and additional establishment-levelvariables.

The complex nature of the NEHIS data set had amajor impact on the order in which the variableswere imputed. For instance, health insurance plancosts are a function of plan enrollments andpremiums. Therefore, enrollments and premiumswere imputed before plan costs. Also, within thechunk consisting of premiums, the employer andemployee contributions to the premiums for singlecoverage were found to be the most highly correlatedcovariates for the corresponding contributions andpremiums for family coverage. Therefore, thesingle-coverage contributions and premiums wereimputed first, and then used in the imputation of theirfamily-coverage counterparts. Several variableswithin some chunks consisting of premiums andenrollments were imputed simultaneously as a block.Examples of such blocks of variables are the numberof retirees under 65 and the number of retirees over65 for all plans; the employer contributions topremiums and the premiums for single coverage forfully-insured plans in the public sector; and thenumber of enrollees and the number of enrollees withfamily coverage for private sector plans.

In planning any imputation process it is important todecide how consistent the imputed data should be. InNEHIS, the goal was to make the data set afterimputation at least as good as the one beforeimputation in terms of allowable ranges andmultivariate relationships between variables. Forexample, care was taken not to impute data valuesthat were out of range, and to be sure that algebraicrelationships among variables were preserved (suchas one variable being the sum of three others). Theseconsistency requirements frequently necessitated

Page 24: EVALUATION OF DUN AND BRADSTREET'S DMI FILE AS ......1147 EVALUATION OF DUN AND BRADSTREET'S DMI FILE AS AN ESTABLISHMENT SURVEY SAMPLING FRAME Christopher L. Moriarity, National Center

1169

both an edit-impute cycle and an impute-construct cycle. The data set before imputation was edited, then missingvalues were imputed, and then the imputed data were edited again. Any values that failed edits and were set tomissing were then re-imputed. Similarly, during the course of imputation, animpute-construct cycle was implemented for sets of variables with algebraic relationships that needed to bemaintained; once a value was imputed, others could be logically constructed using that value. These constructedvariables were then subject to their own edits.

2.2.2. Imputation MethodsTo understand the extent and patterns of missingness in the NEHIS data set, frequency distributions of allimputation variables and covariates were constructed and examined. For each set of variables, an appropriateimputation strategy was selected. Alternative approaches were evaluated in terms of the quality of the imputationsand the associated costs. Some approaches which may be sub-optimal were chosen because they kept the number ofpasses through the data to a minimum, thereby reducing both data processing costs and time while producing resultsthat are essentially comparable to those produced by optimal but very expensive and time-consuming approaches.Other approaches were not considered for implementation in NEHIS for a variety of reasons; for instance, regressionimputation was not used primarily because of the pervasive problem of missing data in the most highly correlatedcovariates. The cold-deck method was not used because of the lack of comparable past data on the same population.Logistic regression imputation was not used because of the relatively small number and the relatively lownonresponse rates of binary imputation variables in the NEHIS data set. Bayesian methods (Gibbs sampling andother Monte Carlo Markov Chain methods) were not used for three reasons. First, it is difficult to identify priordistributions for imputation in complex surveys with multiple sponsors. Second, Bayesian methods assume smoothstandard distributions, which many variables do not have (e.g., deductibles, co-payments). Third, these methodsrequire significant computing capability. For a more complete discussion of Bayesian methods versus alternativemethods, see Marker, et al. (in press).

Variables missing only one or two percent of the time were deterministically imputed. Mean or modal imputationwithin cells was used, depending on if the variable was continuous or categorical. Given the low rate of missingdata, the resulting deflating of the variance was thought to be trivial compared to the savings in time and effortcompared to stochastic imputation methods that require development of models.

Variables with higher rates of missing data were generally imputed using hot deck. Examination of bivariatecorrelations (categorical variables were converted to dummy variables) were used to identify best covariates, whichwere then compared with patterns of missingness across imputation variables and potential covariates. Highlycorrelated covariates that were generally present whenever the imputation variable was missing were chosen todefine the hot-deck cells. If the covariates were continuous they were split into 3 or 5 categories based on theirempirical distribution. The resulting imputed variables were then subjected to the same edits used for reported dataand, if they failed edits, re-imputed.

A variant of the hot deck, which we will refer to as the Hot-Deck-Variant (HDV) method, was implemented insituations where there was one highly significant continuous covariate for a given imputation variable, and thiscovariate turned out to be a count variable (for example, number of enrollees in a health insurance plan or number ofemployees at an establishment). This method is a form of nearest neighbor imputation within cells. In thisprocedure, the covariate itself (rather than categories of it), was the last variable used (perhaps along with one or twomoderately correlated categorical variables) to define the boundaries that could be crossed if necessary to find adonor. This procedure has the advantage of easy implementation and its results are comparable to those obtainedfrom regression imputation (Aigner et al., 1975).

Once all the imputation was completed in a group of chunks, imputation moved on to the next group. At this pointadditional edit constraints that the reported data had passed were applied to the imputed data. If a conflict arosebetween two imputed values from different variable groups, the most recently imputed data were revised tominimize the need for cycling across groups of variables. There were situations, however, where earlier imputationswere identified at a later stage to fail edits and therefore required re-imputation. This required a second pass throughthe imputation process to impute for these complex situations. Careful review of all edits for the smaller number ofcases needing imputation during this second pass assured that no third pass through the data would be necessary.

Page 25: EVALUATION OF DUN AND BRADSTREET'S DMI FILE AS ......1147 EVALUATION OF DUN AND BRADSTREET'S DMI FILE AS AN ESTABLISHMENT SURVEY SAMPLING FRAME Christopher L. Moriarity, National Center

1170

As an example, imputed enrollments in a health plan may have met all the required edits on enrollments (e.g.,enrollment is not greater than total employment at the establishment) but the imputed enrollment causes costs perenrollee to go outside allowable ranges. This is not identified until costs have been imputed in a later group. If thecost data were reported, then it becomes necessary to go back and re-impute enrollments.

The hot deck and HDV procedures were the most frequently used imputation methods in NEHIS primarily becausethe NEHIS data set contains a large number of imputation variables with weakly correlated covariates. Anotherreason is computational convenience - the procedures are easily implemented by standard imputation softwareavailable to NEHIS staff. Of the approximately 150 imputation models implemented in the NEHIS imputation task,60 percent used hot deck, 30 percent used HDV, and 10 percent used deterministic imputation.

2.3. Strengths and Weaknesses of ApproachThe combination of hot deck and HDV imputations described above allows for complete case analyses involvingthese 50 variables. This significantly increases the utility of the resulting data set for multivariate analyses byproviding consistency across tabular analyses. The relatively simple models used for the imputations have hopefullyreduced much of the potential bias from item nonresponse. Correlations before and after imputation for four keypairs of variables were reviewed for each imputation model (resulting in 12 correlation pairs). The results of thereview generally indicated that attenuation of relationships had been minimized. For the variables being reviewed,the correlations before imputation ranged from .43 to .98. In all but two cases, the correlations before and afterimputation differed by .04 or less. The two largest differences were .07 (correlation went from .97 to .90) and .08(correlation went from .45 to .37). Limited resources did not allow a more thorough and systematic assessment ofthe effect of imputation on the multivariate data structure.

The final database provides clear documentation of the source of imputed data (from a donor versus resulting froman edit constraint). The full range of data users is unknown and thus could not be consulted on the models that wereused for the hot-deck imputations. The models, however, are quite straightforward and easily described indocumentation (which covariates were used for which variables), so users can decide on their appropriateness fortheir analyses. This approach, however, just like Bayesian methods, was very labor intensive and time consuming. Ithas taken multiple statisticians and statistical programmers many months to complete this work. This large effortwas a result of the number of imputation variables combined with the very complex logical and edit constraintsimposed on the resulting data.

3. CONCLUDING REMARKSLarge complex datasets typically contain hundreds of variables on thousands of respondents. These datasets usuallycontain item nonresponse for nearly all variables. The missing responses in the questionnaire items can be handledin one of two ways: they can be filled in by some form of imputation, or they can be left as missing with missingdata codes assigned in the data files. Carefully implemented imputations such as those for NEHIS can reduce therisk of bias in survey estimates arising from missing data. Also, analyses can be conducted from the imputed data setmaking use of respondents who had partially reported data, increasing the power of analyses.

4. REFERENCESAigner, D.J., Goldberger, A.S., and Kalton, G. (1975), “On the Explanatory Power of Dummy Variable

Regressions,” International Economic Review, 16, 2, pp. 503-510.

Judkins, D.R. (1997), “Imputing for Swiss Cheese Patterns of Missing Data,” Proceedings of Statistics CanadaSymposium 97, New Directions in Surveys and Censuses, pp. 143-148.

Kalton, G. (1983), Compensating for Missing Survey Data, Research Report Series, Ann Arbor, Michigan: Institutefor Social Research.

Little, R.J.A. and Rubin, D.B. (1987), Statistical Analysis with Missing Data, John Wiley.

Marker, D.A., Judkins, D.R., and Winglee, M. (in press), “Large-Scale Imputation for Complex Surveys,” SurveyNonresponse.

Yansaneh, I.S., Wallace, L., and Marker, D., (1998), “Imputation Methods for Large Complex Datasets: Anapplication to the NEHIS,” Proceedings of the Section on Survey Research Methods, American StatisticalAssociation, pp. 314-319.

Page 26: EVALUATION OF DUN AND BRADSTREET'S DMI FILE AS ......1147 EVALUATION OF DUN AND BRADSTREET'S DMI FILE AS AN ESTABLISHMENT SURVEY SAMPLING FRAME Christopher L. Moriarity, National Center

1171

LESSONS LEARNED FROM CONDUCTING A NATIONAL HEALTH INSURANCE SURVEY OFESTABLISHMENTS: DISCUSSION

Brenda G. Cox, Mathematica Policy Research, Inc. Brenda G. Cox, Mathematica Policy Research, Inc., 600 Maryland Ave. SW, Suite 550, Washington, DC

[email protected]

ABSTRACT

The Medical Expenditure Panel Survey and its predecessor, the National Employer Health Insurance Survey in surveying businesses about the healthinsurance benefits they offer to their employees. The methodological issues encountered by these surveys are common to all surveys of healthinsurance benefits. These issues relate to frame coverage and quality and to the level of the business used to define the data collection and analysisunit. This paper discusses four methodological investigations associated with these two issues and presents a suggested change in sampling unit forfuture surveys.

Key Words: Business surveys, Dun’s Market Identifiers file, employer surveys, health insurance

Seven years ago, the first International Conference on Establishment Surveys (ICES-I) occurred. That conferencesolicited papers that documented current methods and procedures for business surveys. The organizers also had ahidden agenda—to promote an international exchange of ideas that would facilitate progress in improving businesssurvey methods. Although ICES-I focused primarily on government-sponsored economic surveys, the underlyingprinciples are also relevant for a broad array of organizational entities. I have implemented these procedures indesigning a census of substance abuse facilities, a survey of organizations that use agricultural market information, asurvey of emergency food pantries and soup kitchens, and a survey of employers providing health insurance to theiremployees. The latter survey encountered many of the same problems as described for the National Employer HealthInsurance Survey (NEHIS) and the Medical Expenditure Panel Survey insurance component (MEPS-IC). In discussingthese papers, I focus on NEHIS/MEPS-IC experience that illustrates the unique features of business surveys.

Moriarty and Chapman’s (2000) paper on the “Evaluation of Dun and Bradstreet’s DMI File as an Establishment SurveySampling Frame” provides a very useful summary of the attributes of Dun’s Market Identifiers (DMI) file. The DMIfile is the best business frame that is currently available to most Federal agencies and to private organizations. Due toconfidentiality laws, Federal agencies such as Agency for Healthcare Research and Quality do not have access tocomprehensive business registers such as BLS’ Business Establishment List (BEL) or the Census Bureau’s StandardStatistical Establishments List (SSEL). This reflects “that uniquely American problem”—legislated barriers to datasharing across Federal agencies.

The DMI file is an administrative database with all the negative attributes that that implies. Undercoverage occursparticularly for small businesses and newly formed businesses for which DMI’s data sources—credit reports andtelephone listings—tend to be incomplete and out of date. Moriarty and Chapman address what I have found to be animportant source of error in the DMI database—the presence of a substantial number of “duds.” Duds are records thatare not associated with operating enterprises or establishments. Presumably, these records reflect businesses that areno longer in operation. Moriarty and Chapman refer to such cases as “overcoverage” which is somewhat misleading.They reflect invalid records rather than duplicated listings for the same organization.

Moriarty and Chapman’s findings with respect to DMI data quality agree with my experience. However, I do not concurwith their recommendation to delete records without telephone numbers. My approach is to have all such cases, whensampled, forwarded to our Locating Department for detailed searches using publically available databases, the Internet,and local business organizations such as the Chamber of Commerce and the Better Business Bureau. Of course, Dunand Bradstreet should take the same steps.

The Chapman and Sommers’ (2000) paper on “Techniques for Sampling Establishments in Multi-Establishment Firms”furnishes an example of another difficulty business surveys face—relating the unit of sampling to the analytic goals ofthe survey. The process of deciding the sampling unit must balance the needs of the analyst with the practical aspectof data collection. In particular, choice of the sample unit relates directly to the level of the organization where

Page 27: EVALUATION OF DUN AND BRADSTREET'S DMI FILE AS ......1147 EVALUATION OF DUN AND BRADSTREET'S DMI FILE AS AN ESTABLISHMENT SURVEY SAMPLING FRAME Christopher L. Moriarity, National Center

1172

businesses tend to compile and store the desired survey information. Thus, production surveys tend to focus on theestablishment while financial surveys focus on the enterprise.

It is not clear from this paper what the ideal unit of analysis would be for the NEHIS/MEPS-IC, if data could becollected from such a unit. The MEP-IC and its predecessor the NEHIS have used the establishment as the samplingunit as a way to facilitate state-level estimation. The obvious question is—how does the establishment relate to the waybusinesses maintain information? Logically, one would expect that the information resides where the purchase is made.For health insurance, I expect the enterprise is the unit that typically purchases and maintains employee insurance data.I also expect that they maintain data associated with their purchasing unit, which will be the individual health insurancemarkets. Most markets will tend to be rested within state boundaries, but some markets may be metropolitan areas thatcross state lines. For its Vermont Parity Evaluation Survey, Mathematica is using the “Vermont portion of the businessenterprise” as the sampling and analysis units. Zarkin, et al. (1995) indicate that the enterprise is the most appropriatesampling and analysis unit for health insurance employer surveys. MEP-IC might want to consider the feasibility ofalternative sampling units such as the cross of the enterprise with states.

Another feature that I have found to be uniquely different for business surveys is the extent and complexity of the dataediting and imputation needed to prepare the data set for analysis. The types of data typically collected are moredifficult to provide, which in turn leads to nontrivial levels of nonresponse for many items. Logical interrelationshipsbetween data items are very common and this, in combination with the complexity of the questions being asked, leadsto a need for extensive editing prior to analysis.

Wallace and Marker (2000) address the problem of imputation of highly interrelated business survey data in their paper,“Imputation for Establishment Surveys: Lessons Learned from an Employer Health Insurance Survey.” Generally, Ihave found that most analysts do not have the technical knowledge or the time and patience to resolve missing dataproblems in prior to data analysis. As a consequence, many surveys produce public use data sets with logicalinconsistencies corrected and missing data imputed. The approach that Wallace and Marker implemented requiredmultiple passes through the data set with editing, then imputation, editing of the imputation-revised data, re-imputationof discrepant imputed values, re-editing of the re-imputed data, and so forth. Generally, I have been able to deviseimputation approaches that produced consistent data that met edit checks the first time. Doing so, though, requireddesigning imputation runs that reflect the desired consistency between data items. More passes through the data mayalso be needed to complete the imputation while preserving consistency.

Another difference in approaches is minor. When possible, Wallace and Marker categorize continuous variables priorto using them in sorting for the hot deck technique. The only time they use the continuous response is the situationwhere only one variable is useful for sorting. For me, this latter event—only one data item correlated with the itemsbeing imputed—is rare. I often use one continuous variable as the last variable in sorting. If it is highly correlated withthe data item, I sometimes use sort by a categorized version of it first with other categorized items and finally sort bythe continuous variable for added control.

In “Methods to Produce Establishment and Firm Level Estimates for an Economic Survey,” Summers (2000) returnsto the issue of the level of the company for which information is to be collected and analyzed. While the establishmentis an acceptable unit for data collection on current employees, it breaks down for retirees who no longer are linked toan establishment. Sommers describes how “firm” level estimates can be developed for the MEP-IC when theestablishment has been used as the sampling unit. I found his presentation of the current method unclear—I could notquite understand the rationale for the firm weighting. The other two methods are much more straightforward. I shouldpoint out that the proration method is a variation of the classic multiplicity adjustment. Sommers adjusts the data asopposed to adjusting the weight. In this situation, a firm is considered to be selected when one or more of itsestablishments is selected.

The problem Sommers is describing here for estimation relates to the problem Chapman and Sommers describe forsampling. Neither the establishments nor the enterprise would appear to be the best data collection unit (or analysisunit). For another study, I proposed a unit that I called the health insurance unit. The health insurance unit is thecollection of establishments subject to the same set of health insurance plans. It’s logical for a business to provideinsurance to its component establishments using a common set of plans whenever possible. Thus, a small business

Page 28: EVALUATION OF DUN AND BRADSTREET'S DMI FILE AS ......1147 EVALUATION OF DUN AND BRADSTREET'S DMI FILE AS AN ESTABLISHMENT SURVEY SAMPLING FRAME Christopher L. Moriarity, National Center

1173

owner in Southern West Virginia might cover all the employees in its 50 convenience stores using one plan. When abusiness’ establishments fall in different health insurance markets, they must set up multiple health insurance units.Mathematica employees in DC, for instance, are offered different health insurance plans from that of Mathematicaemployees in Princeton, NJ. Unfortunately health insurance units do cross state lines. Mathematica’s DC healthinsurance unit covers our downtown office as well as our downtown telephone center in Columbia, MD.

The papers presented in this session have been interesting and informative. My comments have addressed an issue thatI think merits further methodological investigation for health insurance surveys of employers. This issue is—for whatlevel of the business organization is it most convenient for businesses to report and how does this issue relate to theanalytic goals of the survey?

REFERENCES

Chapman, D.W. (2000), “Techniques for Sampling Establishments in Multi-Establishment Firms,” paper presented atthe second International Conference on Establishments Survey, Buffalo, New York.

Moriarty, C.L., and D.W. Chapman (2000), “Evaluation of Dun and Bradstreet’s DMI File, as an Establishment SurveySampling Frame,” paper presented at the second International Conference on Establishments Survey, Buffalo,New York.

Sommers, J. (2000), “Methods to Produce Establishment and Firm Level Estimates for an Economic Survey,” paperpresented at the second International Conference on Establishments Survey, Buffalo, New York.

Wallace, L., and D. Marker (2000), “Imputation for Establishment Surveys: Lessons Learned from an Employer HealthInsurance Survey,” paper presented at the second International Conference on Establishments Survey, Buffalo,New York.

Zarkin, G.A., S.A. Garfinkel, and J.J. McNeill (1995), “Employment-Based Health Insurance: Implications of theSampling