cutoff sampling and inference -...

42
Cutoff Sampling and Inference James R. Knaub, Jr. 1 Abstract: This article surveys the use of cutoff sampling and inference by various organizations and as described in the literature. This is a technique often used for establishment surveys. Online searches were made which included the key words “cutoff sampling” and “cut-off sampling.” Both spellings are in use. Various approaches are described, but the focus is on the model-based approach, using the classical ratio estimator (CRE). Conclusions are drawn. Key words: establishment surveys, total survey error, model-based classical ratio estimator, CRE, multiple regression, RSE, RSESP, certainty stratum, link relative estimator (I) Introduction: This introduction is followed by three more sections. Section II shows the results of a literature search. Section III lists the advantages of one of these approaches: cutoff sampling with ratio estimation as used for electric power data at the Energy Information Administration (EIA) in the Electric Power Division (EPD). Section IV discusses concepts relating cutoff sampling to total survey error. This approach could be used analytically to complement many methods, but is considered here in the context of model-based cutoff sampling for establishment surveys. A short introduction to cutoff sampling is found in an encyclopedia published by Sage: Knaub (2007c). Pragmatism is the basis for choosing a cutoff sample for an establishment survey. That is generally true for any sample design, but other designs are generally considered more elegant because they purport to give all members of the universe fair representation, while in a cutoff sample, some members of the universe have no chance of being selected (unless a second sample is taken from among those excluded by the cutoff, which eliminates this objection). Whenever any subset of a population has no chance of selection and estimation is based on a model of what is available, there is always a chance that substantial changes in that subset not being sampled could take place (resulting in “model failure”), and this would be unknown. However, this thinking may not include a practical consideration of data quality (total survey error). One should consider all sources of error and their likely magnitudes. Historically, only sampling error is considered when calculating estimates of the standard error of a total for a data element collected on a sample survey, and an (estimated) total 1 Energy Information Administration, EI-53, DOE, 1000 Independence Ave., Washington DC 20585

Upload: duongdien

Post on 27-Sep-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

Cutoff Sampling and Inference James R. Knaub, Jr.1

Abstract: This article surveys the use of cutoff sampling and inference by various organizations and as described in the literature. This is a technique often used for establishment surveys. Online searches were made which included the key words “cutoff sampling” and “cut-off sampling.” Both spellings are in use. Various approaches are described, but the focus is on the model-based approach, using the classical ratio estimator (CRE). Conclusions are drawn. Key words: establishment surveys, total survey error, model-based classical ratio estimator, CRE, multiple regression, RSE, RSESP, certainty stratum, link relative estimator (I) Introduction: This introduction is followed by three more sections. Section II shows the results of a literature search. Section III lists the advantages of one of these approaches: cutoff sampling with ratio estimation as used for electric power data at the Energy Information Administration (EIA) in the Electric Power Division (EPD). Section IV discusses concepts relating cutoff sampling to total survey error. This approach could be used analytically to complement many methods, but is considered here in the context of model-based cutoff sampling for establishment surveys. A short introduction to cutoff sampling is found in an encyclopedia published by Sage: Knaub (2007c). Pragmatism is the basis for choosing a cutoff sample for an establishment survey. That is generally true for any sample design, but other designs are generally considered more elegant because they purport to give all members of the universe fair representation, while in a cutoff sample, some members of the universe have no chance of being selected (unless a second sample is taken from among those excluded by the cutoff, which eliminates this objection). Whenever any subset of a population has no chance of selection and estimation is based on a model of what is available, there is always a chance that substantial changes in that subset not being sampled could take place (resulting in “model failure”), and this would be unknown. However, this thinking may not include a practical consideration of data quality (total survey error). One should consider all sources of error and their likely magnitudes. Historically, only sampling error is considered when calculating estimates of the standard error of a total for a data element collected on a sample survey, and an (estimated) total

1 Energy Information Administration, EI-53, DOE, 1000 Independence Ave., Washington DC 20585

from a census survey thereby has a standard error of zero. Recently there has been more consideration of the impact on accuracy due to imputations (e.g., Steel and Fay (1995)), but that impact is still often ignored, perhaps even when it may be clearly substantial. Even when the impact of imputation is considered, nonsampling errors, such as measurement error, might still be ignored. Cutoff sampling may be demonstrably more practical in terms of accuracy when total survey error is taken into account. (Knaub (2004) shows a way that total survey error may be considered in this context.) To date, the most frequently occurring reason for choosing a cutoff sample may be pragmatic, but not for reasons of increased accuracy, but rather with regard to reducing the costs of producing the survey, without too much of a decrease in accuracy. When most of the data (by volume regarding a data element such as electricity generation) can be collected from a relatively few large establishments in an establishment survey (i.e., the data distribution is highly skewed), then it may be most practical, considering survey costs, to ignore the many very small establishments, especially when the data element of interest may be a ratio of other data elements, such as cost per unit volume. If cost is collected for a truncated part of the universe, and the volume of whatever is sold or purchased is for that same part of the universe, then each of those data elements are biased downward, but the ratio of variables, here cost per (unit) volume, may be reasonably accurate, and the data collection may be affordable. Still one must have some information regarding how much bias this may cause. Another consideration mentioned above would be a type of cutoff sample, where the smaller establishments not in that sample are subject to a secondary data collection, such as the stratified data collection found in Knaub (1989b), and elsewhere. However, that is not the purpose of cutoff sampling that is the primary focus of this article. Here, the primary focus is on cutoff sampling for which data collection of the smallest establishments is not practical on a frequent basis, due to data quality considerations, and perhaps also due to costs, but a classical ratio estimator or multiple regression equivalent is used to estimate for the entire universe. For recurring surveys, where sample data are collected, say monthly, and census data are collected, say annually, past census data may make very good regressor data for the current sample. It may be difficult to obtain accurate data from the smallest members of the population annually, and impossible monthly. (One possible example may be a small utility that only reads meters every three months. This has occurred with data collected by the Energy Information Administration (EIA/EPD)). Many small operations are too burdened by monthly data collection to provide accurate data. Also, other regressor data may be available. In our current example, electric generator nameplate capacity may be of use. (For wind-powered generation capacity is the best regressor available.) This could be similar to the use of administrative records often discussed in the statistical literature, but not for substitution, but rather as regressor data. Also, number of employees may be used as a regressor variable. (Note that a regressor variable in model-based sampling would be referred to as an auxiliary variable in model-assisted design-based sampling, such as may be found in Särndal, Swensson and Wretman (1992).) Multiple regression may be of use (Knaub (2003) and Knaub (1996)). The emphasis should be on reducing total survey error

2

(nonsampling error and sampling error combined). See Knaub (2004) for a survey performance measurement scheme. In general, any kind of performance measurement for official statistics, an area where cutoff sampling may be particularly useful, seems largely lacking. This is the focus of Knaub (2004). Cutoff sampling for multiple attributes cannot generally result in simply the selection of only respondents that are largest for each attribute, but computer algorithms are sometimes written to select the fewest respondents that will have a given minimal ‘coverage’ for each subject variable. Results may vary by test data set and the definition of what is being measured and how it is measured in each case. One must determine what is being “covered,” and thus decide what determines that a response is “large,” or that a potential respondent is expected to be “large.” The ultimate goal is often given by relative standard error levels set for given subject variables. (Bias should be considered, as in Knaub (2001). A more general form of relative standard error is considered in Knaub (2003) and Knaub (2004). Standard errors often give us enough information.) Coverage is important when considering the possible impacts of model failure for the data with no possibility of selection. Establishment surveys generally have highly skewed data distributions, so the very largest observations that are most necessary to obtain are fairly obvious. Royall (1970) showed that the minimal variance may be obtained for a given sample size for a single subject variable if only the largest establishments, as measured by a regressor, are sampled. The desired accuracy may vary greatly by subject variable. Sample reassessments should be made on occasion. If one of the regressors, composed of data from a previously collected census, is for the same data element as the corresponding subject variable in the sample, then “coverage” could easily be defined as the percent of the census total for that variable that is covered by the same respondents that are in the current sample. Once predictions are made by linear regression – for example, using the classical ratio estimator, CRE – then the observed values in the sample should constitute about the same percent of the estimated total as the coverage that had been measured. In fact, if (1) there are no ‘add-ons’ due to frame changes or other problems (see Knaub (2002)), (2) only that one regressor is used, and (3) prediction is by the CRE, then those two percents should match exactly. The situation is somewhat changed with multiple regression, but factors such as fuel switching (Knaub (2003)), for example, can make multiple regression for electric generation data desirable. For cases where none of the possible regressors or sets of regressors can include a previous census of the same data element as one of the variables of interest in the sample, one may still be able to guess if the predicted portion is of reasonable proportions, so one will often correctly be able to determine when to look for software errors or other problems. Coverage, as stated above, is an important concept for cutoff sampling. If coverage is high, say 80 percent or more, then “model failure” is less of a concern than it might be – all factors considered. (Again, model failure refers to the failure of a model to apply well to the cutoff data. Of course this is all a matter of degree.)

3

(Note that multiple regression may be used to impute a number that will occasionally be negative, with a relatively small absolute value compared to its standard error (STDI in SAS PROC REG). These numbers are basically zeroes. It may be pragmatic to replace them with zeroes, and perhaps the small positive numbers should be replaced with zeroes also.) A key concern should be to keep data processing and statistical methodology as simple as possible. This may reduce errors and increases the ease of interpreting results. (II) Survey of Uses and Literature Regarding Cutoff Sampling and Inference This literature search may have missed some important work, but does represent a strenuous online search and review of major relevant text material. In Elisson and Elvers (2001), they say that “Haan, Opperdoes, and Schut (1999) have positive experience with cut-off sampling in a Consumer Price Index context.” Elisson and Elvers (2001) in their own study recommend vigilance in determining “…estimates for the population part that has been cut off.” The use of administrative data can be helpful. They point out the need for “…an appropriate measure of size,” and care in establishing and monitoring relationships between variables specific to the organization of an industry. The internet has references to European studies of interest. The OECD (2006) gives us a definition of “Cut-Off Sampling”: “A sampling procedure in which a predetermined threshold is established with all units in the universe at or above the threshold being included in the sample and all units below the threshold being excluded. The threshold is usually specified in terms of the size of some known relevant variable. In the case of establishments, size is usually defined in terms of employment or output.” Further, three manuals by the International Monetary Fund offer some insights: IMF (2004a), page 17 says the following: “It is possible that a cutoff sample could be more efficient if the bias component of the excluded units is small. For example, if the noncovered units have substantial variation with regard to price change but small bias (that is, the average price change is not much different), the RMSE [root mean square error] could be smaller using the cutoff sample, and the survey costs could be much lower.” IMF (2004b) tells us the same on page 108.

4

So, using RMSE as a measure of accuracy, an International Monetary Fund document has stated that cutoff sampling can be more accurate at times than other reasonable alternatives. IMF (2004c) offers us this on page 72: “5.38 If the error criterion is not minimal bias but minimal mean square error (=variance+squared bias) then, since any estimator from cut-off sampling has zero variance, cut-off sampling might be a good choice where the variance reduction more than offsets the introduction of a small bias. De Haan, Opperdoes and Schut (1999) demonstrate that this may indeed be the case for some item groups.” Of course this leaves us with the question of nonsampling error, but often the smallest establishments have more trouble with that. Model Quality Report (c. 2000) Vol. I, Sections 4.5 and 9.5 are on “Cut-off sampling.” They say that cutoff sampling is considered if it costs too much to establish a complete frame to justify based on an increase in accuracy. This report was influenced by Särndal, Swensson, and Wretman (1992). In section 4.5.1, they discuss the bias when cutoff units are ignored, and in section 4.5.2, they describe the bias when survey weighted ratio estimation is used. The bias is due to the difference between the ratio of the two totals and the ratio of two estimates of those totals, when using a “[design]-consistent estimator” of that ratio. However, note that model-based cutoff sampling used in Knaub (1999a), Knaub (2003), Knaub (2004), and elsewhere, has bias due to the difference between the ratio of two totals and the ratio of subtotals for those two variables. This is discussed in Model Quality Report (c. 2000) Vol. I in section 9.5. Various examples are given in Chapters 5 and 9, including ratio versus regression estimation. Eurostat (2006) says the following on page 5: “Methodologists have a strong preference for random sampling [as opposed to cutoff sampling].” However, here one may say that such a preference does not consider total survey error, but only the view that cutoff sampling is done solely for convenience and/or cost savings. Statistics Sweden (2001) - p. 30: “Cutoff sampling. When good sampling frames are not available, probability sampling is generally impracticable. Also, for surveys relating to products with a very small weight, small samples of perhaps only 1-10 units are wanted and it may then be more practical to cover only the largest. “Cutoff sampling means selecting only units with the largest subweights (sales, turnover, population etc.). This method is reasonable if these units together cover a large share of the total and/or if they are more stable so that it is easier to obtain information from them.

5

“In some cases the sizes of the sampling units are only known approximately. In these cases one can speak of a ‘judgmental cutoff’ method. “Exhibit 4: Example of cutoff sampling Five companies out of 30 in Sweden cover 95 per cent of the total market for heating oil used in single-family houses. These five companies are included with certainty and the other 25 companies have zero probability of inclusion.” ILO (1999), Item 6 -Sampling and data collection: Here we are given some good news about the use of cutoff sampling: Discussant: Mr. G. O'Hanlon (Ireland) Invited papers by: Canada, Austria and Netherlands; Contributed Paper by: Japan: “39. The paper from Netherlands presented results of the use of scanner data for assessing methods used in the selection of items and in the production of CPI. Cut-off sampling and three probability sampling methods were compared on the basis of Monte Carlo simulations from scanner data obtained directly from Dutch supermarkets. The methods compared were: simple random sampling, stratified sampling, cut-off sampling and sampling with probability proportional to size. “40. The empirical results showed that sampling with probability proportional to size did not necessarily decrease mean squared error. The research concluded that simple random sampling should not be used, sampling proportional to size was useful but that cut-off sampling led to better results.” Here we are told that there are often practical problems to consider: Discussant: Mr. R. Edwards (Australia) Invited paper by: Netherlands; Contributed paper by: Slovenia, Azerbaijan and Netherlands: “50. The interrelationship between the sampling design, index bias and procedures to minimise errors was described at some length. The meeting discussed the possibility to speed up the process of introducing new items in the index in United States where probability sampling was used. It was noted that it was not possible to do so without violating the sampling method which was quite elaborate and rigid. It was however also not possible under cut-off sampling to bring in new items into the index immediately they were introduced in the market because their volume of sale would be small.” Sima Assaf (2005) - p. 6: “Ten security firms and two investigation firms were sampled from this combined frame where sample size is based on our experience with the sample size in the PPI of manufacturing, determined by revenue and variance of price movements. The actual firms sampled by a combined method of judgmental and cut-off sampling. Firms that employ less than 10 employees were not included in the sample. Because the structure of the investigation and security services industry is relatively centralized, and because most

6

of the firms are homogeneous and focus on provision of those services, it can be assumed that the level of prices, price movements is determined mainly by the large firms rather than by the small firms. In the price indices, the risk of using a non-probability sample is relatively small because the variance over time of the price changes between the producers of those products is relatively small. Additionally, because the small firms are unstable, a probability estimate would not have yielded the desired results. Finally, the sample size reflects the relative weight of the specific industry out of all service industries.” Another online reference comes from the US National Academy of Sciences with regard to trade: NAS (1992), says simplicity and cost savings can be advantages (p. 246, “ADVANTAGES OF CUTOFF SAMPLING”), but that (p. 247, “DISADVANTAGES OF CUTOFF SAMPLING”) “The effect of their [transaction] omission on highly aggregated data is fairly small, but some of the thousands of cells of data published monthly are subject to much larger effects.” Another economic application is found in OTS (1995) on page 4: Appendix B: Sampling: “Minimum cut-off sampling is an efficient method to analyze nonhomogeneous assets to help determine the thrift's risk of loss. This sampling method selects all assets with a balance (or commitment) equal to or greater than a cut-off amount. Exhibit 1 of this Appendix shows the basic steps to select the assets to be reviewed in minimum cut-off sampling. This sampling methodology can be used to review the underwriting standards used by an institution for its nonhomogeneous assets.” Plewes, T.J. (1988) considers the importance of the skewed nature of establishment survey data. On pages 72 and 73 he says that “Given the importance of large units, extensive resources are devoted to improving frame coverage and content for large units. One-stage, highly stratified designs, with certainty selection of large establishments are used in the vast majority of establishment surveys profiled.” He refers to FCSM (1988). Plewes went on to say that sometimes a “probability design” is not used. He says that “Estimators which do not reflect probability of selection are also commonly used in establishment surveys. The estimators in use may generally be described as model-based, although the model often is implicit, rather than explicitly stated. Imputation techniques are frequently employed because cutoff sampling is a common design practice.” He said that sampling error information was not always calculated and published, but that this did not seem to be due to agency preference, but rather more related to “… the use of nonprobability-based estimation procedures.” This is unfortunate, as sampling error information may be readily calculated for missing data when regressor data are available. (See Knaub (1996) and Knaub (2001).)

7

FCSM (1988) discusses establishment surveys, and includes various comments on cutoff sampling. They say that “A number of establishment surveys employ a form of cutoff sampling where no units are selected below a specified size. Data for smaller firms are either imputed from administrative records or from large firm characteristics, or they are excluded from the target population altogether. Obviously surveys that purport to cover all establishments must adjust for units not given a chance for selection.” They also note that “Units slightly smaller than the certainty cutoff will be given a much higher chance of selection than the smallest units” which is an apparent reference to the very different situation when “The largest establishments will likely be in a "take all" stratum when optimum stratification techniques are used” and thus we do not necessarily have any cases where probability of selection is zero. FCSM (1988) describes a situation where one must track a value over time. They say that “A measure of how this value changes from month to month during the coming year is desired. The sample that has been selected is a cutoff sample representing some convenient group of establishments in the SIC code. Because of the nonrandom nature of the sample, stand alone estimates of monthly totals are not possible. [Current author’s note: One may beg to differ.] However, if one is willing to assume that the month-to-month movements of the reporting establishments is adequate to measure the month-to-month movement of the universe as a whole, then a link-relative estimate may be used. … The link-relative estimator is biased. If the assumption that the responding establishments are representative of the universe is not true, estimates formed using this procedure are biased. In practice the bias can be severe. A common use of this estimator involves measuring change for very large establishments only and then assuming that the changes are reflective of the small establishments as well.” Thus one might say that link relative estimators seem similar to ratio estimators. The equations in FCSM (1988) for the ratio estimator, page 25, see Knaub (2005) referenced below, and the link relative estimator, page 26, are very similar, differing only by the fact that the former is an adjustment to a previous census, and the latter is an adjustment to a previous link relative estimator. (Note that a ratio estimator can actually use a variety of auxiliary/regressor data, and even multiple regression. Also, one must be careful on page 26 not to be confused by the notation. More on this will be said below.) Madow, L.H., and Madow, W.G. (1978), and Madow, L.H., and Madow, W.G. (1979) are base references for the link relative estimators. These papers are found on the internet, on the American Statistical Association website, as shown in the “References and Bibliography.” See the Appendix II for a description of the classical ratio estimator (CRE) as a cumulative period link relative estimator with regard to cutoff sampling.

8

Some other sources from http://www.amstat.org/, the American Statistical Association web site follow: From Tupek, A.R., Copeland, K.R. and Waite, P.J. (1988): The authors express a clear preference for “Explicit imputation methods [which] typically use administrative data for the missing establishments as proxy for survey data.” (See page 301 of this reference.) Bailar, B.A., Isaki, C.T., Wolter, K.M. (1983): In section 5.1 “Cutoff Sample Versus Stratification,” page 18, the authors describe empirical results using a procedure from Hansen, Hurwitz and Madow (1953). Their experiments examined the obtained mean square errors (MSEs) when ratio estimators were used with two strata. Ratios were for “… sample sums of 1979 to 1978 data ….” The first stratum consisted of the establishments with values from 1978 over a specified threshold. Their goal was to determine the optimal mix for a fixed total sample size to be distributed between these two strata. Six characteristics were studied, and in each case the obtained MSE was minimized when all observations came from the first stratum, “…indicating the optimality of a cutoff sample with a ratio estimator. …the establishments in the cutoff sample represented only 63 to 90 percent of the total.” They went on, however, to speculate that differences in the rates of change between years for the different strata could increase if a longer period were to elapse between the two data collections, and that “Should the differences in the rates of change increase over time, the stratified random sample design would yield the smaller MSE.” They express a preference for probability-sampling in large surveys, but perhaps they could have been convinced today that cutoff sampling may be useful for small-area statistics with adequate regressor (auxiliary) data, although it is doubtful considering the stand Hansen took against model-based sampling in his argument with Royall. They did write that “The Bureau uses a fairly large number of cutoff surveys in the industrial area. Part of the reason for doing that is that most of the monthly surveys are voluntary, not mandatory, and the small establishments have a poor record of responding.” The small establishments, based on experience at the Energy Information Administration’s Electric Power Division, appear to have a disproportionately large degree of difficulty reporting accurate data when required to report, especially on a frequent basis. The usefulness of a “certainty stratum,” such as the one indicated in Bailar, Isaki, and Wolter (1983), is also mentioned in Sweet and Sigman (1995). Sweet and Sigman build work in that area around work done by Pierre Lavallee and Michel A. Hidiroglou. In Hidiroglou (1979), several works are referenced with regard to “… stratifying a population into take-all and take-some…” strata. This has been used for electric power sales data (Knaub (1989b)). Helfand, Impett, and Trager (1978) describes a fixed certainty “component,” and “Rotating samples are used for the noncertainty component.” A certainty stratum is also discussed in Butani, Stamas, and Brick (1997).

9

Dorfman and Valliant (1993) reminds us that certainty strata have “zero” for variance estimates. Here we will consider that to be symptomatic of a customary approach to survey statistics, but it is generally less than optimal. True, that part of the sample is a census for a specific part of the population, and there is no sampling error. However, that does not mean that there is no nonsampling error, so a total survey error approach, as in Knaub (2004), is generally best. Kirkendall (1992) investigates seasonality in the data. A single regression coefficient, β , is studied, and for the examples studied, it says on page 640 that “it is likely that estimating a value of β using the certainty companies only would give a reasonable approximation to the seasonality in the total, even though it is clear that the seasonality is not necessarily the same for each utility.” It was also mentioned that test data were used in “earlier evaluations” and that “Traditional justification for the use of a cut-off sample includes the requirement that there is a high correlation between the two data sources (here between the values of the same variable reported by the same utility at different points in time), and that the coverage of the sample exceed 80% or 90% of the totals being measured.” Ahmed and Kirkendall (1981), starts by saying that “The Department of Energy, Energy Information Administration, Office of Oil and Gas is responsible for publishing timely and accurate weekly estimates of total inputs to refineries, total production of individual petroleum products, and total inventories of crude oil and individual petroleum products at the national level.” They go on to say that “Royall [1970] showed theoretically and empirically that for producing an estimate close to the true population total under this model random sampling plans are often inferior to strategies which call for purposive (non-random) selection of samples.” Their experimentation concludes with the following: “It appears that the estimation accuracy does not deteriorate too badly the year following sample selection. To maintain accuracy, operators of the system will have to be conscious of large changes in reporting patterns and of births and deaths in the system. In conclusion, we believe that this exercise has demonstrated that the model-based approach will lead to a manageable sample size and acceptable estimates.” The influence/legacy of this paper might be inferred in a feature article in the Petroleum Supply Monthly, Energy Information Administration (EIA), July 1995, Heppner and French (1995). The EIA also uses cutoff sampling with ratio estimation for repeated monthly electric power surveys, with regressor data provided by annual census data (Knaub (2002)). This followed Kirkendall, et.al. (1990). Another approach to the use of cutoff sampling that was employed by the Bureau of Mines, formerly a part of the United States Department of the Interior, is described in a Bureau of Mines Information Circular, Harding and Berger (1971). In this Information Circular (IC), Katherine Harding and Arthur Berger describe a procedure developed by Abe Rothman. It was found empirically that if estimates were sequentially calculated using data as they became available (using ratio and difference estimates) from mail

10

surveys, then they could be counted upon to reach a prescribed level of accuracy at a given point in the process. Charts were graphed on previously collected data “…to (1) study the derived estimates’ relationships to sample size and accuracy and (2) relate these findings, based on the desired limits for accuracy established, to current reporting periods.” They noted that after a certain point in data collection, the desired accuracy was historically obtained and sustained, indicating they could proceed with a report at that point. The process called for a continuous review to see if the ‘cutoffs’ of this type that were established needed to be adjusted. Similarly, the author has suggested that preliminary results for annual census surveys of electric power data at the US Energy Information Administration could have been released as soon as estimated aggregate values for key data elements achieved a given level for the corresponding estimated relative standard errors. This would be a ‘real time’ test of when the data collection for a preliminary report should be released, based on current data, rather than an historical trend regarding when a given level of accuracy is expected, and might therefore be more satisfactory. Standard errors would be increased by missing data, as well as by nonsampling error due to the preliminary nature of the collected data, which might require further editing in such a case. Generally less accurate (biased) would be ordinary convenience sampling. That, as the name implies, means that the data were relatively easily obtained. Schonlau, Fricker, and Elliott (2002) mentions the possibility of model-based inference in connection with this method, but notes that it is “controversial.” Särndal, Swensson, and Wretman (1992) discusses cutoff sampling on pages 531 – 533, and elsewhere. They, as well as Royall (1970), Kirkendall, et.al. (1990), Ahmed and Kirkendall (1981) and others consider the use of classical ratio and/or other ratio-type estimators in conjunction with cutoff sampling. Royall (1970) discusses variance, and Särndal, Swensson, and Wretman (1992) have a “remark” and a “result” on page 181 that may be of interest. They state that “…the ratio estimator is very precise when the population points … are tightly scattered around a straight line through the origin and with a certain (unknown) slope….” An exercise on page 554 mentions that a cutoff estimator they describe (which is a classical ratio estimator) is model-unbiased. (Another source for discussing model unbiasedness is Cochran (1977), page 158.) Särndal, Swensson, and Wretman (1992), however, say that there is another type of bias introduced by cutoff sampling, in that part of the population is not represented (creating an opportunity for model failure). However, they say this can be justified by cost considerations when it is not practical to obtain a frame, and if the bias is considered to be negligible, say in some highly skewed business surveys. They discuss two options: either ignore the missing part (which the Energy Information Administration (EIA)

11

routinely does with electric plants with less than one megawatt of capacity), or use a “ratio adjustment” as just discussed. On page 533 they say that frame undercoverage is equivalent to cutoff sampling, and they mention that and bias again on page 544. They do not seem to have considered cutoff sampling, with ratio estimation, as being more than expedient. However, taking a total survey error point-of-view, it appears highly likely that much of the electric power data collected by the EIA produce more accurate results when the smallest respondents are not asked to report on monthly samples, but only annual censuses for regressor data. (Note that in a design-based sample, if the smallest observations are unreliable or not at all forthcoming, then imputation is needed and thus modeling will still be done.) Särndal, Swensson, and Wretman (1992), Exercise 14.5, page 554 does note that classical ratio estimation for cutoff sampling is model-unbiased (as explained in Cochran (1977), as stated above), a desirable property. In section 7.3 of Brewer (2002), there is a good discussion of regression properties, with one exception, which is discussed in Knaub (2005). One point that is key to estimation for establishment surveys such as those that are the topic of this article is found on page 110 of Brewer (2002) and reiterated on page 353 of Knaub (2006). There linear regression through the origin is described as a natural choice. Knaub (2005) discusses the classical ratio estimator, a robust method which falls under that set of estimators. Särndal, Swennson and Wretman (1992), continues the discussion of “alternative ratio models” using a coefficient of heteroscedasticity that is the same as that used in Brewer (2002), except for a factor of two. On page 377 of Royall (1970), it says “For a model which seems to apply in many practical problems, the conventional ratio estimator is shown to be, in a certain natural sense, optimal ….” On page 382 he says that “For a wide class of variance functions, if [weighted least square regression through the origin] … is to be used, then … the best sample to observe consists of those n units having the largest x [regressor] values.” This is cutoff sampling. Royall said that a referee pointed out that the use of a certainty stratum was a “step toward” this. Royall said that probability proportionate to size sampling plans may also be such a “step.” Note (see Särndal, Swensson, and Wretman (1992), pages 517 and 531, and Cumberland and Royall (1982)) that subsequent work by Royall and others emphasized the use of balanced sampling or even random sampling to guard against “model failure,” because the part of the population not subject to sampling may not follow the model that the sample seems to obey. However, considering total survey error, one should contemplate all sources of error and both their likely and near worst case impacts. Then one can also consider Royall (1970) and the advantages when the model does not “fail.” An argument developed in the 1970s and 1980s between design-base oriented survey statisticians, notably Morris Hansen, who believe that randomization should be the basis for estimation, and model-base oriented survey statisticians, notably Richard Royall, who believe that estimation should be conditioned on the sample obtained. (See, for example, Hansen, Madow, and Tepping (1978) and Royall (1978).) Ken Brewer worked to combine these two views, culminating in his text, Brewer (2002). Royall and others have considered model-based sampling other than cutoff sampling, to limit model failure

12

impact. However, one should also consider that an unfortunate random selection is actually ameliorated by the use of modeling in model-assisted design-based sampling and estimation, the premise for Särndal, Swensson, and Wretman (1992), so that would appear to argue for Brewer’s approach. However, cutoff sampling seems indicated for many establishment surveys, from a total survey error approach. Model-based estimation is required for cutoff sampling, because those would be cases where collection of data from the smallest respondents is not practical. If a design-based sample were attempted, and many or all of the smallest respondents could not respond reliably, if at all, so that their data needed to be imputed, then one is back to a model of some sort anyway. It is better to keep the process simple, and use regression to fill in for any missing data, so that variance of the estimated totals may be estimated. A cutoff sample may then yield the minimum variance for a given sample size. (See Royall (1970).) Knaub (1999b), on the EIA website, provides a very short, simple, and illustrated explanation as to how model-based sampling is used to estimate totals for data elements from a cutoff sample survey when regressor data are available on the universe. This was meant as an introduction for electric power data customers. Knaub (1996), page 2 says “Respondent burden and resource considerations prompted the EIA to try collecting these [electric power] data with a small, cutoff sample. Note that cutoff sampling is more practical administratively since no special imputation procedure needs to be invoked.” This is because imputation for nonresponse may use the same regression approach as when dealing with data not in the sample. Classical ratio estimation has proven to be useful in both cases. Knaub (1996) goes on to say that “Cutoff samples have performed very well, and are very practical for these highly skewed data.” Further, Knaub (2002) draws together references and ideas useful for estimation from highly skewed establishment surveys, such as the electric power surveys for which they were developed. As noted in Knaub (2005), page 2: “Balanced sampling can lower bias, but Knaub (2002), pages 2-3, and 17-19, found that in highly skewed establishment surveys, extraordinary nonsampling error in the smallest responses, when small respondents are required to respond on too frequent a basis, can make it impractical to use such data, thus forcing the use of a cut-off sample. However, whether a balanced model-based sample, a cut-off model-based sample, or randomization for a design-based or model-based sample is implemented, ratio estimation has often proved useful, and the classical ratio estimator has long-standing and continued appeal.” Experimentation was done, varying the coefficient of heteroscedasticity for instances of imputing for nonresponse as opposed to mass imputation for what is not in the sample, but the classical ratio estimator generally does well under both circumstances. Unfortunately, many good statisticians think of randomization as a necessity for estimation of variance. That may be changing, but consider Bailar (1984): Page 1: "Most of the surveys on which public policy is based are probability samples, but that is not true for some of the industrial surveys. In these surveys, a cut-off sample is often used in which large industrial establishments over a certain size are included with certainty and smaller establishments are omitted completely."

13

Page 2: "Again, in an ideal survey, the sampling variances would be calculated correctly, observing the sample design, and reflect the effect of interviewer and coder variability, and the effect of imputing for nonresponse. In some cases sampling variances do not reflect a complex sample design. Rarely do they include interviewer, coder, or other processing variances. In cases where cut-off sampling is used, sampling variances are not computed because there is no probability sampling." However, model-based variances can be calculated. (See page 12 in Knaub (2004), pages 6 and 7 in Knaub (1996), and pages 877-879, particularly Figure 1, in Knaub (1992). Also Knaub (2003) emphasizes that multiple regression may be used. Knaub (2001) is also about variance - and bias - for model-based cutoff sampling.) Summary: There are several different ways to consider cutoff sampling:

1) A truncated universe may be used, and considered close enough to complete, especially for price data or any other ratio of two data elements, as opposed to having an interest in totals for either the numerator or denominator in such a ratio as total revenue divided by total sales.

2) In the case of link relative estimation one may also only consider a truncated universe.

3) A ratio estimator may be used to estimate for the entire universe. Relative standard errors and other considerations of accuracy (see Knaub (2004) and Knaub (2001)) should be contemplated, including consideration of the possible impact of model failure for the portion of the universe not sampled.

4) There could be design-based sampling of data below the cutoff. See Sweet and Sigman (1995), Hidiroglou (1979), and Knaub (1989b), for example.

5) Sometimes a cutoff sample may be done for analytical purposes only.

14

(III) Reasons found in the Electric Power Division (EPD) of the Energy Information Administration (EIA) for Using Model-Based Cutoff Sampling Instead of Design-Based Sampling For Electric Power Data:

1) Variance is lower for a given sample size. (See Royall (1970).) 2) Model failure-type bias would be applied to a relatively small part of the skewed

population’s total for a given data element – thus having relatively low impact. 3) The smallest potential respondents tend to have disproportionately large

nonsampling error - Evidence from electric power data survey experience at the EIA: a) scatterplots that show a “bulge” near the origin (see Knaub (2002)) b) respondent contact reports with comments indicating that the smallest respondents may be less capable of correctly filing the form, and may have other problems such as filing a monthly form when meters are only read quarterly c) in previous testing on electric generation data, eliminating some of the smaller observations, yielding a smaller sample size, did not always raise estimated RSEs d) much earlier testing with electric sales data showed about the same results for design-based ratio estimation with a stratified random sample and a certainty stratum of largest respondents, as was found for a cutoff sample using only that certainty stratum

4) Between the events mentioned above in 3c and 3d, a table was published of

estimated z-values ( ^

^^

σ

TTz −= ), for test data with observations removed that were

actually collected in a census, so that T (the total of a census of observed values, which does include some nonsampling error) and the estimate of T, and its estimated variance, were all obtained. Results appeared favorable. See page 750 in Knaub (1990) and page 878 in Knaub (1992).

5) Model-based balanced sampling and probability proportional to size (PPS design-based) sampling should be considered as alternatives, but if the smallest observations collected (from the smallest establishments) in the sample are unreliable, which may happen when asking for data on a frequent basis, then cutoff sampling may be the more accurate alternative, from a total survey error perspective.

6) Study results can also be found on the InterStat web site (search ‘Knaub’), including a study of bias and variance (Knaub (2001)).

7) Each year, for a number of years, a census of sales and revenue data, and a census of generation data have been obtained by the Electric Power Division of the EIA, and no instance seems to have been noted where the total of 12 monthly cutoff sample results had not compared well to the corresponding annual census results that followed.

8) Because (see 3 and 5 above) the smallest observations are difficult or impossible to collect well on a frequent basis, that means that in a design-based sample, they often should be imputed … bringing us back to the use of a model. When data are imputed in any way in a design-based sample, the negative impact on accuracy is

15

often ignored, just as nonsampling error is often ignored, but from a total survey error point-of-view, this is another reason to use a cutoff sample and concentrate on collecting high quality data for the largest respondents monthly, and high quality regressor data, say, annually.

9) Resource savings are considerable for respondents for whom this is the heaviest burden, and for the Federal Government.

Summary for cutoff sampling from the first nine points: It is often more accurate and has lower respondent and data collection burden.

10) Further, Särndal, Swensson and Wretman (1992) mentions that an incomplete frame can actually amount to a cutoff sample. This occurs with electricity generation data considered by the EIA when plants with nameplate capacities under 1 megawatt are routinely ignored.

11) Also considered for estimation of sales of electricity to the public transportation sector, was to match the largest suppliers with the largest users prior to establishing the complete frame of suppliers. The remainder (the cutoff) portion of the population could be represented in the regressor (demand data here from the Department of Transportation) as a lump sum if the classical ratio estimator is used with such a cutoff. (This is derived on pages 776 and 777 of Knaub (1991b) - “Incompletely Specified Auxiliary Data.”)

12) Note that econometrics texts often transform regressions to ‘remove’ heteroscedasticity in such a way that they assume that the degree of heteroscedasticity was that found in ratio estimation. (See Maddala (1977) and Griffiths, Hill, and Judge (1993).) Econometric data are often dictated by what is available, and thus survey design weights are not used. When only using already available data, this would make this similar to cutoff sampling. Because ratio estimation may be robust enough to handle both cutoff sampling and nonresponse, and also to deal with available economic data, it could be used in the same way in each case. Econometrics texts strive to be able to derive nearly ordinary least squares (OLS) regression so that hypothesis tests can be used, but there are arguments for avoiding that. Confidence intervals and relative standard errors (RSEs) are easily interpretable, but hypothesis tests are often misinterpreted. (See Knaub (1987), Knaub (1989a), and Appendix III.) Sequential hypothesis tests are an exception because the merits of different decisions are compared. Any kind of sensitivity analysis may be helpful, but a single p-value is not very meaningful, as p-values are functions of sample sizes. The next section discusses measurement of accuracy, which is important for cutoff sampling and all other sampling and census surveys. We will discuss methodology particularly suitable for cutoff sampling.

16

(IV) Concepts for cutoff sampling: Points considering standard errors:

1) Estimates of relative standard error, RSE, are designed to indicate the amount of error to be expected due to sampling, based on the variability of the data observed.

2) Estimates of relative standard error with respect to a superpopulation, RSESP, are

designed to indicate total survey error, based on the overall variability of the data generating process assumed by the model, and is influenced by the model, by sampling error, and by nonsampling error. (See Knaub (2004).)

3) The RSESP may be used as a general tool to compare models with different

regressors and data stratifications, and then indicate sample selections to be made. To appreciate this, first consider the following three figures from Knaub (1999b):

17

18

From page 2 of Knaub (1999b) we have the following: “In establishment surveys, data may be very skewed (i.e., with a few large values and relatively many small values) so that a cutoff sample of the largest establishments may be quite practical. Treatment of heteroscedasticity (a phenomenon involving variance) for optimizing estimation may be substantially altered if imputation is done for one or more 'larger' establishments. (See Knaub(1997).” Testing since that time seems to indicate the classical ratio estimator, CRE, is robust for all such cases. Knaub (1999b) goes on to say, “Note that “heteroscedasticity” refers to the nonconstant variance of the data about the regression line. Notice in Figures 1, 2 and 3 that data closer to the origin may have less variance.)” From page 3 of Knaub (1999b) we have that “Using model-based inference with a cutoff sample may, at times, have advantages over design-based sampling. This could occur, and has occurred, for instance, when the data are highly skewed, and nonsampling error, time and cost considerations indicate that the data from the smallest entities are not efficiently obtainable. (In the case of an annual census of electric utilities, and a monthly sample, a small utility may not, for example, read its customers’ meters more often than once every two or three months. This has happened, and this sort of thing can complicate the collection of accurate monthly observations!) A model may sometimes provide for better estimation than a design-based sample could, given these data quality considerations.” Two Important Lessons Learned: 1) Regarding setting the intercept at zero - Extensive experience with electric power data has shown that allowing a nonzero intercept is not helpful. In Brewer (2002), we see an argument explaining this, which is noted in Knaub (2005). From Brewer (2002), pages 109-110, the usefulness of an intercept term for use in survey sampling is questioned. The following is stated: “It is more often the case than not, in survey sampling, that the most appropriate supplementary variable is close to being proportional to its corresponding survey variable, and that their natural relationship or line of best fit is a straight line through the origin. If the range of the supplementary variable is limited (and it often is limited by the process of stratification on size) then the inclusion of an intercept term permits the estimated relationship to stray well away from the origin, with a consequent loss of efficiency." 2) Regarding the robustness of the CRE for both imputation for nonresponse and for mass imputation for the smallest cases - The Classical Ratio Estimator (CRE) for cutoff sampling does not usually provide the very best results, but is seldom in error by a great deal. That is, it is not volatile, and extensive experience has shown it to be a good choice for both mass imputation under cutoff sampling, and, perhaps surprisingly, imputation for nonresponse as well.

19

Variance: From Knaub (2004), when using one regressor and a zero intercept, the “exact” variance

under the model, , with a WLS estimate of γβ iiii xexy 0

* += ≡*β β , is

, with ∑∑ −− +=− nN inN ieL xwTTV2)(/)( 2*** σ )(

** βV [ ] 2−= γ

ii xw . (If γ = 0.5, then one has the model-based ratio estimate with variance proportionate to x , the CRE.) Note that resembles the mean square error, which is the variance plus

bias-squared. However, is only for the error due to the estimated residuals under a given model, and does not even include the error due to estimating the

model coefficient. That is the second term, . Under the RSESP, we need to make the summations over N rather than N-n.

∑ −nN ie w/2*σ

∑ −nN ie w/2*σ

∑ −nN ix2)( )(

** βV

Multiple regression may be useful, as in Knaub (1996) and Knaub (2003). Recent experiments with multiple regression ratio estimation/prediction for electric generation by energy source (say coal, or solar) have used generation from a previous census for the same energy source, generation from that census for all other energy sources, and plant nameplate capacity, as three regressors. As suggested in Knaub (2003), multiple regression sometimes helps substantially, and does not appear to be substantially detrimental in the other cases.2 Wind powered generation had a particularly interesting result. Estimated RSESP graphs (a performance measure – see Knaub (2004)) showed that the multiple regression model was clearly superior. The use of nameplate capacity as regressor data appears very useful. Below is an example of a graph of estimated gross generation over a twelve-month period (Figure 4), and associated estimated RSE and estimated RSESP graphs (Figures 5 and 6) for wind-powered generation for all plant types (regulated and unregulated) in the Pacific Contiguous Census Division (California, Oregon and Washington). The red dot-marked lines below are for the one-regressor case. The blue triangle-marked lines are for the three-regressor cases. 2 Thanks to Joel Douglas, formerly of SAIC, for programming support, and an extremely useful testing suggestion that left out a procedure that had excluded data apparently not modeled well with one regressor. Also, thanks to John Vetter for suggesting the time series presentation. The proper use of these suggestions is the author’s responsibility and does not represent an endorsement by others.

20

Figure 4

Figure 5

21

Figure 6

---------------------------------------------------------------------------------------------------------- Note that the RSE estimate and RSESP estimate graphs have similar patterns. This is often the case. However, because the sets of observed data and imputed/predicted values collected in ‘publication groups’ may be parts of generally broader sets of data modeled in ‘estimation groups,’ these RSE and RSESP patterns may vary. In the case of electric sales data, some parts are censused or nearly censused by the EIA. The next example (Figure 7) is for residential sales data from traditional ‘bundled’ sources (meaning transmission costs are included). The rises in RSESP estimates that are not reflected by the RSE estimates may be due to Investor Owned Utilities (IOUs) which are supposed to be censused. A substantial rise in nonsampling error among IOUs will cause a rise in RSESP estimates without a corresponding rise in the RSE estimates for the set of traditional utilities that includes a certainty IOU stratum, and sampled municipal and cooperative sales.

22

Figure 7

---------------------------------------------------------------------------------------------------------- Sampling consideration: We cannot collect but so much data monthly, and the smaller observations are particularly problematic. However, we have to collect some monthly data for all categories that we want to publish to avoid possibly severe model failure. In small area estimation we ‘borrow strength’ across groups of data for modeling, but if a group is only represented by other groups, then substantial changes may be missed. As for not collecting data below a cutoff, one must consider all possible impacts and their likely magnitudes.

23

Concluding remark: For many highly skewed establishment surveys where regressor data are available, cutoff sampling with ratio estimation is not only cost effective, but also may be more accurate than other viable alternatives. It also contributes to simplicity in data collection and data processing. Further, it reduces reporting burden for those most disproportionately burdened by reporting, and frees resources at the data collection agency to be used more meaningfully. Acknowledgements: Thanks to Joel Douglas for SAS programming, construction of graphs 4-7, and for discussions, and to Ken Brewer and others for suggestions and/or helpful discussions. Any errors are mine. References and Bibliography: Ahmed, Y.Z., and Kirkendall, N.J. (1981), “Results of Model-Based Approach to Sampling,” Proceedings of the Survey Research Methods Section, ASA, pp. 674-679, http://www.amstat.org/sections/srms/proceedings/ Bailar, B.A. (1984), “The Quality of Survey Data,” Proceedings of the Survey Research Methods Section, ASA, pp. http://www.amstat.org/sections/srms/proceedings/papers/1984_009.pdf Bailar, B.A., Isaki, C.T., Wolter, K.M. (1983), “A Survey Practitioner’s Viewpoint,” Proceedings of the Survey Research Methods Section, ASA, pp. 16-25. http://www.amstat.org/sections/SRMS/proceedings/papers/1983_004.pdf Brewer, K.R.W. (1963), "Ratio Estimation in Finite Populations: Some Results Deducible from the Assumption of an Underlying Stochastic Process," Australian Journal of Statistics, 5, pp. 93-105. Brewer, K.R.W. (1995), “Combining Design-Based and Model-Based Inference,” Business Survey Methods, ed. by B.G. Cox, D.A. Binder, B.N. Chinnappa, A. Christianson, M.J. Colledge, and P.S. Kott, John Wiley & Sons, pp. 589-606. Brewer, KRW (2002), Combining survey sampling inferences: The weighing of Basu's elephants, Arnold: London and Oxford University Press. Butani, S., Stamas, G., and Brick, M. (1997), “Sample Redesign for the Current Employment Statistics Survey,” Proceedings of the Survey Research Methods Section, ASA, pp. 517-522, http://www.amstat.org/sections/srms/proceedings/

24

Carroll, R.J., and Ruppert, D. (1988), Transformation and Weighting in Regression, Chapman &Hall. Chaudhuri, A. and Stenger, H. (1992), Survey Sampling: Theory and Methods, Marcel Dekker, Inc. Cochran, W.G.(1977), Sampling Techniques, 3rd ed., John Wiley & Sons. Cumberland, W.G., and Royall, R.M. (1982), “Does SRS Provide Adequate Balance?,” Proceedings of the Survey Research Methods Section, ASA, pp. 226-229, http://www.amstat.org/sections/srms/proceedings/ Dalén, J. (2005), “Sampling Issues in Business Surveys,” Pilot Project 1 of the European Community's Phare 2002 Multi Beneficiary Statistics Programme, Quality in Statistics, http://epp.eurostat.ec.europa.eu/pls/portal/docs/PAGE/PGP_DS_QUALITY/TAB47143266/QIS_PHARE2002_SAMPLING_ISSUES.PDF Dorfman, A., and Valliant, R. (1993), “Quantile Variance Estimators in Complex Surveys,” Proceedings of the Survey Research Methods Section, ASA, pp. 866-871, http://www.amstat.org/sections/srms/proceedings/ Elisson, H, and Elvers, E (2001), “Cut-off sampling and estimation,” Statistics Canada International Symposium Series – Proceedings. http://www.statcan.ca/english/freepub/11-522-XIE/2001001/session10/s10a.pdf Eurostat (2006), Eurostat, “Handbook on methodological aspects related to sampling designs and weights estimations,” Version 1.0, July 2006 http://forum.europa.eu.int/irc/dsis/nacecpacon/info/data/en/handbook%20part3%20-%20sampling%20and%20estimation.pdf from http://forum.europa.eu.int/irc/dsis/nacecpacon/info/data/en/index.htm FCSM (1988), Federal Committee on Statistical Methodology, Statistical Policy Working Paper 15 - Measurement of Quality in Establishment Surveys, 1988, http://www.fcsm.gov/working-papers/wp15.html Griffiths, W.E., Hill, R.C., Judge, G.G. (1993), Learning and Practicing Econometrics, Wiley. Haan, J. De, E. Opperdoes, and C.M. Schut (1999). “Item Selection in the Consumer Price Index: Cut-off Versus Probability Sampling”, Survey methodology, 25, pp. 31-41. Hansen, M.H., Hurwitz, W.N., and Madow, W.G. (1953). Sample Survey Methods and Theory, Volume I. Wiley.

25

Hansen, M.H., Madow, W.G., and Tepping, B.J. (1978), “On Inference and Estimation from Sample Surveys,” Proceedings of the Survey Research Methods Section, ASA, pp. 82-107. http://www.amstat.org/sections/srms/proceedings/ Hansen, M.H., Madow, W.G., and Tepping, B.J. (1983), “An Evaluation of Model-Dependent and Probability-Sampling Inferences in Sample Surveys: Rejoinder,” Journal of the American Statistical Association, Vol. 78, No. 384 (Dec., 1983), pp. 805-807. Harding, K. and Berger, A. (1971), United States Department of the Interior, Bureau of Mines Information Circular,. IC 8516, “A Practical Approach to Cutoff Sampling for Repetitive Surveys,” June 1971. Helfand, S.D., Impett, L.R., and Trager, M.L. (1978), “Annual Sample Update of the Census Bureau’s Monthly Business Surveys,” Proceedings of the Survey Research Methods Section, ASA, pp. 128-133. http://www.amstat.org/sections/srms/proceedings/ Heppner, T.G., and French, C.L. (1995), “Accuracy of Petroleum Supply Data,” Petroleum Supply Monthly, Energy Information Administration, July 1995. Hidiroglou, M.A. (1979), “On the Inclusion of Large Units in Simple Random Sampling,” Proceedings of the Survey Research Methods Section, ASA, Proceedings of the Survey Research Methods Section, ASA, pp. 305-308. http://www.amstat.org/sections/srms/proceedings/ Holmberg, Anders (2003), Essays on Model Assisted Survey Planning, Uppsala. http://www.diva-portal.org/diva/getDocument?urn_nbn_se_uu_diva-3417-1__fulltext.pdf ILO (1999), International Labour Organization, Joint UN/ECE/ILO Meeting on Consumer price Indices (3-5 November 1999, Geneva), Summary of Discussion, http://www.ilo.org/public/english/bureau/stat/guides/cpi/summary.htm IMF (2004a), International Monetary Fund, “Manual on Export and Import Price Indices,” Chapter 5, 2004: http://www.imf.org/external/np/sta/tegeipi/ch5.pdf IMF (2004b), International Monetary Fund, “Manual on the Producer Price Index,” Chapter 5, 2004: http://www.imf.org/external/np/sta/tegppi/ch5.pdf IMF (2004c), International Monetary Fund, “Manual on the Consumer Price Index,” Chapter 5, 2004: http://www.ilo.org/public/english/bureau/stat/download/cpi/ch5.pdf Kadilar, C., and Cingi, H. (2006), “New Ratio Estimators Using Correlation Coefficient,” InterStat, http://interstat.statjournals.net/, March 2006.

26

Karmel, T.S., and Jain, M. (1987), “Comparison of Purposive and Random Sampling Schemes for Estimating Capital Expenditure,” Journal of the American Statistical Association, American Statistical Association, 82, pp. 52-57. Kirkendall, et.al. (1990), “Sampling and Estimation: Making Best Use of Available Data,” seminar at the EIA, September 1990. Kirkendall, N.J. (1992), “When Is Model-Based Sampling Appropriate for EIA Surveys?” http://www.amstat.org/sections/srms/proceedings/ 1992, pp. 637-642. Knaub, J.R., Jr. (1987), "Practical Interpretation of Hypothesis Tests," Vol. 41, No. 3 (August), letter, The American Statistician, American Statistical Association, pp. 246-247. Knaub, J.R., Jr. (1989a), "Fellegi-Sunter Record Linkage Theory as Compared to Hypothesis Testing," Computing Science and Statistics, Proceedings of the 21st Symposium on the Interface, pages 524-527. Knaub, J.R., Jr. (1989b), "Ratio Estimation and Approximate Optimum Stratification in Electric Power Surveys," Proceedings of the Section on Survey Research Methods, American Statistical Association, pp. 848-853. http://www.amstat.org/sections/srms/proceedings/ Knaub, J.R., Jr. (1990), “Some Theoretical and Applied Investigations of Model and Unequal Probability Sampling for Electric Power Generation and Cost,” Proceedings of the Section on Survey Research Methods, American Statistical Association, pp. 748-753. http://www.amstat.org/sections/srms/proceedings/ Knaub, J.R., Jr. (1991a), position statement, The Future of Statistical Software: Proceedings of a Forum, National Research Council, National Academy Press, p.82. Knaub, J.R., Jr. (1991b), “Some Applications of Model Sampling to Electric Power Data,” Proceedings of the Section on Survey Research Methods, American Statistical Association, pp. 773-778. http://www.amstat.org/sections/srms/proceedings/ Knaub, J.R., Jr. (1992), "More Model Sampling and Analyses Applied to Electric Power Data," Proceedings of the Section on Survey Research Methods, American Statistical Association, pp. 876-881. http://www.amstat.org/sections/srms/proceedings/ Knaub, J.R., Jr. (1993), "Alternative to the Iterated Reweighted Least Squares Method: Apparent Heteroscedasticity and Linear Regression Model Sampling," Proceedings of the International Conference on Establishment Surveys, American Statistical Association, pp. 520-525.

27

Knaub, J.R., Jr. (1994), "Relative Standard Error for a Ratio of Variables at an Aggregate Level Under Model Sampling," Proceedings of the Section on Survey Research Methods, American Statistical Association, pp. 310-312. http://www.amstat.org/sections/srms/proceedings/ Knaub, J.R., Jr. (1995a), "Planning Monthly Sampling of Electric Power Data for a Restructured Electric Power Industry," Data Quality, Vol. 1, No.1, March 1995, pp. 13-20. Knaub, J.R., Jr. (1995b), "A New Look at 'Portability' for Survey Model Sampling and Imputation," Proceedings of the Section on Survey Research Methods, Vol. II, American Statistical Association, pp. 701-705. http://www.amstat.org/sections/srms/proceedings/ Knaub. J.R., Jr. (1996), “Weighted Multiple Regression Estimation for Survey Model Sampling,” InterStat, May 1996, http://interstat.statjournals.net/. (Note that there is a shorter version in the ASA Survey Research Methods Section proceedings, 1996.) Knaub, J.R., Jr. (1997), "Weighting in Regression for Use in Survey Methodology," InterStat, April 1997, http://interstat.statjournals.net/. (Note shorter, but improved version in the 1997 Proceedings of the Section on Survey Research Methods, American Statistical Association, pp. 153-157.) Knaub, J.R., Jr. (1998a), “Filling in the Gaps for A Partially Discontinued Data Series,” InterStat, October 1998, http://interstat.statjournals.net/. (Note shorter, more recent version in ASA Business and Economic Statistics Section proceedings, 1998.) Knaub, J.R., Jr. (circa 1998b), "Model-Based Sampling, Inference and Imputation," found on the EIA web site under http://www.eia.doe.gov/cneaf/electricity/page/forms.html. Knaub, J.R., Jr. (1999a), “Using Prediction-Oriented Software for Survey Estimation,” InterStat, August 1999, http://interstat.statjournals.net/, partially covered in "Using Prediction-Oriented Software for Model-Based and Small Area Estimation," in ASA Survey Research Methods Section proceedings, 1999, and partially covered in "Using Prediction-Oriented Software for Estimation in the Presence of Nonresponse,” presented at the International Conference on Survey Nonresponse, 1999. Knaub, J.R. Jr. (1999b), “Model-Based Sampling, Inference and Imputation,” EIA web site: http://www.eia.doe.gov/cneaf/electricity/forms/eiawebme.pdf Knaub, J.R., Jr. (2000), “Using Prediction-Oriented Software for Survey Estimation - Part II: Ratios of Totals,” InterStat, June 2000, http://interstat.statjournals.net/. (Note shorter, more recent version in ASA Survey Research Methods Section proceedings, 2000.)

28

Knaub, J.R., Jr. (2001), “Using Prediction-Oriented Software for Survey Estimation - Part III: Full-Scale Study of Variance and Bias,” InterStat, June 2001, http://interstat.statjournals.net/. (Note another version in ASA Survey Research Methods Section proceedings, 2001.) Knaub, J.R., Jr. (2002), “Practical Methods for Electric Power Survey Data,” InterStat, July 2002, http://interstat.statjournals.net/. (Note another version in ASA Survey Research Methods Section proceedings, 2002.) Knaub, J.R., Jr. (2003), “Applied Multiple Regression for Surveys with Regressors of Changing Relevance: Fuel Switching by Electric Power Producers,” InterStat, May 2003, http://interstat.statjournals.net/. (Note another version in ASA Survey Research Methods Section proceedings, 2003.) Knaub, J.R., Jr. (2004), “Modeling Superpopulation Variance: Its Relationship to Total Survey Error,” InterStat, August 2004, http://interstat.statjournals.net/. (Note another version in ASA Survey Research Methods Section proceedings, 2004.) Knaub, J.R., Jr. (2005), “Classical Ratio Estimator,” InterStat, October 2005, http://interstat.statjournals.net/. Knaub, J.R., Jr. (2006), Book Review, Journal of Official Statistics, Vol. 22, No. 2, 2006, pp. 351–355, http://www.jos.nu/Articles/abstract.asp?article=222351 Click on “Full Text.” Knaub, J.R., Jr. (2007a), “Heteroscedasticity and Homoscedasticity” in Encyclopedia of Measurement and Statistics, Editor: Neil J. Salkind, Sage, Vol. 2, pp. 431-432. Knaub, J.R., Jr. (2007b), “Survey Weights” in Encyclopedia of Measurement and Statistics, Editor: Neil J. Salkind, Sage, Vol. 3, p. 981. Knaub, J.R., Jr. (2007c), forthcoming. “Cutoff Sampling.” In Encyclopedia of Survey Research Methods, Editor: Paul J. Lavrakas, Sage, to appear in December 2007. Lee, H., Rancourt, E., and Särndal, C.-E. (1999), “Variance Estimation from Survey Data Under Single Value Imputation,” presented at the International Conference on Survey Nonresponse, Oct. 1999, published in Survey Nonresponse, ed by Groves, Dillman, Eltinge and Little, 2002, John Wiley & Sons, Inc., pp 315-328. Maddala, G.S. (1977), Econometrics, McGraw-Hill. Madow, L.H., and Madow, W.G. (1978), “On Link Relative Estimators,” Proceedings of the Survey Research Methods Section, ASA, pp. 534-539. http://www.amstat.org/sections/srms/proceedings/ Madow, L.H., and Madow, W.G. (1979), “On Link Relative Estimators II,” Proceedings of the Survey Research Methods Section, ASA, pp. 336-339. http://www.amstat.org/sections/srms/proceedings/

29

Model Quality Report (c. 2000), Model Quality Report in Business Statistics, Volume I, Theory and Methods for Quality Evaluation, and Volume IV, Guidelines for Implementation of Model Quality Reports, General Editors: Pam Davies, Paul Smith (circa 2000) at http://amrads.jrc.it NAS (1992), “Behind the Numbers: U.S. Trade in the World Economy,” The National Academy of Sciences, http://books.nap.edu/openbook.php?record_id=1865&page=R1 OECD (2004), Organisation for Economic Co-operation and Development, The “Short-Term Economic Statistics (STES) Timeliness Framework,” URL http://www.oecd.org/, search on “cutoff sampling” to find the link to “STES Timeliness Framework: Efficient Sample Designs,” March 13, 2004. OECD (2006), Organisation for Economic Co-operation and Development, Glossary of Statistical Terms, downloaded Nov 4, 2006 from http://stats.oecd.org/glossary/detail.asp?ID=5713. OTS (1995), Examination Handbook 209.B, Office of Thrift Supervision, US Dept. of the Treasury, http://www.ots.treas.gov/docs/4/422030.pdf Plewes, T.J. (1988), “Focusing on Quality in Establishment Surveys,” Proceedings of the Section on Survey Research Methods, American Statistical Association, pp. 71-74. http://www.amstat.org/sections/srms/proceedings/ Rao, Poduri, S.R.S. (1992), unpublished correspondence, Aug. - Oct. 1992, on covariances associated with three Royall and Cumberland model sampling variance estimators. Referenced in Knaub (1994). Royall, R.M. (1970), "On Finite Population Sampling Theory Under Certain Linear Regression Models," Biometrika, 57, pp. 377-387. Royall, R.M. (1978), Discussion of some papers presented, Proceedings of the Survey Research Methods Section, ASA, p. 102. http://www.amstat.org/sections/srms/proceedings/ Samaniego, F.J., and Watnik, M.R. (1997), “The Separation Principle in Linear Regression,” Journal of Statistics Education, Vol. 5, No. 3, http://www.amstat.org/publications/jse/v5n3/samaniego.html. Särndal, C.-E., Swensson, B. and Wretman, J. (1992), Model Assisted Survey Sampling, Springer-Verlag. Särndal, C.-E., and Lundström, S., (2005), Estimation in Surveys with Nonresponse, Wiley.

30

Schonlau, M., Fricker, R.D., and Elliott, M.N. (2002), Conducting Research Surveys via E-mail and the Web, RAND Corporation, pp. 33-34. http://www.rand.org/pubs/monograph_reports/MR1480/ Sima Assaf (2005), Voorburg Group on Service Statistics, Service Price Index for Investigation and Security Services, Central Bureau of Statistics, Israel, August 2005 http://www.stat.fi/voorburg2005/assaf.pdf Statistics Sweden (2001), “The Swedish Consumer Price Index, A handbook of methods,” http://www.scb.se/statistik/PR/PR0101/handbok.pdf Steel, P. and Fay, R.E. (1995), “Variance Estimation for Finite Populations with Imputed Data,” Proceedings of the Section on Survey Research Methods, Vol. I, American Statistical Association, pp. 374-379. http://www.amstat.org/sections/srms/proceedings/ Sweet, E.M. and Sigman, R.S. (1995), “Evaluation of Model-Assisted Procedures for Stratifying Skewed Populations Using Auxiliary Data,” Proceedings of the Section on Survey Research Methods, Vol. I, American Statistical Association, pp. 491-496. http://www.amstat.org/sections/srms/proceedings/ Tupek, A.R., Copeland, K.R. and Waite, P.J. (1988), "Sample Design and Estimation Practices in Federal Establishment Surveys,” in the American Statistical Association Proceedings on the Section on Survey Research Methods, pp. 298-303. http://www.amstat.org/sections/srms/proceedings/ Valliant, R., Dorfman, A.H., and Royall, R.M. (2000), Finite Population Sampling and Inference, A Predictive Approach, John Wiley & Sons. Willett, J.B. and Singer, J.D. (1988), “Another Cautionary Note about R-square: Its Use in Weighted Least-Squares Regression Analysis,” The American Statistician, Vol. 42, pages 236-238. World Bank (retrieved Feb 2007), Chapter 6, “Sampling and Price Collection,” http://siteresources.worldbank.org/ICPINT/Resources/ch6_Sampling_Apr06.doc

31

Appendix I

Correspondence on the Coefficient of Heteroscedasticity,γ , for Establishment Surveys

From: Ken Brewer Sent: Monday, September 25, 2006 9:53 PM To: Knaub, James Subject: RE: JOS book review ... ... I would never expect to find a value of gamma less than 0.5 in a business survey. … ... From: Ken Brewer Sent: Wednesday, September 27, 2006 9:28 PM To: Knaub, James Subject: RE: Small respondents What I am completely firm about is that gamma is almost always less than one, and if a respondent is tiny enough, that is all that is needed to produce an extremely high relative variance. We have had correspondence in the past that served to convince me that putting gamma equal to 0.5 was often useful, even when a higher value was more correct, and the unreliability of small respondents' responses was one of the main reasons. … … From: Ken Brewer Wed 12/20/2006 1:51 AM To: Knaub, James Subject: Re: Small respondents ... I'm quite happy for you to use either or both of the quotes [above]. I might mention the reason why I would never expect the value of gamma to fall below 0.5 for a business collection. 0.5 is the value the collection would take if all large businesses behaved like random aggregations of small ones, which is rather an extreme situation. The greater the extent of central control, the higher the value of gamma that would be expected. … ... Dr Ken Brewer, A.Stat. Visiting Fellow School of Finance and Applied Statistics College of Business and Economics LF Crisp Bdg (Bdg No. 26) The Australian National University Canberra ACT 0200 Australia (See Brewer (2002), page 111 for theoretical limits on gamma, the coefficient of heteroscedasticity.)

32

Appendix II

CRE and the Link Relative Estimator for Cutoff Sampling

The ratio estimator on page 25 of FCSM (1988) includes a weight in both the numerator and denominator that provide expansion estimates for the variable of interest, and for the auxiliary data. Those weights are based on the survey design (they could be survey weights or calibrated weights), and are implicitly combined with regression weights of a specific level to form a ratio estimator. See pages 35 and 42 of Särndal, and Lundström (2005), and pages 232 and 233 of Särndal, Swennson, and Wretman (1992) for combining those survey and regression weights re the generalized regression estimator (GREG). Page 42 in Särndal, and Lundström (2005) shows the ratio estimator as a special case. The classical ratio estimator (Brewer (2002), page 126, is studied in Knaub (2005). In the case of cutoff samples, this is useful because the survey weights have been simplified to zero for cases below the cutoff, and one otherwise. Thus, setting the survey sample weights all to unity on page 25 of FCSM (1988), we have a ratio estimator useful for cutoff sampling. This classical ratio estimator can be related to a presentation of the link relative estimator (from Madow, L.H., and Madow, W.G. (1978), and Madow, L.H., and Madow, W.G. (1979)) found on page 26 of FCSM (1988), by the same treatment of weights. Such a link relative estimator would be useful in cutoff sampling, as in other cases, if we want to know the relative changes between successive periods based on a set of k respondents that are found each reporting period. Between any two periods, we could consider an estimate of change, times the old estimate of total, to be identical to a ratio estimator for a cutoff sample, when those k respondents are the establishments that equal or exceed the cutoff threshold for the measure of size used. Consider the formula for the link relative estimator in FCSM (1988). It may be easier to follow with a change in notation, as shown below. Note that the basic mechanical difference between using the one-regressor classical ratio estimator with cutoff sampling, and the use of the link relative estimator with nonprobability sampling is that the link relative estimator uses respondents that happen to be available in the periods of interest. Those respondents could be a cutoff sample. The remaining difference is one of application. The link relative estimator emphasizes the change from one period to another. The classical ratio estimator (CRE) may use past data from the same data element as a single regressor, and the slope (regression coefficient) would be the change for the first period. After that, it would be the cumulative change. Also note that other regressor data can be used with the CRE, and multiple regression versions are possible. (See Knaub (2003).) Variance estimates are thus easily calculated. (See page 12 in Knaub (2004), pages 6 and 7 in Knaub (1996), and pages 877-879, particularly Figure 1, in Knaub (1992).)

33

Link Relative Estimator for the Case of Starting with an Estimate of a Total

≡^)0(LY beginning estimate of total

- On pages 25 and 26 of FCSM (1988), the example cited is one where an initial total is

known, here Y , and “A measure of how this value changes from month to month during the coming year is desired.”

^)0(L

≡)(^

iX

=

total, for month i, for the k respondents being followed each month - Thus:

^)1(

^^^)( )]1(/)([ −−= iLiL YiXiXY and for the first month, Y

^)0(

^^^)1( )]0(/)1([ LL YXX

CRE as Cumulative Link Relative Estimator for Cutoff Sampling

Applying Y over p time periods: )1(^^^

)(^

)]1(/)([ −−= iLiL YiXiX

^)0(^

^

^

^

^

^

^

^

)(^

)0(

)1(...)3(

)2(

)2(

)1(

)1(

)(LpL Y

X

X

pX

pX

pX

pX

pX

pXY •••−

−•

−•

−=

So, for the pth time period, a chain application of the link relative estimator yields the following:

is analogous to , )0(^

X ∑=

n

iix

1

and Y is analogous to , when applying this to a cutoff sample. ^

)0(L ∑=

N

iix

1

^

)0(^^^

)( )]0(/)([ LpL YXpXY =

From Knaub (2005), page 1: “… the CRE of the total is .” ∑∑∑===

⎥⎦

⎤⎢⎣

⎡ N

ii

n

ii

n

ii xxy

111/

Here is analogous to , )(^

pX ∑=

n

iiy

1

34

Appendix III

Hypothesis Testing vs. Fellegi-Sunter Record Linkage Error Probabilities

With regard to editing data, the Fellegi-Sunter Record linkage theory presents two types of errors that many consider analogous to the two types of error in hypothesis testing, and we may relate them to ‘false positives’ or ‘false negatives’ in data identified by edits such as in a confidence band around a scatterplot. However, this is actually better than the error probabilities associated with hypothesis testing, because p-values are sample size dependent and are only useful when comparing one to another in an appropriate manner such as in sequential hypothesis tests. In contrast, Fellegi-Sunter error probabilities are not sample size dependent (except that they are better estimated with larger samples). (See Knaub (1989a).)

35

Appendix IV

Multiple Regression and the CRE

For multiple regression, with x-values being regressor data, and y-values being the current data element whose total is being estimated (the ‘variable of interest’), we have

2/10444332211 ... −++++++= iijjiiiii wexbxbxbxbxby

where and , Knaub (2003), γ2−= ii zw == ii yz ^ ∑j jij xb/

where the are preliminary estimates of the . jb/

jb

So, , and therefore 2/10

−+= ∑ iij jiji wexby

∑= j jiji xby* , where is the WLS estimator. *iy

If we graph vs. , which is vs. , then the slope would be 1, with

variance about the line due to the estimated residuals . iy *

iy iy ∑ j jij xb2/1

0−= iii wee

For one variable, then, similarly,

ii bxy =*

so, if we graph vs. , which is vs. , then the slope would be 1 again.

However, we normally graph vs. and obtain slope b.

iy *iy iy ibx

iy ix

For CRE, , we have 5.0=γ∑

=

== n

kk

n

kk

x

yb

1

1 , so each estimate of y is , where *iy in

kk

n

kk

i xx

yy

=

==

1

1*

This often works well, the are small, when there is a strong, linear relationship between y and x, through the origin. However, if y is not consistently close to a constant multiple of x, then a ratio estimate is not very good. The multiple regression

equivalent is shown above, when

ie

5.0=γ

36

Appendix V

An Electric Power Data Example of When Not to Use a Ratio Consider the relationship of fuel consumption for useful thermal output, UTO, (utoc) to electric generation (g), for a combined heat and power plant, CHP:

1) Would the graph of utoc vs g go through the origin? (That is, when g is zero, would utoc be zero?)

2) Is there a linear relationship? (That is, would utoc always be a constant multiple of g?)

3) Is the standard error of utoc the square root of the product of a constant and g? The third condition often falls into place if the other two do, but number two is not always true in this example, and the first condition is definitely not true here. When generation is zero, consumption for UTO is not only not necessarily zero, it could be huge. It could be any size. That is, electric generation and the production of useful thermal output are not necessarily strongly related for CHPs. This would seem especially true when generation is very small. (See Cochran (1977), page 158.3) So, equating a ratio of consumption for useful thermal output to generation in a past year, to the corresponding ratio in the current sample period, is likely to be unreliable. Consumption for UTO might often graph against electric gross generation as in the following:

Consumption for UTO

? ? ? ? ? ? ? ?

? ? ?

? ? ? ? ? ? ?

? ? ? ? ?` ? ? ? ?

0 ? ? ? Generation 0

A graph like this would appear to have none of the desired properties.

3 Cochran (1977) references Brewer (1963) and Royall (1970) in the section “Conditions Under Which the Ratio Estimator is a Best Linear Unbiased Estimator.” Also relevant is Brewer (2002), top of page 110, also given in Knaub (2006), top of page 353.

37

1  

Misconceptions and Concerns Regarding “Cutoff Sampling and Inference” (2007) J. Knaub – February 2014 – Addendum Posted by InterStat  

 

Three misconceptions will be addressed here regarding the following:  

1) representativeness  

2) multiple variables of  interest,    i.e., attributes,  i.e., questions on a survey    ‐  leading  to  ‘quasi‐cutoff’ 

sampling  

3) multiple regression  

Since  the  primary  article  at  Knaub(2007) was written,  it  has  become  apparent  that  cutoff  sampling 

enjoys some use, but that there are misconceptions and concerns, particularly numbers 1 and 2 above.  

First,  there are different ways  for a  sample  to be  ‘representative’ of a population.   Randomization  is 

what occurs to most survey statisticians.  However, a simple random sample should not often safely be 

considered representative.  (How often do you see a simple random sample in practice?)  We generally 

need  to  stratify.    Further,  to guard against a particularly bad  selection, one would  typically do much 

better with  a model‐assisted  design‐based  sample.    In model‐based  sampling,  the  regressor  ensures 

representativeness.  As in its role as “auxiliary” data for the model‐assisted design‐based case, regressor 

data give us some  information regarding  the entire population.   With a cutoff, or quasi‐cutoff sample 

(defined  below),  with  regression  through  the  origin  (RTO),  and  weighted  least  squares  (WLS),  we 

generally minimize the damage by estimating for the y‐values of the points near the origin.  They cannot 

be but so much  in error.   If we try to collect them, however, very small observations often need to be 

collected  from very  small business entities  that cannot provide high quality data on a  frequent basis.   

Because  the  smallest observations  collected may have disproportionately  large nonsampling  error,  it 

may often be best to underestimate heteroscedasticity.  See Knaub(2005) and Knaub(2009).   

Note that model‐based (regression) estimation is required.   

So, one of the most effective ways to make a sample “representative,” is to use auxiliary data for model‐

assisted  design‐based  sampling  and  estimation, which would  be  the  regressor  data  in model‐based 

methods.  Another effective method is stratification.  Many still try to impose ideas of randomization on 

model‐based  estimation,  when  that  need  not  be  the  case.    That  is  because  that  practice  uses 

randomization for representativeness, not regression data by category/stratum.  You cannot judge one 

thing  by  the  standards  that  apply  to  another.    (Ken  Brewer  did  an  excellent  job  of  combining  both 

(Brewer(2002)), but here we do not have that luxury.)  The y data relate to the x data in a way that has 

measureable heteroscedasticity, such that the smallest observations impact the regression coefficient(s) 

more, and thus the estimation of the missing data, and the estimation of variance for estimated totals.  

In practice,  for  establishment  surveys, many  years of production,  testing, experimenting  and  general 

research  has  shown  that  this  is  generally  good  enough  to  produce  reliable  results with meaningful 

variance estimation, either as it is, or with the rearrangement of which data are modeled together, and 

perhaps what the model regressor(s) will be.  (Also, see Douglas(2013).)  This generally tests out well.    

2  

 

Second, the cutoff sampling of Knaub(2007) is not just for one attribute, and therefore cutoffs may not 

be  strict.    That  is, because  a  respondent often  reports  for more  than one  variable of  interest,  some 

respondents  selected  because  they  have  a  large  size  variable  indicator  (regressor  x,  or  linear 

combination of multiple  regressors)  for one attribute may also  report  for another attribute  for which 

they have a small size variable.    (At the Energy  Information Administration  [EIA] this kind of reporting 

has sometimes been called a “volunteer.”)  Thus, we may have a “quasi‐cutoff” sample, and the smaller 

observations  may  help  guard  against  “model‐failure”  (which  seems  to  be  so  exaggerated  in  many 

analysts’ minds).  See Knaub(2010) and Knaub(2014).   

To explain this further, from Knaub(2014) we have the following:  

Quasi-cutoff sampling, as noted in Knaub(2011), is a result of the fact that surveys generally ask more than one question. When a respondent is asked for a datum because it is large for that attribute, it often answers for other attributes for which it may not be large. For those other attributes, there will then be data collected corresponding to smaller

 than would have been taken in a strictly cutoff sample. Thus the cutoff may be raised, as other data will be collected, which will contribute to the coverage, which in turn, based on the relationship of the ’s to the ’s, will determine the variance for the error in the estimated totals. (See Knaub(2013).) Note that this is why some cutoffs seem to include so little in Douglas(2007). Because size measures will vary by attribute, this is far more efficient than using one size measure for a probability proportionate to size (PPS) sample, as has been struggled with at the US Energy Information Administration (EIA). Note that often, for agencies producing Official Statistics, there is one very good regressor available for each attribute, and here we rely on that relationship between x (regressor data) and y (attribute) to estimate for missing y-values. This is not to be confused with multivariate (multi-attribute) estimation. Multiple attributes are considered when sampling, if respondents do respond for more than one attribute, but the estimation here is based on each attributes relationship to a regressor or regressors.

 

Third, there is the question of multiple regression.  Many may not realize that there has been substantial 

work  in  this area with  regard  to model‐based estimation  for establishment  surveys.   Experimentation 

bears out the skepticism  in Brewer(2002) regarding this.   However, at times, good multiple regressors 

may be both available and useful.  It is good to look into this.  This has been used on occasion at the EIA.  

See Knaub(1996), and Knaub(2003).   

 Example  scatterplot  showing quasi‐cutoff  sampling and multiple  regression, provided by  Joel Douglas (EIA):     This scatterplot  illustrates multiple regression, but to use this for editing and analyses, one would turn 

this scatterplot to show two‐dimensional views, one at a time.   The view below  illustrates that the red 

points with predicted y‐values are influenced by both regressors.  It also illustrates quasi‐cutoff sampling 

in that this is for electric generation from a minor fuel.  Likely many if not most of the observed y‐values 

3  

for data points here are collected  from respondents who are primarily  in  the sample because of their 

electric generation  fired  from other, more  ‘important’  fuels.   This was designed  to meet an expected 

standard for estimated relative standard error (RSE):        

  

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

4  

Scatterplots help illustrate relationships between modeling, sampling, and analyses for data editing (Knaub(2009)). Nonsampling error may become evident, even though the only measure that the EIA has regarding nonsampling error is a proxy measure: mean absolute data revisions. References: Brewer, K.R.W. (2002). Combined Survey Sampling Inference: Weighing Basu’s Elephants, Arnold, London, pages 109-110. Douglas, J.R.(2007), “Model-Based Sampling Methodology for the new EIA-923,” http://www.eia.gov/pressroom/presentations/asa/asa_meeting_2007/fall/files/modeleia923.ppt. Presented to the American Statistical Association Committee on Energy Statistics, October 18, 2007. Douglas, J.R.(2013), “Efficiently Utilizing Available Regressor Data Through a Multi-Tiered Survey Estimation Strategy,” InterStat, September 2013, http://interstat.statjournals.net/YEAR/2013/abstracts/1309001.php  Knaub, J.R., Jr.(1996), “Weighted Multiple Regression Estimation for Survey Model Sampling,” Proceedings of the Survey Research Methods Section, American Statistical Association, pp. 596- 599. http://www.amstat.org/sections/srms/proceedings/papers/1996_101.pdf Knaub, J.R., Jr. (2003), “Applied Multiple Regression for Surveys with Regressors of Changing Relevance: Fuel Switching by Electric Power Producers,” InterStat, May 2003, http://interstat.statjournals.net/YEAR/2003/abstracts/0305002.php?Name=305002. Knaub, J.R., Jr.(2005), “Classical Ratio Estimator,” InterStat, October 2005, http://interstat.statjournals.net/YEAR/2005/abstracts/0510004.php?Name=510004 – on model-based CRE. Knaub, J.R., Jr.(2007), “Cutoff Sampling and Inference,” InterStat, April 2007, http://interstat.statjournals.net/YEAR/2007/abstracts/0704006.php?Name=704006 Knaub, J.R., Jr.(2009), “Properties of Weighted Least Squares Regression for Cutoff Sampling in Establishment Surveys,” InterStat, December 2009, http://interstat.statjournals.net/YEAR/2009/abstracts/0912003.php?Name=912003. Knaub J.R., Jr. (2010), “On Model-failure When Estimating from Cutoff Samples,” InterStat, July 2010, http://interstat.statjournals.net/YEAR/2010/abstracts/1007005.php?Name=007005 Knaub J.R., Jr.(2011). “Cutoff Sampling and Total Survey Error,” Journal of Official Statistics, Letter to the Editor, 27(1), 135-138, http://www.jos.nu/Articles/abstract.asp?article=271135. (click on “Full Text”)

5  

Knaub, J.R., Jr. (2013), “Projected Variance for the Model-Based Classical Ratio Estimator: Estimating Sample Size Requirements,” to be published in the Proceedings of the Survey Research Methods Section, American Statistical Association, http://www.amstat.org/sections/srms/proceedings/, for 2013, available online, circa April 2014. Knaub J.R., Jr. (2014), “Efficacy of Quasi-Cutoff Sampling and Model-Based Estimation For Establishment Surveys - and Related Considerations, InterStat, January 2014, http://interstat.statjournals.net/YEAR/2014/abstracts/1401001.php