heckman (2001)

76
673 [Journal of Political Economy, 2001, vol. 109, no. 4] 2000 by The Nobel Foundation Micro Data, Heterogeneity, and the Evaluation of Public Policy: Nobel Lecture James J. Heckman University of Chicago and American Bar Foundation This paper summarizes the contributions of microeconometrics to economic knowledge. Four main themes are developed. (1) Micro- econometricians developed new tools to respond to econometric prob- lems raised by the analysis of the new sources of micro data produced after the Second World War. (2) Microeconometrics improved on aggregate time-series methods by building models that linked eco- nomic models for individuals to data on individual behavior. (3) An important empirical regularity detected by the field is the diversity and heterogeneity of behavior. This heterogeneity has profound con- sequences for economic theory and for econometric practice. (4) Microeconometrics has contributed substantially to the scientific eval- uation of public policy. On behalf of all economists who analyze microeconomic data and who use microeconometrics to unite theory and evidence and to evaluate policy interventions of all kinds, I accept the Bank of Sweden Prize in Economic Sciences in Memory of Alfred Nobel. The field of microeconometrics emerged in the past 40 years to aid economists in providing more accurate descriptions of the economy, in designing and evaluating public policies, and in testing economic the- Bank of Sweden Nobel Memorial Lecture in Economic Sciences, presented in Stock- holm, December 8, 2000. I am grateful to Jaap Abbring, Pedro Carneiro, Lars Hansen, Steve Levitt, Costas Meghir, Robert Moffitt, Jeffrey Smith, and Edward Vytlacil for helpful comments. This research was supported by National Science Foundation grant 97-09-873, National Institute of Child Health and Human Development grant 40-4043-000-85-261, and grants from the American Bar Foundation. I thank the National Science Foundation, the National Institutes of Health, and the American Bar Foundation for their support over the years.

Upload: julioavellanedar

Post on 25-Nov-2015

26 views

Category:

Documents


1 download

TRANSCRIPT

  • 673

    [Journal of Political Economy, 2001, vol. 109, no. 4] 2000 by The Nobel Foundation

    Micro Data, Heterogeneity, and the Evaluationof Public Policy: Nobel Lecture

    James J. HeckmanUniversity of Chicago and American Bar Foundation

    This paper summarizes the contributions of microeconometrics toeconomic knowledge. Four main themes are developed. (1) Micro-econometricians developed new tools to respond to econometric prob-lems raised by the analysis of the new sources of micro data producedafter the Second World War. (2) Microeconometrics improved onaggregate time-series methods by building models that linked eco-nomic models for individuals to data on individual behavior. (3) Animportant empirical regularity detected by the field is the diversityand heterogeneity of behavior. This heterogeneity has profound con-sequences for economic theory and for econometric practice. (4)Microeconometrics has contributed substantially to the scientific eval-uation of public policy.

    On behalf of all economists who analyze microeconomic data and whouse microeconometrics to unite theory and evidence and to evaluatepolicy interventions of all kinds, I accept the Bank of Sweden Prize inEconomic Sciences in Memory of Alfred Nobel.

    The field of microeconometrics emerged in the past 40 years to aideconomists in providing more accurate descriptions of the economy, indesigning and evaluating public policies, and in testing economic the-

    Bank of Sweden Nobel Memorial Lecture in Economic Sciences, presented in Stock-holm, December 8, 2000. I am grateful to Jaap Abbring, Pedro Carneiro, Lars Hansen,Steve Levitt, Costas Meghir, Robert Moffitt, Jeffrey Smith, and Edward Vytlacil for helpfulcomments. This research was supported by National Science Foundation grant 97-09-873,National Institute of Child Health and Human Development grant 40-4043-000-85-261,and grants from the American Bar Foundation. I thank the National Science Foundation,the National Institutes of Health, and the American Bar Foundation for their supportover the years.

  • 674 journal of political economy

    ories and estimating the parameters of well-posed economic models. Itis a scientific field within economics that links the theory of individualbehavior to individual data, where individuals may be firms, persons, orhouseholds. Research in microeconometrics is data-driven. The availa-bility of new forms of data has raised challenges and opportunities thathave stimulated all of the important developments in the field and havechanged the way economists think about economic reality. Research inthe field is also policy-driven. Questions of economic policy that can beaddressed with data motivate much of the research in this field. Researchquestions in this field are also motivated by the desire to test and im-plement new economic models.

    In this lecture, I discuss four main themes in microeconometricsafield that has been recognized by the Nobel committee for the first timethis year. The first theme is that the postWorld War II development ofrich new data on individuals and firms gave economists a deeper un-derstanding of the economy. At the same time, it confronted econo-metricians with a host of unsolved problems that could not be adequatelyaddressed with methods developed in Cowles Commission simultaneousequations econometrics. Developments in microeconometrics havebeen stimulated by empirical problems that arise in analyzing economicdata.

    The second theme is closely related to the first. Microeconometricsgrew out of Cowles econometrics in response to its perceived empiricalfailures. Cowles econometrics was aggregative in character and was firstapplied on a wide scale to economic time series. Many of the Cowleseconometric models were not motivated as solutions to precisely for-mulated individual decision problems. Even when they were, the liter-ature on the aggregation problem in econometrics formally establishedthat the link between the decision maker and the aggregate data usedto estimate the models was not clear. Microeconometrics developedprecisely formulated models of individual behavior and estimated mod-els on individual data. The link between theory and data became muchcloser.

    The third theme of this lecture is that a number of important em-pirical discoveries have emerged from microeconometric investigations.The most important discovery was the evidence on the pervasiveness ofheterogeneity and diversity in economic life. When a full analysis ofheterogeneity in responses was made, a variety of candidate averagesemerged to describe the average person, and the long-standing edificeof the representative consumer was shown to lack empirical support.This changed the way economists think about econometric models andpolicy evaluation. A new model of microeconomic phenomenaemerged. In the context of regression analysis, not only were interceptsvariable but so were the slope coefficients, and both slopes and inter-

  • nobel lecture 675

    cepts could be correlated with regressors. Accounting for heterogeneityand diversity and its implications for economics and econometrics is acentral message of this lecture and a main theme of my lifes work.

    The fourth theme of my lecture is that microeconometrics has con-tributed substantially to scientific policy evaluation based on econo-metric models, which has always been a central problem in economet-rics. As difficulties in identifying structural parameters becameevidentwhether in macro or micro datamicroeconomists, followingimportant suggestions by Marschak (1953) and Hurwicz (1962), beganto ask whether it was necessary to recover all the parameters of structuralmodels to answer specific policy questions in a principled way. This gaverise to a new emphasis on problem-specific parameters, or treatmenteffects, which in general are distinct from structural parameters. Theseparameters answer more limited economic questions but are more easilyidentified or bounded. Understanding the advantages and limitationsof these treatment effects and relating them to the structural parametersof the older literature is a recent advance.

    I. Microeconometrics: Origins and a Definition

    Econometrics is a branch of economics that unites economic theorywith statistical methods to interpret economic data and to design andevaluate social policies. Economic theory plays an integral role in theapplication of econometric methods because the data do not speak forthemselves on many questions of interpretation. Econometrics uses eco-nomic theory to guide the construction of counterfactuals and to pro-vide discipline on empirical research in economics.

    The production of a large database that can be used to describe theeconomy, to test theories about it, and to evaluate public policy is amajor development of twentieth-century economics. Prior to the twen-tieth century, economics was largely a deductive discipline that drew onanecdotal observations and on introspection to test theory and evaluatepublic policies.

    Alfred Marshalls theoretically fruitful notion of the representativefirm and the representative consumer was firmly rooted in economictheory by the time economists began the systematic collection and anal-ysis of aggregate economic data. The early econometricians focused onaggregate data to measure business cycles and to build models that couldbe the basis for an empirically based approach to macro policy evalu-ation. Using linear equation systems, these scholars developed a frame-work for analyzing causal models and producing policy counterfactuals.For the first time, causation was distinguished from correlation in aformally precise way that could be empirically implemented.

    Despite these substantial intellectual contributions, empirical results

  • 676 journal of political economy

    from these methods proved to be disappointing. Almost from the outset,aggregate time-series data were perceived to be weak, and empiricalmacro models were perceived as ineffective in testing theories and pro-ducing policy advice (see Morgan 1990). With a few notable exceptions,macroeconometricians turned to using statistical time-series methods inwhich the link between the statistical model and economic theory wasusually weak.1

    Early on, Orcutt (1962) advocated a program of combining microand macro data to produce a more credible description of economicphenomena and to test alternative economic theories. At the time heset forth his views, the micro database was small, computers had limitedpower, and a whole host of econometric problems that arose in usingmicro data to estimate behavioral relationships were not understood,much less solved. Nonetheless, Orcutts vision was a bold one, and hehelped set into motion the forces that produced modern micro-econometrics.

    Microeconometrics extended the Cowles theory by building richereconomic models in which heterogeneity of agents plays a fundamentalrole and the equations being estimated are more closely linked to in-dividual data and individual choice models. At its heart, economic the-ory is about individuals and their interactions in markets or other socialsettings. The data needed to test the micro theory are micro data. Theeconometric literature on the aggregation problem (Theil [1954]; see,e.g., Green [1964] or Fisher [1969] for surveys) demonstrated the fra-gility of aggregate data for inferring either the size or the sign of microrelationships. In the end, this literature produced negative results anddemonstrated the importance of using micro data as the building blockof an empirically based economic science. It provided a major moti-vation for the collection and analysis of microeconomic data.

    Another motivation was the growth of the modern welfare state andthe ensuing demand for information about the characterization, cau-sation, and solutions to social problems and the public demand for theobjective evaluation of social programs directed toward specific groups.Application of the principles of the Cowles paradigm and its extensionsby Theil (1961) gave rise to a demand for structural estimation basedon micro data. In the optimistic era of the 1960s and 1970s, estimationof policy-invariant structural parameters on micro data became a centralgoal of policy-oriented econometric analysis to consider the effects of

    1 See, however, the important work of Fair (1976, 1994), Hansen and Sargent (1980,1991), and Hansen and Singleton (1982), which constitutes an exception to this rule.Heckman (2000) discusses this development. These methods formulate well-posed decisionproblems for individuals in deriving their estimating equations but apply them to aggregatedata.

  • nobel lecture 677

    old policies in new environments and to consider the possible effectsof new policies never tried.

    Another use for structural models independent of interest in policyanalysis was to test economic theory. Labor economics in particular hadbeen enriched by the application of neoclassical theory to the labormarket. This demand was further fueled by the emergence of a microtheorybased macroeconomics. The numerical magnitudes of individ-ual-level preference and production parameters played a crucial role inmacro theory and macro policy debates.

    Another demand for structural estimation arose from the need tosynthesize and interpret the flood of micro data that began to pour intoeconomics in the mid 1950s. The advent of micro surveys coupled withthe introduction of the computer and the development and dissemi-nation of multiple regression methods by Theil (1961, 1971) and Gold-berger (1964) made it possible to produce hundreds, if not thousands,of regressions quickly. The resulting flood of numbers was difficult tointerpret or to use to test theories or create an informed policy con-sensus. A demand for low-dimensional economically interpretable mod-els to summarize the growing mountains of micro data was created, andthere was increasing recognition that standard regression methods didnot capture all of the features of the data, nor did they provide a frame-work for interpreting the data within well-posed economic models.

    Before I turn to specific developments in the field, it is useful toconsider two distinct policy evaluation questions that differ greatly inthe data and assumptions required to answer them. The evolution ofmicroeconometrics in the past 30 years can be described as movingfrom answering the harder structural questions to answering the rela-tively easier treatment effect questions.

    II. Economic Policy, Economic Models, and Econometric PolicyEvaluation

    Two conceptually distinct policy evaluation questions are often confused.Their careful separation is a major development in microeconomicsand is a major theme of this lecture. The first question is (1) What isthe effect of a program in place on participants and nonparticipantscompared to no program at all or some alternative program? This iswhat is now called the treatment effect problem. The second and themore ambitious question raised is (2) What is the likely effect of a newprogram or an old program applied to a new environment? The second

  • 678 journal of political economy

    question raises the same type of problems that arise from estimatingthe demand for a new good.2 Its answer requires structural estimation.

    It is easier to answer the first question than the second, although theearly literature attempted to answer both by estimating structural mod-els. A major development in policy evaluation research to which I havecontributed has been clarification of the conditions that must be sat-isfied to answer both types of questions, and other related questions.

    The goal of structural econometric estimation is to provide the in-gredients to solve a variety of decision problems. Those decision prob-lems entail such distinct tasks as (a) evaluating the effectiveness of anexisting policy, (b) projecting the likely effectiveness of a policy in en-vironments different from the one in which it was experienced, or (c)forecasting the effects of a new policy, never previously experienced.3

    In this lecture I consider only decision problems that arise in policyanalysis.

    Additional benefits of structural models are that they can be used totest economic theory and make quantitative statements about the rel-ative importance of causes within a theory. In addition, structural modelsbased on invariant parameters can be compared across empirical studies.Empirical knowledge can be cumulated within structural frameworks.However, for certain important classes of decision problems, knowledgeof all or even any structural parameters of a model is unnecessary. Thisis fortunate because recovering structural parameters is usually not aneasy task.

    In the recent literature on policy evaluation, the implicit goal hasbeen to recover the ingredients of models required to solve more spe-cific decision problems. This may entail knowing only combinations ofstructural parameters or parameters that are not structural in any con-ventional sense of that term. Thus the modern treatment effect literaturein economics takes as its main goal the estimation of one or anothertreatment effect parameternot the full range of parameters pursuedin structural econometricsalthough the precise questions being an-swered in particular studies are often not clearly stated. These treatmentparameters are identified under weaker conditions than are requiredfor recovering all of the structural parameters of the model. The Cowles

    2 This question is discussed in basic papers by Lancaster (1966, 1971), Quandt (1970),McFadden (1974), Domencich and McFadden (1975), and Gorman (1980) (first writtenin 1956), among others.

    3 Marschak (1953) stressed these features of structural estimation. Similar issues arisein estimating the demand for new goods. Structural methods can be used to estimate theparameters of demand equations in a given economic environment, to forecast the demandfor goods in a different environment, and to forecast the demand for a new good neverpreviously consumed. Knowledge of the parameters of demand functions is crucial intesting alternative theories of consumer demand and measuring the strength of comple-mentaries and substitution among goods.

  • nobel lecture 679

    distinctions between endogenous and exogenous variables and the laterdistinctions of weak, strong, and super exogeneity developed in theliterature on estimating structural parameters and policy forecasting(Engle, Hendry, and Richard 1983) are largely irrelevant in identifyingcertain widely used treatment parameters. By focusing on one particulardecision problem, the treatment effect literature achieves its objectivesunder weaker and hence more credible conditions than are invoked inthe structural econometrics literature. At the same time, the parametersso generated are less readily transported to different environments toestimate the effects of the same policy in a different setting or the effectsof a new policy, and they are difficult to compare across studies. Thetreatment effect literature has to be extended to make such projectionsand comparisons, and, unsurprisingly, the required extensions arenonparametric versions of the assumptions used by structuraleconometricians.4

    To make this discussion specific, but at the same time keep it simple,consider the prototypical problem of determining the impact of taxesand welfare payments on labor supply. This problem motivated the earlyliterature in evaluating the welfare state (Cain and Watts 1973), moti-vated my own research, and remains an important policy problem downto this day.

    Following the conventional theory of consumer demand, write aninterior solution labor supply equation of hours of work H in terms ofwages, W, and other variables including assets, demographic structure,and the like. Denote these other variables by X. Let U denote an unob-servable from the point of view of the observing economist. As we shallsee, unobservables play a big role in microeconometrics. There is muchevidence that unobservables are empirically important. Modern mi-croeconometrics is devoted to accounting for them.

    In the most general form for H,

    Hp f(W, X, U ). (1)

    Assume for simplicity that f is differentiable in all of its arguments.Equation (1) is a Marshallian causal function.5 Its derivatives producethe ceteris paribus effect of a change in the argument being varied onH. Suppose that we wish to evaluate the effect of a change in a pro-portional wage tax on labor supply. Proportional wage taxes at rate tmake the after-tax wage Assume that agents correctly perceiveW(1 t).the tax and ignore any general equilibrium effects of the tax. In thelanguage of treatment effects, the treatment effect or causal effect of

    4 This point is developed more fully in Heckman, LaLonde, and Smith (1999) andHeckman and Vytlacil (2001a, 2001d, 2002).

    5 See Heckman (2000) or Heckman and Vytlacil (2001a, 2002) for a rigorous definitionof Marshallian causal functions.

  • 680 journal of political economy

    a tax change on labor supply defined at the individual level isfor the same person subject tof(W(1 t), X, U ) f(W(1 t ), X, U )

    two different taxes, t and t .An additively separable version of the Marshallian causal function (1)

    is

    Hp f(W, X) U, E(U )p 0. (2)

    This version enables the analyst to define the ceteris paribus effects ofW and X on H without having to know the level of the unknown (tothe econometrician) unobservable U. A parametric version of (1) is

    Hp f(W, X, U, v), (1)

    where v is a low-dimensional parameter that generates the f of equation(1). A parametric version of (2) is

    Hp f(W, X, v) U. (2 )

    The parameters v reduce the dimensionality of the identification prob-lem from that of identifying an infinite-dimensional function to that ofidentifying a finite set of parameters. They play a crucial role in fore-casting the effects of an old policy in different populations, in cumu-lating evidence across studies, and in forecasting the effects of a newpolicy. A linear-in-parameters representation of H writes

    Hp a X b ln W U, (3)

    where we adopt a semilog specification to represent models widely usedin the literature on labor supply (see Killingsworth 1983).

    As in Marschak (1953), it is useful to distinguish three different policyevaluation problems. A tax is externally imposed on a population or asubpopulation of the economy. (Thus the tax is determined indepen-dently of U, but it may depend on X and W, variables that we observeand on which we can condition.) In case 1, tax t has been implementedin the past, and we wish to forecast the effects of the tax in a populationwith the same distribution of (W, X, U) variables that prevailed whenhistorical measurements of tax variation were made. In case 2, tax t hasbeen implemented in the past, but we wish to project the effects of thesame tax to a different population of (W, X, U) variables. In case 3, thetax has never been implemented, and we wish to forecast the effect ofa tax either on an initial population used to estimate (1) or on a differentpopulation.

    Suppose that the goal of the analysis is to determine the effect oftaxes on average labor supply on a relevant population with distributionG(W, X, U). In case 1, we have data from the same population for whichwe wish to construct a forecast. Suppose that we observe different tax

  • nobel lecture 681

    regimes. Persons face externally imposed tax rate tj in regime j, jpIn the sample from each regime we can identify1, , J.

    E(H dW, X, t )p f(W(1 t ), X, U )dG(U dX, W ). (4)j jFor the entire population this function is

    E(H d t )p f(W(1 t ), X, U )dG(U, X, W ). (5)j jThis function is assumed to apply to the target population of interest.Knowledge of (4) or (5) from the historical data can be projected intoall future periods provided that the joint distributions of data are tem-porally invariant. If one regime has been experienced in the past, lessonsfrom it apply to the future, provided that the same and prevail.f(7) G(7)No explicit counterfactual state need be constructed. No knowledge ofany Marshallian causal function or structural parameter is required todo policy analysis for case 1. It is not necessary to break apart (4) or(5) to isolate f from G.6

    Case 2 resembles case 1 except for one crucial difference. Becausewe are now projecting the same policy onto a different population, itis necessary to break (4) or (5) into its components and determine

    separately from G(U, X, W). The problem of policyf(W(1 t ), X, U )jevaluation becomes much harder. A quotation from Frank Knight(1921) is apt: The existence of a problem in knowledge depends onthe future being different from the past, while the possibility of a so-lution of the problem depends on the future being like the past (p.313).

    The assumptions required to project the effects of the old policy ina new regime require that we borrow from the past to determine thecomponents of (4) or (5) on new populations. Those assumptionsfollow.

    a) Knowledge of is needed for the new population. This mayf(7)entail determination of f on a support different from that used todetermine f in an initial sample if the target population has a supportdifferent from that of the original source population. At this stage,

    6 It is not even required that t be externally specified. If a policy setting function tpgenerates t and h is in U and t given (X, W), then each t is associatedh(X, W, U) 1 1

    with a unique U given (X, W). Provided that the goal of the analysis is to forecast theeffects of future t generated by h, we can use historical data to do so. If h is not in1 1U and t given (X, W), then it is not possible, in general, to use historical data to predictthe effect of t variation generated by f on mean H. If, however, the goal is to forecastpolicies generated by a new rule (including external variations of t unrelated to U), thencase 1 no longer is relevant, and it is necessary to do structural estimation (Lucas 1976).

  • 682 journal of political economy

    structural estimation comes into its own. It sometimes enables us toextrapolate f from a source population to a target population. A com-pletely nonparametric solution to this problem is impossible even if weadopt structural additive separability assumption (2) unless the supportsof target and source populations coincide.

    Some structure must be placed on f even if (2) characterizes thelabor supply model. Parametric structure (3) is traditional in the laborsupply literature, and versions of a linear-in-parameters model dominateapplied econometric research.7

    b) Knowledge of for the target population is also required. InG(7)this context, exogeneity enters as a crucial facilitating assumption.

    Assumption 1. (X,W) independent of U.If we define exogeneity by assumption 1, then 8G(U dX, W )p G(U ).In this case, if we assume that the distribution of unobservables is the

    same in the sample as in the forecast or target regime, G(U )pwhere is the distribution of unobservables in the target G (U ), G (U )

    population, we can project to a new population using the relationship

    E(H dW, X, t )p f(W(1 t ), X, U )dG(U ), (6)j jprovided that we can determine over the new support of X, W, U.f(7)If, however, must somehow be determined. This entails in- G ( G, Gvoking some structural assumptions to determine the relationship be-tween G and G .

    In case 3, where no tax has previously been introduced, knowledgeof the target population is required. Taxes operate through the term

    If there is no wage variation in samples extracted from theW(1 t).past, there is no way to identify the effect of taxes on labor supply sinceby assumption and it is not possible to determine the effect oftp 0,the first argument on labor supply. The problem is only worse if weassume that taxes operate on labor supply independently of wages.

    7 The assumption that is real analytic so that it can be extended to otherf(W, X)domains is another structural assumption. This assumption is exploited in Heckman andSinger (1984) to solve a censoring problem in duration analysis.

    8 There are many definitions of this term. Assumption 1 is often supplemented by theadditional assumption that the distribution of X does not depend on the parameters ofthe model (e.g., v in [1 ] or [2 ]) (see Engle et al. 1983).

  • nobel lecture 683

    Then, even if there is wage variation, it is impossible to identify taxeffects or to project them to a new population.9

    The preceding discussion applies with equal force to analyses of ag-gregate data and to analyses of micro data. Using individual variationin micro surveys provides a new avenue of identification of f and G notavailable in macro data. It thus facilitates identification of structuralparameters.

    The treatment effect literature extends Marschaks first case by al-lowing the treatment (t) to be endogenous. Consider two populations.These can be subpopulations of a general population and will be re-ferred to as the treatment group and the comparison group. In onepopulation the tax is tj and in the other the tax is tk, which may be notax at all. If the two populations are identical in terms of f and G anddiffer only in an externally imposed tax rate, then it is possible to de-termine the effect on mean hours of work of tj relative to tax tk foreither population for any given X, W by simply contrasting mean hoursin the two populations, over domains ofE(HFW, X, t ) E(HFW, X, t ),j kcommon support for W, X. No knowledge of f or G is required, so nostructural estimation is required. Moreover, as previously noted, thereare (stringent) conditions under which this exercise is valid even if t isendogenously determined by a stable policy rule provided that the ruleis in (t and U) for a given X, W.1 1

    In the context of the labor supply example, the literature on treatmenteffects seeks to identify the contrasts in mean hours worked on a givenpopulation of (X, W, U) that would arise from different externally im-

    9 If wages vary in the prepolicy period, it may not be necessary to decompose (4) intof and G, or to do structural estimation, in order to estimate the effect of taxes on laborsupply in a regime that introduces taxes for the first time. If the support of

    def W(1 t)p W

    in the target regime is contained in the support of W in the historical regime, the supportsof the X are the same in both regimes, and the conditional distributions of U given X, Wand U given X, are the same, then knowledge of (4) over the support of W in theWhistorical or source regime is enough to determine the effect of taxes in the target regime.More precisely, letting historical denote the past data and target denote the targetpopulation for projection, we may write these assumptions as (a) support(X, W ) Ptarget

    and (b) where support(X, W ) G(U dX, W ) p G(U dX, W ) , W p W(1historical target historicalfor random variables W defined in the new regime and In this case,t) (W ) p W .target historical

    no structural estimation is required to forecast the effect of taxes on labor supply in thetarget population. A fully nonparametric policy evaluation is possible estimating (4) or(5) nonparametrically (and not decomposing into the and com-E(H dX,W) f(7) G(7)ponents). Under assumption a, we may find a counterpart value of in theW(1 t)p Wtarget population for each X to insert in the nonparametric version of (4) (or [5]). Ifthese conditions are not met, it is necessary to build up the G and the f functions overthe new supports using the appropriate distributions. We enter the realm in which struc-tural estimation is required, either to extend the support of the functions or tof(7)determine or both. It is still necessary to determine the relationship betweenG(U dX, W)W and X in the target population.

  • 684 journal of political economy

    posed policy (t) regimes without decomposing mean hours into f or Gcomponents using data from populations in which t is not externallyspecified. Policy experiments (natural or unnatural) that change t anddo not change f or G identify such effects. Instruments that shift tkeeping f and G invariant are also used. A variety of methods are usedto control for observed and unobserved differences in outcomes acrosspolicy regimes that are unrelated to the policy being evaluated. Theidentifying conditions required to estimate treatment effects are gen-erally weaker than those required to identify f and G in the sense thatfewer assumptions are required to identify the treatment effects. At thesame time, the estimates produced are very problem-specific and applyonly to the populations being studied. The treatment effects lack thetransportability of f to new environments and the interpretability of fin terms of ceteris paribus changes (causal effects) for all of the con-ditioning variables except t.

    This dualism between treatment effects and structural equations runsthroughout the literature and my own work. I return to this theme, butfirst I consider how the availability of micro data provided the impetusfor the development of microeconometrics.

    III. New Features of Micro Data

    The micro data first produced on a large scale in the 1950s revealedpatterns and features that were not easily rationalized by standard mod-els of consumer demand and labor supply or that were well modeledby conventional regression analysis. Important dimensions of hetero-geneity and diversity that are masked in macro data were uncovered.These findings challenged the standard econometric tool kit of the day.

    Inspection of cross-section data reveals that otherwise observationallyidentical people make different choices, earn different wages, and holddifferent levels and compositions of asset portfolios. These data revealthe inadequacy of the traditional representative agent paradigm.10 Table1 presents a typical sample of data on labor supply. A considerablefraction of people do not work, and we do not observe wages for non-workers. The R2 (measure of explained fit) of any micro relationshipis typically low, so the unobservables account for a lot of the variabilityin hours of work. Different assessments of the unobservables have dif-ferent effects on the interpretation of the evidence. For example, isjoblessness due to unobserved tastes for leisure on the part of workersor a failure of the market to generate wage offers that are observed only

    10 Lancaster (1966, 1971), Quandt (1970), McFadden (1974), and Domencich andMcFadden (1975) were among the first to question the empirical validity of the repre-sentative agent empirical paradigm. See Kirman (1992) for a recent assessment of therepresentative agent paradigm.

  • nobel lecture 685

    TABLE 1Participation, Hours Worked, and Wage Data, NLSY Data, 197994

    DemographicGroup

    PercentageWorkingat Age 29

    R2 from Regressions

    Total Hours Worked onEducation and Experience

    Log Wage on Educationand Experience

    White males 83.5% .12 .10Black males 75.0% .15 .14Hispanic males 80.0% .11 .10White females 76.4% .15 .17Black females 69.6% .18 .21Hispanic females 66.6% .18 .10

    Source.National Longitudinal Survey of Youth, 197994, as used in Carneiro et al. (2001).

    if they are accepted?11 Are all women transients in the labor market, ordo some women (or most) have a long-term attachment to it?12

    There are additional problems with using these data that are muchless apparent in analyses of time-series data. Wages are missing for non-workers. How can one estimate the effect of wages on labor supply ifwages are available only for workers? How can one interrelate the variousdimensions of labor supply (hours of work, work or not work, numberof periods worked) in order to do counterfactual policy analysis?

    Analyses of the new data gave rise to a variety of econometric prob-lems: (a) accounting for discreteness of outcome variables; (b) ration-alizing choices made at both the extensive and intensive margins (mod-els for discrete choice and for joint discrete and continuous choices)within a common structural model; and (c) accounting systematicallyfor missing data, with prices or wages missing because of choices madeby individuals.

    Focusing solely on the statistical aspects of microeconometrics ob-scures its basic contributions. After all, many statisticians worried aboutsome of these problems. Models for discrete data were analyzed byGoodman (1968), Haberman (1974), and Bishop, Fienberg, and Hol-land (1975), although it was economists who pioneered the study ofmodels with jointly determined discrete and continuous outcomes(Heckman 1974a, 1974c) and models with systematically missing data(Gronau 1974; Heckman 1974a, 1974c, 1976a, 1976c, 1979).13 An im-portant contribution of microeconometrics was to clarify the limitationsof, and to extend, these statistical frameworks for estimating economicmodels, making causal distinctions and solving various versions of thepolicy evaluation problem described in Section II.

    11 Flinn and Heckman (1982) analyze this question and show the difficulty of resolvingit using data on market choices.

    12 Heckman and Willis (1977) and Heckman (1981a) analyze this question.13 See Holt (1985) for a discussion of the originality of the work of econometricians in

    analyzing models for data not missing at random.

  • 686 journal of political economy

    Unlike the models developed by statisticians, the class of microecon-ometric models developed to exploit and interpret the new sources ofmicro data emphasized the role of economics and causal frameworksin interpreting evidence, in establishing causal relationships, and inconstructing counterfactuals, whether they were counterfactual missingwages in the analysis of female labor supply or counterfactual policystates that arise in evaluating social policies. Research in microecono-metrics demonstrated that it was necessary to be careful in accountingfor the sources of manifest differences among apparently similar indi-viduals. Different assumptions about the sources of unobserved heter-ogeneity have a profound effect on the estimation and economic in-terpretation of empirical evidence, in evaluating programs in place, andin using the data to forecast new policies and assess the effect of trans-porting existing policies to new environments.

    Heterogeneity due to unmeasured variables became an importanttopic in this literature because its manifestations were so evident in thedata and the consequences of ignoring it turned out to be so profound.The problem became even more apparent as panel micro data becameavailable and it was possible to observe persistent differences over timefor the same persons.

    IV. Potential Outcomes, Counterfactuals, and Selection Bias

    My initial efforts in the field of microeconometrics were focused onbuilding models to capture the central features of data like that dis-played in table 1 within well-posed choice theoretic models that alsocould be used to address structural policy evaluation problems (question2 problems as defined in Sec. II). I was inspired by the work of Mincer(1962) on female labor supply and was challenged by the opportunityof building a precise econometric framework for analyzing the variousdimensions of female labor supply and their relationship with wages.In accomplishing this task, I drew on two sets of econometric tools thatwere available, and my attempts to fuse these tools into a commonresearch instrument produced both frustration and discovery.

    The two sets of tools available to me were (1) classical Cowles Com-mission simultaneous equations theory and (2) models of discretechoice originating in mathematical psychology that were introducedinto economics by Quandt (1956, 1970), McFadden (1974, 1981), andDomencich and McFadden (1975). My goal was to unite these two lit-eratures in order to produce an economically motivated, low-dimen-sional, simultaneous equations model with both discrete and continuousendogenous variables that accounted for systematically missing wagesfor nonworkers and different dimensions of labor supply within a com-mon framework, that could explain female labor supply, and that could

  • nobel lecture 687

    be the basis for a rigorous analysis of policies never previouslyimplemented.

    The standard model of labor supply embodied in equation (1), (2),or (3) is not adequate to account for the data in table 1. Neither isCowles econometrics. Under standard conditions, Cowles methods canaccount for the correlation between W and U in equation (3), assumingthat wages are measured for everyone. Such correlation can arise frommeasurement error in wages or because of common unobservables inthe wage and labor supply equations (e.g., more motivated people workmore and have higher wages, and motivation is not observed). Cowlesmethods do not tell us what to do when wages are missing, how toaccount for nonworkers, or how to relate the decision to work with thedecision on hours of work.

    In a series of papers written in the period 197275 (Heckman 1973,1974a, 1974c, 1976a, 1976c, 1978a),14 I developed index models of po-tential outcomes to unite Cowles econometrics and discrete choice the-ory as well as to unify the disjointed and scattered literature on sampleselection, truncation, and limited dependent variables that character-ized the literature of the day.15 I also developed a variety of two-stageestimators for this class of models to circumvent computational diffi-culties associated with estimating these models by the method of max-imum likelihood.

    Following the literature in mathematical psychology and discretechoice as synthesized and extended by McFadden (1974, 1981), define

    Y p g (X , U ), ip 1, , I, (7)i i i i

    as latent random variables reflecting potential outcomes. In the contextof discrete choice, the Yi are latent utilities associated with choice i, andthey depend on both observed (Xi) and unobserved (Ui) characteristics.These are also called index function models. Within each choice i, thelevel of the utility may vary. More generally, as in the Cowles programand, in particular, Haavelmo (1943), equation (7) may represent anypotential outcome, including wages, hours of work, and the like. Equa-tions (1) and (7) are Marshallian causal relationships that tell us howhypothetical outcome Yi varies as the arguments on the right-hand sideare manipulated holding everything else but the manipulated variablefixed.

    Depending on the context, the Yi may be directly observed or onlytheir manifestations may be observed. In models of discrete choice, the

    14 The earliest papers were published in 1974 (Heckman 1974a, 1974c) but widely cir-culated before then. Two others (Heckman 1976c, 1978a) were actually written in 1973and widely circulated at that time to many senior econometricians.

    15 See Heckman and MaCurdy (1985) for a systematic development of index functionmodels.

  • 688 journal of political economy

    Yi are never observed, but we observe argmaxi{Yi}. In the more generalclass of models I considered, some of the Yi can be observed undercertain conditions.

    To consider these models in the most elementary setting, consider aversion with three potential outcome functions. The literature analyzesmodels with many potential outcomes. Write the potential outcomes inadditively separable form as

    Y p g (X) U ,0 0 0

    Y p g (X) U ,1 1 1

    Y p g (X) U . (8)2 2 2

    These are latent variables that may be only imperfectly observed. In thecontext of the neoclassical theory of labor supply, the theory of search,and the theory of consumer demand, the reservation wage or reservationprice at zero hours of work (zero demand for the good) plays a centralrole. It informs us what price it takes to induce someone to work thefirst hour or buy the first unit of a good. Denote this potential reservationwage function by Y0. Let Y1 be the market wage functionwhat themarket offers. With no fixed costs of work, a person works ( ) ifDp 1

    Y Y Dp 1. (9)1 0Otherwise the person does not work. Potential hours of work Y2 aregenerated from the same preferences that produce the reservation wagefunction, so Y2 and Y0 are generated by a common set of parameters.In my 1974c paper, I produced a class of simple tractable functionalforms in which

    Y p ln Rp log reservation wage, (10a)0

    Y p ln Wp market wage, (10b)1

    ln W ln RY p , g 1 0, (10c)2

    g

    and observed hours of work are written as

    ln W ln RHp Y 1(ln W ln R)p 1(ln W ln R),2

    g

    where 1(A) is an indicator that equals one if A is true. Proportionaltaxes or transfers t introduce another source of variation into theseequations so that in place of W one uses the after-tax wage W(1 t).

  • nobel lecture 689

    The unobservables U1 and U0 account for why otherwise observationallyidentical people (with the same X) make different choices.16

    Closely related to this model is the pioneering model of Roy (1951)on self-selection in the labor market that was rediscovered in the 1970s.His model is a version of the model for index functions just presented.17

    From equation (8), Y0 and Y1 are potential outcomes and Y2 is a latentutility:

    Y 0 Dp 1 and Y observed,2 1Y ! 0 Dp 0 and Y observed. (11)2 0

    Thus observed Y is

    Yp DY (1 D)Y . (12)1 0

    In the original Roy model, In the generalized Roy model,Y p Y Y .2 1 0Y2 is more freely specified but may depend on Y1 and Y0.

    These models of potential outcomes contain several distinct ideas.(1) As in the Cowles Commission analyses, there is a hypothetical su-perpopulation of potential outcomes defined by possible values assumedby the Yj for ceteris paribus changes in the X and the U. These aremodels of Marshallian causal functions usually represented by low-dimensional structural models to facilitate forecasting and policy anal-ysis. (2) In contrast to the Cowles models, but as in the models fordiscrete choice, some of the latent variables are not observed (e.g., lnR is not observed but is sometimes elicited by a questionnaire). (3) Incontrast to either the Cowles model or the discrete choice model, someof the latent variables are observed, but only as a consequence of choices;that is, they are observed selectively.

    Thus we observe up to scale and observe wages only ifln W ln R

    16 In Heckman (1974a), I present a more explicit structural model of labor supply, childcare, and wages that develops, among others things, the first rigorous econometric frame-work for analyzing the effect of progressive taxes on labor supply and the effect of informalmarkets on labor supply. In that paper, I characterized preferences by the marginal rateof substitution function, generate Y0 and Y2 from the consumer indifference curves, andproduce Y2 from a solution of consumer first-order conditions and the budget constraint.In that model the unobservables affecting preferences translate into variation across con-sumers (or workers) in the slopes of indifference curves. Characterizing consumer pref-erences by the slopes of indifference curves facilitates the analysis of labor supply withkinked income tax schedules and provides a more flexible class of preferences than isproduced by simple linear or semilog specifications of labor supply equations. The analysisof progressive taxes for the convex case appears in an appendix to Heckman (1974b).Because of space limitations, the editor, T. W. Schultz, requested a condensed presentation.The full formal analysis was published later in Heckman and MaCurdy (1981, 1985) andHeckman, Killingsworth, and MaCurdy (1981). Hausman (1980, 1985) extends this analysisto the nonconvex case.

    17 Roy develops an economic model of income inequality and sorting but does notconsider any of the econometric issues arising from his model. Lee (1978) and Willis andRosen (1979) are two applications of the Roy model.

  • 690 journal of political economy

    ( ). This selective sampling of potential outcomes givesln W ln R Dp 1rise to the problem of selection bias. We observe only selected subsamplesof the latent population variables. In the context of the Roy model, weobserve Y0 or Y1 but not both.

    If there were no unobservables in the model, this selective samplingwould not be a cause for any concern. Conditioning on X, we wouldobtain unbiased or consistent estimators of the missing outcomes forthose who do not work using the outcomes of those who do work. Yetthe data in table 1, which are typical, reveal that the observables explainonly a small fraction of the variance of virtually all microeconomic var-iables. It is necessary to account for heterogeneity in preferences andselective sampling on unobservables. As a consequence of selection rule(9), in general the wages and hours we observe are a selected sampleof the potential outcomes from the larger population.18 Accounting forthis is a major issue if we seek to estimate structural relationships (theparameters of the causal functions) or describe the world of potentialoutcomes (the equations such as [10a][10c]). This gives rise to theproblem of selection bias. To solve this problem required a new analysisof discrete choice and mixed continuous-discrete choice that revisedconventional Cowles econometrics and demonstrated the inadequacyof conventional statistical models for discrete data in making causaldistinctions. The theory of discrete choice and mixed discrete-contin-uous choice challenged the received Cowles paradigm by linking econ-ometrics more closely to choice and decision processes. I consider theserevisions in Appendix A. In brief, log linear models used by statisticiansto model discrete data were unable to make the ceteris paribus distinc-tions between true and spurious causality that are required in econo-metric policy analysis, and new conditions for coherence in simulta-neous equations models were developed to make modelsprobabilistically and economically well defined (Heckman 1976c, 1978a;Heckman and MaCurdy 1986). Amemiya (1985) presents a masterfulsummary of the main developments in this literature.

    V. Selection Bias and Missing Data

    Selection bias arises in estimating structural models with partially ob-served potential outcomes. But the problem of selection bias is moregeneral and can arise when a rule other than simple random samplingis used to sample the underlying population that is the object of interest.The distorted representation of a true population in a sample as a

    18 If (9) applies, then there must be selection bias in observing wages or reservationwages except for degenerate cases (Heckman 1993). The selection in potential hours isan immediate consequence of (9) since Dp 1 ln W ln R 0.

  • nobel lecture 691

    Fig. 1.Relationship between hypothetical (counterfactual) population and observeddata.

    consequence of a sampling rule is the essence of the selection problem.The identification problem is to recover features of a hypothetical pop-ulation from an observed sample (see fig. 1). The hypothetical popu-lation can refer to the potential wages of all persons whether or notthey work (and wages are observed for them) or to the potential out-comes of any choice problem in which only actual choices are observed.Distorting selection rules may arise from decisions of sample surveystatisticians or the economic self-selection decisions of the sort previ-ously discussed, where, as a consequence of self-selection, we observe

  • 692 journal of political economy

    only subsets of a population of potential outcomes (e.g., Y0 or Y1 in theRoy model).

    A random sample of a population produces a description of the pop-ulation distribution of characteristics that provides a full enumerationof the models of potential outcomes presented in the previous sections.A sample selected by any rule not equivalent to random sampling pro-duces a description of the population distribution of characteristics thatdoes not accurately describe the true population distribution of char-acteristics, no matter how big the sample size.

    Two characterizations of the selection problem are fruitful. The first,which originates in statistics, involves characterizing the sampling ruledepicted in figure 1 as applying a weighting to hypothetical populationdistributions to produce observed distributions. The second, which orig-inates in econometrics, explicitly treats the selection problem as a miss-ing data problem and, in its essence, uses observables to impute therelevant unobservables.

    A. Weighted Distributions

    Any selection bias model can be described in terms of weighted distri-butions. Let Y be a vector of outcomes of interest and let X be a vectorof control or explanatory variables. The population distribution of(Y, X) is F(y, x). To simplify the exposition, assume that the density iswell defined and write it as f(y, x).

    Any sampling rule is equivalent to a nonnegative weighting functionq(y, x) that alters the population density. People are selected into thesampled population by a rule that differs, in general, from randomsampling. Let ( , ) denote the random variables produced from Y Xsampling. The density of the sampled data may be written as g(y , x )

    q(y , x )f(y , x ) g(y , x )p , (13) q(y , x )f(y , x )dy dxwhere the denominator of the expression is introduced to make thedensity integrate to one as is required for proper densities. g(y , x )Simple random sampling corresponds to the case in which q(y, x)p

    Sampling schemes for which for some values of (Y, X)1. q(y, x)p 0create special problems because not all values of (Y, X) are sampled.19

    In many problems in economics, attention focuses on thef(y dx),

    19 For samples in which for a nonnegligible proportion of the population,q(y, x)p 0it is useful to consider two cases. A truncated sample is one for which the probability ofobserving the sample from the larger random sample is not known. For such a sample,(13) is the density of all the sampled Y and X values. A censored sample is one for whichthe probability is known or can be consistently estimated.

  • nobel lecture 693

    conditional density of Y given If samples are selected solely onXp x.the x variables (selection on the exogenous variables), q(y, x)p

    and there is no problem about using selected samples to makeq(x)valid inferences about the population conditional density.

    Sampling on both y and x is termed general stratified sampling, and avariety of different sampling schemes can be characterized by the struc-ture they place on the weights (Heckman 1987).

    From a sample of data, it is not possible to recover the true densityf(y, x) without knowledge of the weighting rule. On the other hand, ifthe weight is known, the support of (y, x) is known, and q(y, q(y , x )x) is nonzero, then f(y, x) can always be recovered because

    g(y , x ) f(y , x )p , (14) q(y , x ) q(y , x )f(y , x )dy dx

    by hypothesis both the numerator and denominator of the left-handside are known, and we know so it is possible to f(y , x )dy dx p 1,determine It is fundamentally easier to cor- q(y , x )f(y , x )dy dx .rect for sampling plans with known nonnegative weights or weights thatcan be estimated separately from the full model than it is to correct forselection in which the weights are not known and must be estimatedjointly with the model.20 Choice-based sampling, length-biased sampling,and size-biased sampling are examples of the former; sampling arisingfrom selection in the model of equations (10a)(10c) or in the gen-eralized Roy model is an example of the latter.21

    The requirements that (a) the support of (y, x) is known and (b)q(y, x) is nonzero are not innocuous. In many important problems ineconomics, requirement b is not satisfied: the sampling rule excludesobservations for certain values of (y, x), and hence it is impossiblewithout invoking further assumptions to determine the population dis-tribution of (Y, X) at those values. If neither the support nor the weightis known, it is impossible, without invoking strong assumptions, to de-

    20 Selection with known weights has been studied under the rubric of the Horvitz-Thompson estimates since the mid 1950s. Rao (1965, 1985) summarizes this research instatistics. Important contributions to the choice-based sampling literature in economicswere made by Manski and Lerman (1977), Cosslett (1981), and Manski and McFadden(1981). Length-biased sampling is analytically equivalent to choice-based sampling andhas been studied since the late nineteenth century by Danish actuaries (see Sheps andMenken 1973; Trivedi and Baker 1983). Heckman and Singer (1985) extend the classicalanalysis of length-biased sampling in duration analysis to consider models with unobserv-ables dependent across spells and time-varying variables. In their more general case, simpleweighting methods with weights determined independently from the model are notavailable.

    21 Lewbel (2001) presents an interesting analysis of a selection model based on a Roy-type model in which weights can be constructed independently of the full model to recoverthe marginal outcome distributions but not the full outcome and selection ruledistributions.

  • 694 journal of political economy

    termine whether the fact that data are missing at certain (y, x) valuesis due to the sampling plan or the population density has no supportat those values. Using this framework, Heckman (1987) analyzes a varietyof sampling plans of interest in economics, showing what assumptionsthey make about the weights and the model to solve the inferentialproblem of going from the observed population to the hypotheticalpopulation.

    Figure 2 illustrates the problem arising from in a simpleq(x, y)p 0way. In figure 2a, I depict a truncated distribution for Y with data missingfor values of Y below c. Any shape of the true hyperpopulation densityis possible below c. Figure 2b shows a regression version of the sameproblem for a labor supply function H written in terms of wage W. Wecan fit the regression within the sample, but how do we project it tonew samples or to the hypothetical population?

    B. A Regression Representation of the Selection Problem When There IsSelection on Unobservables

    A regression version of the selection problem when the weights q(y, x)cannot be estimated independently of the model originates in the workof Gronau (1974), Heckman (1976a, 1976c, 1978a, 1979), and Lewis(1974). It starts from the Roy model, using (8), assuming (U0, U1, U2)independent of X, Z. It is closely related to Lester Telsers characteri-zation of simultaneous equations bias in a conventional Cowles system.22

    I use Z to denote variables that affect choices whereas the X affectoutcomes. There may be variables in common in X and Z. We observeY (see eq. [12]). Then

    E(Y dX, Z, Dp 1)p E(Y dX, Z, Dp 1)1

    p m (X) E(U dX, Z, Dp 1) (15a)1 1

    and

    E(Y dX, Z, Dp 0)p E(Y dX, Z, Dp 0)0

    p m (X) E(U dX, Z, Dp 1). (15b)0 0

    The conditional means of U0 and U1 are the control functions or biasfunctions as introduced and defined in Heckman (1980a) and Heckmanand Robb (1985, 1986). The mean observed outcomes (the left-hand-side variables) are generated by the mean of the potential outcomesplus a bias term.

    Define As a consequence of decision ruleP(z)p Pr (Dp 1 dZp z).

    22 See equation system (A1) in App. A. See Telser (1964) and the discussions in Heckman(1976c, 1978a, 2000).

  • Fig. 2.a, Data for missing. Two possible slopes for density below c. b, The problemY ! cof extrapolating out of sample.

  • 696 journal of political economy

    Fig. 3.Control function or selection bias as a function of P(z)

    (11), in Heckman (1980a) I demonstrate that under general conditionswe may always write these expressions as

    E(Y dX, Z, Dp 1)p m (X) K (P(Z)) (16a)1 1

    and

    E(Y dX, Z, Dp 0)p m (X) K (P(Z)), (16b)0 0

    where K1(P(Z)) and K0(P(Z)) are control functions and depend on Zonly through P. The functional forms of the K depend on specific dis-tributional assumptions. See Heckman and MaCurdy (1985) for a cat-alog of examples.

    The value of P is related to the magnitude of the selection bias. Assamples become more representative, and See fig-P(z) r 1 K (P) r 0.1ure 3, which plots control function K1(P(z)) versus P. As theP r 1,sample becomes increasingly representative since the probability thatany type of person is included in the sample is the same (and ).Pp 1The bias function declines with P. We can compute the population meanof Y1 in samples with little selection (high P). In general, regressionson selected samples are biased for m1(X). We conflate the selection biasterm with the function of interest. If there are variables in Z not in X,regressions on selected samples would indicate that they belong inthe regression. Representation (16a) and (16b) is the basis for an entire

  • nobel lecture 697

    econometric literature on selection bias in regression functions.23 Thekey idea in all this literature is to control for the effect of P on fittedrelationships.24

    The control functions relate the missing data (the U0 and U1) toobservables. Under a variety of assumptions, it is possible to form thesefunctions up to unknown parameters and identify the m0(X), m1(X), andthe unknown parameters from regression analysis and control for se-lection bias (see Heckman 1976a; Heckman and Robb 1985, 1986; Heck-man and Vytlacil 2002).

    In the early literature, specific functional forms for (15) and (16)were derived assuming that the U were jointly normally distributed.

    Assumption 2. (U , U , U ) N(0, S).0 1 2Assumption 3. (U0, U1, U2) independent of (X, Z).Assumption 2 coupled with assumption 3 produces precise functional

    forms for K1 and K0. For censored samples, a two-step estimation pro-cedure was developed: (1) estimate P(Z) from data on the decision towork and (2) use an estimated P(Z) form K1(P(Z)) and K0(P(Z)) up tounknown parameters. Then (16a) and (16b) can be estimated usingregression. This produces a convenient expression linear in the param-eters when and 25 A direct one-step regres-m (X)p Xb m (X)p Xb .1 1 0 0sion procedure was developed for truncated samples (see Heckman andRobb 1985, 1986). Equations (16a) and (16b) became the basis for anentire literature, which generalized and extended the early models andremains active to this day.

    C. Empirical Results from These Models and Their Consequences forEconomics

    The regression framework is useful for investigating microeconomicphenomena from selected samples in the general case of selection cov-ered by the Roy model. In general, no simple weighting with weightsthat can be estimated separately from the complete model is availableto solve the selection problem in the Roy model. Versions of this model

    23 Heckman, Ichimura, Smith, and Todd (1998) present methods for testing the suita-bility of this expression in a semiparametric setting.

    24 Heckman (1980a) suggests a series expansion of the K1 and K0 functions in terms ofpolynomials of P and suggests that a test for the absence of selection can be based on atest of whether the joint set of polynomials is statistically significant in an outcome equation.Andrews (1991) and Newey (1994) provide more general analyses.

    25 Corrections for using estimated P(Z) in first-stage estimation are given in Heckman(1979) and Newey and McFadden (1994). Assumptions 2 and 3 were also used to estimatethe model by maximum likelihood as in Heckman (1974a, 1974c). The early literaturewas not clear about the sources of identification, whether exclusion restrictions wereneeded, and the role of normality.

  • 698 journal of political economy

    Fig. 4.a, Median black-white male wage ratio, 194090. b, Percentage of males not inlabor force, 194090. Source: Heckman and Todd (1999).

    have been applied to a variety of problems in economics besides inves-tigations of labor supply and wages.

    Recognizing the potential importance of selection shapes the way weinterpret economic and social data and gauge the effectiveness of socialpolicy. Consider, for example, the important question of whether therehas been improvement in the economic status of African Americans. Asdepicted in figure 4a, the median black-white male wage ratio increasedin the United States over the period 194080 and then stabilized (see

  • nobel lecture 699

    the dark curve in fig. 4a). This statistic is widely cited as justificationfor a whole set of social policies put into place in this period. Over thesame period, blacks were withdrawing from the labor force (P(Z) wasgoing down), and hence from the statistics used to measure wages, ata much greater rate than whites (see fig. 4b). Correcting for the selectivewithdrawal of low-wage black workers from employment reduces andvirtually eliminates black male economic progress compared to that ofwhites and challenges optimistic assessments of African-American eco-nomic progress.26

    Thinking about issues in this way has much wider generality. It affectsthe way we analyze inequality and the effects on employment and welfareof alternative ways of organizing the labor market. In European dis-cussions, the low-wage, high-inequality U.S. labor market is often com-pared unfavorably to high-wage, low-inequality European labor markets.

    These comparisons founder on the same issues that arise in discus-sions of black-white wage gaps. In Europe, the unemployed and thenonemployed are not counted in computing the wage measures usedto gauge the performance of the labor market. This practice understateswage inequality and overstates wage levels for the entire population bycounting only the workers. A recent paper by Blundell, Reed, and Stoker(1999) indicates the importance of the selection problem in the Englishcontext. The English data reveal a growth in the real wages of workersover the period 197894 (see the top curve in fig. 5a). At the same time,the proportion of persons working has declined (see fig. 5b), and ac-counting for dropouts reduces the level and rate of growth of real wages.The observed growth in real wages may be a consequence of improve-ments in skill endowments and skill prices (e.g., m1(X)) or improvementsin the nonmarket sector that change the conditional mean of the unob-servables in the wage equation by eliminating workers with low potentialwages from the labor market. Adjusting for selection (the lower twocurves in fig. 5a) greatly reduces estimated wage growth.

    Accounting for selection also affects measures of wage variability overthe cycle (Bils 1985). Low-wage persons drop out of the workforce (andhence the statistics used to measure worker wages) in recessions, andthey return to it in booms. Changing composition partially offsets mea-sured wage variability. Thus measured wages appear to exhibit too littlevariability over the business cycle. When a Roy model of self-selection

    26 The particular selection correction used to produce the numbers used in this figureis to use median wages of workers assuming that low-wage workers are the ones who dropout and dropouts are less than 50 percent of the entire population. Butler and Heckman(1977) first raised this issue. Subsequent research by Brown (1984), Juhn (1997), Chandra(2000), and Heckman, Lyons, and Todd (2000) verifies the importance of accounting fordropouts in analyzing black-white wage differentials. Research on this important questionis very active.

  • Fig. 5.a, Wage predictions from a micro model, aggregate wage, and correctionsrebased to 1978. b, Wages and labor market participation, British males. Source: Blundellet al. (1999).

  • nobel lecture 701

    is estimated with multiple market sectors, the argument becomes moresubtle. Over the cycle, not only is there entry and exit from the workforcebut there are movements of workers across sectors within the workforce.Measured wages are thus not simply the price of labor services. In ad-dition to the standard selection effect, measured wages include the effectof weighting placed on different mixes of skills used by workers overthe cycle (Heckman and Sedlacek 1985).

    Accounting for both the extensive and intensive margin affects ourview of the operation of the labor market. Consider equation (2). Weobserve only hours of work for workers. To focus on the selection prob-lem, assume, contrary to the fact, that wages are observed for everyone,and ignore any endogeneity in wages (so W independent of U). Let

    denote work. The observed labor supply conditional on W andDp 1X is

    E(H dW, X, Dp 1)p H(W, X) E(U dW, X, Dp 1), (17)

    where the Marshallian labor supply parameter or causal parameter forwages is the ceteris paribus change in labor supply due to aH/W,change in wages. Compensating for income effects, we can construct autility constant labor supply function from this to conduct a welfareanalysis, for example, to compute measures of consumer surplus. Buta labor supply function fit on a selected sample of workers identifiestwo wage effects: the Marshallian effect and a compositional or selectioneffect due to entry and exit from the workforce (Heckman 1978c). Thus

    E(H dW, X, Dp 1) H(W, X) E(U dW, X, Dp 1)p . (18)

    W W W

    The second term is a selection effect, or compositional effect, arisingfrom the change in the composition of the unobservables due to theentry or exit of people into the workforce induced by the wage change.This is not a ceteris paribus change corresponding to the parametersof classical consumer theory. Equation (18) does not tell us how mucha given worker would change her labor supply when wages change.However, it does inform us of what an in-sample wage change wouldpredict for average labor supply. It answers a Marschak question 1 typeevaluation question. Under proper conditions on the support of W, itcan be used to estimate the within-sample effects of taxes on laborsupply.27

    Aggregate labor supply elasticities are inclusive of the effects of entryand exit into the workforce as well the effects of movement along aMarshallian labor supply curve. This simple observation has had sub-

    27 The required condition is assumption a in n. 9 of Sec. II.

  • 702 journal of political economy

    stantial effects on the specification, estimation, and interpretation oflabor supply in macroeconomic models.

    In the early 1980s, a literature in macroeconomics arose claiming thataggregate labor supply elasticities were too small and wage movementswere too large for a neoclassical model of the labor market to explainthe U.S. time-series data. I have already discussed why measured wagevariation understates the variation in the price of labor. In Heckman(1984), I go on to note that the macro literature focused exclusively onthe interior solution labor supply componentthe first term on theright of (18)and ignored the selection effect arising from workersentry into and exit from the labor force. Since half of the aggregatelabor supply movements are at the extensive margin (Coleman 1984),where the labor supply elasticity is higher, the standard 1980s calcula-tions understated the true aggregate labor supply elasticity and henceunderstated the ability of a neoclassical labor supply model to accountfor fluctuations in aggregate labor supply. Accounting for choices at theextensive margin changed the way macroeconomists perceived and mod-eled the labor market (see, e.g., Hansen 1985; Rogerson 1988).

    Empirical developments in the labor supply literature reinforced thisconclusion. Early on, the evidence called into question the empiricalvalidity of model (10). Fixed costs of work make it unlikely that theindex for hours of work is as tightly linked to the participation decisionas that model suggests. When workers jump into the labor market, theytend to work a large number of hours, not a small number of hours,as equations (10) suggest if the U are normally distributed. Heckman(1976a, 1980b) and Cogan (1981) proposed a more general model withfixed costs in which participation and hours of work equations are lesstightly linked. This produces an even greater elasticity for the secondterm in equation (18). The evidence also called into question the validityof the normality assumption, especially for hours of work data. Hoursof work distributions from many countries reveal spiking at standardhours of work.28 This led to developments to relax the normality as-sumption used in the early models.

    D. Identification

    Much of the econometric literature on the selection problem combinesdiscussions of identification (going from populations generated by se-lection rules back to the source population) with discussions of esti-mation in solving the inferential problem of going from observed sam-

    28 See the articles in the Journal of Human Resources (vol. 25, Summer 1990) special issueon Taxes and Labor Supply in Industrial Countries.

  • nobel lecture 703

    ples to hypothetical populations.29 It is analytically useful to distinguishthe conditions required to identify the selection model from ideal datafrom the numerous practical and important problems of estimating themodel. Understanding the sources of identification of a model is es-sential to understanding how much of what we are getting out of anempirical model is a consequence of what we put into it.

    A conference at the Educational Testing Service (ETS) in 1985brought together economists and statisticians and provided some usefulcontrasts in points of view on causal modeling and selection models(see Wainer 1986 [reissued 2000]).30 At that conference, Holland (1986)used the law of iterated expectations to write the conditional distributionof an outcome, say Y1 on X, in the following form:

    F(Y dX)p F(Y dX, Dp 1) Pr (Dp 1 dX)1 1

    F(Y dX, Dp 0) Pr (Dp 0 dX). (19)1

    From the analysis of (11) and (12), we observe Y1 only if In aDp 1.censored sample, we can identify andF(Y dX, Dp 1), Pr (Dp 1 dX),1hence We do not observe Y1 when Hence, we doPr (Dp 0 dX). Dp 0.not identify In independent work, Smith and Welch (1986)F(Y dX).1made a similar decomposition of conditional means (replacing F withE).

    Holland questioned how one could identify and briefly com-F(Y dX)1pared selection models with other approaches. Smith and Welch (1986)and some of the authors at the ETS conference discussed how to bound

    (or ) by placing bounds on the missing componentsF(Y dX) E(Y dX)1 1( respectively).31 A clear precedent forF(Y dX, Dp 0) E(Y dX, Dp 0),1 1

    this idea was the work of Peterson (1976), who developed nonparametricbounds for the competing risk model of duration analysis, which ismathematically identical to the Roy model of equations (11) and (12).32

    I discuss some recent developments in this literature in Appendix B.The normality assumption that was widely used in the early literature

    was called into question. Arabmazar and Schmidt (1981) and Gold-berger (1983) presented Monte Carlo analysis of models showing sub-stantial bias for models with continuous outcomes when normality wasassumed but the true model was nonnormal. The empirical evidence ismore mixed. Normality is not a bad assumption for analyzing modelsof self-selection for log wage outcomes once allowance is made for trun-

    29 See Heckman (2000) for one precise definition of identification.30 The exchange between Tukey and me recorded in that volume highlights the contrast

    between statisticians and econometricians in the value placed on making identifying dis-cussions explicit and in making causal distinctions.

    31 Smith and Welch use their analysis to bound the effects of dropping out on the black-white wage gap discussed in subsection C.

    32 The competing risks model replaces max(Y0, Y1) with min(Y0, Y1).

  • 704 journal of political economy

    cation and self-selection.33 See figures 6a and 6b, from Heckman andSedlacek (1985), and the related analysis of Blundell et al. (1999). Thesestudies show that when one accounts for selection bias, a normal modelfits wage distributions rather well. Olsen (1980) and Lee (1982) presentearly nonnormal but parametric extensions of the early normal Royframework. Heckman and MaCurdy (1985) present a synthesis of thisliterature. Heckman (1980a) presents an early nonparametric estimatorof the control function using a series expansion in P.

    Heckman and Honore (1990) consider identification of the Roymodel under a variety of conditions. They establish that under normality,the model is identified even if there are no regressors, so there are noexclusion restrictions. They further establish that the model is identified(up to subscripts) even if one observes only Y, but does not know if itis Y1 or Y0. The original normality assumption used in selection modelswas based on powerful functional form assumptions.34

    Heckman and Honore develop a nonparametric Roy model and es-tablish conditions under which variation in regressors over time or acrosspeople can identify the model nonparametrically. One can replace dis-tributional assumptions with different types of variation in the data toidentify the Roy version of the selection model. Heckman and Smith(1998) extend this line of analysis to the generalized Roy model. It turnsout that decision rule (12) plays a crucial role in securing identificationof the selection model. In a more general case, where Y2 may dependon but on other unobservables as well, even with substantialY Y1 0variation in regressors across persons or over time, only partial identi-fication of the full selection model is possible. When the models arenot identified, it is still possible to bound crucial parameters, and anentire literature has grown up elaborating this idea. See Appendix Bfor a discussion of this literature. Heckman, Ichimura, Smith, and Todd(1998), among others, discuss semiparametric estimation of selectionmodels (see also Robinson 1988; Ahn and Powell 1993).

    VI. Microdynamics and Panel Data: Heterogeneity versus StateDependence and Life Cycle Labor Supply

    The initial micro data were cross sections. Thus early work on discretechoice, limited dependent variables, and models with mixed continuous-discrete endogenous variables was cross-sectional in nature and focusedexclusively on explaining variation over people at a point in time. This

    33 Normality of latent variables turns out to be an acceptable assumption for discretechoice models except under extreme conditions (Todd 1996).

    34 Powerful, but testable. The model is overidentified. See, e.g., Bera, Jarque, and Lee(1984) for the tests of distributional assumptions within a class of limited dependentvariable models.

  • Fig. 6.Predicted vs. observed log wage distribution from the generalized Roy model:a, nonmanufacturing sector; b, manufacturing sector. Source: Heckman and Sedlacek(1985).

  • 706 journal of political economy

    gave rise to multiple interpretations of the sources of the unobservablesin (7) and (8). The random utility models introduced in the literaturein discrete choice interpreted these as temporally independent pref-erence shocks (McFadden 1974), especially when discrete choice wasconsidered. Other interpretations were (a) systematic variations inunobserved preferences that were stable over time and (b) omitted char-acteristics of choices and agents that may or may not be stable overtime.35

    With the advent of panel data in labor economics, an accomplishmentdue in large part to Morgan and his group at the Institute for SurveyResearch at the University of Michigan, it was possible to explore thesesources of variation more systematically (see Stafford 2001). The issuewas especially important in the study of the female labor supply.

    Mincer (1962) used an implicit version of the random utility modelto argue that cross-section labor force participation data could be usedto estimate Hicks-Slutsky income and substitution effects. His idea wasthat H in equation (1) measured the fraction of the lifetime that peopleworked and that if leisure time is perfectly substitutable over time, thetiming of labor supply is irrelevant and could be determined by thedraw of a coin. Then a regression of labor force participation rates onW would identify a Hicks-Slutsky wage effect.36 Ben-Porath (1973) as-sumed instead that shocks were permanent, stable traits of individualsand interpreted labor force participation as a corner solution andshowed that a regression of labor force participation rates on wageswould identify parameters from a distribution of tastes for work, andnot the Hicks-Slutsky substitution effect (i.e., it would define the pa-rameters of from eq. [9]).Pr (Dp 1 dX)

    This issue is also important in understanding employment and un-employment data. A frequently noted empirical regularity in the analysisof unemployment data is that those who were unemployed in the pastor have worked in the past are more likely to be unemployed (or work)in the future. Is this due to a causal effect of being unemployed (orworking), or is it a manifestation of a stable trait (e.g., some people arelazier than others and observables are persistent)? One theory of mac-roeconomics was built around the premise that promoting work throughmacro policies would foster higher levels of employment (Phelps 1972).The distinction between true and spurious effects is the distinction be-tween true and spurious state dependence.

    In a series of papers, I developed econometric models to use paneldata to investigate these issues. One set of studies builds on the model

    35 Heckman and Snyder (1997) consider the history of these ideas.36 Heckman (1978c) provides a formal analysis and the relationship to the random utility

    model studied by McFadden (1974, 1981).

  • nobel lecture 707

    of equations (10a)(10c) but places them in a life cycle setting. My workon life cycle labor supply (Heckman 1974b, 1976b) demonstrated thatthe marginal utility of wealth constant (Frisch) demand functions werethe relevant concept for analyzing the evolution of labor supply overthe life cycle in environments of perfect certainty or with completecontingent claims markets. Building on this work, MaCurdy and I (Heck-man and MaCurdy 1980), drawing on Heckman (1974b) and thesisresearch by MaCurdy (1978, 1981), formulated and estimated a life cycleversion of the model of equations (10) that interpreted one of the keyunobservables in the model as the marginal utility of wealth, l. In theeconomic settings we assumed, l is a stable unobservable or fixed effectderived from economic theory. The models we developed extended, forthe first time, models for limited dependent variables, systematicallymissing data, and joint continuous-discrete endogenous variables to apanel setting.37 Our evidence and my related joint work with Willis(Heckman and Willis 1977) suggests that a synthesis of the views of Ben-Porath and Mincer was appropriate, and a pure random utility speci-fication was inappropriate. This framework has been extended to ac-count for human capital and uncertainty in important papers by Altugand Miller (1990, 1998).

    In related work, I generalized the static cross-sectional models of dis-crete choice to a dynamic setting and used this generalization to addressthe problem of heterogeneity versus state dependence. This fundamen-tal problem can be understood most simply by considering the followingurn schemes (Heckman 1981a).

    In the first scheme there are I individuals who possess urns with thesame content of red and black balls. On T independent trials, individuali draws a ball and then puts it back in his or her urn. If a red ball isdrawn at trial t, person i experiences the event (e.g., is employed, isunemployed, etc.). If a black ball is drawn, person i does not experiencethe event. This model corresponds to a simple Bernoulli model andcaptures the essential idea underlying the choice process in McFaddens(1974) work on discrete choice. From data generated by this urnscheme, one would not observe the empirical regularity that a personwho experiences the event in the past is more likely to experience theevent in the future. Irrespective of their event histories, all people havethe same probability of experiencing the event.

    A second urn scheme generates data that would give rise to a mea-sured effect of past events on current events solely due to heterogeneity.In this model, individuals possess distinct urns that differ in their com-position of red and black balls. As in the first model, sampling is done

    37 Browning, Deaton, and Irish (1985) adapt this idea to repeated cross-section datausing standard methods for analyzing synthetic cohorts.

  • 708 journal of political economy

    with replacement. However, in contrast to the first model, informationconcerning an individuals past experience of the event provides infor-mation useful in locating the position of the individual in the populationdistribution of urn compositions.

    The persons past record can be used to estimate the person-specificurn composition. The conditional probability that individual i experi-ences the event at time t is a function of his past experience of theevent. The contents of each urn are unaffected by actual outcomes andin fact are constant. There is no true state dependence.

    The third urn scheme generates data characterized by true state de-pendence. In this model, individuals start out with identical urns. Oneach trial, the contents of the urn change as a consequence of the outcomeof the trial. For example, if a person draws a red ball and experiencesthe event, additional new red balls are added to his urn. Subsequentoutcomes are affected by previous outcomes because the choice set forsubsequent trials is altered as a consequence of experiencing the event.

    A variant of the third urn scheme can be constructed that correspondsto a renewal model. In this scheme, new red balls are added to anindividuals urn on successive drawings of red balls until a black ball isdrawn, and then all of the red balls added as a result of the most recentcontinuous run of drawings of red balls are removed from the urn. Thecomposition of the urn is then the same as it was before the first redball in the run was drawn. A model corresponding to fixed costs of laborforce entry is a variant of the renewal scheme in which new red ballsare added to an individuals urn only on the first draw of the red ballin any run of red draws.

    The crucial feature that distinguishes the third scheme from the sec-ond is that the contents of the urn (the choice set) are altered as aconsequence of previous experience. The key point is not that the choiceset changes across trials but that it changes in a way that depends onprevious outcomes of the choice process. To clarify this point, it is usefulto consider a fourth urn scheme that corresponds to models with moregeneral types of heterogeneity to be introduced more formally below.

    In this model individuals start out with identical urns, exactly as inthe first urn scheme. After each trial, but independent of the outcomeof the trial, the contents of each persons urn are changed by discardinga randomly selected portion of balls and replacing the discarded ballswith a randomly selected portion of balls and replacing the discardedballs with a randomly selected group of balls from a larger urn (say,with a very large number of balls of both colors). If the individual urnsare not completely replenished in each trial, information about theoutcomes of previous trials is useful in forecasting the outcomes offuture trials, although the information from a previous trial declineswith its remoteness in time. As in the second and third urn models,

  • nobel lecture 709

    previous outcomes give information about the contents of each urn.Unlike the second model, the fourth model is a scheme in which theinformation depreciates since the contents of the urn are changed ina random fashion. In contrast to the third model, the contents of theurn do not change as a consequence of any outcome of the choiceprocess.

    In the literature on female labor force participation, models of ex-treme homogeneity (corresponding to urn model 1) and extreme het-erogeneity (corresponding to urn model 2 with urns either all red orall black) are both consistent with the cross-sectional evidence. This isthe contrast between Mincer (1962) and Ben-Porath (1973). Heckmanand Willis (1977) estimate a model of heterogeneity in female laborforce participation probabilities that is a probit analogue of urn model2.

    Urn model 3 is of special interest. It is consistent with human capitaltheory and other models that stress the impact of prior work experienceon current work choices. Human capital investment acquired throughon-the-job training may generate structural state dependence. Fixedcosts incurred by labor force entrants may also generate structural statedependence as a renewal process. So may spell-specific human capital.This urn model is also consistent with psychological choice models inwhich, as a consequence of receiving a stimulus of work, womens pref-erences are altered so that labor force activity is reinforced (Atkinson,Bower, and Crothers 1965) or economic models of habit formation.

    Panel data can be used to discriminate among these models. Forexample, an implication of the second urn model is that the probabilitythat a woman participates does not change with her labor force expe-rience. An implication of the third model in the general case is thatparticipation probabilities change with work experience. One methodfor discriminating between these two models utilizes individual laborforce histories of sufficient length to estimate the probability of partic-ipation in different subintervals of the life cycle. If the estimated prob-abilities for a given woman do not differ at different stages of the lifecycle, there is no evidence of structural state dependence.

    Heckman (1981b, 1981c) develops a class of discrete data stochasticprocesses that generalize the discrete choice model of McFadden (1974)to a dynamic setting. That setup is sufficiently general to test among allfour urn schemes and present a framework for dynamic discretechoice.38 Heckman and Singer (1986) present an explicit generalizationof the McFadden model in which Weibull shocks arrive at Poisson arrival

    38 Heckman and Willis (1973, 1976) present a prototype for this class of models in theiranalysis of fertility dynamics. Lillard and Willis (1978) apply this model to the analysis ofearnings dynamics.

  • 710 journal of political economy

    times. Dagsvik (1994) presents a generalization to a continuous timeset