the role of english fluency in migrant assimilation: evidence from
TRANSCRIPT
Have language skills always been so valuable? The lowreturn to English fluency during the Age of Mass
Migration∗
Zachary Ward†
The Australian National University
February 2018
Abstract
English skills are highly valuable for today’s immigrants, but has this always been thecase? We estimate the premium for English fluency and the rate of language acquisi-tion in the early 20th century US using new linked data on over two hundred thousandimmigrants. Compared with today’s immigrants, fewer early 20th century immigrantsarrived with English proficiency, yet many acquired language skills rapidly after arrival.Learning to speak English was correlated with a small upgrade in occupational-basedearnings (0 to 4.5 percent). Various empirical methods suggest that the English pre-mium has more than doubled between 1910 and 2010, revealing that English fluencyhas become an increasingly large barrier to immigration over time.
JEL Classification: F22, J24, J61, J62, N31, N32Keywords: English fluency, language, immigrant assimilation
∗This paper was previously circulated as “The Role of English Fluency in Migrant Assimilation: Evidencefrom United States History.” We would like to thank Brian Cadena, Ann Carlos, Katherine Eriksson, DustinFrye, Tue Gorgens, Tim Hatton, Priti Kalsi, Ling-Yu Kong, Edward Kosack, Martine Mariotti, Xin Meng,Amber McKinney, Julie Moschion, John Tang and Jose Tessada for helpful pointers and discussions. Wealso thank the audience members at the Australian National University, the 2015 Australasian CliometricConference, Colby College, the 2016 EH-Clio conference at Pontificia Univerisidad Catolica, the 2015 NaturalExperiments in History Workshop, the 2017 Society of Labor Economists Annual Conference, La TrobeUniversity, the University of Adelaide, the University of Colorado, the University of Melbourne, and theUniversity of Queensland. Many thanks go to Lee Alston who helped me to gain access to the full-count censusdata, Rohan Alexander who created the file to Americanize names, and to Rowena Gray who generouslyshared her data on tasks. All errors are my own.†Email: [email protected], Research School of Economics, HW Arndt Building 25A, College of
Business and Economics, The Australian National University, Canberra, ACT 2600, Australia.
1
1 Introduction
The value of English skills in the United States is estimated to be quite high − depending
on the methodology, going from speaking no English to speaking English very well leads to
a 33 to over 100 percent increase in income (Bleakley and Chin, 2004; Chiswick and Miller,
2014). This implies that gaining English fluency is one of the most important investments
that an immigrant can make. Further, the high value of English fluency is fundamental for
understanding one of the most-studied patterns in the immigration literature: the migrant
assimilation profile, as shown in Figure 1 (Chiswick, 1978; Borjas, 2015; Lubotsky, 2007).
Immigrants who arrive, often without the ability speak English, earn much less than natives;
however, in the decades after arrival, immigrants acquire more human capital (such as English
fluency) and converge towards natives’ earnings.
Have language skills always been this valuable? In this paper, we turn to early 20th
century and estimate the importance of English fluency during the Age of Mass Migration,
one of the key migration episodes in American history. There are reasons to suspect that the
English premium was lower in the early 20th century; first, the early 20th century assimilation
profile in Figure 1 suggests a low value of English skills. Immigrants who arrived one hundred
years ago initially held similarly skilled jobs to natives at arrival, potentially showing no
penalty despite arriving without English skills (Abramitzky, Boustan and Eriksson, 2014).1
Moreover, immigrants surprisingly had no improvement relative to natives after arrival,
possibly showing no benefit from acquiring English fluency. A second reason to suspect a
lower return to English fluency in the past is that job tasks were much less interactive relative
to today since the structure of the economy was dominated by agriculture and manufacturing
rather than services (Katz and Margo, 2014; Michaels et al., 2017). While these reasons
point to a lower return to English fluency, early 20th century officials stressed that English
1The early 20th century assimilation profile can only be estimated based on the skill content of immi-grant occupations and not wages since wage income is not recorded until the 1940 United States Census.However, the same basic differences in assimilation patterns hold when using occupations in recent decades;for example, see Borjas (2015).
2
fluency was a primary determinant of immigrant economic outcomes, and argued that English
should be taught to new arrivals; indeed, this was one force driving the “Americanization”
movement, which led to many states enacting policy changes aimed to assimilate immigrants
into American culture (Lleras-Muney and Shertzer, 2015; Fouka, 2016).
One reason why those in the early 20th century claimed that English fluency was highly
valuable is because cross-sectional estimates showed that both English fluency and occupa-
tional status increased rapidly with more years of stay, suggesting that immigrants learned to
speak English and upgraded their jobs (Jenks and Lauck, 1926). However, as is well known
today, cross-sectional estimates of immigrant outcomes are biased by cohort quality change
and selective return migration (Borjas, 1985; Dustmann and Gorlach, 2015; Lubotsky, 2007).
Panel data is necessary to fix these issues since you can track the same individual over time;
indeed, using panel data is especially important in the context of the Age of Mass Migration
since the return flow was large (≥ 40% of inflows) and highly selective (Abramitzky, Boustan
and Eriksson, 2014; Bandiera, Rasul and Viarengo, 2013; Ward, 2017). Therefore, for immi-
grants who arrived between 1900 and 1919, we create new panel data using machine-learning
techniques based on Feigenbaum (2016) to estimate how many permanent immigrants were
able to speak English near arrival and the rate of acquisition after arrival. Ultimately, we
are able to follow over 200,000 immigrants between either 1910 to 1920, or 1920 to 1930.
We show that few pre-World War I arrivals from non-English-speaking sources came
with English skills (about 30 percent of recent arrivals in the 1910 Census). Yet while many
immigrants arrived without the ability to speak English, they had a high rate of language
acquisition: within ten years of arrival, more than 80 percent of immigrants were able to
speak some English. Importantly, this is found in the panel data, which shows that the fast
rate of acquisition is not biased by the selective return of those with lower English skills.
We also show with task-based measures that immigrants sorted into jobs that required
more communication tasks in the decades after arrival; however, after two decades of stay
immigrants still held jobs that required less communication skills relative to natives.
3
The fact that many arrived without English skills contrasts with the lack of occupational-
based earnings gap between immigrants and natives in Figure 1; moreover, the rapid increase
in the English fluency after arrival also contrasts with the flat assimilation profile. This
suggests that arriving without English skills did not yield a large penalty; further, gaining
language human capital had little effect on improving immigrants’ relative position with
natives. We use a variety of empirical strategies to verify if this is true. Primarily, we
exploit the hundreds of thousands in the panel data by estimating an individual fixed effect
model; this method eliminates time-invariant individual-specific unobservables that may be
correlated with English skills, such as ability. The analysis reveals that the relationship
between speaking English and occupational outcomes was weak in the early 20th century,
where those who gained English fluency had an associated increase in occupational-based
earnings of 0 to 4.5%; these results are consistent with a flat assimilation profile the early
20th century.
Our main approach of using individual fixed-effects may yield an upper bound estimate
since acquiring English skills is endogenous and those who acquired English skills may have
acquired other unobservable skills. One may also be concerned that linking messy historical
data will lead to false links and bias our estimate (Bailey et al., 2017). Using an alternative
empirical strategy with non-linked data − instrumenting for English fluency with the inter-
action of age at arrival and arriving from a non-English-speaking source as in Bleakley and
Chin (2004) − leads to the same qualitative conclusion that language skills were relatively
unimportant for upgrading jobs in the early 20th century compared with the high return to
English skills for immigrants in recent decades.
Since then and over the past 100 years, the gap between immigrants’ and natives’ occu-
pations has gradually widened; one reason may be that the English premium has increased
or that fewer recent migrants arrive with the ability to speak English. Using the 2000 Census
and 2008 to 2012 ACS, we provide suggestive evidence that more recent immigrants arrived
4
with basic English skills than early 20th century arrivals (74 percent v. 30 percent).2 Al-
though we are not able to recreate our preferred individual fixed effects methodology with
recent data, we also show with OLS that the occupational premium for English skills is about
18.4 percent, more than double the premium from the early 20th century when estimated
with a similar methodology.3 Age-at-arrival analysis is also consistent with a large increase
to the English premium over time. Thus, one reason for a widening gap between natives and
immigrants over the last 100 years is not because of a declining fraction of English speakers;
rather, because the penalty for arriving without English fluency has increased over the past
century.
Our paper draws from the large literature on relationship between the value of human
capital and technology. We argue that the return to a specific piece of human capital −
the ability to speak English − was low in the early 20th century and has increased over
time, relating to others who have shown an increased return to education over the second
half of the 20th century (Acemoglu and Autor, 2011; Goldin and Katz, 2008). A common
interpretation for the increasing return to human capital is due to skill-biased technical
change, where demand shifts that complement high-skilled work have outpaced the relative
increase in high-skilled workers. We argue that the increasing value of English skills may also
reflect technology-driven shifts; for instance, others have shown that there was an increase in
demand for those with people skills in the early 20th century, an increase in the amount of
interaction in jobs over the 20th century and an increase in the value of social skills in recent
decades (Deming, 2017; Gray, 2013; Michaels, Rauch and Redding, 2017). Therefore, the
technological setting is key for understanding the relative performance of immigrants and
2The early 20th century data was recorded by an enumerator as a binary variable (0=cannot speakEnglish, 1=can speak English). The late 20th century data was self-reported on a more qualitative scale(0=speaks no English, 1=speaks English not well, 2=speaks English well, 3=speaks English very well). Theresults in this paragraph treat people who report any English skills (values 1-3) in the late 20th century asable to speak English.
3We cannot use the individual fixed-effects methodology with recent census data since it is not linkedover time. Instead, these results are from a simple OLS regression like that reported in Chiswick and Miller(2014). We use a similar strategy to impute earnings by occupation in the early 20th century and early 21stcentury data for this estimate.
5
natives in the early 20th century, just as it is for understanding the outcomes of immigrants
in more recent decades (Lalonde and Topel, 1992; Lubotsky, 2011; Perlmann, 2005).
Our paper also relates to a fast-growing literature on the Age of Mass Migration using
newly digitized sources to study the major immigration questions, such as the selection of
immigrants (Abramitzky, Boustan and Eriksson, 2013; Spitzer and Zimran, 2017) and the
effects of immigration in the short run and long run (Ager and Hansen, 2017, Sequeira, Nunn
and Qian, 2017; Tabellini, 2017). Our paper connects with the assimilation literature, where
others have shown that cultural assimilation in terms of having more American-sounding
names leads to a higher level of income and occupational standing for the first and sec-
ond generation (Abramitzky, Boustan and Eriksson, 2017; Biavaschi, Giulietti and Siddique,
2017). We complement these results by showing that cultural assimilation in terms of ac-
quiring language skills was quite fast; however, this did not lead to substantial economic
assimilation − at least relative to today’s estimates of the English premium. These results
help to understand why early 20th century language policy that enforced English-only in-
struction did little to improve immigrant economic outcomes (Lleras Muney and Shertzer,
2015).
2 Historical Context
Since 1850, there has been a secular decline in the percentage of immigrants from English-
speaking sources (see Figure 2). For example, about 70 percent of the immigrant stock in
1850 were from England or Ireland but soon non-English-speaking countries such as Ger-
many and Norway became major senders to the United States; by 1880, the percentage from
English-speaking sources had dropped to 50.4 The 1880s marked a turning point for the geo-
graphical composition of the flow toward lower-income countries from Southern and Eastern
Europe; by 1910, the immigrant stock had only 30 percent from an English-speaking source.
4The period when Europeans came to the United States is referred to as the Age of Mass Migration (1850-1913), which has been studied extensively, most notably by Hatton and Williamson (1998). See Abramitzkyand Boustan (2017) for a recent overview of the literature on historical immigration to the United States.
6
Fewer of the newer arrivals arrived with English fluency and appeared to pick up English at
slower rates than prior European arrivals: for example, the stock’s ability to speak English
decreased rapidly between 1900 (82%) and 1910 (70%).5
The worsening perceived quality of Southern and Eastern European immigrants led to a
severe nativist backlash against the country’s open immigration policies, with immigrants’
decreasing ability to speak English a particularly salient feature.6 This view was also held
by some of the most prominent academics studying immigration at the time: after analyzing
data on thousands of immigrants from the Dillingham Commission, Jeremiah Jenks and W.
Jett Lauck argued that “the greatest obstacle to a more rapid [assimilation] is that the recent
immigrant cannot speak English” (Jenks and Lauck, 1926; pg 269). They concluded that
“progress in industry, in business, in the trades and professions and in the accumulation
of property, are all primarily a result of the development in the recent immigrant popula-
tion having an English-speaking ability” (pg 293). These statements were reflective of the
“Americanization” fervor during the 1910s and 1920s, a movement focused on assimilating
immigrants through language instruction for children and adults (Bloch, 1920; Lleras-Muney
and Shertzer, 2015).
Despite the heightened focus on English during this time period, the importance of
English fluency for occupational upgrading still remains unknown.7 Jenks and Lauck (1926)
argued for the importance of English by showing the positive associations of staying longer in
the United States, English proficiency, and skill, suggesting that those who stayed longer were
5These results are based on a sample of foreign-born individuals over the age of 16, which was drawnfrom IPUMS. See Figure A1 for the rate of English fluency across the 20th century.
6Many began to lobby the government to maintain the “national origins” of the American population,which culminated in the Immigration Quota Acts of 1921 and 1924. Prior to this in 1906, English fluencywas added as a requirement for citizenship.
7There are a few studies who estimate the association between English proficiency and occupationaloutcomes (Blau, 1980; Lleras-Muney and Shertzer, 2015). Most notably, Jasso and Rosenzweig (1989, 1980)estimate the premium of English and acquisition rates in 1900 for Germans and 1980 for Mexicans. Theyfind a larger “return” for speaking English in the early 20th century, but the interpretation across time isunclear because the estimated return in 1900 is based on occupational prestige while the return in 1980is estimated for hourly wages. We will show results across time using a similar measure of occupationalstanding. An exception to these studies on the occupational return in the early 20th century is Inwood et al.(2016), who show that the association between English proficiency and wages increased between 1911 and1931.
7
able to learn English and upgrade their occupation. However, as is well known today, these
results could simply reflect selective return migration of low-skilled non-English-speaking
immigrants.8 We improve on Jenks and Lauck’s original study and more recent research on
English acquisition during this time period (e.g. Vigdor, 2010; Kuziemko and Ferrie, 2014;
Jasso and Rosenzweig, 1989) by creating panel data, reducing bias from selective return
migration.
3 Data
3.1 Measuring English Skills in the Early 20th Century
To estimate the value of English skills in the early 20th century, we use the restricted-access
full-count 1910 to 1930 Censuses from IPUMS, which we accessed at the NBER (Ruggles et
al., 2017).9 Unlike mailed census forms in recent decades, the early 20th century censuses
were taken by enumerators from door to door, and thus the ability to speak English (“Yes”
or “No”) was a judgment by the enumerator rather than self-reported as in recent Census
data. Enumerators did not have an explicit cut-off point for whether a respondent was able
to speak English between 1910 and 1930, leading to a familiar problem of measurement error
in language studies (Bleakley and Chin, 2004; Dustmann and Van Soest, 2001).10 Moreover,
8See Abramitzky et al. (2014) and Borjas (1985) for a discussion of this problem of estimating the rateof occupational upgrading with repeated cross sections. Kuziemko and Ferrie (2014) show a similar set ofcorrelations as Jenks and Lauck (1926) between length of stay, English fluency and skill using micro-datafrom IPUMS. See Vigdor (2010) for a nice discussion of the English acquisition of immigrants across theearly and late 20th century; however, as is noted by the author, the results also may be biased by selectiveemigration.
9While English skills were measured in the 1890 and 1900 Census, we focus our attention on the 1910 -1930 Censuses for a few reasons. First, the 1890 micro-data was lost in a fire. Second, there may be moremeasurement error in the 1900 Census variable (Stevens, 1999). In 1900, three census questions under thebroad heading of ‘Education’ were asked in a row: whether an individual could read, could write, and couldspeak English. The Census Bureau noted that some census takers simply recorded ‘yes’ or ‘no’ three times ina row - this problem was discovered as it appeared that black individuals had low rates of English proficiencywhen they likely only had low literacy rates (Census Bureau, 1913, page 1265). By 1910, the census sheetswere corrected so as to not have the questions in order. English proficiency was not asked of immigrantsagain until 1980.
10The 1890 Census gave instructions to record English fluency based on whether a immigrant was “ableto speak English so as to be understood in ordinary conversation” − a higher bar than simply knowing a
8
it is unclear whether an immigrant was responding to the enumerator questions in person, or
whether it was another member of the household answering for the immigrant; unfortunately,
who was responding to enumerator is not included in the data.
Given issues with measurement, one must first ask whether the English variable from the
1910 to 1930 Censuses actually reflects true English skills. A straightforward test of this is
to estimate whether the English variable follows well-known age-at-arrival patterns, where
older arrivals are less likely to speak English as adults compared with younger arrivals. This
pattern is thought to be related to neurobioloical changes in the brain prior and during
puberty, which make it more difficult to acquire a second language (Singleton, 1999). There-
fore, we estimate the age-at-arrival profile by regressing English ability on age at arrival
and other controls such as country of birth, age, sex and fraction of immigrants from their
own country of birth in the county. To provide an idea of how these age-at-arrival patterns
look for the more well-known English variable from recent years, we separately estimate the
age-at-arrival patterns for the pooled 1900-1930 Census and the pooled 2000 Census and
2008-2012 ACS.11
We plot the estimated age-at-arrival fixed effects in Figure 3 for both the early 20th
century data and the early 21st century data. Fortunately, the early 20th century data
shows a similarly sloped age-at-arrival pattern where older arrivals were less likely to speak
English as an adult compared with younger arrivals, affirming that the measure reflects
some level of English proficiency. The age-at-arrival patterns in the early 20th century are
strikingly similar to the patterns from the early 21st century when one codes the English
variable as whether one is able to speak any English, whether not well, well or very well.12
few words. However, this guidance was not in the instructions for later Censuses.11The sample for this regression are all immigrants from non-English-speaking countries aged 17 to 55
and those who arrived under the age of 17. The regression controls for age, country of birth, sex, cohort ofarrival, year and fraction of county from same country of birth. We do not use the 1980 and 1990 Censusbecause they do not record a specific year of arrival, making it impossible to back out a precise age at arrival.
12On the other hand, if one follows the more common method where those who speak English “not well”are instead placed in the unable to speak English group (Chiswick and Miller, 2014), then a 17-year-oldarrival would be 30 percentage points less likely to speak English − much too steep of a decline relative tothe early 20th century data.
9
Therefore, we interpret the English variable in the early 20th century as reflective of basic
English skills, where there was a low bar to clear for being recorded as able to speak English.
Further, whenever we compare English proficiency across the early and late 20th century, we
will make the assumption that those who self-reported any English ability in the late 20th
century had a similar level of skill as those who were recorded as able to speak English in
the early 20th century. Of course, this assumption is untestable, so comparisons across time
are only suggestive.
3.2 Building New Linked Data
With this measure of English ability, we aim to estimate how many immigrants arrived
with English skills, the rate at which immigrants learned to speak English after arrival, and
the return to speaking English. To answer the research question on the speed of language
acquisition, we need to create a panel that tracks individual immigrant’s ability to speak
English over time; this is alternative to the method of using repeated cross sections, which
suffers from the well-documented problem of selective return migration (see Abramitzky et
al. (2014) for a discussion of this bias).
Therefore, we build a new panel to fix any bias that arises from selective out-immigration.
To do this, we take Europeans first observed in the the 1910 full-count Census and the
1920 full-count Census, and then link them ten years later to the full-count 1920 and 1930
censuses, respectively.13 For each of the base samples, we do not link forward the entire set
of European immigrants. Primarily, we drop immigrants from English-speaking countries
such as England and Ireland because we are interested in how non-native speakers acquired
human capital after arrival. Second, we only keep immigrants who arrived within the past
ten years in order to track specific immigrant cohorts over time.14 Third, we drop immigrants
who were under the age of 10 at first observation because they were not asked about their
13Note that we do not use the full-count 1900 Census because the English variable has yet to be digitized.14I drop those who arrived in the same year as the Census (e.g. 1910 arrivals in 1910) since the Census
does not cover the entire year of arrivals.
10
ability to speak English; we also drop those who are older than 40 to ensure that no one
would be older than 50 ten years later − this is to reduce bias from death. Note that when
we estimate the occupational return to acquiring English fluency, we will drop those without
reported occupations, which primarily drops children.
To link the data across years, we find similar matches based on first name, last name, year
of birth, country of birth and year of arrival. To find the best match, we follow the method
outlined by Feigenbaum (2016) and first hand-link 15 random samples of 2,000 immigrants
from the 1920 to 1930 Censuses each. The 15 random samples are from 15 different language
groups, such as German, Italian, Polish and Dutch; see Appendix B for more detail on
this and the overall linking process. After hand-linking these individuals to form a set of
training data, we estimate a probit to find the best match, relying on observables such as the
closeness in year of birth, year of arrival, and string distance for first and last name.15 We are
particularly concerned with falsely matching an immigrant to a wrong individual and keeping
them in the dataset (Bailey et al., 2017); this is because we will use an individual fixed effects
methodology to estimate the return to English skills, which may lead to attenuation bias if
we have a high level of false positives.
Therefore, we take a very conservative approach and only keep immigrants if they have a
high predicted probability of being a true match from the probit, and have no close second
matches. Making this decision to reduce false positives necessarily lowers the efficiency of
finding matches because few clear both bars of a high probability of a link and no close
second match; however, we believe this is a worthwhile trade off since we are linking full-
count censuses and can afford a much lower linking rate. Based on the predicted probabilities
from the training data, we are able to track 96,400 males from 1910 to 1920, and 108,590
males from 1920 to 1930. Indeed, the backward linking rates for 1910 and 1920 (4%),
and 1920 to 1930 (5%) are less than others who link immigrants in the literature with
15We also are concerned about immigrants who change Americanize their first name (Biavaschi et al.,2017), so we link based on an Americanized version of each first name based on information from be-hindthename.com.
11
automated methods (around 15-25 percent) and less than in our training data (25 percent)
(Catron, 2017). This demonstrates that efficiently modeling the hand-linking process for
immigrants is more difficult than modeling the link between the 1915 Iowa and 1940 Census
as in Feigenbaum (2016). Despite the low linking rate, we still end with over 200,000 linked
immigrants in our sample. One may be concerned with the low linking rate; we have also
linked immigrants using more traditional methods related to Abramitzky et al. (2014) with
about a 15-25 percent linking rate, and find the same qualitative results that English language
acquisition was fast and that acquiring English fluency was associated with a small upgrade
in occupation.
While the linked datasets solve the problem of selective return migration, there are also a
few limitations. Primarily, linked datasets are non-random, as individuals with very common
names, those who died or those who changed their name (e.g. females after marriage) cannot
be linked forward. We are particularly concerned that a successful link is related to better
English proficiency; if so, then we would mistakenly infer that permanent immigrants had
better English skills at arrival when it would actually just reflect a bias from the linking
process. To gauge the representativeness of the sample, we compare an arrival cohort in
the IPUMS random sample to the same cohort in the linked sample in the second year of
observation (e.g., 1920 Census for the 1910 to 1920 linked sample). The linked and cross-
sectional sample should contain the same information since each immigrant has stayed in
the United States for 11 to 20 years; however, while the IPUMS cross section is random, the
linked sample may not be.
The representativeness of each linked sample is shown in Table 1. The linked samples
are indeed biased; they contain immigrants with higher English-speaking ability, by 2.2 to
3.4 percentage points. This is likely related to differences in linking by country of birth,
where we are much more likely to like one from a Northern and Western Europe relative to
Southern and Eastern Europe; this pattern is common in studies that link immigrants (e.g.
Abramitzky et al., 2014). We are also more likely to link farmers rather than laborers or
12
low-skilled service workers. One may be concerned that our linking strategy relies on very
unusual names who may not be representative of immigrants in general, so we test whether
the names of individuals have a different amount of “foreignness” according to the index used
by Abramitzky et al. (2017); however, we do not find that our sample has names that are
especially more or less foreign-sounding.16 Due to the differences in representativeness, we
re-weight our panel to be representative on ability to speak English, literacy, occupational
categories, and country of birth.17 We use our weighted sample for the rest of the analysis.
4 English Fluency Rates
4.1 Speed of English Acquisition
We estimate the rate of English acquisition for arrivals between 1900 and 1919 using the
following flexible form:
SpeakEnglishict = φc + µt−c + Π′Xit + εit (1)
Individual i from arrival cohort c’s ability to speak English in census year t is modeled as
a non-linear function of years in the United States (µt−c), incorporated as fixed effects for
every two years (e.g. 0 to 1 years, 2 to 3 years, etc.). This parameterization captures the
quick acquisition of English within the first ten years of stay and a leveling off in the second
ten years. We also estimate arrival cohort fixed effects (φc) for every five year entry cohort
to capture changes in the cohort quality in terms of English speaking ability (e.g. 1900-
1904 arrivals, 1905-1909 arrivals, etc.). In various regressions we include control variables
16The foreignness index ranges from 0 to 100 and measures the prevalence a given first or last name appearsfor foreigners relative to natives. A measure close to 100 indicates that it is more foreign.
17We reweight to match the random sample means. We have alternatively used the inverse-proportionalweighting method proposed by Bailey et al. (2017). Using inverse-proportional weights does not qualitativelychange our estimated association between English acquisition on occupational upgrading, but does lead toEnglish fluency rates in our linked sample to be slightly higher than English rates in the random sample.Therefore, we prefer our method of weighting to pin down the linked sample’s English levels to the randomsample’s English levels.
13
in Xit such as the country of birth and age at arrival. To capture potential biases from
selective return migration, we run the regression twice, once with the panel data and once
with repeated cross sections.
The estimated rate of acquisition is shown in Figure 4 for the 1900-04 cohort. The
regression estimates that 30 percent of arrivals in the panel data, or those who stayed at
least ten years, knew how to speak English within the first year of arrival. After this low
start, the rate of English proficiency increased rapidly within ten years of arrival: for those
who had stayed ten to eleven years, 80 percent of immigrants were able to speak English,
50 percentage points higher than arrivals in their first year. This estimate is reasonable as
second language acquisition can take only a couple of years, and we interpret the English
variable as reflective of basic English skills (Krashen et al., 1979). After ten years of stay, the
rate of acquisition leveled off as most immigrants knew how to speak some English; however,
even after 20 years of stay, about 90% of immigrants were still unable to speak English.
The estimated rate of acquisition from the repeated cross sections are also shown in
Figure 4. The repeated cross section would estimate a higher rate of English acquisition
since arrivals start at a lower percentage of 19%, but end at the same 90% fluency after
twenty years. This is consistent with the arguments by Lubotsky (2007) and Abramitzky
et al. (2014) that repeated cross-sections tend to overestimate improvements in immigrants’
attributes because of negatively selected return migration. In this case, immigrants with
worse English proficiency at arrival tended to return at higher rates.
How does this speed of acquisition compare across the earlier and more recent immigrant
cohorts? In Figure 4, we also plot the mean English fluency over time of the 1990-94 arrival
cohort, estimated in the same way as the early 20th century using repeated cross sections,
but this time pooling immigrants from non-English-speaking sources in the 2000 Census
and 2008-2012 ACS.18 The figure shows two main conclusions. First, more arrivals from
non-English-speaking countries in 1990 had a basic level of English fluency at arrival than
18Note that with the 1990s cohort, we code an observation as able to speak English if they spoke Englishnot well, well or very well.
14
immigrants from the early 20th century cross section (74% v 19%). This could be due to the
spread of English internationally compared with the beginning of the 20th century or visa
restrictions requiring some level of English proficiency. In contrast, for the pre-World War
I arrivals, there was nothing stopping individuals from freely arriving within a 2-week trip
from Europe.
Another conclusion from Figure 4 is that immigrants during the Age of Mass Migration
acquired English fluency at fast rates after arrival; they nearly catch up to the 1990s arrivals
within 15 to 20 years of stay when over 90% of immigrants could speak some English.
Unfortunately, selective out-immigration cannot be corrected for in the late 20th century
data, but if return immigrants were negatively selected on English ability as they were on
income (Lubotsky, 2007), then return migration would not overturn these results that late
20th century immigrants arrived with higher levels of English fluency.19
The other main benefit from using panel or repeated cross sections is that one can estimate
how cohorts changed in their ability to speak English near arrival. In the appendix, we show
that subsequent arrival cohorts increased their levels of English fluency at arrival after the
1900-1904 cohort. In particular, the beginning of World War I led to a rapid increase
in arrival’s English fluency, likely because few could cross the Atlantic due to restricted
shipping; also, the country of birth mix may have shifted toward countries with higher levels
of English proficiency at arrival.
Indeed, English fluency rates at arrival depended strongly on the immigrant’s origin.
In Figure 5 we show English fluency levels by language of origin, as proxied by mother’s
tongue. The figure is sorted by English fluency at arrival, where the leftmost ethnicities
arrived with the lowest levels of English skills. Southern and Eastern European ethnicities
19This evidence is only suggestive because the variables do not match precisely. For example, as opposedto coding the post-1980 self-reported English proficiency of “not well” as “able to speak English,” we recodeit as “unable to speak English.” When one does this, the results that late 20th century immigrants havea slower rate of acquisition remain, but the results on starting levels differ depending on the matching ofEnglish variables. In fact, this way of merging variables suggests that the early 20th century immigrantsassimilated much more quickly than recent immigrants in terms of English skills. However, based on the ageat arrival/ English fluency profiles across time, we do not believe this is the best way to match the data.
15
such as Poles, Romanians and Greeks all dominate the left-hand side, where 20% to 30%
of immigrants were able to speak English within one year of arrival. These fluency rates
for Eastern and Southern Europeans are often lower when compared with immigrants from
Northern and Western Europe; the Dutch, Norwegians, Germans and Danish all had higher
levels of fluency at arrival, from 50 to 70%. After fifteen plus years in the United States,
most ethnicities had over 90 percent of their group as able to speak English.
5 The English Premium
5.1 The English Premium in the Early 20th Century
Immigrants acquired basic English skills at relatively fast rates in the early 20th century;
this could reflect that English was highly valuable for improving outcomes. In this section,
we estimate the English premium between 1910 and 1930. Estimating the premium for
English has straightforward econometric issues: primarily, the ability to speak English could
be correlated with an unobserved omitted variable that could positively bias the estimate.
Instead, we leverage the panel features of the linked dataset to estimate the association
between upgrading one’s occupation and learning to speak English. Note that we only aim
to estimate the association while reducing the threat of unobservables; another method aimed
to provide exogenous variation in English ability will be discussed later.
We estimate effect of English on occupation instead of wage because this variable is
first available in the 1940 Census. To explore the effect on occupation, we first group
immigrants into six occupational categories: high-skilled white collar (e.g. managers and
doctors), medium-skilled white collar (e.g., salesmen and clerks), semi-skilled workers (e.g.
craftsmen), farmers, low-skilled service/manual workers (e.g. waiters and operatives), and
laborers.20 Later we will assign occupational scores for each occupation to provide an esti-
20These are coded based on the first digit of the occ1950 variable from IPUMS. High-skilled white collarare occupations that start with 0 or 2, farmers start with 1, medium-skilled white collar start with 3 or 4,semi-skilled start with 5, low-skilled service and manual workers start with 6 or 7, and laborers start with 8
16
mate of the return to English for earnings, but here we provide descriptive evidence based
on occupational categories.
We estimate the rate at which one changes occupations in the following linear probability
model:
OccupationGroupit = γ0 + γ1SpeaksEnglishit + ϕi + Π′Xit + εit (2)
The dependent variable is a zero / one variable for whether one belongs to one of six occu-
pational groups. We run the regression six times − once for each of the groups in order to
estimate how learning to speak English affects the net flow into or out of an occupational
group. After controlling for individual fixed effects ϕi, the coefficient γ1 will produce an
estimate of the effect of English while accounting for numerous unobservable factors that are
constant within an individual i, including unobserved ability. This essentially estimates the
extra movement into an occupational group (or for occupational score in a later regression)
for those who acquired English skills relative to those who knew how to speak English at
first observation and those who never learned how to speak English.21
For controls, we include the year of observation, which accounts for the average shift in
occupational group over time as identified by those who either do not acquire English skills
or those who had already acquired English skills. We further interact year with age at first
observation (grouped into five-year intervals) to allow for job switching to vary by points in
the life cycle. We also interact years in the United States at initial observation, grouped into
two-year intervals, with year for the same reason. Finally, we include controls for literacy,
logged population in a county and fraction of immigrants from the same birthplace in county
to account for changes in general human capital, size of network, and population density.
or 9. However, any non-occupational response is dropped from the dataset.21See Table A1 for descriptive statistics of the three groups of those who always knew how to speak English,
those who never knew how to speak English and switchers. Those who always knew how to speak Englishwere more skilled at arrival than switchers while those who never learned how to speak English were lessskilled. The return to English skills is estimated to be low if one compares switchers to only those who neverlearned, or switchers to only those who always knew how to speak English.
17
We drop individuals for whom we do not observe jobs in both censuses; this mostly leads
to dropping children. In all regressions we calculate the standard errors by clustering on
country of birth.
The results are shown in Table 2. The table is split into two panels, one for the years
1910 to 1920 (or immigrants who arrived between 1900 and 1909), and one for years 1920
to 1930 − though results are similar no matter which dataset you use. The results show
that between censuses those who learned to speak English were much less likely to hold a
laborer job by the next census, and slightly more likely to hold slightly higher skilled jobs.
For example, learning to speak English was associated with a 7.6 percentage point drop in
being a laborer, which was about a fourth of the percentage of laborers at first observation.
Acquiring English skills most commonly led to a higher number of unskilled service, semi-
skilled or low-skilled white collar jobs, indicating a movement up the occupational ladder.
Note that learning to speak English did not lead to a large flow into professional jobs such as
a manager, doctor or lawyer. However, the base number of immigrants holding managerial
or professional occupations was small, so a lack of statistical significant for moving into this
highest skill group may reflect that it was a relatively rare outcome to begin with.
5.2 Sorting into Occupations by Task Intensity
Immigrants moved up in the occupational distribution slightly, but they may have moved
into jobs that required more communication. One way to measure this is to estimate whether
immigrants sorted into jobs that were more intensive in communication tasks rather than
other tasks, such as manual-based tasks (Peri and Sparber, 2009). We do this using the task
data from Gray (2013), who calculates the extent of communication tasks based on data
in the 1956 Report from the United States Employment Service. Communication tasks are
measured on a 1 to 6 scale, where jobs such as teamsters and laborers have a rating below 2,
and sales people, and managers have a rating greater than 4. Following the task literature,
we transform the communication variable into percentiles of the 1910 distribution, such that
18
being at the 10th percentile implies that the immigrant held a job that was more intensive
in communication tasks that 10 percent of those in 1910 (Autor et al., 2003).
To estimate the sorting of immigrants into either manual or communication tasks, we
simply re-estimate our English acquisition regression, but now with the percentile communi-
cation as the dependent variable. The results are shown in Panel A of Figure 6. According
to our linked samples, permanent immigrants started at about the 48th percentile of the
1910 communication task distribution at arrival, suggesting that they were not far behind
natives in communication-based tasks and improved to the 57th percentile after two decades
of stay.
The raw percentiles may be misleading because they do not account for the age of the
immigrant nor the year of observation. To address these issues, we estimate rate at which the
gap between immigrants and natives in communication-based task intensity closed by pooling
the panel of immigrants and repeated cross-section of natives from the 1910-1930 censuses.22
The estimated gap at arrival and rate at which it closed it shown in Panel B for 1900-1904
arrivals, where immigrants held jobs about 14 percentiles less in the communication task
distribution, and then closed this gap only by about a one third after 20 years of arrival.
This evidence shows that while immigrants did sort into jobs which required more com-
munication, they did not move up in the communication-distribution at a fast rate. To the
extent that jobs with communication-based tasks were more highly rewarded in the early
20th century, we would expect immigrants to improve on initial earnings gaps with natives,
but not by much. In the next section, we estimate the actual association between acquisition
of English skills and improving occupational-based earnings.
22When measuring the trend of immigrants into communication jobs in the regression, we additionallyadd age fixed effects to control for the life-cycle profile, and also estimate the rate of convergence betweenimmigrants and natives using a quadratic in years since arrival.
19
5.3 The Occupational-Based Earnings Premium
The occupational categories and task data show that immigrants who acquired English skills
slightly moved up in the occupational distribution and into more communication-intensive
jobs; however, they do not give a simple estimate of the English premium. To estimate
this, one needs to assign each of the nearly 250 occupational codes an occupational score
since income and wages not observed until the 1940 Census.23 Unfortunately, there is no
representative occupational score at this level of detail for each decade between 1910 and
1930; therefore, we resort to other occupational scores used in the literature: the score based
on the 1901 Cost of Living Survey (CLS), wage data from the 1940 Census, and wage and
business income data from the 1950 Census (occscore from IPUMS).24 Our preferred score is
from the 1940 Census, since it is based on the average wages by occupation and country of
birth, while the 1901 and 1950 scores reflect both immigrant and native earnings.25 However,
since our time period is between 1910 and 1930, we also report the 1901 CLS to reflect the
wider income distribution in the early 20th century. Finally, we also show results from the
1950 score since it is the one used by Abramitzky et al. (2014) to estimate the assimilation
profile in Figure 1.
The results from running the Equation (2) with logged occupational score as the depen-
dent variable are shown in Table 3. The first column shows the estimate when applying
occupational scores based on the 1901 Cost of Living Survey. The estimated association
between acquiring English skills and occupational-based earnings is between 4.2 and 4.5%.
The second column uses the immigrant-specific occupational score from 1940 and finds that
acquiring English skills is associated with a 0.5% upgrade between 1910 and 1920, and a
23This is based on the standardized occupational codes variable occ1950 in IPUMS.24For less than 2% of observations there is no occupation score in the 1901 Cost of Living Survey. For the
missing occupations, I calculate its position in the 1950s occupational score distribution, which has scoresfor all occupations. I assume the missing occupation’s point in the 1901 distribution is the same as in the1950 distribution, and then fill in its score based on its predicted wage. We use the farmer income valuefrom Abramitzky et al. (2014).
25The basic method to create this occupational score relies heavily on Collins and Wanamaker (2017),where we impute self-employed earnings for non-wage workers. See Appendix E for further detail.
20
2.4% upgrade between 1920 and 1930. The estimate from the immigrant-specific score is
very similar to the one when using the 1950 occupational score shown in the third column.
The finding of essentially no return to speaking English using the 1950 score is consistent
with the flat assimilation profile estimated by Abramitzky, Boustan and Eriksson (2014),
with little penalty for not speaking English at arrival.
As expected, the occupational scores reflecting a more compressed wage distribution
(1940 and 1950) imply a lower return to English skills, while the score reflective of a wide
wage distribution (1901) imply a stronger return to English skills. Therefore, it appears that
the return to English is correlated with the general return to human capital in the economy,
where wider wage distributions leader to higher returns to language capital (Goldin and
Katz, 2008). We prefer an estimate that is between the 1901 and 1940 score such that the
return to acquiring English skill is between zero and 4.5% in the early 20th century. However,
note that the variable for speaking English in these regressions is not exogenous, but may be
correlated with other factors - such as other types of United States specific human capital
- that change over time; however, these other factors likely positively affect labor market
outcomes, suggesting that the individual fixed effect estimate is an upper bound of the true
return to English skills.
5.4 A Discussion of Age-At-Arrival Analysis
The estimate from the linked sample shows a relatively low return to English compared with
estimates of a wage return greater than 20 percent for recent decades (Chiswick and Miller,
2014). However, our estimate has a few limitations: mainly, learning to speak English is
not exogenous. Further, the linked data may has false positives and measurement error in
the English variable may attenuate results (Dustmann and Van Soest, 2001).26 Here we
26One way to check measurement error is to see how many people were recorded as able to speak Englishat first observation but then recorded as not able at second observation: this happened to 2.5% of thoserecorded as able to speak English at first observation. The results that English acquisition had a lowreturn are unaffected if one drops those who “downgraded” English skills. Specifically, regressions that dropthose who downgraded English skills and use the 1940 immigrant-specific score yield a 0.0% return between
21
briefly discuss an additional empirical strategy to estimate the English premium where one
could exploit the well-defined relationship between age at arrival and the ability to speak
English, as shown previously in Figure 3. Bleakley and Chin (2004) use this relationship
to instrument for the ability to speak English based on whether one arrived at an older or
younger age, and whether the immigrant was born in an English-speaking country. Further,
we can do this strategy using cross-sectional data, avoiding any potential bias from the linked
data. While we briefly discuss the strategy here, we employ it fully in Appendix D.
Figure 7 shows the basic intuition behind the Bleakley and Chin (2004) strategy. Panel
B estimates the age-at-arrival English fluency and occupational profile in the early 21st
century and shows that non-English-speaking sources fall in English speaking ability after
the critical period of language acquisition ends. Non-English-speaking sources also have a
steeper occupational profile, which falls at the same arrival ages when English fluency levels
drop; this combination of profiles for English fluency and occupational score form the basis of
Bleakley and Chin’s (2004) argument that older arriving immigrants were strongly penalized
for a lack of English skills.
However, the age-at-arrival strategy applied to the early 20th century shows little penalty
for those unable to speak English. While the age-at-arrival and English fluency relationship
is largely the same as in the late 20th century, the effect of age-at-arrival on occupational
scores are similar for English and non-English-speaking sources.27 In other words, older
arrival who had lower levels of English fluency did not also have substantially lower levels
of occupational score. This evidence is consistent with our argument that a lack of English
fluency did not strongly penalize workers in the early 20th century compared with workers
in the early 21st century; we discuss point estimates for this instrumental variables strategy
at further length in Appendix D.
1910-1920, and a 2.0% return between 1920-1930.27Alexander and Ward (2018) also estimate the age-at-arrival and wage profile for English-speaking and
non-English-speaking using a sample of brothers linked from arrival records to the 1940 Census, and findthe same results that there is no difference in the age-at-arrival profiles across sources.
22
6 The Increasing Return to English Fluency
6.1 A Consistent Method to Estimate the Changing Premium
The age-at-arrival analysis is consistent with an increasing return to English fluency over
the past 100 years; however, it is identified off of variation in outcomes for child arrivals (or
the 1.5 generation) and may not be fully applicable to the rest of the immigrant population.
Unfortunately we cannot compare our preferred estimate from the individual fixed effects
strategy to English premium estimates from recent decades since the early 21st century
censuses are not linked. Moreover, recent estimates of the English premium are on income
rather than occupational score.28
Therefore to create a consistent estimate of the association between English fluency and
economic outcomes, we estimate an OLS model for data in 1910 and data in 2010; note that
the OLS method is also used by Chiswick and Miller (2014) to show the association between
English fluency and economic outcomes.29 Further, we create an occupational score in the
same manner in 2010 as we did in our prior analysis, where the score in 2010 reflects average
income by country of birth and occupation, similar to the occupational score measure from
the 1940 Census. Therefore, we regress an immigrant’s log occupational score on the ability
to speak English, a measure of general human capital (literacy in the early 20th century
and having more than 8 years of education in the early 21st century), age, the fraction of
immigrants in the county, the population of the county and country of birth. This model is
clearly parsimonious but it is difficult to include other variables that are both in the 1910
and 2010 data.30
28A further issue is that often recent estimates either group English speakers based on whether they spokeEnglish “very well”/‘well” or “not well”/“not at all”, rather than our preferred grouping of speaking anyEnglish since the age-at-arrival and English profiles look similar (recall Figure 3).
29We refer to the 2008-2012 ACS as the 2010 data for convenience. The sample for both datasets is of25-60 year-old males from non-English-speaking sources.
30We use eight years of education since it likely reflects the ability to read and write and about 80 percentof immigrants in 2010 held more than 8 years of education, similar to the percent of immigrants in 1910 whocould read and write. If one uses a smaller level of schooling to reflect literacy, such as 4 years of education,the results are qualitatively the same.
23
The results from the regression in Table 4 show that an OLS estimate of the occupational-
based return to English fluency in 1910 is 6.3 percent, which is higher than the individual
fixed effects estimates of 0 to 2.4 percent when using the same occupational score. The
difference between the OLS and individual fixed effect estimate is unsurprising if one expects
English fluency to be positively biased from an omitted variable such as ability. Using a
closely-related regression with the 2010 data, the occupational-based return to English skills
is 16.9 log points or 18.4 percent. Based on this method, the English premium is about three
times higher than the estimated association in 1910.
One caveat to the occupational-based estimates is that they only capture part of English
premium due to the limited nature of the occupational score. If one instead uses log income
rather than log occupational score in 2010, then the return increases from 16.9 to 35.8 log
points (or 43 percent). The difference between the occupational-based return and the income
return suggests that the occupational-based score captures about 40 percent of the benefit
to gaining English skills in 2010, where the rest of the benefit comes from increased income
within occupation. One should keep this in mind when interpreting the results from the
early 20th century when we only have occupational scores, although we cannot quantify how
much information we lose from not having income.
Overall, the analysis shows that the premium to English skills in the early 20th century
was less than the English premium in recent decades. While we cannot compare our preferred
individual fixed effects estimate over time, the recreation of the methods used by others in
literature based on OLS (Chiswick and Miller, 2014) and instrumental variables (Bleakley
and Chin, 2004) consistently point in the same direction of an increasing English premium
over the 20th century. Moreover, a low value of English skills in the early 20th century
helps and high value of English skills in the early 21st century to understand the changing
assimilation profiles shown in Figure 1.
24
6.2 Discussion of Changing Premium
Why did the return to English skills increase over the past 100 years? A common explanation
for a changing return to skill is due to technology-driven shifts in demand; therefore, it may
be that the premium for language skills was smaller due to a lower demand for English
skills. The early 20th century economy was more agricultural and industrial compared with
the service-dominated economy today, and still had a large focus on brawn rather than
brain. The shift away from agriculture over the 20th century coincided with increasing
urbanization rates; as population density increased, the tasks performed by workers also
changed toward more interaction (Boustan et al., 2013; Michaels et al., 2017). This is easily
seen when examining the structure of the labor force over the past century, in which the
fraction of white-collar jobs has tripled, agricultural jobs have been all but eliminated, and
the proportion of blue collar jobs has decreased (Katz and Margo, 2014). At the same time,
Michaels et al. (2017) show that the importance of interactive tasks for jobs has grown rapidly
between 1880 and 2000, especially in cities where immigrants tended to locate. Deming
(2017) further demonstrates that jobs with social skills have had an increasing premium
since 1980. These demand shifts favoring interaction, social skills, and general human capital
may have increased the demand for English skills, causing English fluency to be of primary
importance for immigrants to succeed in the United States in the 21st century (Goldin and
Katz, 2008).
Another possibility for a low return to English skills in the early 20th century is that
discrimination against immigrants was rampant, and therefore any type of human capi-
tal received a small premium in the market. Indeed, discrimination appears to have been
widespread; for example, those who changed their name to be more “American” received
a premium in the labor market (Biavaschi et al., 2017), and brothers who had more American-
sounding names earned more than brothers who had more foreign-sounding names (Abramitzky
et al., 2017). Yet these two results suggests that there is a positive return to becoming more
“American”, and a similar positive return would likely hold for becoming more American by
25
learning to speak English. In Appendix Table A2 we show that the English premium is sim-
ilar for immigrants who experienced more discrimination (Southern and Eastern Europeans)
and immigrants who experienced less discrimination (Northern and Western Europeans),
suggesting that discrimination was not the main driver of the low English premium.
7 Concluding Remarks
Surprisingly little is known about the importance of English for immigrant outcomes one
hundred years ago. Using new linked data between 1910 and 1930, we show a few simple
relationships. First, many immigrants arrived without English skills, which contrasts with a
lack of occupational-based earnings deficit for immigrant arrivals; second, immigrants rapidly
acquired English skills in the years after arrival, which contrasts with a flat assimilation pro-
file where immigrants barely improved on their relative position with natives. This suggests
that English skills had little occupational value in the early 20th century relative to recent
decades, which we directly show using individual fixed effects and age-at-arrival analysis.
Therefore, our results help to explain the assimilation profile in the early 20th century and
why the assimilation profile has changed over time.
While we cannot definitively pinpoint the mechanisms for why the value of English skills
was lower in the early 20th century, we argue that it likely reflects the structure of the econ-
omy, which was primarily agricultural and manufacturing. In this setting, interaction and
social skills were relatively unimportant compared with today’s service-dominated economy.
Therefore, technological change can influence immigrant’s relative position with natives, es-
pecially if it influences the relative return to communication and language skills. Further,
these results on a low value of English skills help to understand why the Americanization
movement, which aimed to increase immigrants’ English fluency levels, did little to improve
the foreign-born’s adult economic outcomes (Lleras-Muney and Shertzer, 2015).
While we stress the importance of English fluency for understanding the variation in
26
immigrant assimilation profiles over time, it is not the only determinant of the profile. In
particular, immigrants’ earnings relative to natives’ also depend on their pre-immigration
human capital; indeed, this point has been stressed by Borjas (1985, 1995, 2015) as immi-
grant sources have shifted to poorer countries following the Immigration and Nationality
Act of 1965. Therefore, another reason for the difference in assimilation profiles across time
may be that immigrants in the past had pre-immigration human capital levels similar to
natives, compared with today’s difference in human capital between natives and immigrants
(Abramitzky and Boustan, 2017).
If current trends continue, then the English premium may increase even further in future
decades. If technological shifts continue to favor those with skill, especially social skills
(Goldin and Katz, 2008; Deming, 2017), and if immigrants do not have higher rates of
investment in English skills either pre-arrival or post-arrival (Borjas, 2015), then the premium
for English skills will increase. If so, then immigrants’ economic position relative to natives
will gradually worsen over time.
27
References
Abramitzky, Ran and Leah Platt Boustan, “Immigration in American History,” Jour-nal of Economic Literature, 2017., , and Katherine Eriksson, “Have the poor always been less likely to migrate?Evidence from inheritance practices during the Age of Mass Migration,” Journal of De-velopment Economics, 2013, 102, 2–14., , and , “A Nation of Immigrants: Assimilation and Economic Outcomes in the Ageof Mass Migration,” Journal of Political Economy, 2014, 122 (3), 467–506., , and , “Cultural Assimilation during the Age of Mass Migration,” Working Paper22381, National Bureau of Economic Research July 2017.
Acemoglu, Daron and David Autor, “Skills, tasks and technologies: Implications foremployment and earnings,” in “Handbook of labor economics,” Vol. 4, Elsevier, 2011,pp. 1043–1171.
Ager, Philipp and Casper Worm Hansen, “Closing Heaven’s Door: Evidence from the1920s US Immigration Quota Acts,” 2017.
Alexander, Rohan and Zachary Ward, “Age at Arrival and Assimilation during theAge of Mass Migration,” Journal of Economic History, 2018.
Autor, David H, Frank Levy, and Richard J Murnane, “The skill content of recenttechnological change: An empirical exploration,” The Quarterly journal of economics,2003, 118 (4), 1279–1333.
Bailey, Martha, Connor Cole, Morgan Henderson, and Catherine Massey, “HowWell Do Automated Linking Methods Perform in Historical Samples? Evidence from NewGround Truth,” Technical Report, Working Paper 2017.
Biavaschi, Costanza, Corrado Giulietti, and Zahra Siddique, “The Economic Payoffof Name Americanization,” Journal of Labor Economics, 2017, 35 (4), 1089–1116.
Blau, Francine D, “Immigration and labor earnings in early twentieth century America.,”1980.
Bleakley, Hoyt and Aimee Chin, “Language Skills and Earnings: Evidence from Child-hood Immigrants,” Review of Economics and Statistics, 2004, 86 (2), 481–496.and , “Age at Arrival, English Proficiency, and Social Assimilation Among US Immi-
grants,” American Economic Journal: Applied Economics, 2010, pp. 165–192.Bloch, Louis, “The Ability of European Immigrants to Speak English,” Quarterly publica-
tions of the American Statistical Association, 1920, 17 (132), 402–416.Borjas, George J, “Assimilation, Changes in Cohort Quality, and the Earnings of Immi-
grants,” Journal of Labor Economics, 1985, 3 (4), 463–489., “Assimilation in Cohort Quality Revisited: What Happened to Immigrant Earnings inthe 1980s?,” Journal of Labor Economics, 1995, 13 (2), 211–245., “The Slowdown in the Economic Assimilation of Immigrants: Aging and Cohort EffectsRevisited Again,” Journal of Human Capital, 2015, 9 (4), 483–517.
Boustan, Leah Platt, Devin Bunten, and Owen Hearey, “Urbanization in the UnitedStates, 1800-2000,” Technical Report, National Bureau of Economic Research 2013.
Catron, Peter, “The Citizenship Advantage: Immigrant Socioeconomic Attainment acrossGenerations in the Age of Mass Migration,” 2017.
Chiswick, Barry R, “The Effect of Americanization on the Earnings of Foreign-born Men,”
28
Journal of Political Economy, 1978, 86 (5), 897–921.and Paul W Miller, “Do enclaves matter in immigrant adjustment?,” City & Commu-
nity, 2005, 4 (1), 5–35.and , “International migration and the economics of language,” Handbook of the Eco-
nomics of International Migration, 1A: The Immigrants, 2014, 1, 211.Deming, David J, “The growing importance of social skills in the labor market,” The
Quarterly Journal of Economics, 2017.Dustmann, Christian and Arthur Van Soest, “Language fluency and earnings: Esti-
mation with misclassified language indicators,” Review of Economics and Statistics, 2001,83 (4), 663–674.and Joseph-Simon Gorlach, “Selective out-migration and the estimation of immigrantsearnings profiles,” in “Handbook of the Economics of International Migration,” Vol. 1,Elsevier, 2015, pp. 489–533.
Feigenbaum, James J, “A Machine Learning Approach to Census Record Linking,” 2016.Fouka, Vasiliki, “Backlash: The unintended effects of language prohibition in US schools
after World War I,” Stanford Center for International Development Working Paper, 2016,591.
Goldin, Claudia and Lawrence F Katz, The race between education and technology,Harvard University Press, 2008.
Gray, Rowena, “Taking technology to task: The skill content of technological change inearly twentieth century united states,” Explorations in Economic History, 2013, 50 (3),351–367.
Guven, C and A Islam, “Age at migration, language proficiency, and socioeconomicoutcomes: evidence from australia.,” Demography, 2015, 52 (2), 513.
Hatton, Timothy J, “The Immigrant Assimilation Puzzle in Late Nineteenth-CentutyAmerica,” The journal of economic history, 1997, 57 (01), 34–62.and Jeffrey G Williamson, “The age of mass migration: Causes and economic impact,”
OUP Catalogue, 1998.Inwood, Kris, Chris Minns, and Fraser Summerfield, “Reverse assimilation? Immi-
grants in the Canadian labour market during the Great Depression,” European Review ofEconomic History, 2016, 20 (3), 299–321.
Jasso, Guillermina and Mark R Rosenzweig, “Language Skill Acquisition, Labor Mar-kets and Locational Choice: The Foreign-Born in the United States, 1900 and 1980,” in“Migration and Labor Market Adjustment,” Springer, 1989, pp. 217–239.and , The new chosen people: Immigrants in the United States, Russell Sage Founda-
tion, 1990.Jenks, Jeremiah Whipple and William Jett Lauck, The immigration problem, Funk
& Wagnalls Company, 1926.Katz, Lawrence F and Robert A Margo, “Technical change and the relative demand
for skilled labor: The united states in historical perspective,” Technical Report, NationalBureau of Economic Research 2014.
Krashen, Stephen D, Michael A Long, and Robin C Scarcella, “Age, rate andeventual attainment in second language acquisition,” Tesol Quarterly, 1979, pp. 573–582.
Kuziemko, Ilyana and Joseph Ferrie, “The Role of Immigrant Children in Their Par-ents’ Assimilation in the United States, 1850–2010,” in “Human Capital in History: The
29
American Record,” University of Chicago Press, 2014, pp. 97–120.Lafortune, Jeanne, Jose Tessada, and Ethan Lewis, “People and Machines: A Look
at the Evolving Relationship Between Capital and Skill In Manufacturing 1860-1930 UsingImmigration Shocks,” Working Paper 21435, National Bureau of Economic Research July2016.
LaLonde, Robert J and Robert H Topel, “The assimilation of immigrants in the USlabor market,” in “Immigration and the workforce: Economic consequences for the UnitedStates and source areas,” University of Chicago Press, 1992, pp. 67–92.
Lleras-Muney, Adriana and Allison Shertzer, “Did the Americanization MovementSucceed? An Evaluation of the Effect of English-Only and Compulsory Schooling Lawson Immigrants,” American Economic Journal: Economic Policy, 2015, 7 (3), 258–90.
Lubotsky, Darren, “Chutes or ladders? A longitudinal analysis of immigrant earnings,”Journal of Political Economy, 2007, 115 (5), 820–867., “The effect of changes in the US wage structure on recent immigrants’ earnings,” TheReview of Economics and Statistics, 2011, 93 (1), 59–71.
Massey, Catherine G, “Playing with matches: An assessment of accuracy in linked histor-ical data,” Historical Methods: A Journal of Quantitative and Interdisciplinary History,2017, pp. 1–15.
Michaels, Guy, Ferdinand Rauch, and Stephen J Redding, “Task specialization inUS cities from 1880-2000,” Technical Report, National Bureau of Economic Research,accessed “http://personal.lse.ac.uk/michaels/” 2017.
Peri, Giovanni and Chad Sparber, “Task specialization, immigration, and wages,”American Economic Journal: Applied Economics, 2009, 1 (3), 135–69.
Perlmann, Joel, Italians Then, Mexicans Now: Immigrant Origins and the Second-Generation Progress, 1890-2000, Russell Sage Foundation, 2005.
Sequeira, Sandra, Nathan Nunn, and Nancy Qian, “Migrants and the Making ofAmerica: The Short-and Long-Run Effects of Immigration during the Age of Mass Migra-tion,” Technical Report, National Bureau of Economic Research 2017.
Singleton, David, “Age and second language acquisition,” Annual review of applied lin-guistics, 2001, 21, 77–89.
Spitzer, Yannay and Ariell Zimran, “Migrant Self-Selection: Anthropometric Evidencefrom the Mass Migration of Italians to the United States, 1907–1925,” 2017.
Stevens, Gillian, “A century of US censuses and the language characteristics of immi-grants,” Demography, 1999, 36 (3), 387–397.
Tabellini, Marco, “Gifts of the Immigrants, Woes of the Natives: Lessons from the Age ofMass Migration,” 2017.
Vigdor, Jacob L, From immigrants to Americans: The rise and fall of fitting in, Rowman& Littlefield, 2010.
Ward, Zachary, “Birds of Passage: Return Migration, Self-Selection and ImmigrationQuotas,” Explorations in Economic History, 2017.
30
Table 1: Representativeness of the Linked Samples
1900-1909 Cohort in 1920 1910-1919 Cohort in 1930Cross Panel Cross Panel
Difference from Cross Difference from CrossUnweighted Weighted Unweighted Weighted
Speak English 0.895 0.0337*** -2.55e-09 0.955 0.0216*** -6.05e-05(0.306) (0.00254) (0.00273) (0.207) (0.000976) (0.00124)
Literate 0.852 0.0582*** -3.50e-09 0.902 0.0372*** -0.000119(0.355) (0.00294) (0.00318) (0.297) (0.00144) (0.00171)
South or East Europe 0.748 -0.255*** 3.96e-09 0.807 -0.184*** 0.000552(0.434) (0.00377) (0.00363) (0.395) (0.00221) (0.00192)
Age 35.44 0.308*** 0.290*** 36.92 -0.894*** -0.987***(7.095) (0.0608) (0.0624) (6.862) (0.0371) (0.0384)
Professional 0.119 0.0125*** 7.53e-10 0.124 0.00272 -0.000115(0.324) (0.00277) (0.00278) (0.330) (0.00171) (0.00174)
Sales/Clerical 0.0507 0.0130*** 1.54e-09 0.0642 0.00642*** -0.000116(0.219) (0.00189) (0.00186) (0.245) (0.00128) (0.00127)
Semi-Skilled 0.227 0.0111*** -4.82e-09 0.221 0.0206*** -0.000138(0.419) (0.00356) (0.00362) (0.415) (0.00216) (0.00218)
Unskilled Service/ 0.294 -0.0565*** -3.81e-09 0.298 -0.0462*** 5.02e-05Operative (0.456) (0.00383) (0.00399) (0.457) (0.00232) (0.00247)Farmer 0.0531 0.0512*** 1.46e-09 0.0375 0.0367*** 4.76e-05
(0.224) (0.00202) (0.00186) (0.190) (0.00112) (0.000916)Laborer 0.229 -0.0511*** 4.49e-09 0.232 -0.0421*** 0.000248
(0.420) (0.00352) (0.00370) (0.422) (0.00213) (0.00230)Foreignness Index, First name 0.663 -3.74e-06 0.0172*** 0.671 -0.00812*** 0.000454
(0.190) (0.00163) (0.00167) (0.188) (0.000987) (0.00101)Foreignness Index, Last name 0.730 -0.0102*** 0.0187*** 0.746 -0.0324*** -0.0157***
(0.203) (0.00173) (0.00175) (0.188) (0.000962) (0.000970)Observations 16,258 Panel: 96,400 57,482 Panel: 108,590
Notes: Data is from linked samples between 1910 and 1920, and 1920 and 1930; cross-sectional data is from IPUMS
1% samples in 1920 and a 5% sample in 1930 (Ruggles et al., 2015). This table shows whether the linked (panel)
samples are representative with respect to the random cross-sectional samples from IPUMS. Weights are applied to
match English proficiency, literacy, being from Southern or Eastern Europe, and occupational categories according to
the 1920 and 1930 IPUMS samples. *p<0.10, **p<0.05, ***p<0.01
31
Table 2: Acquiring English Skills and Occupational Categories, Individual Fixed Effects
I II III IV V VIProfessional/ Sales/ Semi- Unskilled Farmer Laborer
Manager Clerical Skilled Service/Oper.
Panel B: 1910 to 1920 CensusSpeak English 0.00550 0.0202*** 0.0255*** 0.0186 0.00463 -0.0745***
(0.00489) (0.00446) (0.00110) (0.0121) (0.00434) (0.0100)
Mean of Dep. Var. in 1910 0.0612 0.0683 0.22 0.265 0.0544 0.331
Individual FE Yes Yes Yes Yes Yes YesNumber of ind 77,448 77,448 77,448 77,448 77,448 77,448
Panel C: 1920 to 1930 CensusSpeak English 0.00739 0.00812*** 0.0269*** 0.0332*** 0.00312 -0.0787***
(0.00513) (0.00234) (0.00634) (0.00673) (0.00356) (0.00612)
Mean of Dep. Var. in 1920 0.0748 0.0525 0.245 0.3 0.0565 0.271
Individual FE Yes Yes Yes Yes Yes YesNumber of ind 84,595 84,595 84,595 84,595 84,595 84,595
Notes: Data is from linked samples between 1910 to 1920 in Panel A, and 1920 to 1930 in Panel B. Each cell reports results
from a separate regression of the occupational category on the ability to speak English, individual fixed effects and controls
described in text such as literacy and fraction of foreign born in the country. Standard errors are clustered by country of birth.
*p<0.10, **p<0.05, ***p<0.01
32
Table 3: Speaking English and Occupational Score, Individual Fixed Effects
Occupational Score based on:1901 CLS 1940 Census 1950 Census
Panel A: 1910 to 1920 CensusSpeak English 0.0420*** 0.00535 0.00636
(0.00394) (0.00498) (0.00509)
Individual FE Yes Yes YesNumber of ind 77,448 77,448 77,448
Panel B: 1920 to 1930 CensusSpeak English 0.0453*** 0.0242*** 0.0266***
(0.00399) (0.00659) (0.00487)
Individual FE Yes Yes YesNumber of ind 84,595 84,595 84,595
Notes: Data is from linked samples between 1910 to 1920 in Panel
A, and 1920 to 1930 in Panel B. Each cell reports results from
a separate regression of log occupational score on the ability to
speak English, individual fixed effects and controls described in
text such as literacy and fraction of foreign born in the country.
The 1901 CLS uses income scores from the 1901 Cost of Living
Survey. The 1940 Census is based on income from the 1940 Cen-
sus and is country of birth-specific; see Appendix E for further de-
tail. The 1950 census is the occscore variable from IPUMS. Stan-
dard errors are clustered by country of birth. *p<0.10, **p<0.05,
***p<0.01
33
Table 4: Association between Speaking English and Outcomes, 1910 and 2010
1910 Census 2008-2012 ACSLog (Occ. Score) Log (Occ Score) Log (Income)
Speak English 0.0633*** 0.169*** 0.358***(0.00356) (0.00277) (0.00580)
Literacy 0.0571***(0.00395)
More than 8 years of education 0.176*** 0.277***(0.00208) (0.00411)
Fraction of own Migrants in County 0.0225** 0.0167*** -0.169***(0.0111) (0.00598) (0.0116)
Log (County Pop) 0.00741*** 0.00492*** 0.00596***(0.000844) (0.000618) (0.00115)
Country of Birth FE Y Y YAge FE Y Y YObservations 37,289 427,227 427,227R-squared 0.284 0.305 0.164
Notes: Data is from the 1910 Census and the 2008-2012 ACS. In the 2008-2012 ACS, speaking English is
coded as 1 if a immigrant is able to speak any English, whether not well or very well. Both samples are
of male immigrants from non-English speaking countries aged 25 to 60. We use the immigrant-specific
occupational score from 1940 in the 1910 census. The occupational score in 2010 is based on the mean
total income by occupation and country of birth in the 2008-2012 ACS, which is the same method of
calculating occupational score. Standard errors are clustered by birth.
34
Figure 1: Assimilation Profiles Across Time for Permanent Immigrants
Notes: The typical assimilation profile in the early 20th century is found by Abramitzky, Boustan andEriksson (2014); late 20th century by Lubotsky (2007). The findings only represent the assimilation ofpermanent immigrants who stay throughout a panel.
35
Figure 2: Fraction of Immigrant Stock Born in an English-Speaking Country
Notes: Data is from 1850-2014 IPUMS. The graph separates countries by whether English if an officiallanguage or dominantly spoken; for example, India and Philippines have English as an official language, butit is not predominantly spoken by the populace. See Bleakley and Chin (2010) for a further discussion.
36
Figure 3: Age at Arrival and English Proficiency Profile, Early 20th and 21st Century
Notes: Data is from 1900-1930, 2000 Censuses and 2008-2012 ACS. The figure plots age-at-arrival fixedeffects from a regression of ability to speak English on age at arrival, age, year, cohort of arrival, country ofbirth, sex, and fraction of immigrants from same birthplace in county.
37
Figure 4: Speed of Language Acquisition Across the 20th century
Notes: Data is from linked panel data 1910-1920 and 1920-1930; the 1910-1930, 2000 IPUMS random samplesand 2008-2012 ACS. The figure shows the mean ability to speak English in the years after arrival. RCS standsfor repeated cross section.
38
Figure 5: Speed of Language Acquisition by Ethnicity
Notes: Data is from linked samples between 1910-1920 and 1920-1930. The figure shows the mean ability to speak English by ethnicity, as proxied bymother’s tongue.
39
Figure 6: Immigrants moved into Jobs with more Communication Tasks
Notes: Data is from 1910-1920 and 1920-1930 linked samples, and the 1910-1930 IPUMS random samples.Communication tasks are rated on a 1 to 6 scale and then transformed into percentiles based on the 1910Census. The top figure estimates the rate of moving up the communication distribution for immigrants. Thebottom figure estimates the gap between immigrants and natives after accounting for life-cycle effects andperiod effects.
40
Figure 7: Age-at-Arrival, English Fluency and Occupation in Early 20th and 21st Century
Notes: The figure shows the residuals of the ability to speak English and the log occupational score afterremoving the effects of age, sex and country of birth. Panel A uses the 1900 to 1930 Censuses and PanelB uses the 2000 and 2008-2012 ACS. See Appendix D for a fuller exploration of English and age-at-arrivaleffects.
41
Online appendix, not meant for publication
42
Table A1: Descriptives of Groups of Always Speak English, Switchers, and Never learnersat first observation
Always Speak English Switchers Never Speak EnglishLiterate 0.913 0.716 0.602
(0.283) (0.451) (0.489)Age 27.15 27.95 30.20
(6.155) (6.437) (6.394)South or East Europe 0.724 0.869 0.932
(0.447) (0.337) (0.251)Log (Occ. Score), 1940 6.791 6.695 6.672
(0.348) (0.274) (0.239)Professional 0.0868 0.0383 0.0294
(0.282) (0.192) (0.169)Sales/Clerical 0.0730 0.0287 0.0175
(0.260) (0.167) (0.131)Semi-skilled 0.247 0.170 0.129
(0.431) (0.375) (0.335)Unskileld Service/Operative 0.301 0.298 0.261
(0.459) (0.458) (0.439)Farmer 0.0439 0.0297 0.0316
(0.205) (0.170) (0.175)Laborer 0.249 0.435 0.532
(0.432) (0.496) (0.499)
Observations 127,421 30,991 3,631
Notes: Data is from linked data from 1910-1920, and 1920-1930.
43
Table A2: Speaking English and Occupational Upgrading, Alternative Sam-ples
I II III IV VSample: Base Old New Child Adult
Sources Sources Arrivals ArrivalsPanel A: 1910 to 1920 CensusSpeak English 0.00535 0.00512 0.0140*** 0.0199 0.00355
(0.00498) (0.0138) (0.00284) (0.0209) (0.00330)
Individual FE Yes Yes Yes Yes YesNumber of ind 77,448 39,469 38,858 14,770 63,557
Panel B: 1920 to 1930 CensusSpeak English 0.0242*** 0.0369** 0.0264*** 0.0376* 0.0220***
(0.00659) (0.0131) (0.00630) (0.0204) (0.00501)
Individual FE Yes Yes Yes Yes YesNumber of ind 84,595 32,712 51,883 19,025 65,570
Notes: Data is from linked samples between 1910 to 1920 in Panel A, and 1920 to 1930
in Panel B. Each cell reports results from a separate regression of log occupational
score (based on the 1940 Census) on the ability to speak English with individual fixed
effects and controls described in text. Each column limits the sample to a different
subsample. Old sources are from Northern and Western Europe; new sources are
from Southern and Eastern Europe. Child arrivals are those who arrived under the
age of 16. Standard Errors are clustered by country of birth. *p<0.10, **p<0.05,
***p<0.01
44
Figure A1: Able to Speak English, 1900 to 2010
Notes: Data is from IPUMS (1900-1930; 1980-2010).
45
Figure A2: Rate of English Acquisition Raw Data, 1900-1919 Cohorts
Notes: Data is from linked data from 1910-1920, and 1920-1930.
46
Figure A3: Cohort Effects
Notes: Data is from IPUMS (1910-1930) and linked samples (1910-1920; 1920-1930). The figure shows themean ability to speak English for arrivals by arrival cohort. RCS stands for repeated cross section.
47
Figure A4: Cohort effects when accounting for birth place and age at arrival
Notes: Data is from IPUMS (1910-1930) and linked samples (1910-1920; 1920-1930). The figure shows themean ability to speak English for arrivals by arrival cohort. RCS stands for repeated cross section.
48
B Linking Process
B.1 A machine-learning approach to linking immigrants
In this section, we provide further detail on how we build a longitudinal dataset which
links immigrants from Census to Census. We follow the approach discussed at length by
Feigenbaum (2016) where we hand-link a set of immigrants to ensure a set of high-quality
links, train an algorithm to find a best link based on links in our hand-linked dataset, and
then apply the algorithm to the overall set of potential links for the rest of the census to
pick the best link. We pursue this method over other linking strategies, such as the linking
algorithm used by Abramitzky, Boustan and Eriksson (2014), in order to reduce biases
associated with false positives, as discussed by Bailey et al. (2017). We discuss our method
in detail below, but the reader should reference Feigenbaum (2016) since much of the method
is based on his discussion.
Before drawing random samples to build training data, we set the sampling frame and
pre-process the data in the following manner. First, we take the 1920 Census and keep
male European immigrants who arrived between 1900 and 1919 and are between 10 and 40
years old. We create a new variable which is the Americanization of all first names based
on information from behindthename.com31; for example, the Americanization of Giuseppe
is Joseph. We do this because individuals may have Americanized their names between
censuses, as shown by Biavaschi et al. (2017); yet note that Americanization may be more
likely to occur immediately after arrival (and thus before first observation at the Census)
than between censuses (Carneiro et al., 2015) and thus may not be as strong of a bias. For
individuals without an Americanized first name, we keep their original first name string.
Next, we drop any individual which has the same set of variables for: Americanized first
name, last name string, country of birth, year of birth, year of immigration and mother’s
tongue. These are the variables which we will link on between 1920 and 1930 to determine
the best match; therefore we drop anyone with the exact same set of linking variables because
we cannot distinguish between them and other potential links.
After pre-processing the data in this way, we draw 15 different random samples of 2,000
immigrants each. These random samples are by mother’s tongue, a variable which best
reflects the ethnicity of each immigrant. The different mother’s tongue are German, Yid-
dish/Jewish, Dutch, Swedish, Danish, Norwegian, Italian, French, Romanian, Greek, Rus-
sian/Ukrainian, Slovak/Slovene, Polish, Finnish, and Magyar/Hungarian.32 We sample by
31See Appendix B and C in Alexander and Ward (2018) for further detail.32These are based on the mtongue variable in IPUMS. Slovak/Slovene includes Czech, Slovak, Serbo-
Croatian, Yugoslavian, Slovene and Lithuanian.
49
mother’s tongue since the method to pick the best link may vary based on the ethnicity of an
immigrant; for example, the way to pick the best match among immigrants from German-
language sources may differ from the way to pick the best match among immigrants from
Italian-language sources. We draw 2,000 since Feigenbaum (2016) suggests that training
datasets converge quickly after a sample of 500; we increase to 2,000 in case we need more
data from linking the foreign-born rather than Feigenbaum’s sample of Iowans.
After drawing the random samples of 2,000 for a total of 30,000 immigrants, we draw
a set of potential matches for them from the 1930 Census, from which we hand-pick the
best link. The set of potential matches are restricted in the following ways. First, the
differences in year of birth must be plus or minus 3 years, which is the same restriction used
by Feigenbaum (2016). Second, the difference in year of arrival must be at most seven years;
we discuss about linking on year of arrival in the next section, a variable which Abramitzky
et al. (2014) do not link on. Third, the Jaro-Winkler score in Americanized first name must
be greater than or equal to 0.75; this is to allow for differences in spelling due across censuses.
Fourth, the Jaro-Winkler score in last name must be greater than or equal to 0.80; we allow
for a slightly greater deviation in first name due to Americanizing the first name. Fifth, one
of the following conditions must hold: either the first letter of the Americanized first name
must match, or the first letter of the last name must match, or the soundex version of both
first and last name must match. Sixth, for immigrants with more than 25 potential matches,
we keep the best 25 candidates based on the linking score provided by Feigenbaum (2016).
Seventh, the country of birth and mother’s tongue must match exactly.
The 15 random samples of 2,000 immigrants yields 228,391 potential matches. Note
that out of the 30,000 individuals we sampled, we found a potential match for only 21,913
individuals, which may be due to return migration, under enumeration, death, spelling errors
by enumerators, Americanization between censuses, or mistyping by those digitizing the
dataset. Therefore, the best possible linking rate we can achieve with our training dataset
is 73 percent; however, we will match much fewer than this due to common names or being
unable to find a good match. For individuals which we find a potential match, we pick the
best link of the (on average) 10.3 potential links.
We choose the best potential match for each of the 21,913 individuals in our dataset.
Many individuals do not have any potential match with is the true person, which we code
as zero. For those with multiple good potential matches that are very close in name, year of
arrival and year of birth, we code them as zero since we cannot determine the best match.
After linking, we are left with a linked dataset that has 7,548, or 34.4 percent of the those
with potential matches, and 25.2 percent of our original random sample. Linking rates vary
by mother’s tongue as shown in Table B1, where we are most likely to link Dutch immigrants
50
(35.4 percent of starting sample), and least likely to link someone who is Romanian (12.7
percent of the starting sample).
With these sets of potential links, we train multiple models to choose the best match
among the potential links; the aim is to use the coefficients from these models to apply to
the full link between censuses. We model each potential match based on our experience
linking individuals; the link is a function of first name match, last name match, differences
in year of birth, year of arrival, the number of other potential matches that have the same
exact last name or exact NYSIIS code, or the length of the last name. The models for each
language of origin are shown in Tables B2-B5. With these coefficients estimated from the
probit models, one can then easily use them to gauge the best possible match when linking
the entire censuses.
Before linking the full-count censuses, we must set two meta-parameters for each ethnicity
to determine which of the potential links are to be included in the final linked dataset. First,
we must determine the threshold for the likelihood that a potential link is actually a true
link. For example, Nels Nelson who was born in 1890 and arrived in 1915 may match with
Neils Nillson who was born in 1887 and arrived in 1920, but should we keep this link or
not? We term this meta-parameter b1 such that we keep any individual with a predicted
probability that is greater than or equal to b1. Note that b1 can range between 0 and 1.
The second meta-parameter we must determine is the threshold for similarity between the
best potential match and second-best potential match. For example, we may find multiple
Nels Nelsons that match between 1920 and 1930 that deviate slightly on year of birth and
year of arrival; should we keep one of the two close links in other dataset or not? We term
this meta-parameter b2 such that we keep any individual that has a predicted probability
that is b2 times greater than the predicted probability of the second-best match. Note that
b2 can range from 1 to infinity.
Setting the values of b1 and b2 will influence the efficiency of linking and also the number
of false positives. In particular, we can measure the true positive rate (TPR) with our
training data. The formula for the true positive rate is the number of true positives (correctly
linked) divided by the sum of true positives and false negatives (fail to link a true match).
A related metric for the performance of the probit is the positive prediction value (PPV),
which measures likelihood a link in the set of predicted links is a true link. The formula
for the positive prediction value is the number of true positives divided by the sum of true
positives and false positives.
Given that we are linking full-count to full-count censuses, the potential size of a linked
dataset could be very large. Therefore, the cost of reducing the TPR may be less than
the cost of linking to the wrong individual. Therefore, we lean on the side of reducing the
51
linking rate and TPR by increasing the PPV; this method is essentially the same as relying
more on uncommon names rather than common names to build our dataset. To determine
the parameters for b1 and b2, we perform a grid search between 0 and 1 for the predicted
probability b1, and between 1 and 50 for the ratio. We choose b2 so the PPV is at least 0.90
in each of our training datasets; after hitting this rate of PPV, we pick b1 to maximize the
TPR.
The critical values for the probability threshold b1 and ratio threshold b2 are presented
in Table B6. Immediately it is clear that the true positive rate is much less than the positive
prediction value of 0.90: the values range from 0.265 for Russian and 0.82 for Dutch. Since
the rates are relatively low for the true positive rate, then the overall linking rate between
censuses will be low; for example with Russian, we only keep 26.5 percent of the positive
hand-linked matches in our dataset. Since the hand-linked rate was only 15.9 percent,
one could expect an overall linking rate of 4.2 percent. However, we believe this to be a
worthwhile trade-off to increase confidence that the links are the correct individuals. Note
that the ratio of predicted probabilities for the first and second link b2 is sometimes nearly
33 times larger than the second-best link; we believe we are being very conservative to ensure
the best matches in our dataset.
While we can easily apply our algorithm to linking the 1920-1930 censuses since these
censuses contain country of birth, mother’s tongue, year of arrival, year of birth, first name
and last name, we cannot easily apply this method to linking the 1910-1920 censuses. This
is because the mother’s tongue variable in the 1910 census the variable was often coded as
English to reflect one’s ability to speak English rather than the language of the mother;
therefore the mother’s tongue “English” is severely overrepresented in the 1910 preliminary
full-count data. Since we do not have reliable mother’s tongue data in the 1910 census, it
is difficult to identify whether to use the, for example, linking predictions from the Jew-
ish/Yiddish mother’s tongue regression or the Russian mother’s tongue regression for those
who list a country of birth as Russia. We circumvent this problem by linking between
1910 and 1920 censuses without mother’s tongue, and then assume the individual’s mother’s
tongue is most common mother’s tongue out of the 25 potential methods in the 1920 census.
One trade-off when building the linked dataset is that those included in the dataset are
those with no close alternatives; primarily, this implies that they have uncommon names.
We show the results for representativeness in Table 1 in the main text. To make the sample
more representative of the population, we reweight it to match the 1920 and 1930 full-count
censuses distribution of ability to speak English, country of birth, and ability to read and
write.
52
B.2 Matching on Year in the United States
For the creation of the main dataset, we match immigrants across censuses based on years
in the United States.33 Abramitzky, Boustan and Eriksson (2014) do not match on years in
the United States because of concerns over heaping on zeroes or fives or from misreporting
in the variable. The advantages of matching on years in the United States are that it could
potentially increase match rates due to not dropping similar names, and then also decrease
the number of false positives due to having an extra piece of information. Yet if the variable
was not recorded accurately then it would not improve the quality of the linked dataset.
One way to quantify the accuracy of the variable is to compare the extent of heaping
for years in the United States with heaping for age. One can measure the extent of heaping
by assuming a smooth age or years in the United States distribution, and then count the
number of people who report a value ending with zero or five; this is most commonly used in
the ABCC index as described by A’Hearn, Baten and Crayen (2009). This index ranges from
0 to 100 and is interpreted as the percentage of people who know their true age or years in
the US. Therefore, we will calculate the ABCC index for age and years in the United States
using a sample of individuals aged 18 to 52, with years of stay between 3 and 22. These
restrictions are used to reflect the restrictions for the linked samples.34
Before comparing the ABCC index for years in the United States and age, Figure B1
show the degree of heaping for both variables in the 1900 to 1930 United States Census.
There is clear heaping on the zeroes and fives for both variables, yet it is unclear which
variable has a larger amount of heaping.
According to the ABCC index, the years in the United States variable is slightly more
heaped than the age variable: The ABCC index is 94.7 for age, suggesting that 94.7 percent
of individuals reported their true age. At the same time, the ABCC index for years in the
United States is 94.2, only 0.5 less than age. This suggests that the years in the United
States variable is similarly heaped as age.
Applying the methodology of ABCC to measure heaping for the years in the US variable
may not be accurate because it assumes that the underlying distribution is smooth. This
may not be the case for inflow data because of large yearly fluctuations that reflect the
business cycle (Hatton and Williamson, 1998); note that age does not have this problem
since fertility rates change smoothly. However, the immigration data between 1888 and 1922
from the Historical Statistics (Carter et al., 2006) shows that the average inflow in years
33I discuss matching on Years in the US, which is equivalent to matching on Year of Arrival. That is, Yearof Arrival = Year − Years in US. Year and Years in the US both change by ten in between censuses.
34I start the calculation at years in the United States at 3 since heaping is very low between years zeroand two. If we include these values, then there is less heaping for the years in the United States variable,which reinforces my conclusion that it is a good variable to match on.
53
that end in zero or five was 94.3 percent of other years. Thus, there should be less heaping
in the years in the United States variable than the age variable in the first place.
When hand-linking data between censuses, year of arrival was often a good indicator of
a match, but if there were multiple individuals with close years of arrival, it was difficult to
choose which individual was the correct match. Therefore, we did not take year of arrival
as informative if there were two potential matches with small differences in year of arrival
(within one or two years). However, the results from the probit indicate that year of arrival
does provide information about the quality of the match, where those with larger differences
in year of arrival are less likely to be matched. (See Tables B2 - B5).
54
Table B1: Hand-Linking the 1920 to 1930 Censuses
Mother’s Tongue Random Draw N with at N of Potential Successful Overall Linking Ratein 1920 least 1 Potential Matches in 1930 Link Linking Rate given 1
Match in 1930 Potential MatchGerman 2,000 1,476 11,378 632 31.6 42.8Yiddish, Jewish 2,000 1,745 25,810 617 30.9 35.4Dutch 2,000 1,295 6,684 707 35.4 54.6Swedish 2,000 1,621 21,682 691 34.6 42.6Danish 2,000 1,537 18,697 582 29.1 37.9Norwegian 2,000 1,518 13,995 599 30.0 39.5Italian 2,000 1,806 29,773 620 31.0 34.3French 2,000 1,113 3,205 460 23.0 41.3Romanian 2,000 799 2,955 258 12.9 32.3Greek 2,000 1,576 21,932 369 18.5 23.4Russia/Ukranian 2,000 1,481 10,513 318 15.9 21.5Czech/Slovak 2,000 1,454 16,114 333 16.7 22.9Polish 2,000 1,677 27,394 466 23.3 27.8Finnish 2,000 1,382 8,103 529 26.5 38.3Magyar, Hungarian 2,000 1,433 10,156 367 18.4 25.6
Notes: Results from linking immigrants from the 1920 Census to the 1930 Census.
55
Table B2: Probit Results, Set 1
Mother’s tongue: English German Yiddish DutchYear of Birth Diff is 1 -0.495*** -0.293*** -0.519*** -0.508***
(0.0902) (0.0920) (0.0802) (0.113)Year of Birth Diff is 2 -0.817*** -0.489*** -0.737*** -0.804***
(0.103) (0.105) (0.0891) (0.144)Year of Birth Diff is 3 -0.888*** -0.776*** -0.882*** -0.946***
(0.110) (0.120) (0.104) (0.155)Year of Arr. Diff is 1 -0.600*** -0.450*** -0.464*** -0.472***
(0.101) (0.101) (0.0862) (0.121)Year of Arr. Diff is 2 -0.785*** -0.650*** -0.726*** -0.835***
(0.108) (0.114) (0.0950) (0.142)Year of Arr. Diff is 3 -1.038*** -0.540*** -0.993*** -1.114***
(0.120) (0.115) (0.109) (0.164)Year of Arr. Diff is 4 -0.949*** -0.650*** -1.116*** -1.195***
(0.146) (0.158) (0.156) (0.242)Year of Arr. Diff is 5 -0.937*** -0.901*** -1.083*** -1.706***
(0.165) (0.180) (0.178) (0.302)Year of Arr. Diff is 6 -1.253*** -0.921*** -1.305*** -1.524***
(0.170) (0.211) (0.215) (0.309)Year of Arr. Diff is 7 -1.201*** -0.894*** -1.031*** -1.263***
(0.195) (0.251) (0.216) (0.319)JW Distance First name -6.726** -2.016 -3.088 1.314
(2.910) (2.441) (2.770) (2.148)JW distance last name -9.251*** -9.449*** -8.668*** -12.23***
(0.867) (0.732) (0.771) (0.980)NYSIIS First name match -0.103 0.177 0.150 0.612*
(0.415) (0.325) (0.370) (0.331)NYSIIS Last name match -0.397*** -0.385*** -0.301*** -0.518***
(0.108) (0.122) (0.0957) (0.183)Hits -0.175*** -0.200*** -0.179*** -0.271***
(0.0197) (0.0192) (0.0229) (0.0261)Hits squared 0.00419*** 0.00517*** 0.00389*** 0.00677***
(0.000680) (0.000684) (0.000745) (0.000968)First letter of last name match 0.238 -0.121 0.161 -0.477***
(0.179) (0.121) (0.159) (0.178)First letter of first name match 0.0793 0.536*** 1.620*** 0.350
(0.296) (0.180) (0.542) (0.274)NYSIIS last name match, 2 hits 0.543*** 0.491*** 0.596*** -1.114***
(0.132) (0.168) (0.121) (0.280)NYSIIS last name match, unique 1.223*** 0.879*** 1.484*** 2.012***
(0.130) (0.167) (0.125) (0.267)JW Distance in NYSIIS Last name -1.788** -3.336*** -2.538*** -5.629***
(0.782) (0.652) (0.734) (0.997)JW Distance in NYSIIS First name -1.114* -0.170 -0.0438 -0.421
(0.649) (0.253) (0.126) (0.342)Middle initial match, if have one 1.096*** 1.480*** 0.127 0.877***
(0.131) (0.321) (0.743) (0.325)Constant 1.420*** 1.381*** -0.256 2.928***
(0.549) (0.418) (0.705) (0.507)
Observations 12,975 11,227 25,691 6,651
Notes: Dependent variable is a successful link in our training data. Each column is adifferent set of training data.
56
Table B3: Probit Results, Set 2
Mother’s tongue: Swedish Danish Norwegian Italian
Year of Birth Diff is 1 -0.586*** -0.400*** -0.445*** -0.258***(0.0760) (0.0830) (0.0904) (0.0694)
Year of Birth Diff is 2 -0.755*** -0.713*** -0.814*** -0.606***(0.0861) (0.0997) (0.112) (0.0846)
Year of Birth Diff is 3 -0.996*** -1.012*** -1.111*** -0.984***(0.102) (0.120) (0.130) (0.112)
Year of Arr. Diff is 1 -0.605*** -0.653*** -0.736*** -0.373***(0.0803) (0.0856) (0.101) (0.0814)
Year of Arr. Diff is 2 -0.926*** -1.128*** -0.803*** -0.496***(0.0958) (0.113) (0.115) (0.0878)
Year of Arr. Diff is 3 -0.990*** -1.146*** -0.838*** -0.699***(0.0991) (0.116) (0.118) (0.0986)
Year of Arr. Diff is 4 -1.160*** -1.613*** -1.042*** -0.799***(0.147) (0.183) (0.165) (0.131)
Year of Arr. Diff is 5 -1.034*** -1.590*** -1.154*** -0.777***(0.154) (0.204) (0.197) (0.146)
Year of Arr. Diff is 6 -0.998*** -1.459*** -0.848*** -0.979***(0.166) (0.223) (0.192) (0.173)
Year of Arr. Diff is 7 -1.107*** -1.082*** -1.675*** -1.614***(0.195) (0.175) (0.290) (0.247)
JW Distance First name -4.548*** -4.985*** -2.658* 0.915(1.518) (1.604) (1.572) (1.259)
JW distance last name -6.400*** -5.700*** -7.320*** -10.52***(0.859) (0.939) (0.825) (0.606)
NYSIIS First name match 0.244 0.187 0.402* 0.491***(0.234) (0.222) (0.231) (0.189)
NYSIIS Last name match -0.0609 -0.0626 -0.349*** 0.0145(0.0992) (0.101) (0.112) (0.0960)
Hits -0.172*** -0.221*** -0.215*** -0.0653***(0.0189) (0.0215) (0.0202) (0.0252)
Hits squared 0.00358*** 0.00497*** 0.00473*** 0.000408(0.000630) (0.000720) (0.000712) (0.000809)
First letter of last name match 0.221 0.371* 0.380** 0.0309(0.137) (0.191) (0.156) (0.148)
First letter of first name match 0.456*** 1.024*** 0.693*** 0.155(0.164) (0.216) (0.190) (0.110)
NYSIIS last name match, 2 hits 0.738*** 0.0664 0.0171 0.707***(0.126) (0.163) (0.162) (0.122)
NYSIIS last name match, unique 1.175*** 1.449*** 1.538*** 0.713***(0.132) (0.167) (0.159) (0.127)
JW Distance in NYSIIS Last name -1.670** -1.886** -1.578** -3.784***(0.765) (0.899) (0.710) (0.609)
JW Distance in NYSIIS First name -0.743*** -0.519** -0.780*** -0.0790(0.254) (0.254) (0.277) (0.147)
Middle initial match, if have one 1.194*** 1.630*** 1.017*** -(0.141) (0.134) (0.237)
Constant 0.540 0.0836 0.498 0.800**(0.337) (0.404) (0.362) (0.331)
Observations 21,648 18,690 13,893 29,591
Notes: Dependent variable is a successful link in our training data. Each column is adifferent set of training data. 57
Table B4: Probit Results, Set 3
Mother’s tongue: French Romanian Greek Russian
Year of Birth Diff is 1 -0.423*** -0.110 -0.318*** -0.0167(0.141) (0.155) (0.0903) (0.117)
Year of Birth Diff is 2 -0.727*** -0.392** -0.736*** -0.371***(0.170) (0.161) (0.109) (0.124)
Year of Birth Diff is 3 -0.618*** -0.411** -0.809*** -0.354***(0.168) (0.170) (0.121) (0.131)
Year of Arr. Diff is 1 -0.323* -0.114 -0.537*** -0.392***(0.173) (0.174) (0.105) (0.120)
Year of Arr. Diff is 2 -0.456** -0.0539 -0.646*** -0.650***(0.189) (0.179) (0.109) (0.135)
Year of Arr. Diff is 3 -0.599*** -0.470** -0.789*** -0.934***(0.189) (0.197) (0.120) (0.152)
Year of Arr. Diff is 4 -1.293*** -0.495** -1.042*** -0.932***(0.277) (0.238) (0.164) (0.200)
Year of Arr. Diff is 5 -0.677*** -0.577** -0.792*** -1.142***(0.235) (0.277) (0.162) (0.234)
Year of Arr. Diff is 6 -0.895*** -0.486* -1.422*** -0.918***(0.260) (0.280) (0.224) (0.226)
Year of Arr. Diff is 7 -1.108*** -0.919** -1.315*** -0.845***(0.319) (0.360) (0.256) (0.252)
JW Distance First name -5.042* -7.971 -0.616 -2.432(2.952) (5.298) (1.966) (3.893)
JW distance last name -12.18*** -9.360*** -9.277*** -9.429***(1.108) (0.995) (0.749) (0.813)
NYSIIS First name match -0.0698 -1.201 0.158 0.412(0.372) (0.731) (0.294) (0.578)
NYSIIS Last name match -1.022*** -0.591*** -0.158 -0.907***(0.206) (0.225) (0.114) (0.157)
Hits -0.341*** -0.251*** -0.216*** -0.225***(0.0362) (0.0313) (0.0254) (0.0225)
Hits squared 0.0109*** 0.00663*** 0.00503*** 0.00545***(0.00171) (0.00136) (0.000831) (0.000815)
First letter of last name match 0.191 0.101 0.298 0.183(0.212) (0.189) (0.208) (0.158)
First letter of first name match 1.155*** 1.009* 0.437* -0.0285(0.425) (0.529) (0.229) (0.300)
NYSIIS last name match, 2 hits -0.696** -0.605* 0.880*** 0.0710(0.335) (0.365) (0.147) (0.273)
NYSIIS last name match, unique 1.833*** 1.805*** 0.839*** 1.171***(0.319) (0.366) (0.155) (0.269)
JW Distance in NYSIIS Last name -3.571*** -3.109*** -3.322*** -5.051***(1.032) (0.871) (0.700) (0.839)
JW Distance in NYSIIS First name -0.00510 0.00271 -0.572*** -0.140(0.216) (0.269) (0.218) (0.236)
Middle initial match, if have one 1.110** - 0.732 2.714**(0.482) (0.482) (1.263)
Constant 1.625** 1.894** 1.251*** 1.630**(0.644) (0.948) (0.477) (0.705)
Observations 3,190 2,899 21,761 10,481
Notes: Dependent variable is a successful link in our training data. Each column is adifferent set of training data. 58
Table B5: Probit Results, Set 4
Mother’s tongue: Czech/Slovak Polish Finnish Magyar
Year of Birth Diff is 1 -0.354*** -0.309*** -0.352*** -0.217**(0.111) (0.0781) (0.0996) (0.110)
Year of Birth Diff is 2 -0.789*** -0.469*** -0.595*** -0.547***(0.131) (0.0903) (0.117) (0.122)
Year of Birth Diff is 3 -0.990*** -0.528*** -0.655*** -0.850***(0.156) (0.103) (0.118) (0.137)
Year of Arr. Diff is 1 -0.702*** -0.263*** -0.463*** -0.569***(0.119) (0.0846) (0.106) (0.117)
Year of Arr. Diff is 2 -0.854*** -0.525*** -0.577*** -0.848***(0.137) (0.0957) (0.121) (0.133)
Year of Arr. Diff is 3 -1.021*** -0.728*** -0.936*** -1.266***(0.151) (0.112) (0.126) (0.163)
Year of Arr. Diff is 4 -1.163*** -0.755*** -1.353*** -0.951***(0.230) (0.153) (0.197) (0.182)
Year of Arr. Diff is 5 -1.111*** -0.590*** -1.051*** -0.907***(0.245) (0.169) (0.196) (0.185)
Year of Arr. Diff is 6 -0.835*** -0.879*** -0.662*** -1.244***(0.232) (0.211) (0.185) (0.246)
Year of Arr. Diff is 7 -0.723** -0.962*** -1.175*** -0.989***(0.289) (0.332) (0.248) (0.221)
JW Distance First name -6.304* -2.514 1.789 -7.769***(3.228) (2.770) (1.852) (2.581)
JW distance last name -12.84*** -11.49*** -7.890*** -8.110***(0.949) (0.658) (0.769) (0.815)
NYSIIS First name match -0.871* 0.148 1.090*** -0.503(0.449) (0.369) (0.307) (0.368)
NYSIIS Last name match -0.335** -0.599*** -0.680*** -0.457***(0.168) (0.115) (0.128) (0.138)
Hits -0.232*** -0.119*** -0.284*** -0.246***(0.0275) (0.0250) (0.0218) (0.0236)
Hits squared 0.00564*** 0.00168** 0.00806*** 0.00605***(0.000935) (0.000816) (0.000813) (0.000862)
First letter of last name match -0.169 0.254* 0.0135 -0.00937(0.163) (0.146) (0.141) (0.164)
First letter of first name match 0.466 0.389* 0.266 0.324(0.321) (0.222) (0.181) (0.328)
NYSIIS last name match, 2 hits 0.241 0.334** 0.146 -0.117(0.227) (0.164) (0.170) (0.214)
NYSIIS last name match, unique 1.094*** 1.181*** 1.402*** 1.732***(0.228) (0.166) (0.165) (0.212)
JW Distance in NYSIIS Last name -4.801*** -3.330*** -1.975*** -2.694***(0.911) (0.627) (0.638) (0.746)
JW Distance in NYSIIS First name -0.497* 0.0479 -1.126** -0.138(0.282) (0.244) (0.498) (0.201)
Middle initial match, if have one -0.637 1.672 1.348*** 2.898**(2.430) (6.124) (0.370) (1.384)
Constant 3.593*** 1.228** 0.556 2.076***(0.652) (0.498) (0.399) (0.513)
Observations 16,041 27,298 8,006 9,891
Notes: Dependent variable is a successful link in our training data. Each column is a differentset of training data. 59
Table B6: Meta-parameters for Linked Samples
Mother’s Tongue b1 b2 PPV TPREnglish 0.236 1.7 0.901 0.760German 0.36 1.3 0.900 0.749Yiddish, Jewish 0.343 2.1 0.902 0.587Dutch 0.211 1.8 0.902 0.905Swedish 0.263 5 0.902 0.555Danish 0.313 3.1 0.902 0.638Norwegian 0.334 2.1 0.901 0.723Italian 0.479 1.5 0.900 0.470French 0.336 1.2 0.901 0.889Romanian 0.408 2.4 0.903 0.627Greek 0.462 1.9 0.904 0.414Russian/Ukranian 0.508 2 0.904 0.445Czech, Slovak, Serbo-Croatian, Slovene, Lithuanian 0.357 2.3 0.901 0.639Polish 0.346 8.2 0.903 0.399Finnish 0.262 3.1 0.902 0.705Magyar, Hungarian 0.375 6.4 0.903 0.564
Notes: Keep individuals in the linked sample who have a predicted probability above b1.
We also keep those whose predicted probability is more than b2 times the second-highest
predicted score.
60
Table B7: Linking Rates for Non-English-Speaking Source Countries
Ethnicity 1910-1920 Link 1920-1930 LinkPotential Links in 1920 Linked Linking Rate Potential Links in 1930 Linked Linking Rate
German 318417 21350 6.7 216442 22918 10.6Yiddish, Jewish 329563 8252 2.5 253668 8520 3.4Dutch 37708 5539 14.7 36717 7707 21.0Swedish 116040 6112 5.3 73095 5841 8.0Danish 34647 2346 6.8 27519 2696 9.8Norwegian 68986 5765 8.4 39703 4134 10.4Italian 564562 23800 4.2 486866 29724 6.1French 26966 2544 9.4 33192 1829 5.5Romanian 22938 1212 5.3 16934 737 4.4Greek 77596 1119 1.4 85401 2083 2.4Russian 136836 5012 3.7 113983 3210 2.8Czech/Slovak 292614 2199 0.8 198365 5665 2.9Polish 361386 3906 1.1 263270 7733 2.9Finnish 41911 3211 7.7 25515 3065 12.0Hungarian 96307 889 0.9 58080 2627 4.5
Notes: The linked sample sizes for1910 to 1920, and 1920 to 1930 censuses. Note that we only link non-English-speaking source countries.
61
Figure B1: Heaping on Age and Years in the US
Notes: Data is from IPUMS (1900-1930). The figure shows the residuals of the ability the log occupationalscore after removing the effects of age, sex and country of birth. The right hand side graph treats immigrantsas able to speak English if they speak not well, well or very well.
62
C Discussion of Data Choices
C.1 Coding of Countries
I group the following countries into one country to maintain consistency across the 1900 to
1930 data:
• Russia includes Russia, Poland, Latvia, Lithuania, Estonia, and any Baltic state
• Hungary includes Hungary, Czechoslovakia and Yugoslavia
The following are how we code “Old” source countries, “New” source countries, and
English-speaking countries.
• Old source countries include: Canada, Denmark, Finland, Iceland, Norway, Swe-
den, England, Scotland, Wales, Ireland, Belgium, France, Luxembourg, Netherlands,
Switzerland, Australia and New Zealand. New source countries are all others, including
those from Eastern Europe, Asia, Africa and Central/South America.
• English-speaking countries in 1900 are Australia, (English) Canada, England, Ireland,
Scotland, India, British West Indies colonies (e.g., Antigua, Barbados, etc.), New
Zealand, the Philippines and Wales. The British and American colonies are a very
small number of the overall immigrant total, so coding them as non-English does not
qualitatively affect results. For the years 1910 to 1930, English is coded as whether
the mother’s tongue is English or one is born in an English-speaking countries for
1900. Between 1980 and 2010, English countries are coded to follow Bleakley and
Chin (2010).
D Recreating Bleakley and Chin’s (2004) Empirical
Strategy
D.1 Age-at-Arrival Analysis
One way to measure the English premium is to exploit a well-defined relationship between
age at arrival and English fluency as an adult. In particular, second language acquisition
is easier at younger ages during the so-called “critical period;” after age 8-11, the ability
to acquire a second language decreases at a linear rate (Bleakley and Chin, 2004). To fix
ideas for how one can test this empirically, researchers typically estimate a variation of the
following regression using a sample of adults who arrived as children:
63
SpeakEngij = a0 + a11[AgeArrival ≥ 8]ij + a2NonEngSpeakCntryij
+ a31[AgeArrival ≥ 8]×NonEngSpeakCntryij + γj + Π′Xij + υij (3)
where SpeakEngij is an indicator that equals one if individual i from country of birth j can
speak English as an adult. After controlling for country of birth (γj) and various individual-
level observables (Xij) such as age, the equation fundamentally estimates two age-at-arrival
profiles: one for immigrants from English-speaking countries and one for immigrants from
non-English-speaking countries. Using English fluency as the dependent variable, a1 should
equal zero as English-speaking immigrants already know English no matter the age at arrival,
and a3 should be negative, reflecting the critical period of language acquisition.
To relate speaking English with labor market outcomes, one can estimate the same equa-
tion as above, but instead using wages as the dependent variable. However, one drawback
of using this strategy with a cross section of data is collinearity: since age at arrival, age
and years in the United States are collinear, one cannot precisely separate the effects. For
example, if one controls for age in the regression and finds a negative age-at-arrival profile,
this may reflect older arrivals having fewer years in the United States rather than worse
adaptability at an older age. The typical solution to this problem is to include natives in
the regression to identify the aging profile, and estimate the effect of age at arrival on the
native-immigrant gap by age (Schaasfma and Sweetman, 2001). Nevertheless, one could run
the following regression:
log(Wagesij) = b0 + b11[AgeArrival ≥ 8]ij + b2NonEngSpeakCntryij
+ b31[AgeArrival ≥ 8]×NonEngSpeakCntryij + ηj + Γ′Xij + νij (4)
Given that those arriving at older ages from non-English-speaking countries generally have
less English fluency, one would expect that these same immigrants would also have lower
wages; in other words, the expectation is that b3 is negative. Indeed, this is exactly what
studies find in present-day settings (Bleakley and Chin, 2004; Guven and Islam, 2015).
To retrieve a precise estimate of the return to speaking English, studies use a two-stage
least squares strategy where Equation 3 predicts English ability in the first stage with the
interaction 1[AgeArrival ≥ 8] × NonEngSpeakCntryij as the excluded instrument. The
results from this first stage are then used in the second stage for a regression of wages on
64
predicted English ability.35 The 2SLS estimate of the premium for English is essentially a
ratio of the reduced form and the first stage (βIV = b3a3
). In the typical study, the instrument
is not simply the interaction between a dummy variable for age at arrival above eight, but
rather a linear interaction for past age 8 and being born in a non-English-speaking country.
This reflects the linear decrease in English-speaking ability observed in the data (Bleakley
and Chin, 2004). Finally, note that this IV estimates the effect for only those who are
affected by the instrument, or the LATE.
D.2 Regression evidence from the early 20th and early 21st cen-
turies
I run the age-at-arrival regression specifications for years 1900 to 1930 from Equation 3 with
two dependent variables: the ability to speak English and occupational score. The control
variables are purposely parsimonious − we only control for country of birth, current age,
and sex − as other controls like current neighborhood or marital status could be considered
outcomes of age at arrival. The results are shown in Table D1. As expected from the critical
period hypothesis, the coefficient on the interaction between arriving at or older than age
8 and being from a non-English-speaking country is negative. The effect is consistently
negative across all four decades and estimates that arriving at an older age leads to about
a 4.6 to 8.9 percentage point drop in the likelihood of speaking English as an adult. Note
the wide variation in the age-at-arrival effect on English fluency, which suggests that the
English proficiency may be driven by factors outside of the critical period hypothesis, such
as a different mix of country of origin or selection at older ages.
The second panel shows the effect of a immigrant’s age at arrival on occupational score.
The results, especially in the years 1900 and 1910, show a perplexing correlation: immigrants
from non-English-speaking countries who arrived older than age 8, despite speaking English
at lower rates, held jobs that paid about 1.1% more than otherwise similar immigrants who
arrived under the age of 8. The effect is similarly positive in 1910, and in is negative but
statistically insignificant in 1920.. Finally, by 1930, the expected results hold: immigrants
who arrived at older ages had lower English fluency rates and worse-paying occupations.
The results for lower English fluency and occupation are not being driven by literacy: when
controlling for literacy in the bottom two panels, the main results on English fluency and
occupation do not change.
How does the Bleakley and Chin analysis perform in the data between 1900 and 1930? In
Panel A of Table D3 we show the 2SLS estimates, recreating the Bleakley and Chin (2004)
35Thus, the reduced form equation of the outcome on the excluded instrument is Equation 4.
65
specification (i.e., using a linear decrease in English-speaking ability rather than a dummy
variable). While the pooled estimate of 1900 to 1930 yields an IV result about a 17.6 percent
return to English skills, which reflects the relationships shown in Panel A of Figure 3, the
2SLS strategy estimates an unreasonably large negative premium for English in 1900 (18 log
points) and an very large positive premium in 1930 (86 log points). This reflects the decade
by decade analysis in Table D1.
Why do the results change so drastically for the pooled analysis and for each year between
1900 and 1930? It may be that the premium truly rose between 1900 and 1930, or it may be
due to other factors besides a rise in the English premium. In the second row, we show what
happens when the birth country composition from 1900 to 1930 is fixed at 1900 levels. For
example, Germany was 26.6% of the sample in 1900, but only 8.9% of the sample in 1930;
we reweight the Germans from 1910 to 1930 to be 26.6% of the sample. The results from
this weighting show that there is a significant negative premium from 1900 to 1930 using
the age-at-arrival analysis. Therefore, one reason for the difference in premium from 1900 to
1930 in the unweighted sample (from negative to positive) is that the composition of non-
English-speaking sources shifted to poorer countries who had worse outcomes. This suggests
that the instrumental variable is correlated with other aspects of the child’s environment
rather than just English ability. See Alexander and Ward (2018) for a further examination
of the effect of age at arrival in the Age of Mass Migration.
In summary, the results from this section show that immigrants held higher paying jobs in
1900 despite having lower English skills. This implies that English was relatively unimportant
in 1900 compared to other non-language human capital that is correlated with age at arrival.
By 1930, this relationship is no longer true, which could either be because English was more
important or that non-English sources came from poorer countries. Ultimately, the age-at-
arrival analysis likely does not isolate the English premium throughout the entire 1900-1930
time period. This is likely because age-at-arrival effects are not similar between English-
speaking and non-English-speaking sources throughout different migrant cohorts.
Finally, we show the results from the instrumental variables strategy when applied to
the 2000 Census and the 2008-2012 Census. Note that these estimates differ from the ones
reported in Bleakley and Chin (2004) first because we use an occupational score (that reflects
average earnings by occupation and country of birth) rather than income. Second, we use
the 2000 Census and 2008-2012 ACS rather than the 1990 Census. Third, we code the ability
to speak English as whether one spoke any English, rather than treat the English categories
as a continuous variable.
The results in Table D2 estimate a much larger premium for English skills relative to
the estimates from the early 20th century. The instrumental variables strategy estimates
66
that speaking any English leads to an increase in occupational score of 175 log point, about
10 times the estimate from the pooled analysis between 1900 and 1930. This evidence is
consistent with the main argument of the paper that the return to English skills has increased
over the past 100 years.
67
Table D1: Age-at-arrival Effects, 1900 to 1930
I II III IV1900 1910 1920 1930
Outcome: Can Speak EnglishArrived Older than 8 × -0.0460*** -0.0886*** -0.0295*** -0.0186***Non-English Speaking Country (0.00185) (0.00477) (0.00493) (0.00137)Arrived Older than 8 -0.00492*** 0.00117 -0.00261 0.000654
(0.000522) (0.00199) (0.00378) (0.000605)
Outcome: Occupational ScoreArrived Older than 8 × 0.0105** 0.00507 -0.00260 -0.0211***Non-English Speaking Country (0.00416) (0.0105) (0.00964) (0.00463)Arrived Older than 8 -0.0353*** -0.0331*** -0.0410*** -0.0381***
(0.00322) (0.00865) (0.00841) (0.00411)
Outcome: Can Speak English, Control for LiteracyArrived Older than 8 × -0.0403*** -0.0823*** -0.0232*** -0.0124***Non-English Speaking Country (0.00178) (0.00467) (0.00488) (0.00133)Arrived Older than 8 -0.00384*** 0.00294 -0.00263 0.000980
(0.000600) (0.00200) (0.00381) (0.000617)
Outcome: Occupational Score, Control for LiteracyArrived Older than 8 × 0.0127*** 0.00787 -3.03e-05 -0.0187***Non-English Speaking Country (0.00415) (0.0104) (0.00964) (0.00463)Arrived Older than 8 -0.0349*** -0.0324*** -0.0410*** -0.0380***
(0.00321) (0.00864) (0.00839) (0.00411)N 89,959 22,081 24,851 121,689
Notes: Data is from IPUMS (1900-1930). The dependent variable is listed for each panel. Each
column represents the same regression but for a different census. Sex, current age, and country
of birth are also controls in the regression. Robust standard errors are in parenthesis. *p<0.10,
**p<0.05, ***p<0.01
68
Table D2: Recreating Bleakley and Chin (2004) 2SLS Estimate, early 20th century
Normal WeightsPooled 1900-1930 1900 1910 1920 1930
Can Speak English 0.176*** -0.181** -0.0463 -0.0309 0.858***(0.0597) (0.0731) (0.0938) (0.280) (0.238)
Fix at 1900 Country CompositionCan Speak English -0.101 -0.181** -0.185 -0.680 0.144
(0.0983) (0.0731) (0.145) (0.710) (0.362)
Observations 258,580 89,959 22,081 24,851 121,689
Notes: Data is from IPUMS (1900-1930). The dependent variable is the logged occupational
score from the 1940 census. The dataset are immigrants aged 16-55 who arrived under age
17. The second row reweights the 1910 to 1930 samples to have the same birth country
composition as in the 1900 sample. *p<0.10, **p<0.05, ***p<0.01.
Table D3: Recreating Bleakley and Chin (2004) 2SLS Estimate, early 21st century
Pooled 2000 Census and 2008-2012 ACS 2000 2008-2012Can Speak English 1.752*** 1.505*** 2.151***
(0.159) (0.198) (0.255)Observations 356,050 171,535 184,515
Notes: Data is from IPUMS (2000, 2008-2012 ACS). The dependent variable is the logged occu-
pational score from average the total income by country of birth and occupation. The dataset are
immigrants aged 16-55 who arrived under age 17. *p<0.10, **p<0.05, ***p<0.01.
E Immigrant-Specific Occupational Score
In this section, we repeat the information given in Alexander and Ward’s Appendix D (2018)
on the creation of the immigrant-occupational score for interested readers:
We create the immigrant-specific occupational score to improve on the standard
occupational scores used in the literature, such as the 1950 occscore from IPUMS
and the 1901 Cost of Living Survey score. There are important limitations when
using these commonly used scores; for example, the 1901 Cost of Living Score
is only representative for married urban families and therefore does not provide
an accurate estimate for rural workers. The 1950 occupational score reflects
earnings after World War II, and therefore understates wage gaps for data prior
to World War II (Goldin and Margo, 1992). Moreover, neither score reflects
earnings that are specific to immigrants and thus they understate any difference
between immigrants and natives, a key interest for this paper.
69
We create an alternative occupational score that is based on income reported
in the full-count 1940 United States Census. Our approach follows Collins and
Wanamaker (henceforth CW) (2014, 2017) in that we impute income separately
by group; but instead of groups separated by race and region as in CW, we
impute income separately by country of birth. Therefore, the occupation score
is essentially the average earnings in each occupation / country of birth cell.
We provide further details on how we create the score below, but we follow
Appendix I.b of CW (2017) to fix for self-employed earnings and non-monetary
compensation for farm laborers and farmers.
First, we take the full-count 1940 United States Census and top-code income
to 5,000 for wage workers. For self-employed workers, we ignore their reported
wage income since this is not consistently reported, but we instead impute their
income. To do this, we follow the strategy laid out by CW (2017) where we
take the ratio of self-employed earnings to wage-worker earnings by occupation
in the 1960 census, assume this ratio from 1960 is a good proxy for the ratio
in 1940, and multiply the ratio with the mean wage income by occupation and
country of birth. This leads to an imputed income for each self-employed person
that varies by occupation and country of birth. Then we collapse the 1940 data
by detailed occupation code and country of birth to get an average income for
each occupation, which forms the occupational score for the large majority of our
data.
We do not take the above approach for farm laborers and farmers because they
may receive compensation in kind which is not recorded in the income data. We
take a few extra steps to estimate their incomes. Starting with farm laborers and
once again following CW (2017), we increase farm laborers mean wage income
in the 1940 census by 26 percent to reflect in-kind compensation, which is based
on the 1957 USDA report Major Statistical Series of the U.S. Department of
Agriculture. The next step is to estimate income for farmers. First, we assume
that the perquisite rate of farmers in the 1960 census is 35 percent (also based on
the USDA report), and we scale up their reported (wage and business) income
by this factor. To create the final estimate for farmer income in 1940, we assume
that the ratio between farm laborers and farmer income (inclusive of perquisites)
in 1960 is the same as in 1940. Therefore, we need to estimate farm laborers
income in 1960, which we boost their income by 19 percent to reflect in-kind
compensation.
70