the role of english fluency in migrant assimilation: evidence from

Have language skills always been so valuable? The lowreturn to English fluency during the Age of Mass

Migration∗

Zachary Ward†

The Australian National University

February 2018

Abstract

English skills are highly valuable for today’s immigrants, but has this always been thecase? We estimate the premium for English fluency and the rate of language acquisi-tion in the early 20th century US using new linked data on over two hundred thousandimmigrants. Compared with today’s immigrants, fewer early 20th century immigrantsarrived with English proficiency, yet many acquired language skills rapidly after arrival.Learning to speak English was correlated with a small upgrade in occupational-basedearnings (0 to 4.5 percent). Various empirical methods suggest that the English pre-mium has more than doubled between 1910 and 2010, revealing that English fluencyhas become an increasingly large barrier to immigration over time.

JEL Classification: F22, J24, J61, J62, N31, N32Keywords: English fluency, language, immigrant assimilation

∗This paper was previously circulated as “The Role of English Fluency in Migrant Assimilation: Evidencefrom United States History.” We would like to thank Brian Cadena, Ann Carlos, Katherine Eriksson, DustinFrye, Tue Gorgens, Tim Hatton, Priti Kalsi, Ling-Yu Kong, Edward Kosack, Martine Mariotti, Xin Meng,Amber McKinney, Julie Moschion, John Tang and Jose Tessada for helpful pointers and discussions. Wealso thank the audience members at the Australian National University, the 2015 Australasian CliometricConference, Colby College, the 2016 EH-Clio conference at Pontificia Univerisidad Catolica, the 2015 NaturalExperiments in History Workshop, the 2017 Society of Labor Economists Annual Conference, La TrobeUniversity, the University of Adelaide, the University of Colorado, the University of Melbourne, and theUniversity of Queensland. Many thanks go to Lee Alston who helped me to gain access to the full-count censusdata, Rohan Alexander who created the file to Americanize names, and to Rowena Gray who generouslyshared her data on tasks. All errors are my own.†Email: [email protected], Research School of Economics, HW Arndt Building 25A, College of

Business and Economics, The Australian National University, Canberra, ACT 2600, Australia.

1

mailto: [email protected]

1 Introduction

The value of English skills in the United States is estimated to be quite high − depending

on the methodology, going from speaking no English to speaking English very well leads to

a 33 to over 100 percent increase in income (Bleakley and Chin, 2004; Chiswick and Miller,

2014). This implies that gaining English fluency is one of the most important investments

that an immigrant can make. Further, the high value of English fluency is fundamental for

understanding one of the most-studied patterns in the immigration literature: the migrant

assimilation profile, as shown in Figure 1 (Chiswick, 1978; Borjas, 2015; Lubotsky, 2007).

Immigrants who arrive, often without the ability speak English, earn much less than natives;

however, in the decades after arrival, immigrants acquire more human capital (such as English

fluency) and converge towards natives’ earnings.

Have language skills always been this valuable? In this paper, we turn to early 20th

century and estimate the importance of English fluency during the Age of Mass Migration,

one of the key migration episodes in American history. There are reasons to suspect that the

English premium was lower in the early 20th century; first, the early 20th century assimilation

profile in Figure 1 suggests a low value of English skills. Immigrants who arrived one hundred

years ago initially held similarly skilled jobs to natives at arrival, potentially showing no

penalty despite arriving without English skills (Abramitzky, Boustan and Eriksson, 2014).1

Moreover, immigrants surprisingly had no improvement relative to natives after arrival,

possibly showing no benefit from acquiring English fluency. A second reason to suspect a

lower return to English fluency in the past is that job tasks were much less interactive relative

to today since the structure of the economy was dominated by agriculture and manufacturing

rather than services (Katz and Margo, 2014; Michaels et al., 2017). While these reasons

point to a lower return to English fluency, early 20th century officials stressed that English

1The early 20th century assimilation profile can only be estimated based on the skill content of immi-grant occupations and not wages since wage income is not recorded until the 1940 United States Census.However, the same basic differences in assimilation patterns hold when using occupations in recent decades;for example, see Borjas (2015).

2

fluency was a primary determinant of immigrant economic outcomes, and argued that English

should be taught to new arrivals; indeed, this was one force driving the “Americanization”

movement, which led to many states enacting policy changes aimed to assimilate immigrants

into American culture (Lleras-Muney and Shertzer, 2015; Fouka, 2016).

One reason why those in the early 20th century claimed that English fluency was highly

valuable is because cross-sectional estimates showed that both English fluency and occupa-

tional status increased rapidly with more years of stay, suggesting that immigrants learned to

speak English and upgraded their jobs (Jenks and Lauck, 1926). However, as is well known

today, cross-sectional estimates of immigrant outcomes are biased by cohort quality change

and selective return migration (Borjas, 1985; Dustmann and Gorlach, 2015; Lubotsky, 2007).

Panel data is necessary to fix these issues since you can track the same individual over time;

indeed, using panel data is especially important in the context of the Age of Mass Migration

since the return flow was large (≥ 40% of inflows) and highly selective (Abramitzky, Boustan

and Eriksson, 2014; Bandiera, Rasul and Viarengo, 2013; Ward, 2017). Therefore, for immi-

grants who arrived between 1900 and 1919, we create new panel data using machine-learning

techniques based on Feigenbaum (2016) to estimate how many permanent immigrants were

able to speak English near arrival and the rate of acquisition after arrival. Ultimately, we

are able to follow over 200,000 immigrants between either 1910 to 1920, or 1920 to 1930.

We show that few pre-World War I arrivals from non-English-speaking sources came

with English skills (about 30 percent of recent arrivals in the 1910 Census). Yet while many

immigrants arrived without the ability to speak English, they had a high rate of language

acquisition: within ten years of arrival, more than 80 percent of immigrants were able to

speak some English. Importantly, this is found in the panel data, which shows that the fast

rate of acquisition is not biased by the selective return of those with lower English skills.

We also show with task-based measures that immigrants sorted into jobs that required

more communication tasks in the decades after arrival; however, after two decades of stay

immigrants still held jobs that required less communication skills relative to natives.

3

The fact that many arrived without English skills contrasts with the lack of occupational-

based earnings gap between immigrants and natives in Figure 1; moreover, the rapid increase

in the English fluency after arrival also contrasts with the flat assimilation profile. This

suggests that arriving without English skills did not yield a large penalty; further, gaining

language human capital had little effect on improving immigrants’ relative position with

natives. We use a variety of empirical strategies to verify if this is true. Primarily, we

exploit the hundreds of thousands in the panel data by estimating an individual fixed effect

model; this method eliminates time-invariant individual-specific unobservables that may be

correlated with English skills, such as ability. The analysis reveals that the relationship

between speaking English and occupational outcomes was weak in the early 20th century,

where those who gained English fluency had an associated increase in occupational-based

earnings of 0 to 4.5%; these results are consistent with a flat assimilation profile the early

20th century.

Our main approach of using individual fixed-effects may yield an upper bound estimate

since acquiring English skills is endogenous and those who acquired English skills may have

acquired other unobservable skills. One may also be concerned that linking messy historical

data will lead to false links and bias our estimate (Bailey et al., 2017). Using an alternative

empirical strategy with non-linked data − instrumenting for English fluency with the inter-

action of age at arrival and arriving from a non-English-speaking source as in Bleakley and

Chin (2004) − leads to the same qualitative conclusion that language skills were relatively

unimportant for upgrading jobs in the early 20th century compared with the high return to

English skills for immigrants in recent decades.

Since then and over the past 100 years, the gap between immigrants’ and natives’ occu-

pations has gradually widened; one reason may be that the English premium has increased

or that fewer recent migrants arrive with the ability to speak English. Using the 2000 Census

and 2008 to 2012 ACS, we provide suggestive evidence that more recent immigrants arrived

4

with basic English skills than early 20th century arrivals (74 percent v. 30 percent).2 Al-

though we are not able to recreate our preferred individual fixed effects methodology with

recent data, we also show with OLS that the occupational premium for English skills is about

18.4 percent, more than double the premium from the early 20th century when estimated

with a similar methodology.3 Age-at-arrival analysis is also consistent with a large increase

to the English premium over time. Thus, one reason for a widening gap between natives and

immigrants over the last 100 years is not because of a declining fraction of English speakers;

rather, because the penalty for arriving without English fluency has increased over the past

century.

Our paper draws from the large literature on relationship between the value of human

capital and technology. We argue that the return to a specific piece of human capital −

the ability to speak English − was low in the early 20th century and has increased over

time, relating to others who have shown an increased return to education over the second

half of the 20th century (Acemoglu and Autor, 2011; Goldin and Katz, 2008). A common

interpretation for the increasing return to human capital is due to skill-biased technical

change, where demand shifts that complement high-skilled work have outpaced the relative

increase in high-skilled workers. We argue that the increasing value of English skills may also

reflect technology-driven shifts; for instance, others have shown that there was an increase in

demand for those with people skills in the early 20th century, an increase in the amount of

interaction in jobs over the 20th century and an increase in the value of social skills in recent

decades (Deming, 2017; Gray, 2013; Michaels, Rauch and Redding, 2017). Therefore, the

technological setting is key for understanding the relative performance of immigrants and

2The early 20th century data was recorded by an enumerator as a binary variable (0=cannot speakEnglish, 1=can speak English). The late 20th century data was self-reported on a more qualitative scale(0=speaks no English, 1=speaks English not well, 2=speaks English well, 3=speaks English very well). Theresults in this paragraph treat people who report any English skills (values 1-3) in the late 20th century asable to speak English.

3We cannot use the individual fixed-effects methodology with recent census data since it is not linkedover time. Instead, these results are from a simple OLS regression like that reported in Chiswick and Miller(2014). We use a similar strategy to impute earnings by occupation in the early 20th century and early 21stcentury data for this estimate.

5

natives in the early 20th century, just as it is for understanding the outcomes of immigrants

in more recent decades (Lalonde and Topel, 1992; Lubotsky, 2011; Perlmann, 2005).

Our paper also relates to a fast-growing literature on the Age of Mass Migration using

newly digitized sources to study the major immigration questions, such as the selection of

immigrants (Abramitzky, Boustan and Eriksson, 2013; Spitzer and Zimran, 2017) and the

effects of immigration in the short run and long run (Ager and Hansen, 2017, Sequeira, Nunn

and Qian, 2017; Tabellini, 2017). Our paper connects with the assimilation literature, where

others have shown that cultural assimilation in terms of having more American-sounding

names leads to a higher level of income and occupational standing for the first and sec-

ond generation (Abramitzky, Boustan and Eriksson, 2017; Biavaschi, Giulietti and Siddique,

2017). We complement these results by showing that cultural assimilation in terms of ac-

quiring language skills was quite fast; however, this did not lead to substantial economic

assimilation − at least relative to today’s estimates of the English premium. These results

help to understand why early 20th century language policy that enforced English-only in-

struction did little to improve immigrant economic outcomes (Lleras Muney and Shertzer,

2015).

2 Historical Context

Since 1850, there has been a secular decline in the percentage of immigrants from English-

speaking sources (see Figure 2). For example, about 70 percent of the immigrant stock in

1850 were from England or Ireland but soon non-English-speaking countries such as Ger-

many and Norway became major senders to the United States; by 1880, the percentage from

English-speaking sources had dropped to 50.4 The 1880s marked a turning point for the geo-

graphical composition of the flow toward lower-income countries from Southern and Eastern

Europe; by 1910, the immigrant stock had only 30 percent from an English-speaking source.

4The period when Europeans came to the United States is referred to as the Age of Mass Migration (1850-1913), which has been studied extensively, most notably by Hatton and Williamson (1998). See Abramitzkyand Boustan (2017) for a recent overview of the literature on historical immigration to the United States.

6

Fewer of the newer arrivals arrived with English fluency and appeared to pick up English at

slower rates than prior European arrivals: for example, the stock’s ability to speak English

decreased rapidly between 1900 (82%) and 1910 (70%).5

The worsening perceived quality of Southern and Eastern European immigrants led to a

severe nativist backlash against the country’s open immigration policies, with immigrants’

decreasing ability to speak English a particularly salient feature.6 This view was also held

by some of the most prominent academics studying immigration at the time: after analyzing

data on thousands of immigrants from the Dillingham Commission, Jeremiah Jenks and W.

Jett Lauck argued that “the greatest obstacle to a more rapid [assimilation] is that the recent

immigrant cannot speak English” (Jenks and Lauck, 1926; pg 269). They concluded that

“progress in industry, in business, in the trades and professions and in the accumulation

of property, are all primarily a result of the development in the recent immigrant popula-

tion having an English-speaking ability” (pg 293). These statements were reflective of the

“Americanization” fervor during the 1910s and 1920s, a movement focused on assimilating

immigrants through language instruction for children and adults (Bloch, 1920; Lleras-Muney

and Shertzer, 2015).

Despite the heightened focus on English during this time period, the importance of

English fluency for occupational upgrading still remains unknown.7 Jenks and Lauck (1926)

argued for the importance of English by showing the positive associations of staying longer in

the United States, English proficiency, and skill, suggesting that those who stayed longer were

5These results are based on a sample of foreign-born individuals over the age of 16, which was drawnfrom IPUMS. See Figure A1 for the rate of English fluency across the 20th century.

6Many began to lobby the government to maintain the “national origins” of the American population,which culminated in the Immigration Quota Acts of 1921 and 1924. Prior to this in 1906, English fluencywas added as a requirement for citizenship.

7There are a few studies who estimate the association between English proficiency and occupationaloutcomes (Blau, 1980; Lleras-Muney and Shertzer, 2015). Most notably, Jasso and Rosenzweig (1989, 1980)estimate the premium of English and acquisition rates in 1900 for Germans and 1980 for Mexicans. Theyfind a larger “return” for speaking English in the early 20th century, but the interpretation across time isunclear because the estimated return in 1900 is based on occupational prestige while the return in 1980is estimated for hourly wages. We will show results across time using a similar measure of occupationalstanding. An exception to these studies on the occupational return in the early 20th century is Inwood et al.(2016), who show that the association between English proficiency and wages increased between 1911 and1931.

7

able to learn English and upgrade their occupation. However, as is well known today, these

results could simply reflect selective return migration of low-skilled non-English-speaking

immigrants.8 We improve on Jenks and Lauck’s original study and more recent research on

English acquisition during this time period (e.g. Vigdor, 2010; Kuziemko and Ferrie, 2014;

Jasso and Rosenzweig, 1989) by creating panel data, reducing bias from selective return

migration.

3 Data

3.1 Measuring English Skills in the Early 20th Century

To estimate the value of English skills in the early 20th century, we use the restricted-access

full-count 1910 to 1930 Censuses from IPUMS, which we accessed at the NBER (Ruggles et

al., 2017).9 Unlike mailed census forms in recent decades, the early 20th century censuses

were taken by enumerators from door to door, and thus the ability to speak English (“Yes”

or “No”) was a judgment by the enumerator rather than self-reported as in recent Census

data. Enumerators did not have an explicit cut-off point for whether a respondent was able

to speak English between 1910 and 1930, leading to a familiar problem of measurement error

in language studies (Bleakley and Chin, 2004; Dustmann and Van Soest, 2001).10 Moreover,

8See Abramitzky et al. (2014) and Borjas (1985) for a discussion of this problem of estimating the rateof occupational upgrading with repeated cross sections. Kuziemko and Ferrie (2014) show a similar set ofcorrelations as Jenks and Lauck (1926) between length of stay, English fluency and skill using micro-datafrom IPUMS. See Vigdor (2010) for a nice discussion of the English acquisition of immigrants across theearly and late 20th century; however, as is noted by the author, the results also may be biased by selectiveemigration.

9While English skills were measured in the 1890 and 1900 Census, we focus our attention on the 1910 -1930 Censuses for a few reasons. First, the 1890 micro-data was lost in a fire. Second, there may be moremeasurement error in the 1900 Census variable (Stevens, 1999). In 1900, three census questions under thebroad heading of ‘Education’ were asked in a row: whether an individual could read, could write, and couldspeak English. The Census Bureau noted that some census takers simply recorded ‘yes’ or ‘no’ three times ina row - this problem was discovered as it appeared that black individuals had low rates of English proficiencywhen they likely only had low literacy rates (Census Bureau, 1913, page 1265). By 1910, the census sheetswere corrected so as to not have the questions in order. English proficiency was not asked of immigrantsagain until 1980.

10The 1890 Census gave instructions to record English fluency based on whether a immigrant was “ableto speak English so as to be understood in ordinary conversation” − a higher bar than simply knowing a

8

it is unclear whether an immigrant was responding to the enumerator questions in person, or

whether it was another member of the household answering for the immigrant; unfortunately,

who was responding to enumerator is not included in the data.

Given issues with measurement, one must first ask whether the English variable from the

1910 to 1930 Censuses actually reflects true English skills. A straightforward test of this is

to estimate whether the English variable follows well-known age-at-arrival patterns, where

older arrivals are less likely to speak English as adults compared with younger arrivals. This

pattern is thought to be related to neurobioloical changes in the brain prior and during

puberty, which make it more difficult to acquire a second language (Singleton, 1999). There-

fore, we estimate the age-at-arrival profile by regressing English ability on age at arrival

and other controls such as country of birth, age, sex and fraction of immigrants from their

own country of birth in the county. To provide an idea of how these age-at-arrival patterns

look for the more well-known English variable from recent years, we separately estimate the

age-at-arrival patterns for the pooled 1900-1930 Census and the pooled 2000 Census and

2008-2012 ACS.11

We plot the estimated age-at-arrival fixed effects in Figure 3 for both the early 20th

century data and the early 21st century data. Fortunately, the early 20th century data

shows a similarly sloped age-at-arrival pattern where older arrivals were less likely to speak

English as an adult compared with younger arrivals, affirming that the measure reflects

some level of English proficiency. The age-at-arrival patterns in the early 20th century are

strikingly similar to the patterns from the early 21st century when one codes the English

variable as whether one is able to speak any English, whether not well, well or very well.12

few words. However, this guidance was not in the instructions for later Censuses.11The sample for this regression are all immigrants from non-English-speaking countries aged 17 to 55

and those who arrived under the age of 17. The regression controls for age, country of birth, sex, cohort ofarrival, year and fraction of county from same country of birth. We do not use the 1980 and 1990 Censusbecause they do not record a specific year of arrival, making it impossible to back out a precise age at arrival.

12On the other hand, if one follows the more common method where those who speak English “not well”are instead placed in the unable to speak English group (Chiswick and Miller, 2014), then a 17-year-oldarrival would be 30 percentage points less likely to speak English − much too steep of a decline relative tothe early 20th century data.

9

Therefore, we interpret the English variable in the early 20th century as reflective of basic

English skills, where there was a low bar to clear for being recorded as able to speak English.

Further, whenever we compare English proficiency across the early and late 20th century, we

will make the assumption that those who self-reported any English ability in the late 20th

century had a similar level of skill as those who were recorded as able to speak English in

the early 20th century. Of course, this assumption is untestable, so comparisons across time

are only suggestive.

3.2 Building New Linked Data

With this measure of English ability, we aim to estimate how many immigrants arrived

with English skills, the rate at which immigrants learned to speak English after arrival, and

the return to speaking English. To answer the research question on the speed of language

acquisition, we need to create a panel that tracks individual immigrant’s ability to speak

English over time; this is alternative to the method of using repeated cross sections, which

suffers from the well-documented problem of selective return migration (see Abramitzky et

al. (2014) for a discussion of this bias).

Therefore, we build a new panel to fix any bias that arises from selective out-immigration.

To do this, we take Europeans first observed in the the 1910 full-count Census and the

1920 full-count Census, and then link them ten years later to the full-count 1920 and 1930

censuses, respectively.13 For each of the base samples, we do not link forward the entire set

of European immigrants. Primarily, we drop immigrants from English-speaking countries

such as England and Ireland because we are interested in how non-native speakers acquired

human capital after arrival. Second, we only keep immigrants who arrived within the past

ten years in order to track specific immigrant cohorts over time.14 Third, we drop immigrants

who were under the age of 10 at first observation because they were not asked about their

13Note that we do not use the full-count 1900 Census because the English variable has yet to be digitized.14I drop those who arrived in the same year as the Census (e.g. 1910 arrivals in 1910) since the Census

does not cover the entire year of arrivals.

10

ability to speak English; we also drop those who are older than 40 to ensure that no one

would be older than 50 ten years later − this is to reduce bias from death. Note that when

we estimate the occupational return to acquiring English fluency, we will drop those without

reported occupations, which primarily drops children.

To link the data across years, we find similar matches based on first name, last name, year

of birth, country of birth and year of arrival. To find the best match, we follow the method

outlined by Feigenbaum (2016) and first hand-link 15 random samples of 2,000 immigrants

from the 1920 to 1930 Censuses each. The 15 random samples are from 15 different language

groups, such as German, Italian, Polish and Dutch; see Appendix B for more detail on

this and the overall linking process. After hand-linking these individuals to form a set of

training data, we estimate a probit to find the best match, relying on observables such as the

closeness in year of birth, year of arrival, and string distance for first and last name.15 We are

particularly concerned with falsely matching an immigrant to a wrong individual and keeping

them in the dataset (Bailey et al., 2017); this is because we will use an individual fixed effects

methodology to estimate the return to English skills, which may lead to attenuation bias if

we have a high level of false positives.

Therefore, we take a very conservative approach and only keep immigrants if they have a

high predicted probability of being a true match from the probit, and have no close second

matches. Making this decision to reduce false positives necessarily lowers the efficiency of

finding matches because few clear both bars of a high probability of a link and no close

second match; however, we believe this is a worthwhile trade off since we are linking full-

count censuses and can afford a much lower linking rate. Based on the predicted probabilities

from the training data, we are able to track 96,400 males from 1910 to 1920, and 108,590

males from 1920 to 1930. Indeed, the backward linking rates for 1910 and 1920 (4%),

and 1920 to 1930 (5%) are less than others who link immigrants in the literature with

15We also are concerned about immigrants who change Americanize their first name (Biavaschi et al.,2017), so we link based on an Americanized version of each first name based on information from be-hindthename.com.

11

automated methods (around 15-25 percent) and less than in our training data (25 percent)

(Catron, 2017). This demonstrates that efficiently modeling the hand-linking process for

immigrants is more difficult than modeling the link between the 1915 Iowa and 1940 Census

as in Feigenbaum (2016). Despite the low linking rate, we still end with over 200,000 linked

immigrants in our sample. One may be concerned with the low linking rate; we have also

linked immigrants using more traditional methods related to Abramitzky et al. (2014) with

about a 15-25 percent linking rate, and find the same qualitative results that English language

acquisition was fast and that acquiring English fluency was associated with a small upgrade

in occupation.

While the linked datasets solve the problem of selective return migration, there are also a

few limitations. Primarily, linked datasets are non-random, as individuals with very common

names, those who died or those who changed their name (e.g. females after marriage) cannot

be linked forward. We are particularly concerned that a successful link is related to better

English proficiency; if so, then we would mistakenly infer that permanent immigrants had

better English skills at arrival when it would actually just reflect a bias from the linking

process. To gauge the representativeness of the sample, we compare an arrival cohort in

the IPUMS random sample to the same cohort in the linked sample in the second year of

observation (e.g., 1920 Census for the 1910 to 1920 linked sample). The linked and cross-

sectional sample should contain the same information since each immigrant has stayed in

the United States for 11 to 20 years; however, while the IPUMS cross section is random, the

linked sample may not be.

The representativeness of each linked sample is shown in Table 1. The linked samples

are indeed biased; they contain immigrants with higher English-speaking ability, by 2.2 to

3.4 percentage points. This is likely related to differences in linking by country of birth,

where we are much more likely to like one from a Northern and Western Europe relative to

Southern and Eastern Europe; this pattern is common in studies that link immigrants (e.g.

Abramitzky et al., 2014). We are also more likely to link farmers rather than laborers or

12

low-skilled service workers. One may be concerned that our linking strategy relies on very

unusual names who may not be representative of immigrants in general, so we test whether

the names of individuals have a different amount of “foreignness” according to the index used

by Abramitzky et al. (2017); however, we do not find that our sample has names that are

especially more or less foreign-sounding.16 Due to the differences in representativeness, we

re-weight our panel to be representative on ability to speak English, literacy, occupational

categories, and country of birth.17 We use our weighted sample for the rest of the analysis.

4 English Fluency Rates

4.1 Speed of English Acquisition

We estimate the rate of English acquisition for arrivals between 1900 and 1919 using the

following flexible form:

SpeakEnglishict = φc + µt−c + Π′Xit + εit (1)

Individual i from arrival cohort c’s ability to speak English in census year t is modeled as

a non-linear function of years in the United States (µt−c), incorporated as fixed effects for

every two years (e.g. 0 to 1 years, 2 to 3 years, etc.). This parameterization captures the

quick acquisition of English within the first ten years of stay and a leveling off in the second

ten years. We also estimate arrival cohort fixed effects (φc) for every five year entry cohort

to capture changes in the cohort quality in terms of English speaking ability (e.g. 1900-

1904 arrivals, 1905-1909 arrivals, etc.). In various regressions we include control variables

16The foreignness index ranges from 0 to 100 and measures the prevalence a given first or last name appearsfor foreigners relative to natives. A measure close to 100 indicates that it is more foreign.

17We reweight to match the random sample means. We have alternatively used the inverse-proportionalweighting method proposed by Bailey et al. (2017). Using inverse-proportional weights does not qualitativelychange our estimated association between English acquisition on occupational upgrading, but does lead toEnglish fluency rates in our linked sample to be slightly higher than English rates in the random sample.Therefore, we prefer our method of weighting to pin down the linked sample’s English levels to the randomsample’s English levels.

13

in Xit such as the country of birth and age at arrival. To capture potential biases from

selective return migration, we run the regression twice, once with the panel data and once

with repeated cross sections.

The estimated rate of acquisition is shown in Figure 4 for the 1900-04 cohort. The

regression estimates that 30 percent of arrivals in the panel data, or those who stayed at

least ten years, knew how to speak English within the first year of arrival. After this low

start, the rate of English proficiency increased rapidly within ten years of arrival: for those

who had stayed ten to eleven years, 80 percent of immigrants were able to speak English,

50 percentage points higher than arrivals in their first year. This estimate is reasonable as

second language acquisition can take only a couple of years, and we interpret the English

variable as reflective of basic English skills (Krashen et al., 1979). After ten years of stay, the

rate of acquisition leveled off as most immigrants knew how to speak some English; however,

even after 20 years of stay, about 90% of immigrants were still unable to speak English.

The estimated rate of acquisition from the repeated cross sections are also shown in

Figure 4. The repeated cross section would estimate a higher rate of English acquisition

since arrivals start at a lower percentage of 19%, but end at the same 90% fluency after

twenty years. This is consistent with the arguments by Lubotsky (2007) and Abramitzky

et al. (2014) that repeated cross-sections tend to overestimate improvements in immigrants’

attributes because of negatively selected return migration. In this case, immigrants with

worse English proficiency at arrival tended to return at higher rates.

How does this speed of acquisition compare across the earlier and more recent immigrant

cohorts? In Figure 4, we also plot the mean English fluency over time of the 1990-94 arrival

cohort, estimated in the same way as the early 20th century using repeated cross sections,

but this time pooling immigrants from non-English-speaking sources in the 2000 Census

and 2008-2012 ACS.18 The figure shows two main conclusions. First, more arrivals from

non-English-speaking countries in 1990 had a basic level of English fluency at arrival than

18Note that with the 1990s cohort, we code an observation as able to speak English if they spoke Englishnot well, well or very well.

14

immigrants from the early 20th century cross section (74% v 19%). This could be due to the

spread of English internationally compared with the beginning of the 20th century or visa

restrictions requiring some level of English proficiency. In contrast, for the pre-World War

I arrivals, there was nothing stopping individuals from freely arriving within a 2-week trip

from Europe.

Another conclusion from Figure 4 is that immigrants during the Age of Mass Migration

acquired English fluency at fast rates after arrival; they nearly catch up to the 1990s arrivals

within 15 to 20 years of stay when over 90% of immigrants could speak some English.

Unfortunately, selective out-immigration cannot be corrected for in the late 20th century

data, but if return immigrants were negatively selected on English ability as they were on

income (Lubotsky, 2007), then return migration would not overturn these results that late

20th century immigrants arrived with higher levels of English fluency.19

The other main benefit from using panel or repeated cross sections is that one can estimate

how cohorts changed in their ability to speak English near arrival. In the appendix, we show

that subsequent arrival cohorts increased their levels of English fluency at arrival after the

1900-1904 cohort. In particular, the beginning of World War I led to a rapid increase

in arrival’s English fluency, likely because few could cross the Atlantic due to restricted

shipping; also, the country of birth mix may have shifted toward countries with higher levels

of English proficiency at arrival.

Indeed, English fluency rates at arrival depended strongly on the immigrant’s origin.

In Figure 5 we show English fluency levels by language of origin, as proxied by mother’s

tongue. The figure is sorted by English fluency at arrival, where the leftmost ethnicities

arrived with the lowest levels of English skills. Southern and Eastern European ethnicities

19This evidence is only suggestive because the variables do not match precisely. For example, as opposedto coding the post-1980 self-reported English proficiency of “not well” as “able to speak English,” we recodeit as “unable to speak English.” When one does this, the results that late 20th century immigrants havea slower rate of acquisition remain, but the results on starting levels differ depending on the matching ofEnglish variables. In fact, this way of merging variables suggests that the early 20th century immigrantsassimilated much more quickly than recent immigrants in terms of English skills. However, based on the ageat arrival/ English fluency profiles across time, we do not believe this is the best way to match the data.

15

such as Poles, Romanians and Greeks all dominate the left-hand side, where 20% to 30%

of immigrants were able to speak English within one year of arrival. These fluency rates

for Eastern and Southern Europeans are often lower when compared with immigrants from

Northern and Western Europe; the Dutch, Norwegians, Germans and Danish all had higher

levels of fluency at arrival, from 50 to 70%. After fifteen plus years in the United States,

most ethnicities had over 90 percent of their group as able to speak English.

5 The English Premium

5.1 The English Premium in the Early 20th Century

Immigrants acquired basic English skills at relatively fast rates in the early 20th century;

this could reflect that English was highly valuable for improving outcomes. In this section,

we estimate the English premium between 1910 and 1930. Estimating the premium for

English has straightforward econometric issues: primarily, the ability to speak English could

be correlated with an unobserved omitted variable that could positively bias the estimate.

Instead, we leverage the panel features of the linked dataset to estimate the association

between upgrading one’s occupation and learning to speak English. Note that we only aim

to estimate the association while reducing the threat of unobservables; another method aimed

to provide exogenous variation in English ability will be discussed later.

We estimate effect of English on occupation instead of wage because this variable is

first available in the 1940 Census. To explore the effect on occupation, we first group

immigrants into six occupational categories: high-skilled white collar (e.g. managers and

doctors), medium-skilled white collar (e.g., salesmen and clerks), semi-skilled workers (e.g.

craftsmen), farmers, low-skilled service/manual workers (e.g. waiters and operatives), and

laborers.20 Later we will assign occupational scores for each occupation to provide an esti-

20These are coded based on the first digit of the occ1950 variable from IPUMS. High-skilled white collarare occupations that start with 0 or 2, farmers start with 1, medium-skilled white collar start with 3 or 4,semi-skilled start with 5, low-skilled service and manual workers start with 6 or 7, and laborers start with 8

16

mate of the return to English for earnings, but here we provide descriptive evidence based

on occupational categories.

We estimate the rate at which one changes occupations in the following linear probability

model:

OccupationGroupit = γ0 + γ1SpeaksEnglishit + ϕi + Π′Xit + εit (2)

The dependent variable is a zero / one variable for whether one belongs to one of six occu-

pational groups. We run the regression six times − once for each of the groups in order to

estimate how learning to speak English affects the net flow into or out of an occupational

group. After controlling for individual fixed effects ϕi, the coefficient γ1 will produce an

estimate of the effect of English while accounting for numerous unobservable factors that are

constant within an individual i, including unobserved ability. This essentially estimates the

extra movement into an occupational group (or for occupational score in a later regression)

for those who acquired English skills relative to those who knew how to speak English at

first observation and those who never learned how to speak English.21

For controls, we include the year of observation, which accounts for the average shift in

occupational group over time as identified by those who either do not acquire English skills

or those who had already acquired English skills. We further interact year with age at first

observation (grouped into five-year intervals) to allow for job switching to vary by points in

the life cycle. We also interact years in the United States at initial observation, grouped into

two-year intervals, with year for the same reason. Finally, we include controls for literacy,

logged population in a county and fraction of immigrants from the same birthplace in county

to account for changes in general human capital, size of network, and population density.

or 9. However, any non-occupational response is dropped from the dataset.21See Table A1 for descriptive statistics of the three groups of those who always knew how to speak English,

those who never knew how to speak English and switchers. Those who always knew how to speak Englishwere more skilled at arrival than switchers while those who never learned how to speak English were lessskilled. The return to English skills is estimated to be low if one compares switchers to only those who neverlearned, or switchers to only those who always knew how to speak English.

17

We drop individuals for whom we do not observe jobs in both censuses; this mostly leads

to dropping children. In all regressions we calculate the standard errors by clustering on

country of birth.

The results are shown in Table 2. The table is split into two panels, one for the years

1910 to 1920 (or immigrants who arrived between 1900 and 1909), and one for years 1920

to 1930 − though results are similar no matter which dataset you use. The results show

that between censuses those who learned to speak English were much less likely to hold a

laborer job by the next census, and slightly more likely to hold slightly higher skilled jobs.

For example, learning to speak English was associated with a 7.6 percentage point drop in

being a laborer, which was about a fourth of the percentage of laborers at first observation.

Acquiring English skills most commonly led to a higher number of unskilled service, semi-

skilled or low-skilled white collar jobs, indicating a movement up the occupational ladder.

Note that learning to speak English did not lead to a large flow into professional jobs such as

a manager, doctor or lawyer. However, the base number of immigrants holding managerial

or professional occupations was small, so a lack of statistical significant for moving into this

highest skill group may reflect that it was a relatively rare outcome to begin with.

5.2 Sorting into Occupations by Task Intensity

Immigrants moved up in the occupational distribution slightly, but they may have moved

into jobs that required more communication. One way to measure this is to estimate whether

immigrants sorted into jobs that were more intensive in communication tasks rather than

other tasks, such as manual-based tasks (Peri and Sparber, 2009). We do this using the task

data from Gray (2013), who calculates the extent of communication tasks based on data

in the 1956 Report from the United States Employment Service. Communication tasks are

measured on a 1 to 6 scale, where jobs such as teamsters and laborers have a rating below 2,

and sales people, and managers have a rating greater than 4. Following the task literature,

we transform the communication variable into percentiles of the 1910 distribution, such that

18

being at the 10th percentile implies that the immigrant held a job that was more intensive

in communication tasks that 10 percent of those in 1910 (Autor et al., 2003).

To estimate the sorting of immigrants into either manual or communication tasks, we

simply re-estimate our English acquisition regression, but now with the percentile communi-

cation as the dependent variable. The results are shown in Panel A of Figure 6. According

to our linked samples, permanent immigrants started at about the 48th percentile of the

1910 communication task distribution at arrival, suggesting that they were not far behind

natives in communication-based tasks and improved to the 57th percentile after two decades

of stay.

The raw percentiles may be misleading because they do not account for the age of the

immigrant nor the year of observation. To address these issues, we estimate rate at which the

gap between immigrants and natives in communication-based task intensity closed by pooling

the panel of immigrants and repeated cross-section of natives from the 1910-1930 censuses.22

The estimated gap at arrival and rate at which it closed it shown in Panel B for 1900-1904

arrivals, where immigrants held jobs about 14 percentiles less in the communication task

distribution, and then closed this gap only by about a one third after 20 years of arrival.

This evidence shows that while immigrants did sort into jobs which required more com-

munication, they did not move up in the communication-distribution at a fast rate. To the

extent that jobs with communication-based tasks were more highly rewarded in the early

20th century, we would expect immigrants to improve on initial earnings gaps with natives,

but not by much. In the next section, we estimate the actual association between acquisition

of English skills and improving occupational-based earnings.

22When measuring the trend of immigrants into communication jobs in the regression, we additionallyadd age fixed effects to control for the life-cycle profile, and also estimate the rate of convergence betweenimmigrants and natives using a quadratic in years since arrival.

19

5.3 The Occupational-Based Earnings Premium

The occupational categories and task data show that immigrants who acquired English skills

slightly moved up in the occupational distribution and into more communication-intensive

jobs; however, they do not give a simple estimate of the English premium. To estimate

this, one needs to assign each of the nearly 250 occupational codes an occupational score

since income and wages not observed until the 1940 Census.23 Unfortunately, there is no

representative occupational score at this level of detail for each decade between 1910 and

1930; therefore, we resort to other occupational scores used in the literature: the score based

on the 1901 Cost of Living Survey (CLS), wage data from the 1940 Census, and wage and

business income data from the 1950 Census (occscore from IPUMS).24 Our preferred score is

from the 1940 Census, since it is based on the average wages by occupation and country of

birth, while the 1901 and 1950 scores reflect both immigrant and native earnings.25 However,

since our time period is between 1910 and 1930, we also report the 1901 CLS to reflect the

wider income distribution in the early 20th century. Finally, we also show results from the

1950 score since it is the one used by Abramitzky et al. (2014) to estimate the assimilation

profile in Figure 1.

The results from running the Equation (2) with logged occupational score as the depen-

dent variable are shown in Table 3. The first column shows the estimate when applying

occupational scores based on the 1901 Cost of Living Survey. The estimated association

between acquiring English skills and occupational-based earnings is between 4.2 and 4.5%.

The second column uses the immigrant-specific occupational score from 1940 and finds that

acquiring English skills is associated with a 0.5% upgrade between 1910 and 1920, and a

23This is based on the standardized occupational codes variable occ1950 in IPUMS.24For less than 2% of observations there is no occupation score in the 1901 Cost of Living Survey. For the

missing occupations, I calculate its position in the 1950s occupational score distribution, which has scoresfor all occupations. I assume the missing occupation’s point in the 1901 distribution is the same as in the1950 distribution, and then fill in its score based on its predicted wage. We use the farmer income valuefrom Abramitzky et al. (2014).

25The basic method to create this occupational score relies heavily on Collins and Wanamaker (2017),where we impute self-employed earnings for non-wage workers. See Appendix E for further detail.

20

2.4% upgrade between 1920 and 1930. The estimate from the immigrant-specific score is

very similar to the one when using the 1950 occupational score shown in the third column.

The finding of essentially no return to speaking English using the 1950 score is consistent

with the flat assimilation profile estimated by Abramitzky, Boustan and Eriksson (2014),

with little penalty for not speaking English at arrival.

As expected, the occupational scores reflecting a more compressed wage distribution

(1940 and 1950) imply a lower return to English skills, while the score reflective of a wide

wage distribution (1901) imply a stronger return to English skills. Therefore, it appears that

the return to English is correlated with the general return to human capital in the economy,

where wider wage distributions leader to higher returns to language capital (Goldin and

Katz, 2008). We prefer an estimate that is between the 1901 and 1940 score such that the

return to acquiring English skill is between zero and 4.5% in the early 20th century. However,

note that the variable for speaking English in these regressions is not exogenous, but may be

correlated with other factors - such as other types of United States specific human capital

- that change over time; however, these other factors likely positively affect labor market

outcomes, suggesting that the individual fixed effect estimate is an upper bound of the true

return to English skills.

5.4 A Discussion of Age-At-Arrival Analysis

The estimate from the linked sample shows a relatively low return to English compared with

estimates of a wage return greater than 20 percent for recent decades (Chiswick and Miller,

2014). However, our estimate has a few limitations: mainly, learning to speak English is

not exogenous. Further, the linked data may has false positives and measurement error in

the English variable may attenuate results (Dustmann and Van Soest, 2001).26 Here we

26One way to check measurement error is to see how many people were recorded as able to speak Englishat first observation but then recorded as not able at second observation: this happened to 2.5% of thoserecorded as able to speak English at first observation. The results that English acquisition had a lowreturn are unaffected if one drops those who “downgraded” English skills. Specifically, regressions that dropthose who downgraded English skills and use the 1940 immigrant-specific score yield a 0.0% return between

21

briefly discuss an additional empirical strategy to estimate the English premium where one

could exploit the well-defined relationship between age at arrival and the ability to speak

English, as shown previously in Figure 3. Bleakley and Chin (2004) use this relationship

to instrument for the ability to speak English based on whether one arrived at an older or

younger age, and whether the immigrant was born in an English-speaking country. Further,

we can do this strategy using cross-sectional data, avoiding any potential bias from the linked

data. While we briefly discuss the strategy here, we employ it fully in Appendix D.

Figure 7 shows the basic intuition behind the Bleakley and Chin (2004) strategy. Panel

B estimates the age-at-arrival English fluency and occupational profile in the early 21st

century and shows that non-English-speaking sources fall in English speaking ability after

the critical period of language acquisition ends. Non-English-speaking sources also have a

steeper occupational profile, which falls at the same arrival ages when English fluency levels

drop; this combination of profiles for English fluency and occupational score form the basis of

Bleakley and Chin’s (2004) argument that older arriving immigrants were strongly penalized

for a lack of English skills.

However, the age-at-arrival strategy applied to the early 20th century shows little penalty

for those unable to speak English. While the age-at-arrival and English fluency relationship

is largely the same as in the late 20th century, the effect of age-at-arrival on occupational

scores are similar for English and non-English-speaking sources.27 In other words, older

arrival who had lower levels of English fluency did not also have substantially lower levels

of occupational score. This evidence is consistent with our argument that a lack of English

fluency did not strongly penalize workers in the early 20th century compared with workers

in the early 21st century; we discuss point estimates for this instrumental variables strategy

at further length in Appendix D.

1910-1920, and a 2.0% return between 1920-1930.27Alexander and Ward (2018) also estimate the age-at-arrival and wage profile for English-speaking and

non-English-speaking using a sample of brothers linked from arrival records to the 1940 Census, and findthe same results that there is no difference in the age-at-arrival profiles across sources.

22

6 The Increasing Return to English Fluency

6.1 A Consistent Method to Estimate the Changing Premium

The age-at-arrival analysis is consistent with an increasing return to English fluency over

the past 100 years; however, it is identified off of variation in outcomes for child arrivals (or

the 1.5 generation) and may not be fully applicable to the rest of the immigrant population.

Unfortunately we cannot compare our preferred estimate from the individual fixed effects

strategy to English premium estimates from recent decades since the early 21st century

censuses are not linked. Moreover, recent estimates of the English premium are on income

rather than occupational score.28

Therefore to create a consistent estimate of the association between English fluency and

economic outcomes, we estimate an OLS model for data in 1910 and data in 2010; note that

the OLS method is also used by Chiswick and Miller (2014) to show the association between

English fluency and economic outcomes.29 Further, we create an occupational score in the

same manner in 2010 as we did in our prior analysis, where the score in 2010 reflects average

income by country of birth and occupation, similar to the occupational score measure from

the 1940 Census. Therefore, we regress an immigrant’s log occupational score on the ability

to speak English, a measure of general human capital (literacy in the early 20th century

and having more than 8 years of education in the early 21st century), age, the fraction of

immigrants in the county, the population of the county and country of birth. This model is

clearly parsimonious but it is difficult to include other variables that are both in the 1910

and 2010 data.30

28A further issue is that often recent estimates either group English speakers based on whether they spokeEnglish “very well”/‘well” or “not well”/“not at all”, rather than our preferred grouping of speaking anyEnglish since the age-at-arrival and English profiles look similar (recall Figure 3).

29We refer to the 2008-2012 ACS as the 2010 data for convenience. The sample for both datasets is of25-60 year-old males from non-English-speaking sources.

30We use eight years of education since it likely reflects the ability to read and write and about 80 percentof immigrants in 2010 held more than 8 years of education, similar to the percent of immigrants in 1910 whocould read and write. If one uses a smaller level of schooling to reflect literacy, such as 4 years of education,the results are qualitatively the same.

23

The results from the regression in Table 4 show that an OLS estimate of the occupational-

based return to English fluency in 1910 is 6.3 percent, which is higher than the individual

fixed effects estimates of 0 to 2.4 percent when using the same occupational score. The

difference between the OLS and individual fixed effect estimate is unsurprising if one expects

English fluency to be positively biased from an omitted variable such as ability. Using a

closely-related regression with the 2010 data, the occupational-based return to English skills

is 16.9 log points or 18.4 percent. Based on this method, the English premium is about three

times higher than the estimated association in 1910.

One caveat to the occupational-based estimates is that they only capture part of English

premium due to the limited nature of the occupational score. If one instead uses log income

rather than log occupational score in 2010, then the return increases from 16.9 to 35.8 log

points (or 43 percent). The difference between the occupational-based return and the income

return suggests that the occupational-based score captures about 40 percent of the benefit

to gaining English skills in 2010, where the rest of the benefit comes from increased income

within occupation. One should keep this in mind when interpreting the results from the

early 20th century when we only have occupational scores, although we cannot quantify how

much information we lose from not having income.

Overall, the analysis shows that the premium to English skills in the early 20th century

was less than the English premium in recent decades. While we cannot compare our preferred

individual fixed effects estimate over time, the recreation of the methods used by others in

literature based on OLS (Chiswick and Miller, 2014) and instrumental variables (Bleakley

and Chin, 2004) consistently point in the same direction of an increasing English premium

over the 20th century. Moreover, a low value of English skills in the early 20th century

helps and high value of English skills in the early 21st century to understand the changing

assimilation profiles shown in Figure 1.

24

6.2 Discussion of Changing Premium

Why did the return to English skills increase over the past 100 years? A common explanation

for a changing return to skill is due to technology-driven shifts in demand; therefore, it may

be that the premium for language skills was smaller due to a lower demand for English

skills. The early 20th century economy was more agricultural and industrial compared with

the service-dominated economy today, and still had a large focus on brawn rather than

brain. The shift away from agriculture over the 20th century coincided with increasing

urbanization rates; as population density increased, the tasks performed by workers also

changed toward more interaction (Boustan et al., 2013; Michaels et al., 2017). This is easily

seen when examining the structure of the labor force over the past century, in which the

fraction of white-collar jobs has tripled, agricultural jobs have been all but eliminated, and

the proportion of blue collar jobs has decreased (Katz and Margo, 2014). At the same time,

Michaels et al. (2017) show that the importance of interactive tasks for jobs has grown rapidly

between 1880 and 2000, especially in cities where immigrants tended to locate. Deming

(2017) further demonstrates that jobs with social skills have had an increasing premium

since 1980. These demand shifts favoring interaction, social skills, and general human capital

may have increased the demand for English skills, causing English fluency to be of primary

importance for immigrants to succeed in the United States in the 21st century (Goldin and

Katz, 2008).

Another possibility for a low return to English skills in the early 20th century is that

discrimination against immigrants was rampant, and therefore any type of human capi-

tal received a small premium in the market. Indeed, discrimination appears to have been

widespread; for example, those who changed their name to be more “American” received

a premium in the labor market (Biavaschi et al., 2017), and brothers who had more American-

sounding names earned more than brothers who had more foreign-sounding names (Abramitzky

et al., 2017). Yet these two results suggests that there is a positive return to becoming more

“American”, and a similar positive return would likely hold for becoming more American by

25

learning to speak English. In Appendix Table A2 we show that the English premium is sim-

ilar for immigrants who experienced more discrimination (Southern and Eastern Europeans)

and immigrants who experienced less discrimination (Northern and Western Europeans),

suggesting that discrimination was not the main driver of the low English premium.

7 Concluding Remarks

Surprisingly little is known about the importance of English for immigrant outcomes one

hundred years ago. Using new linked data between 1910 and 1930, we show a few simple

relationships. First, many immigrants arrived without English skills, which contrasts with a

lack of occupational-based earnings deficit for immigrant arrivals; second, immigrants rapidly

acquired English skills in the years after arrival, which contrasts with a flat assimilation pro-

file where immigrants barely improved on their relative position with natives. This suggests

that English skills had little occupational value in the early 20th century relative to recent

decades, which we directly show using individual fixed effects and age-at-arrival analysis.

Therefore, our results help to explain the assimilation profile in the early 20th century and

why the assimilation profile has changed over time.

While we cannot definitively pinpoint the mechanisms for why the value of English skills

was lower in the early 20th century, we argue that it likely reflects the structure of the econ-

omy, which was primarily agricultural and manufacturing. In this setting, interaction and

social skills were relatively unimportant compared with today’s service-dominated economy.

Therefore, technological change can influence immigrant’s relative position with natives, es-

pecially if it influences the relative return to communication and language skills. Further,

these results on a low value of English skills help to understand why the Americanization

movement, which aimed to increase immigrants’ English fluency levels, did little to improve

the foreign-born’s adult economic outcomes (Lleras-Muney and Shertzer, 2015).

While we stress the importance of English fluency for understanding the variation in

26

immigrant assimilation profiles over time, it is not the only determinant of the profile. In

particular, immigrants’ earnings relative to natives’ also depend on their pre-immigration

human capital; indeed, this point has been stressed by Borjas (1985, 1995, 2015) as immi-

grant sources have shifted to poorer countries following the Immigration and Nationality

Act of 1965. Therefore, another reason for the difference in assimilation profiles across time

may be that immigrants in the past had pre-immigration human capital levels similar to

natives, compared with today’s difference in human capital between natives and immigrants

(Abramitzky and Boustan, 2017).

If current trends continue, then the English premium may increase even further in future

decades. If technological shifts continue to favor those with skill, especially social skills

(Goldin and Katz, 2008; Deming, 2017), and if immigrants do not have higher rates of

investment in English skills either pre-arrival or post-arrival (Borjas, 2015), then the premium

for English skills will increase. If so, then immigrants’ economic position relative to natives

will gradually worsen over time.

27

References

Abramitzky, Ran and Leah Platt Boustan, “Immigration in American History,” Jour-nal of Economic Literature, 2017., , and Katherine Eriksson, “Have the poor always been less likely to migrate?Evidence from inheritance practices during the Age of Mass Migration,” Journal of De-velopment Economics, 2013, 102, 2–14., , and , “A Nation of Immigrants: Assimilation and Economic Outcomes in the Ageof Mass Migration,” Journal of Political Economy, 2014, 122 (3), 467–506., , and , “Cultural Assimilation during the Age of Mass Migration,” Working Paper22381, National Bureau of Economic Research July 2017.

Acemoglu, Daron and David Autor, “Skills, tasks and technologies: Implications foremployment and earnings,” in “Handbook of labor economics,” Vol. 4, Elsevier, 2011,pp. 1043–1171.

Ager, Philipp and Casper Worm Hansen, “Closing Heaven’s Door: Evidence from the1920s US Immigration Quota Acts,” 2017.

Alexander, Rohan and Zachary Ward, “Age at Arrival and Assimilation during theAge of Mass Migration,” Journal of Economic History, 2018.

Autor, David H, Frank Levy, and Richard J Murnane, “The skill content of recenttechnological change: An empirical exploration,” The Quarterly journal of economics,2003, 118 (4), 1279–1333.

Bailey, Martha, Connor Cole, Morgan Henderson, and Catherine Massey, “HowWell Do Automated Linking Methods Perform in Historical Samples? Evidence from NewGround Truth,” Technical Report, Working Paper 2017.

Biavaschi, Costanza, Corrado Giulietti, and Zahra Siddique, “The Economic Payoffof Name Americanization,” Journal of Labor Economics, 2017, 35 (4), 1089–1116.

Blau, Francine D, “Immigration and labor earnings in early twentieth century America.,”1980.

Bleakley, Hoyt and Aimee Chin, “Language Skills and Earnings: Evidence from Child-hood Immigrants,” Review of Economics and Statistics, 2004, 86 (2), 481–496.and , “Age at Arrival, English Proficiency, and Social Assimilation Among US Immi-

grants,” American Economic Journal: Applied Economics, 2010, pp. 165–192.Bloch, Louis, “The Ability of European Immigrants to Speak English,” Quarterly publica-

tions of the American Statistical Association, 1920, 17 (132), 402–416.Borjas, George J, “Assimilation, Changes in Cohort Quality, and the Earnings of Immi-

grants,” Journal of Labor Economics, 1985, 3 (4), 463–489., “Assimilation in Cohort Quality Revisited: What Happened to Immigrant Earnings inthe 1980s?,” Journal of Labor Economics, 1995, 13 (2), 211–245., “The Slowdown in the Economic Assimilation of Immigrants: Aging and Cohort EffectsRevisited Again,” Journal of Human Capital, 2015, 9 (4), 483–517.

Boustan, Leah Platt, Devin Bunten, and Owen Hearey, “Urbanization in the UnitedStates, 1800-2000,” Technical Report, National Bureau of Economic Research 2013.

Catron, Peter, “The Citizenship Advantage: Immigrant Socioeconomic Attainment acrossGenerations in the Age of Mass Migration,” 2017.

Chiswick, Barry R, “The Effect of Americanization on the Earnings of Foreign-born Men,”

28

Journal of Political Economy, 1978, 86 (5), 897–921.and Paul W Miller, “Do enclaves matter in immigrant adjustment?,” City & Commu-

nity, 2005, 4 (1), 5–35.and , “International migration and the economics of language,” Handbook of the Eco-

nomics of International Migration, 1A: The Immigrants, 2014, 1, 211.Deming, David J, “The growing importance of social skills in the labor market,” The

Quarterly Journal of Economics, 2017.Dustmann, Christian and Arthur Van Soest, “Language fluency and earnings: Esti-

mation with misclassified language indicators,” Review of Economics and Statistics, 2001,83 (4), 663–674.and Joseph-Simon Gorlach, “Selective out-migration and the estimation of immigrantsearnings profiles,” in “Handbook of the Economics of International Migration,” Vol. 1,Elsevier, 2015, pp. 489–533.

Feigenbaum, James J, “A Machine Learning Approach to Census Record Linking,” 2016.Fouka, Vasiliki, “Backlash: The unintended effects of language prohibition in US schools

after World War I,” Stanford Center for International Development Working Paper, 2016,591.

Goldin, Claudia and Lawrence F Katz, The race between education and technology,Harvard University Press, 2008.

Gray, Rowena, “Taking technology to task: The skill content of technological change inearly twentieth century united states,” Explorations in Economic History, 2013, 50 (3),351–367.

Guven, C and A Islam, “Age at migration, language proficiency, and socioeconomicoutcomes: evidence from australia.,” Demography, 2015, 52 (2), 513.

Hatton, Timothy J, “The Immigrant Assimilation Puzzle in Late Nineteenth-CentutyAmerica,” The journal of economic history, 1997, 57 (01), 34–62.and Jeffrey G Williamson, “The age of mass migration: Causes and economic impact,”

OUP Catalogue, 1998.Inwood, Kris, Chris Minns, and Fraser Summerfield, “Reverse assimilation? Immi-

grants in the Canadian labour market during the Great Depression,” European Review ofEconomic History, 2016, 20 (3), 299–321.

Jasso, Guillermina and Mark R Rosenzweig, “Language Skill Acquisition, Labor Mar-kets and Locational Choice: The Foreign-Born in the United States, 1900 and 1980,” in“Migration and Labor Market Adjustment,” Springer, 1989, pp. 217–239.and , The new chosen people: Immigrants in the United States, Russell Sage Founda-

tion, 1990.Jenks, Jeremiah Whipple and William Jett Lauck, The immigration problem, Funk

& Wagnalls Company, 1926.Katz, Lawrence F and Robert A Margo, “Technical change and the relative demand

for skilled labor: The united states in historical perspective,” Technical Report, NationalBureau of Economic Research 2014.

Krashen, Stephen D, Michael A Long, and Robin C Scarcella, “Age, rate andeventual attainment in second language acquisition,” Tesol Quarterly, 1979, pp. 573–582.

Kuziemko, Ilyana and Joseph Ferrie, “The Role of Immigrant Children in Their Par-ents’ Assimilation in the United States, 1850–2010,” in “Human Capital in History: The

29

American Record,” University of Chicago Press, 2014, pp. 97–120.Lafortune, Jeanne, Jose Tessada, and Ethan Lewis, “People and Machines: A Look

at the Evolving Relationship Between Capital and Skill In Manufacturing 1860-1930 UsingImmigration Shocks,” Working Paper 21435, National Bureau of Economic Research July2016.

LaLonde, Robert J and Robert H Topel, “The assimilation of immigrants in the USlabor market,” in “Immigration and the workforce: Economic consequences for the UnitedStates and source areas,” University of Chicago Press, 1992, pp. 67–92.

Lleras-Muney, Adriana and Allison Shertzer, “Did the Americanization MovementSucceed? An Evaluation of the Effect of English-Only and Compulsory Schooling Lawson Immigrants,” American Economic Journal: Economic Policy, 2015, 7 (3), 258–90.

Lubotsky, Darren, “Chutes or ladders? A longitudinal analysis of immigrant earnings,”Journal of Political Economy, 2007, 115 (5), 820–867., “The effect of changes in the US wage structure on recent immigrants’ earnings,” TheReview of Economics and Statistics, 2011, 93 (1), 59–71.

Massey, Catherine G, “Playing with matches: An assessment of accuracy in linked histor-ical data,” Historical Methods: A Journal of Quantitative and Interdisciplinary History,2017, pp. 1–15.

Michaels, Guy, Ferdinand Rauch, and Stephen J Redding, “Task specialization inUS cities from 1880-2000,” Technical Report, National Bureau of Economic Research,accessed “http://personal.lse.ac.uk/michaels/” 2017.

Peri, Giovanni and Chad Sparber, “Task specialization, immigration, and wages,”American Economic Journal: Applied Economics, 2009, 1 (3), 135–69.

Perlmann, Joel, Italians Then, Mexicans Now: Immigrant Origins and the Second-Generation Progress, 1890-2000, Russell Sage Foundation, 2005.

Sequeira, Sandra, Nathan Nunn, and Nancy Qian, “Migrants and the Making ofAmerica: The Short-and Long-Run Effects of Immigration during the Age of Mass Migra-tion,” Technical Report, National Bureau of Economic Research 2017.

Singleton, David, “Age and second language acquisition,” Annual review of applied lin-guistics, 2001, 21, 77–89.

Spitzer, Yannay and Ariell Zimran, “Migrant Self-Selection: Anthropometric Evidencefrom the Mass Migration of Italians to the United States, 1907–1925,” 2017.

Stevens, Gillian, “A century of US censuses and the language characteristics of immi-grants,” Demography, 1999, 36 (3), 387–397.

Tabellini, Marco, “Gifts of the Immigrants, Woes of the Natives: Lessons from the Age ofMass Migration,” 2017.

Vigdor, Jacob L, From immigrants to Americans: The rise and fall of fitting in, Rowman& Littlefield, 2010.

Ward, Zachary, “Birds of Passage: Return Migration, Self-Selection and ImmigrationQuotas,” Explorations in Economic History, 2017.

30

Table 1: Representativeness of the Linked Samples

1900-1909 Cohort in 1920 1910-1919 Cohort in 1930Cross Panel Cross Panel

Difference from Cross Difference from CrossUnweighted Weighted Unweighted Weighted

Speak English 0.895 0.0337*** -2.55e-09 0.955 0.0216*** -6.05e-05(0.306) (0.00254) (0.00273) (0.207) (0.000976) (0.00124)

Literate 0.852 0.0582*** -3.50e-09 0.902 0.0372*** -0.000119(0.355) (0.00294) (0.00318) (0.297) (0.00144) (0.00171)

South or East Europe 0.748 -0.255*** 3.96e-09 0.807 -0.184*** 0.000552(0.434) (0.00377) (0.00363) (0.395) (0.00221) (0.00192)

Age 35.44 0.308*** 0.290*** 36.92 -0.894*** -0.987***(7.095) (0.0608) (0.0624) (6.862) (0.0371) (0.0384)

Professional 0.119 0.0125*** 7.53e-10 0.124 0.00272 -0.000115(0.324) (0.00277) (0.00278) (0.330) (0.00171) (0.00174)

Sales/Clerical 0.0507 0.0130*** 1.54e-09 0.0642 0.00642*** -0.000116(0.219) (0.00189) (0.00186) (0.245) (0.00128) (0.00127)

Semi-Skilled 0.227 0.0111*** -4.82e-09 0.221 0.0206*** -0.000138(0.419) (0.00356) (0.00362) (0.415) (0.00216) (0.00218)

Unskilled Service/ 0.294 -0.0565*** -3.81e-09 0.298 -0.0462*** 5.02e-05Operative (0.456) (0.00383) (0.00399) (0.457) (0.00232) (0.00247)Farmer 0.0531 0.0512*** 1.46e-09 0.0375 0.0367*** 4.76e-05

(0.224) (0.00202) (0.00186) (0.190) (0.00112) (0.000916)Laborer 0.229 -0.0511*** 4.49e-09 0.232 -0.0421*** 0.000248

(0.420) (0.00352) (0.00370) (0.422) (0.00213) (0.00230)Foreignness Index, First name 0.663 -3.74e-06 0.0172*** 0.671 -0.00812*** 0.000454

(0.190) (0.00163) (0.00167) (0.188) (0.000987) (0.00101)Foreignness Index, Last name 0.730 -0.0102*** 0.0187*** 0.746 -0.0324*** -0.0157***

(0.203) (0.00173) (0.00175) (0.188) (0.000962) (0.000970)Observations 16,258 Panel: 96,400 57,482 Panel: 108,590

Notes: Data is from linked samples between 1910 and 1920, and 1920 and 1930; cross-sectional data is from IPUMS

1% samples in 1920 and a 5% sample in 1930 (Ruggles et al., 2015). This table shows whether the linked (panel)

samples are representative with respect to the random cross-sectional samples from IPUMS. Weights are applied to

match English proficiency, literacy, being from Southern or Eastern Europe, and occupational categories according to

the 1920 and 1930 IPUMS samples. *p<0.10, **p<0.05, ***p<0.01

31

Table 2: Acquiring English Skills and Occupational Categories, Individual Fixed Effects

I II III IV V VIProfessional/ Sales/ Semi- Unskilled Farmer Laborer

Manager Clerical Skilled Service/Oper.

Panel B: 1910 to 1920 CensusSpeak English 0.00550 0.0202*** 0.0255*** 0.0186 0.00463 -0.0745***

(0.00489) (0.00446) (0.00110) (0.0121) (0.00434) (0.0100)

Mean of Dep. Var. in 1910 0.0612 0.0683 0.22 0.265 0.0544 0.331

Individual FE Yes Yes Yes Yes Yes YesNumber of ind 77,448 77,448 77,448 77,448 77,448 77,448

Panel C: 1920 to 1930 CensusSpeak English 0.00739 0.00812*** 0.0269*** 0.0332*** 0.00312 -0.0787***

(0.00513) (0.00234) (0.00634) (0.00673) (0.00356) (0.00612)

Mean of Dep. Var. in 1920 0.0748 0.0525 0.245 0.3 0.0565 0.271

Individual FE Yes Yes Yes Yes Yes YesNumber of ind 84,595 84,595 84,595 84,595 84,595 84,595

Notes: Data is from linked samples between 1910 to 1920 in Panel A, and 1920 to 1930 in Panel B. Each cell reports results

from a separate regression of the occupational category on the ability to speak English, individual fixed effects and controls

described in text such as literacy and fraction of foreign born in the country. Standard errors are clustered by country of birth.

*p<0.10, **p<0.05, ***p<0.01

32

Table 3: Speaking English and Occupational Score, Individual Fixed Effects

Occupational Score based on:1901 CLS 1940 Census 1950 Census

Panel A: 1910 to 1920 CensusSpeak English 0.0420*** 0.00535 0.00636

(0.00394) (0.00498) (0.00509)

Individual FE Yes Yes YesNumber of ind 77,448 77,448 77,448

Panel B: 1920 to 1930 CensusSpeak English 0.0453*** 0.0242*** 0.0266***

(0.00399) (0.00659) (0.00487)

Individual FE Yes Yes YesNumber of ind 84,595 84,595 84,595

Notes: Data is from linked samples between 1910 to 1920 in Panel

A, and 1920 to 1930 in Panel B. Each cell reports results from

a separate regression of log occupational score on the ability to

speak English, individual fixed effects and controls described in

text such as literacy and fraction of foreign born in the country.

The 1901 CLS uses income scores from the 1901 Cost of Living

Survey. The 1940 Census is based on income from the 1940 Cen-

sus and is country of birth-specific; see Appendix E for further de-

tail. The 1950 census is the occscore variable from IPUMS. Stan-

dard errors are clustered by country of birth. *p<0.10, **p<0.05,

***p<0.01

33

Table 4: Association between Speaking English and Outcomes, 1910 and 2010

1910 Census 2008-2012 ACSLog (Occ. Score) Log (Occ Score) Log (Income)

Speak English 0.0633*** 0.169*** 0.358***(0.00356) (0.00277) (0.00580)

Literacy 0.0571***(0.00395)

More than 8 years of education 0.176*** 0.277***(0.00208) (0.00411)

Fraction of own Migrants in County 0.0225** 0.0167*** -0.169***(0.0111) (0.00598) (0.0116)

Log (County Pop) 0.00741*** 0.00492*** 0.00596***(0.000844) (0.000618) (0.00115)

Country of Birth FE Y Y YAge FE Y Y YObservations 37,289 427,227 427,227R-squared 0.284 0.305 0.164

Notes: Data is from the 1910 Census and the 2008-2012 ACS. In the 2008-2012 ACS, speaking English is

coded as 1 if a immigrant is able to speak any English, whether not well or very well. Both samples are

of male immigrants from non-English speaking countries aged 25 to 60. We use the immigrant-specific

occupational score from 1940 in the 1910 census. The occupational score in 2010 is based on the mean

total income by occupation and country of birth in the 2008-2012 ACS, which is the same method of

calculating occupational score. Standard errors are clustered by birth.

34

Figure 1: Assimilation Profiles Across Time for Permanent Immigrants

Notes: The typical assimilation profile in the early 20th century is found by Abramitzky, Boustan andEriksson (2014); late 20th century by Lubotsky (2007). The findings only represent the assimilation ofpermanent immigrants who stay throughout a panel.

35

Figure 2: Fraction of Immigrant Stock Born in an English-Speaking Country

Notes: Data is from 1850-2014 IPUMS. The graph separates countries by whether English if an officiallanguage or dominantly spoken; for example, India and Philippines have English as an official language, butit is not predominantly spoken by the populace. See Bleakley and Chin (2010) for a further discussion.

36

Figure 3: Age at Arrival and English Proficiency Profile, Early 20th and 21st Century

Notes: Data is from 1900-1930, 2000 Censuses and 2008-2012 ACS. The figure plots age-at-arrival fixedeffects from a regression of ability to speak English on age at arrival, age, year, cohort of arrival, country ofbirth, sex, and fraction of immigrants from same birthplace in county.

37

Figure 4: Speed of Language Acquisition Across the 20th century

Notes: Data is from linked panel data 1910-1920 and 1920-1930; the 1910-1930, 2000 IPUMS random samplesand 2008-2012 ACS. The figure shows the mean ability to speak English in the years after arrival. RCS standsfor repeated cross section.

38

Figure 5: Speed of Language Acquisition by Ethnicity

Notes: Data is from linked samples between 1910-1920 and 1920-1930. The figure shows the mean ability to speak English by ethnicity, as proxied bymother’s tongue.

39

Figure 6: Immigrants moved into Jobs with more Communication Tasks

Notes: Data is from 1910-1920 and 1920-1930 linked samples, and the 1910-1930 IPUMS random samples.Communication tasks are rated on a 1 to 6 scale and then transformed into percentiles based on the 1910Census. The top figure estimates the rate of moving up the communication distribution for immigrants. Thebottom figure estimates the gap between immigrants and natives after accounting for life-cycle effects andperiod effects.

40

Figure 7: Age-at-Arrival, English Fluency and Occupation in Early 20th and 21st Century

Notes: The figure shows the residuals of the ability to speak English and the log occupational score afterremoving the effects of age, sex and country of birth. Panel A uses the 1900 to 1930 Censuses and PanelB uses the 2000 and 2008-2012 ACS. See Appendix D for a fuller exploration of English and age-at-arrivaleffects.

41

Online appendix, not meant for publication

42

Table A1: Descriptives of Groups of Always Speak English, Switchers, and Never learnersat first observation

Always Speak English Switchers Never Speak EnglishLiterate 0.913 0.716 0.602

(0.283) (0.451) (0.489)Age 27.15 27.95 30.20

(6.155) (6.437) (6.394)South or East Europe 0.724 0.869 0.932

(0.447) (0.337) (0.251)Log (Occ. Score), 1940 6.791 6.695 6.672

(0.348) (0.274) (0.239)Professional 0.0868 0.0383 0.0294

(0.282) (0.192) (0.169)Sales/Clerical 0.0730 0.0287 0.0175

(0.260) (0.167) (0.131)Semi-skilled 0.247 0.170 0.129

(0.431) (0.375) (0.335)Unskileld Service/Operative 0.301 0.298 0.261

(0.459) (0.458) (0.439)Farmer 0.0439 0.0297 0.0316

(0.205) (0.170) (0.175)Laborer 0.249 0.435 0.532

(0.432) (0.496) (0.499)

Observations 127,421 30,991 3,631

Notes: Data is from linked data from 1910-1920, and 1920-1930.

43

Table A2: Speaking English and Occupational Upgrading, Alternative Sam-ples

I II III IV VSample: Base Old New Child Adult

Sources Sources Arrivals ArrivalsPanel A: 1910 to 1920 CensusSpeak English 0.00535 0.00512 0.0140*** 0.0199 0.00355

(0.00498) (0.0138) (0.00284) (0.0209) (0.00330)

Individual FE Yes Yes Yes Yes YesNumber of ind 77,448 39,469 38,858 14,770 63,557

Panel B: 1920 to 1930 CensusSpeak English 0.0242*** 0.0369** 0.0264*** 0.0376* 0.0220***

(0.00659) (0.0131) (0.00630) (0.0204) (0.00501)

Individual FE Yes Yes Yes Yes YesNumber of ind 84,595 32,712 51,883 19,025 65,570

Notes: Data is from linked samples between 1910 to 1920 in Panel A, and 1920 to 1930

in Panel B. Each cell reports results from a separate regression of log occupational

score (based on the 1940 Census) on the ability to speak English with individual fixed

effects and controls described in text. Each column limits the sample to a different

subsample. Old sources are from Northern and Western Europe; new sources are

from Southern and Eastern Europe. Child arrivals are those who arrived under the

age of 16. Standard Errors are clustered by country of birth. *p<0.10, **p<0.05,

***p<0.01

44

Figure A1: Able to Speak English, 1900 to 2010

Notes: Data is from IPUMS (1900-1930; 1980-2010).

45

Figure A2: Rate of English Acquisition Raw Data, 1900-1919 Cohorts

Notes: Data is from linked data from 1910-1920, and 1920-1930.

46

Figure A3: Cohort Effects

Notes: Data is from IPUMS (1910-1930) and linked samples (1910-1920; 1920-1930). The figure shows themean ability to speak English for arrivals by arrival cohort. RCS stands for repeated cross section.

47

Figure A4: Cohort effects when accounting for birth place and age at arrival

Notes: Data is from IPUMS (1910-1930) and linked samples (1910-1920; 1920-1930). The figure shows themean ability to speak English for arrivals by arrival cohort. RCS stands for repeated cross section.

48

B Linking Process

B.1 A machine-learning approach to linking immigrants

In this section, we provide further detail on how we build a longitudinal dataset which

links immigrants from Census to Census. We follow the approach discussed at length by

Feigenbaum (2016) where we hand-link a set of immigrants to ensure a set of high-quality

links, train an algorithm to find a best link based on links in our hand-linked dataset, and

then apply the algorithm to the overall set of potential links for the rest of the census to

pick the best link. We pursue this method over other linking strategies, such as the linking

algorithm used by Abramitzky, Boustan and Eriksson (2014), in order to reduce biases

associated with false positives, as discussed by Bailey et al. (2017). We discuss our method

in detail below, but the reader should reference Feigenbaum (2016) since much of the method

is based on his discussion.

Before drawing random samples to build training data, we set the sampling frame and

pre-process the data in the following manner. First, we take the 1920 Census and keep

male European immigrants who arrived between 1900 and 1919 and are between 10 and 40

years old. We create a new variable which is the Americanization of all first names based

on information from behindthename.com31; for example, the Americanization of Giuseppe

is Joseph. We do this because individuals may have Americanized their names between

censuses, as shown by Biavaschi et al. (2017); yet note that Americanization may be more

likely to occur immediately after arrival (and thus before first observation at the Census)

than between censuses (Carneiro et al., 2015) and thus may not be as strong of a bias. For

individuals without an Americanized first name, we keep their original first name string.

Next, we drop any individual which has the same set of variables for: Americanized first

name, last name string, country of birth, year of birth, year of immigration and mother’s

tongue. These are the variables which we will link on between 1920 and 1930 to determine

the best match; therefore we drop anyone with the exact same set of linking variables because

we cannot distinguish between them and other potential links.

After pre-processing the data in this way, we draw 15 different random samples of 2,000

immigrants each. These random samples are by mother’s tongue, a variable which best

reflects the ethnicity of each immigrant. The different mother’s tongue are German, Yid-

dish/Jewish, Dutch, Swedish, Danish, Norwegian, Italian, French, Romanian, Greek, Rus-

sian/Ukrainian, Slovak/Slovene, Polish, Finnish, and Magyar/Hungarian.32 We sample by

31See Appendix B and C in Alexander and Ward (2018) for further detail.32These are based on the mtongue variable in IPUMS. Slovak/Slovene includes Czech, Slovak, Serbo-

Croatian, Yugoslavian, Slovene and Lithuanian.

49

mother’s tongue since the method to pick the best link may vary based on the ethnicity of an

immigrant; for example, the way to pick the best match among immigrants from German-

language sources may differ from the way to pick the best match among immigrants from

Italian-language sources. We draw 2,000 since Feigenbaum (2016) suggests that training

datasets converge quickly after a sample of 500; we increase to 2,000 in case we need more

data from linking the foreign-born rather than Feigenbaum’s sample of Iowans.

After drawing the random samples of 2,000 for a total of 30,000 immigrants, we draw

a set of potential matches for them from the 1930 Census, from which we hand-pick the

best link. The set of potential matches are restricted in the following ways. First, the

differences in year of birth must be plus or minus 3 years, which is the same restriction used

by Feigenbaum (2016). Second, the difference in year of arrival must be at most seven years;

we discuss about linking on year of arrival in the next section, a variable which Abramitzky

et al. (2014) do not link on. Third, the Jaro-Winkler score in Americanized first name must

be greater than or equal to 0.75; this is to allow for differences in spelling due across censuses.

Fourth, the Jaro-Winkler score in last name must be greater than or equal to 0.80; we allow

for a slightly greater deviation in first name due to Americanizing the first name. Fifth, one

of the following conditions must hold: either the first letter of the Americanized first name

must match, or the first letter of the last name must match, or the soundex version of both

first and last name must match. Sixth, for immigrants with more than 25 potential matches,

we keep the best 25 candidates based on the linking score provided by Feigenbaum (2016).

Seventh, the country of birth and mother’s tongue must match exactly.

The 15 random samples of 2,000 immigrants yields 228,391 potential matches. Note

that out of the 30,000 individuals we sampled, we found a potential match for only 21,913

individuals, which may be due to return migration, under enumeration, death, spelling errors

by enumerators, Americanization between censuses, or mistyping by those digitizing the

dataset. Therefore, the best possible linking rate we can achieve with our training dataset

is 73 percent; however, we will match much fewer than this due to common names or being

unable to find a good match. For individuals which we find a potential match, we pick the

best link of the (on average) 10.3 potential links.

We choose the best potential match for each of the 21,913 individuals in our dataset.

Many individuals do not have any potential match with is the true person, which we code

as zero. For those with multiple good potential matches that are very close in name, year of

arrival and year of birth, we code them as zero since we cannot determine the best match.

After linking, we are left with a linked dataset that has 7,548, or 34.4 percent of the those

with potential matches, and 25.2 percent of our original random sample. Linking rates vary

by mother’s tongue as shown in Table B1, where we are most likely to link Dutch immigrants

50

(35.4 percent of starting sample), and least likely to link someone who is Romanian (12.7

percent of the starting sample).

With these sets of potential links, we train multiple models to choose the best match

among the potential links; the aim is to use the coefficients from these models to apply to

the full link between censuses. We model each potential match based on our experience

linking individuals; the link is a function of first name match, last name match, differences

in year of birth, year of arrival, the number of other potential matches that have the same

exact last name or exact NYSIIS code, or the length of the last name. The models for each

language of origin are shown in Tables B2-B5. With these coefficients estimated from the

probit models, one can then easily use them to gauge the best possible match when linking

the entire censuses.

Before linking the full-count censuses, we must set two meta-parameters for each ethnicity

to determine which of the potential links are to be included in the final linked dataset. First,

we must determine the threshold for the likelihood that a potential link is actually a true

link. For example, Nels Nelson who was born in 1890 and arrived in 1915 may match with

Neils Nillson who was born in 1887 and arrived in 1920, but should we keep this link or

not? We term this meta-parameter b1 such that we keep any individual with a predicted

probability that is greater than or equal to b1. Note that b1 can range between 0 and 1.

The second meta-parameter we must determine is the threshold for similarity between the

best potential match and second-best potential match. For example, we may find multiple

Nels Nelsons that match between 1920 and 1930 that deviate slightly on year of birth and

year of arrival; should we keep one of the two close links in other dataset or not? We term

this meta-parameter b2 such that we keep any individual that has a predicted probability

that is b2 times greater than the predicted probability of the second-best match. Note that

b2 can range from 1 to infinity.

Setting the values of b1 and b2 will influence the efficiency of linking and also the number

of false positives. In particular, we can measure the true positive rate (TPR) with our

training data. The formula for the true positive rate is the number of true positives (correctly

linked) divided by the sum of true positives and false negatives (fail to link a true match).

A related metric for the performance of the probit is the positive prediction value (PPV),

which measures likelihood a link in the set of predicted links is a true link. The formula

for the positive prediction value is the number of true positives divided by the sum of true

positives and false positives.

Given that we are linking full-count to full-count censuses, the potential size of a linked

dataset could be very large. Therefore, the cost of reducing the TPR may be less than

the cost of linking to the wrong individual. Therefore, we lean on the side of reducing the

51

linking rate and TPR by increasing the PPV; this method is essentially the same as relying

more on uncommon names rather than common names to build our dataset. To determine

the parameters for b1 and b2, we perform a grid search between 0 and 1 for the predicted

probability b1, and between 1 and 50 for the ratio. We choose b2 so the PPV is at least 0.90

in each of our training datasets; after hitting this rate of PPV, we pick b1 to maximize the

TPR.

The critical values for the probability threshold b1 and ratio threshold b2 are presented

in Table B6. Immediately it is clear that the true positive rate is much less than the positive

prediction value of 0.90: the values range from 0.265 for Russian and 0.82 for Dutch. Since

the rates are relatively low for the true positive rate, then the overall linking rate between

censuses will be low; for example with Russian, we only keep 26.5 percent of the positive

hand-linked matches in our dataset. Since the hand-linked rate was only 15.9 percent,

one could expect an overall linking rate of 4.2 percent. However, we believe this to be a

worthwhile trade-off to increase confidence that the links are the correct individuals. Note

that the ratio of predicted probabilities for the first and second link b2 is sometimes nearly

33 times larger than the second-best link; we believe we are being very conservative to ensure

the best matches in our dataset.

While we can easily apply our algorithm to linking the 1920-1930 censuses since these

censuses contain country of birth, mother’s tongue, year of arrival, year of birth, first name

and last name, we cannot easily apply this method to linking the 1910-1920 censuses. This

is because the mother’s tongue variable in the 1910 census the variable was often coded as

English to reflect one’s ability to speak English rather than the language of the mother;

therefore the mother’s tongue “English” is severely overrepresented in the 1910 preliminary

full-count data. Since we do not have reliable mother’s tongue data in the 1910 census, it

is difficult to identify whether to use the, for example, linking predictions from the Jew-

ish/Yiddish mother’s tongue regression or the Russian mother’s tongue regression for those

who list a country of birth as Russia. We circumvent this problem by linking between

1910 and 1920 censuses without mother’s tongue, and then assume the individual’s mother’s

tongue is most common mother’s tongue out of the 25 potential methods in the 1920 census.

One trade-off when building the linked dataset is that those included in the dataset are

those with no close alternatives; primarily, this implies that they have uncommon names.

We show the results for representativeness in Table 1 in the main text. To make the sample

more representative of the population, we reweight it to match the 1920 and 1930 full-count

censuses distribution of ability to speak English, country of birth, and ability to read and

write.

52

B.2 Matching on Year in the United States

For the creation of the main dataset, we match immigrants across censuses based on years

in the United States.33 Abramitzky, Boustan and Eriksson (2014) do not match on years in

the United States because of concerns over heaping on zeroes or fives or from misreporting

in the variable. The advantages of matching on years in the United States are that it could

potentially increase match rates due to not dropping similar names, and then also decrease

the number of false positives due to having an extra piece of information. Yet if the variable

was not recorded accurately then it would not improve the quality of the linked dataset.

One way to quantify the accuracy of the variable is to compare the extent of heaping

for years in the United States with heaping for age. One can measure the extent of heaping

by assuming a smooth age or years in the United States distribution, and then count the

number of people who report a value ending with zero or five; this is most commonly used in

the ABCC index as described by A’Hearn, Baten and Crayen (2009). This index ranges from

0 to 100 and is interpreted as the percentage of people who know their true age or years in

the US. Therefore, we will calculate the ABCC index for age and years in the United States

using a sample of individuals aged 18 to 52, with years of stay between 3 and 22. These

restrictions are used to reflect the restrictions for the linked samples.34

Before comparing the ABCC index for years in the United States and age, Figure B1

show the degree of heaping for both variables in the 1900 to 1930 United States Census.

There is clear heaping on the zeroes and fives for both variables, yet it is unclear which

variable has a larger amount of heaping.

According to the ABCC index, the years in the United States variable is slightly more

heaped than the age variable: The ABCC index is 94.7 for age, suggesting that 94.7 percent

of individuals reported their true age. At the same time, the ABCC index for years in the

United States is 94.2, only 0.5 less than age. This suggests that the years in the United

States variable is similarly heaped as age.

Applying the methodology of ABCC to measure heaping for the years in the US variable

may not be accurate because it assumes that the underlying distribution is smooth. This

may not be the case for inflow data because of large yearly fluctuations that reflect the

business cycle (Hatton and Williamson, 1998); note that age does not have this problem

since fertility rates change smoothly. However, the immigration data between 1888 and 1922

from the Historical Statistics (Carter et al., 2006) shows that the average inflow in years

33I discuss matching on Years in the US, which is equivalent to matching on Year of Arrival. That is, Yearof Arrival = Year − Years in US. Year and Years in the US both change by ten in between censuses.

34I start the calculation at years in the United States at 3 since heaping is very low between years zeroand two. If we include these values, then there is less heaping for the years in the United States variable,which reinforces my conclusion that it is a good variable to match on.

53

that end in zero or five was 94.3 percent of other years. Thus, there should be less heaping

in the years in the United States variable than the age variable in the first place.

When hand-linking data between censuses, year of arrival was often a good indicator of

a match, but if there were multiple individuals with close years of arrival, it was difficult to

choose which individual was the correct match. Therefore, we did not take year of arrival

as informative if there were two potential matches with small differences in year of arrival

(within one or two years). However, the results from the probit indicate that year of arrival

does provide information about the quality of the match, where those with larger differences

in year of arrival are less likely to be matched. (See Tables B2 - B5).

54

Table B1: Hand-Linking the 1920 to 1930 Censuses

Mother’s Tongue Random Draw N with at N of Potential Successful Overall Linking Ratein 1920 least 1 Potential Matches in 1930 Link Linking Rate given 1

Match in 1930 Potential MatchGerman 2,000 1,476 11,378 632 31.6 42.8Yiddish, Jewish 2,000 1,745 25,810 617 30.9 35.4Dutch 2,000 1,295 6,684 707 35.4 54.6Swedish 2,000 1,621 21,682 691 34.6 42.6Danish 2,000 1,537 18,697 582 29.1 37.9Norwegian 2,000 1,518 13,995 599 30.0 39.5Italian 2,000 1,806 29,773 620 31.0 34.3French 2,000 1,113 3,205 460 23.0 41.3Romanian 2,000 799 2,955 258 12.9 32.3Greek 2,000 1,576 21,932 369 18.5 23.4Russia/Ukranian 2,000 1,481 10,513 318 15.9 21.5Czech/Slovak 2,000 1,454 16,114 333 16.7 22.9Polish 2,000 1,677 27,394 466 23.3 27.8Finnish 2,000 1,382 8,103 529 26.5 38.3Magyar, Hungarian 2,000 1,433 10,156 367 18.4 25.6

Notes: Results from linking immigrants from the 1920 Census to the 1930 Census.

55

Table B2: Probit Results, Set 1

Mother’s tongue: English German Yiddish DutchYear of Birth Diff is 1 -0.495*** -0.293*** -0.519*** -0.508***

(0.0902) (0.0920) (0.0802) (0.113)Year of Birth Diff is 2 -0.817*** -0.489*** -0.737*** -0.804***

(0.103) (0.105) (0.0891) (0.144)Year of Birth Diff is 3 -0.888*** -0.776*** -0.882*** -0.946***

(0.110) (0.120) (0.104) (0.155)Year of Arr. Diff is 1 -0.600*** -0.450*** -0.464*** -0.472***







(0.195) (0.251) (0.216) (0.319)JW Distance First name -6.726** -2.016 -3.088 1.314

(2.910) (2.441) (2.770) (2.148)JW distance last name -9.251*** -9.449*** -8.668*** -12.23***

(0.867) (0.732) (0.771) (0.980)NYSIIS First name match -0.103 0.177 0.150 0.612*

(0.415) (0.325) (0.370) (0.331)NYSIIS Last name match -0.397*** -0.385*** -0.301*** -0.518***

(0.108) (0.122) (0.0957) (0.183)Hits -0.175*** -0.200*** -0.179*** -0.271***

(0.0197) (0.0192) (0.0229) (0.0261)Hits squared 0.00419*** 0.00517*** 0.00389*** 0.00677***

(0.000680) (0.000684) (0.000745) (0.000968)First letter of last name match 0.238 -0.121 0.161 -0.477***

(0.179) (0.121) (0.159) (0.178)First letter of first name match 0.0793 0.536*** 1.620*** 0.350

(0.296) (0.180) (0.542) (0.274)NYSIIS last name match, 2 hits 0.543*** 0.491*** 0.596*** -1.114***

(0.132) (0.168) (0.121) (0.280)NYSIIS last name match, unique 1.223*** 0.879*** 1.484*** 2.012***

(0.130) (0.167) (0.125) (0.267)JW Distance in NYSIIS Last name -1.788** -3.336*** -2.538*** -5.629***

(0.782) (0.652) (0.734) (0.997)JW Distance in NYSIIS First name -1.114* -0.170 -0.0438 -0.421

(0.649) (0.253) (0.126) (0.342)Middle initial match, if have one 1.096*** 1.480*** 0.127 0.877***

(0.131) (0.321) (0.743) (0.325)Constant 1.420*** 1.381*** -0.256 2.928***

(0.549) (0.418) (0.705) (0.507)

Observations 12,975 11,227 25,691 6,651

Notes: Dependent variable is a successful link in our training data. Each column is adifferent set of training data.

56


Mother’s tongue: Swedish Danish Norwegian Italian

Year of Birth Diff is 1 -0.586*** -0.400*** -0.445*** -0.258***(0.0760) (0.0830) (0.0904) (0.0694)



Year of Arr. Diff is 1 -0.605*** -0.653*** -0.736*** -0.373***(0.0803) (0.0856) (0.101) (0.0814)







JW Distance First name -4.548*** -4.985*** -2.658* 0.915(1.518) (1.604) (1.572) (1.259)

JW distance last name -6.400*** -5.700*** -7.320*** -10.52***(0.859) (0.939) (0.825) (0.606)

NYSIIS First name match 0.244 0.187 0.402* 0.491***(0.234) (0.222) (0.231) (0.189)

NYSIIS Last name match -0.0609 -0.0626 -0.349*** 0.0145(0.0992) (0.101) (0.112) (0.0960)

Hits -0.172*** -0.221*** -0.215*** -0.0653***(0.0189) (0.0215) (0.0202) (0.0252)

Hits squared 0.00358*** 0.00497*** 0.00473*** 0.000408(0.000630) (0.000720) (0.000712) (0.000809)

First letter of last name match 0.221 0.371* 0.380** 0.0309(0.137) (0.191) (0.156) (0.148)

First letter of first name match 0.456*** 1.024*** 0.693*** 0.155(0.164) (0.216) (0.190) (0.110)

NYSIIS last name match, 2 hits 0.738*** 0.0664 0.0171 0.707***(0.126) (0.163) (0.162) (0.122)

NYSIIS last name match, unique 1.175*** 1.449*** 1.538*** 0.713***(0.132) (0.167) (0.159) (0.127)

JW Distance in NYSIIS Last name -1.670** -1.886** -1.578** -3.784***(0.765) (0.899) (0.710) (0.609)

JW Distance in NYSIIS First name -0.743*** -0.519** -0.780*** -0.0790(0.254) (0.254) (0.277) (0.147)

Middle initial match, if have one 1.194*** 1.630*** 1.017*** -(0.141) (0.134) (0.237)

Constant 0.540 0.0836 0.498 0.800**(0.337) (0.404) (0.362) (0.331)

Observations 21,648 18,690 13,893 29,591

Notes: Dependent variable is a successful link in our training data. Each column is adifferent set of training data. 57


Mother’s tongue: French Romanian Greek Russian

Year of Birth Diff is 1 -0.423*** -0.110 -0.318*** -0.0167(0.141) (0.155) (0.0903) (0.117)

Year of Birth Diff is 2 -0.727*** -0.392** -0.736*** -0.371***(0.170) (0.161) (0.109) (0.124)

Year of Birth Diff is 3 -0.618*** -0.411** -0.809*** -0.354***(0.168) (0.170) (0.121) (0.131)

Year of Arr. Diff is 1 -0.323* -0.114 -0.537*** -0.392***(0.173) (0.174) (0.105) (0.120)

Year of Arr. Diff is 2 -0.456** -0.0539 -0.646*** -0.650***(0.189) (0.179) (0.109) (0.135)

Year of Arr. Diff is 3 -0.599*** -0.470** -0.789*** -0.934***(0.189) (0.197) (0.120) (0.152)



Year of Arr. Diff is 6 -0.895*** -0.486* -1.422*** -0.918***(0.260) (0.280) (0.224) (0.226)


JW Distance First name -5.042* -7.971 -0.616 -2.432(2.952) (5.298) (1.966) (3.893)


NYSIIS First name match -0.0698 -1.201 0.158 0.412(0.372) (0.731) (0.294) (0.578)

NYSIIS Last name match -1.022*** -0.591*** -0.158 -0.907***(0.206) (0.225) (0.114) (0.157)

Hits -0.341*** -0.251*** -0.216*** -0.225***(0.0362) (0.0313) (0.0254) (0.0225)

Hits squared 0.0109*** 0.00663*** 0.00503*** 0.00545***(0.00171) (0.00136) (0.000831) (0.000815)

First letter of last name match 0.191 0.101 0.298 0.183(0.212) (0.189) (0.208) (0.158)

First letter of first name match 1.155*** 1.009* 0.437* -0.0285(0.425) (0.529) (0.229) (0.300)

NYSIIS last name match, 2 hits -0.696** -0.605* 0.880*** 0.0710(0.335) (0.365) (0.147) (0.273)


JW Distance in NYSIIS Last name -3.571*** -3.109*** -3.322*** -5.051***(1.032) (0.871) (0.700) (0.839)

JW Distance in NYSIIS First name -0.00510 0.00271 -0.572*** -0.140(0.216) (0.269) (0.218) (0.236)

Middle initial match, if have one 1.110** - 0.732 2.714**(0.482) (0.482) (1.263)

Constant 1.625** 1.894** 1.251*** 1.630**(0.644) (0.948) (0.477) (0.705)

Observations 3,190 2,899 21,761 10,481

Notes: Dependent variable is a successful link in our training data. Each column is adifferent set of training data. 58


Mother’s tongue: Czech/Slovak Polish Finnish Magyar

Year of Birth Diff is 1 -0.354*** -0.309*** -0.352*** -0.217**(0.111) (0.0781) (0.0996) (0.110)









Year of Arr. Diff is 7 -0.723** -0.962*** -1.175*** -0.989***(0.289) (0.332) (0.248) (0.221)

JW Distance First name -6.304* -2.514 1.789 -7.769***(3.228) (2.770) (1.852) (2.581)


NYSIIS First name match -0.871* 0.148 1.090*** -0.503(0.449) (0.369) (0.307) (0.368)

NYSIIS Last name match -0.335** -0.599*** -0.680*** -0.457***(0.168) (0.115) (0.128) (0.138)

Hits -0.232*** -0.119*** -0.284*** -0.246***(0.0275) (0.0250) (0.0218) (0.0236)

Hits squared 0.00564*** 0.00168** 0.00806*** 0.00605***(0.000935) (0.000816) (0.000813) (0.000862)

First letter of last name match -0.169 0.254* 0.0135 -0.00937(0.163) (0.146) (0.141) (0.164)

First letter of first name match 0.466 0.389* 0.266 0.324(0.321) (0.222) (0.181) (0.328)

NYSIIS last name match, 2 hits 0.241 0.334** 0.146 -0.117(0.227) (0.164) (0.170) (0.214)


JW Distance in NYSIIS Last name -4.801*** -3.330*** -1.975*** -2.694***(0.911) (0.627) (0.638) (0.746)

JW Distance in NYSIIS First name -0.497* 0.0479 -1.126** -0.138(0.282) (0.244) (0.498) (0.201)

Middle initial match, if have one -0.637 1.672 1.348*** 2.898**(2.430) (6.124) (0.370) (1.384)

Constant 3.593*** 1.228** 0.556 2.076***(0.652) (0.498) (0.399) (0.513)

Observations 16,041 27,298 8,006 9,891

Notes: Dependent variable is a successful link in our training data. Each column is a differentset of training data. 59

Table B6: Meta-parameters for Linked Samples

Mother’s Tongue b1 b2 PPV TPREnglish 0.236 1.7 0.901 0.760German 0.36 1.3 0.900 0.749Yiddish, Jewish 0.343 2.1 0.902 0.587Dutch 0.211 1.8 0.902 0.905Swedish 0.263 5 0.902 0.555Danish 0.313 3.1 0.902 0.638Norwegian 0.334 2.1 0.901 0.723Italian 0.479 1.5 0.900 0.470French 0.336 1.2 0.901 0.889Romanian 0.408 2.4 0.903 0.627Greek 0.462 1.9 0.904 0.414Russian/Ukranian 0.508 2 0.904 0.445Czech, Slovak, Serbo-Croatian, Slovene, Lithuanian 0.357 2.3 0.901 0.639Polish 0.346 8.2 0.903 0.399Finnish 0.262 3.1 0.902 0.705Magyar, Hungarian 0.375 6.4 0.903 0.564

Notes: Keep individuals in the linked sample who have a predicted probability above b1.

We also keep those whose predicted probability is more than b2 times the second-highest

predicted score.

60

Table B7: Linking Rates for Non-English-Speaking Source Countries

Ethnicity 1910-1920 Link 1920-1930 LinkPotential Links in 1920 Linked Linking Rate Potential Links in 1930 Linked Linking Rate

German 318417 21350 6.7 216442 22918 10.6Yiddish, Jewish 329563 8252 2.5 253668 8520 3.4Dutch 37708 5539 14.7 36717 7707 21.0Swedish 116040 6112 5.3 73095 5841 8.0Danish 34647 2346 6.8 27519 2696 9.8Norwegian 68986 5765 8.4 39703 4134 10.4Italian 564562 23800 4.2 486866 29724 6.1French 26966 2544 9.4 33192 1829 5.5Romanian 22938 1212 5.3 16934 737 4.4Greek 77596 1119 1.4 85401 2083 2.4Russian 136836 5012 3.7 113983 3210 2.8Czech/Slovak 292614 2199 0.8 198365 5665 2.9Polish 361386 3906 1.1 263270 7733 2.9Finnish 41911 3211 7.7 25515 3065 12.0Hungarian 96307 889 0.9 58080 2627 4.5

Notes: The linked sample sizes for1910 to 1920, and 1920 to 1930 censuses. Note that we only link non-English-speaking source countries.

61

Figure B1: Heaping on Age and Years in the US

Notes: Data is from IPUMS (1900-1930). The figure shows the residuals of the ability the log occupationalscore after removing the effects of age, sex and country of birth. The right hand side graph treats immigrantsas able to speak English if they speak not well, well or very well.

62

C Discussion of Data Choices

C.1 Coding of Countries

I group the following countries into one country to maintain consistency across the 1900 to

1930 data:

• Russia includes Russia, Poland, Latvia, Lithuania, Estonia, and any Baltic state

• Hungary includes Hungary, Czechoslovakia and Yugoslavia

The following are how we code “Old” source countries, “New” source countries, and

English-speaking countries.

• Old source countries include: Canada, Denmark, Finland, Iceland, Norway, Swe-

den, England, Scotland, Wales, Ireland, Belgium, France, Luxembourg, Netherlands,

Switzerland, Australia and New Zealand. New source countries are all others, including

those from Eastern Europe, Asia, Africa and Central/South America.

• English-speaking countries in 1900 are Australia, (English) Canada, England, Ireland,

Scotland, India, British West Indies colonies (e.g., Antigua, Barbados, etc.), New

Zealand, the Philippines and Wales. The British and American colonies are a very

small number of the overall immigrant total, so coding them as non-English does not

qualitatively affect results. For the years 1910 to 1930, English is coded as whether

the mother’s tongue is English or one is born in an English-speaking countries for

1900. Between 1980 and 2010, English countries are coded to follow Bleakley and

Chin (2010).

D Recreating Bleakley and Chin’s (2004) Empirical

Strategy

D.1 Age-at-Arrival Analysis

One way to measure the English premium is to exploit a well-defined relationship between

age at arrival and English fluency as an adult. In particular, second language acquisition

is easier at younger ages during the so-called “critical period;” after age 8-11, the ability

to acquire a second language decreases at a linear rate (Bleakley and Chin, 2004). To fix

ideas for how one can test this empirically, researchers typically estimate a variation of the

following regression using a sample of adults who arrived as children:

63

SpeakEngij = a0 + a11[AgeArrival ≥ 8]ij + a2NonEngSpeakCntryij

+ a31[AgeArrival ≥ 8]×NonEngSpeakCntryij + γj + Π′Xij + υij (3)

where SpeakEngij is an indicator that equals one if individual i from country of birth j can

speak English as an adult. After controlling for country of birth (γj) and various individual-

level observables (Xij) such as age, the equation fundamentally estimates two age-at-arrival

profiles: one for immigrants from English-speaking countries and one for immigrants from

non-English-speaking countries. Using English fluency as the dependent variable, a1 should

equal zero as English-speaking immigrants already know English no matter the age at arrival,

and a3 should be negative, reflecting the critical period of language acquisition.

To relate speaking English with labor market outcomes, one can estimate the same equa-

tion as above, but instead using wages as the dependent variable. However, one drawback

of using this strategy with a cross section of data is collinearity: since age at arrival, age

and years in the United States are collinear, one cannot precisely separate the effects. For

example, if one controls for age in the regression and finds a negative age-at-arrival profile,

this may reflect older arrivals having fewer years in the United States rather than worse

adaptability at an older age. The typical solution to this problem is to include natives in

the regression to identify the aging profile, and estimate the effect of age at arrival on the

native-immigrant gap by age (Schaasfma and Sweetman, 2001). Nevertheless, one could run

the following regression:

log(Wagesij) = b0 + b11[AgeArrival ≥ 8]ij + b2NonEngSpeakCntryij

+ b31[AgeArrival ≥ 8]×NonEngSpeakCntryij + ηj + Γ′Xij + νij (4)

Given that those arriving at older ages from non-English-speaking countries generally have

less English fluency, one would expect that these same immigrants would also have lower

wages; in other words, the expectation is that b3 is negative. Indeed, this is exactly what

studies find in present-day settings (Bleakley and Chin, 2004; Guven and Islam, 2015).

To retrieve a precise estimate of the return to speaking English, studies use a two-stage

least squares strategy where Equation 3 predicts English ability in the first stage with the

interaction 1[AgeArrival ≥ 8] × NonEngSpeakCntryij as the excluded instrument. The

results from this first stage are then used in the second stage for a regression of wages on

64

predicted English ability.35 The 2SLS estimate of the premium for English is essentially a

ratio of the reduced form and the first stage (βIV = b3a3

). In the typical study, the instrument

is not simply the interaction between a dummy variable for age at arrival above eight, but

rather a linear interaction for past age 8 and being born in a non-English-speaking country.

This reflects the linear decrease in English-speaking ability observed in the data (Bleakley

and Chin, 2004). Finally, note that this IV estimates the effect for only those who are

affected by the instrument, or the LATE.

D.2 Regression evidence from the early 20th and early 21st cen-

turies

I run the age-at-arrival regression specifications for years 1900 to 1930 from Equation 3 with

two dependent variables: the ability to speak English and occupational score. The control

variables are purposely parsimonious − we only control for country of birth, current age,

and sex − as other controls like current neighborhood or marital status could be considered

outcomes of age at arrival. The results are shown in Table D1. As expected from the critical

period hypothesis, the coefficient on the interaction between arriving at or older than age

8 and being from a non-English-speaking country is negative. The effect is consistently

negative across all four decades and estimates that arriving at an older age leads to about

a 4.6 to 8.9 percentage point drop in the likelihood of speaking English as an adult. Note

the wide variation in the age-at-arrival effect on English fluency, which suggests that the

English proficiency may be driven by factors outside of the critical period hypothesis, such

as a different mix of country of origin or selection at older ages.

The second panel shows the effect of a immigrant’s age at arrival on occupational score.

The results, especially in the years 1900 and 1910, show a perplexing correlation: immigrants

from non-English-speaking countries who arrived older than age 8, despite speaking English

at lower rates, held jobs that paid about 1.1% more than otherwise similar immigrants who

arrived under the age of 8. The effect is similarly positive in 1910, and in is negative but

statistically insignificant in 1920.. Finally, by 1930, the expected results hold: immigrants

who arrived at older ages had lower English fluency rates and worse-paying occupations.

The results for lower English fluency and occupation are not being driven by literacy: when

controlling for literacy in the bottom two panels, the main results on English fluency and

occupation do not change.

How does the Bleakley and Chin analysis perform in the data between 1900 and 1930? In

Panel A of Table D3 we show the 2SLS estimates, recreating the Bleakley and Chin (2004)

35Thus, the reduced form equation of the outcome on the excluded instrument is Equation 4.

65

specification (i.e., using a linear decrease in English-speaking ability rather than a dummy

variable). While the pooled estimate of 1900 to 1930 yields an IV result about a 17.6 percent

return to English skills, which reflects the relationships shown in Panel A of Figure 3, the

2SLS strategy estimates an unreasonably large negative premium for English in 1900 (18 log

points) and an very large positive premium in 1930 (86 log points). This reflects the decade

by decade analysis in Table D1.

Why do the results change so drastically for the pooled analysis and for each year between

1900 and 1930? It may be that the premium truly rose between 1900 and 1930, or it may be

due to other factors besides a rise in the English premium. In the second row, we show what

happens when the birth country composition from 1900 to 1930 is fixed at 1900 levels. For

example, Germany was 26.6% of the sample in 1900, but only 8.9% of the sample in 1930;

we reweight the Germans from 1910 to 1930 to be 26.6% of the sample. The results from

this weighting show that there is a significant negative premium from 1900 to 1930 using

the age-at-arrival analysis. Therefore, one reason for the difference in premium from 1900 to

1930 in the unweighted sample (from negative to positive) is that the composition of non-

English-speaking sources shifted to poorer countries who had worse outcomes. This suggests

that the instrumental variable is correlated with other aspects of the child’s environment

rather than just English ability. See Alexander and Ward (2018) for a further examination

of the effect of age at arrival in the Age of Mass Migration.

In summary, the results from this section show that immigrants held higher paying jobs in

1900 despite having lower English skills. This implies that English was relatively unimportant

in 1900 compared to other non-language human capital that is correlated with age at arrival.

By 1930, this relationship is no longer true, which could either be because English was more

important or that non-English sources came from poorer countries. Ultimately, the age-at-

arrival analysis likely does not isolate the English premium throughout the entire 1900-1930

time period. This is likely because age-at-arrival effects are not similar between English-

speaking and non-English-speaking sources throughout different migrant cohorts.

Finally, we show the results from the instrumental variables strategy when applied to

the 2000 Census and the 2008-2012 Census. Note that these estimates differ from the ones

reported in Bleakley and Chin (2004) first because we use an occupational score (that reflects

average earnings by occupation and country of birth) rather than income. Second, we use

the 2000 Census and 2008-2012 ACS rather than the 1990 Census. Third, we code the ability

to speak English as whether one spoke any English, rather than treat the English categories

as a continuous variable.

The results in Table D2 estimate a much larger premium for English skills relative to

the estimates from the early 20th century. The instrumental variables strategy estimates

66

that speaking any English leads to an increase in occupational score of 175 log point, about

10 times the estimate from the pooled analysis between 1900 and 1930. This evidence is

consistent with the main argument of the paper that the return to English skills has increased

over the past 100 years.

67

Table D1: Age-at-arrival Effects, 1900 to 1930

I II III IV1900 1910 1920 1930

Outcome: Can Speak EnglishArrived Older than 8 × -0.0460*** -0.0886*** -0.0295*** -0.0186***Non-English Speaking Country (0.00185) (0.00477) (0.00493) (0.00137)Arrived Older than 8 -0.00492*** 0.00117 -0.00261 0.000654

(0.000522) (0.00199) (0.00378) (0.000605)

Outcome: Occupational ScoreArrived Older than 8 × 0.0105** 0.00507 -0.00260 -0.0211***Non-English Speaking Country (0.00416) (0.0105) (0.00964) (0.00463)Arrived Older than 8 -0.0353*** -0.0331*** -0.0410*** -0.0381***

(0.00322) (0.00865) (0.00841) (0.00411)

Outcome: Can Speak English, Control for LiteracyArrived Older than 8 × -0.0403*** -0.0823*** -0.0232*** -0.0124***Non-English Speaking Country (0.00178) (0.00467) (0.00488) (0.00133)Arrived Older than 8 -0.00384*** 0.00294 -0.00263 0.000980

(0.000600) (0.00200) (0.00381) (0.000617)

Outcome: Occupational Score, Control for LiteracyArrived Older than 8 × 0.0127*** 0.00787 -3.03e-05 -0.0187***Non-English Speaking Country (0.00415) (0.0104) (0.00964) (0.00463)Arrived Older than 8 -0.0349*** -0.0324*** -0.0410*** -0.0380***

(0.00321) (0.00864) (0.00839) (0.00411)N 89,959 22,081 24,851 121,689

Notes: Data is from IPUMS (1900-1930). The dependent variable is listed for each panel. Each

column represents the same regression but for a different census. Sex, current age, and country

of birth are also controls in the regression. Robust standard errors are in parenthesis. *p<0.10,

**p<0.05, ***p<0.01

68

Table D2: Recreating Bleakley and Chin (2004) 2SLS Estimate, early 20th century

Normal WeightsPooled 1900-1930 1900 1910 1920 1930

Can Speak English 0.176*** -0.181** -0.0463 -0.0309 0.858***(0.0597) (0.0731) (0.0938) (0.280) (0.238)

Fix at 1900 Country CompositionCan Speak English -0.101 -0.181** -0.185 -0.680 0.144

(0.0983) (0.0731) (0.145) (0.710) (0.362)

Observations 258,580 89,959 22,081 24,851 121,689

Notes: Data is from IPUMS (1900-1930). The dependent variable is the logged occupational

score from the 1940 census. The dataset are immigrants aged 16-55 who arrived under age

17. The second row reweights the 1910 to 1930 samples to have the same birth country

composition as in the 1900 sample. *p<0.10, **p<0.05, ***p<0.01.

Table D3: Recreating Bleakley and Chin (2004) 2SLS Estimate, early 21st century

Pooled 2000 Census and 2008-2012 ACS 2000 2008-2012Can Speak English 1.752*** 1.505*** 2.151***

(0.159) (0.198) (0.255)Observations 356,050 171,535 184,515

Notes: Data is from IPUMS (2000, 2008-2012 ACS). The dependent variable is the logged occu-

pational score from average the total income by country of birth and occupation. The dataset are

immigrants aged 16-55 who arrived under age 17. *p<0.10, **p<0.05, ***p<0.01.

E Immigrant-Specific Occupational Score

In this section, we repeat the information given in Alexander and Ward’s Appendix D (2018)

on the creation of the immigrant-occupational score for interested readers:

We create the immigrant-specific occupational score to improve on the standard

occupational scores used in the literature, such as the 1950 occscore from IPUMS

and the 1901 Cost of Living Survey score. There are important limitations when

using these commonly used scores; for example, the 1901 Cost of Living Score

is only representative for married urban families and therefore does not provide

an accurate estimate for rural workers. The 1950 occupational score reflects

earnings after World War II, and therefore understates wage gaps for data prior

to World War II (Goldin and Margo, 1992). Moreover, neither score reflects

earnings that are specific to immigrants and thus they understate any difference

between immigrants and natives, a key interest for this paper.

69

We create an alternative occupational score that is based on income reported

in the full-count 1940 United States Census. Our approach follows Collins and

Wanamaker (henceforth CW) (2014, 2017) in that we impute income separately

by group; but instead of groups separated by race and region as in CW, we

impute income separately by country of birth. Therefore, the occupation score

is essentially the average earnings in each occupation / country of birth cell.

We provide further details on how we create the score below, but we follow

Appendix I.b of CW (2017) to fix for self-employed earnings and non-monetary

compensation for farm laborers and farmers.

First, we take the full-count 1940 United States Census and top-code income

to 5,000 for wage workers. For self-employed workers, we ignore their reported

wage income since this is not consistently reported, but we instead impute their

income. To do this, we follow the strategy laid out by CW (2017) where we

take the ratio of self-employed earnings to wage-worker earnings by occupation

in the 1960 census, assume this ratio from 1960 is a good proxy for the ratio

in 1940, and multiply the ratio with the mean wage income by occupation and

country of birth. This leads to an imputed income for each self-employed person

that varies by occupation and country of birth. Then we collapse the 1940 data

by detailed occupation code and country of birth to get an average income for

each occupation, which forms the occupational score for the large majority of our

data.

We do not take the above approach for farm laborers and farmers because they

may receive compensation in kind which is not recorded in the income data. We

take a few extra steps to estimate their incomes. Starting with farm laborers and

once again following CW (2017), we increase farm laborers mean wage income

in the 1940 census by 26 percent to reflect in-kind compensation, which is based

on the 1957 USDA report Major Statistical Series of the U.S. Department of

Agriculture. The next step is to estimate income for farmers. First, we assume

that the perquisite rate of farmers in the 1960 census is 35 percent (also based on

the USDA report), and we scale up their reported (wage and business) income

by this factor. To create the final estimate for farmer income in 1940, we assume

that the ratio between farm laborers and farmer income (inclusive of perquisites)

in 1960 is the same as in 1940. Therefore, we need to estimate farm laborers

income in 1960, which we boost their income by 19 percent to reflect in-kind

compensation.

70

the role of english fluency in migrant assimilation: evidence from

Documents