incorporating dialectal variability for socially equitable...
Post on 08-Apr-2018
224 Views
Preview:
TRANSCRIPT
Incorporating Dialectal Variability for Socially Equitable
Language IdentificationDavid Jurgens, Yulia Tsvetkov, and Dan Jurafsky
McNamee, P., “Language identification: a solved problem suitable for undergraduate instruction” Journal
of Computing Sciences in Colleges 20(3) 2005.
“This paper describes […] how even the most simple of these methods
using data obtained from the World Wide Web achieve accuracy
approaching 100% on a test suite comprised of ten European languages”
McNamee, P., “Language identification: a solved problem suitable for undergraduate instruction” Journal
of Computing Sciences in Colleges 20(3) 2005.
Whose language are we identifying?
Whose language are we identifying?
Whose language are we identifying?
Whose language are we identifying?
Global platforms attract global diversity in a language
English
Global platforms attract global diversity in a language
English 125M Speakers
90M Speakers
79M Speakers
60M Speakers
251M Speakers
Global platforms attract global diversity in a language
English
French Spanish Arabic
125M Speakers
90M Speakers
79M Speakers
60M Speakers
251M Speakers
5
Human Development Index of text’s origin country
Estimated LID accuracy for
English tweets
{EducationLife expectancy Income
(Labov, 1964; Ash, 2002)
5
Human Development Index of text’s origin country
Estimated LID accuracy for
English tweets
{EducationLife expectancy Income
MoreDialect
LessDialect
(Labov, 1964; Ash, 2002)
5
Human Development Index of text’s origin country
Estimated LID accuracy for
English tweets
{EducationLife expectancy Income
MoreDialect
LessDialect
(Labov, 1964; Ash, 2002)
Current language detection methods perform significantly worse in less-developed countries
5
Human Development Index of text’s origin country
Estimated LID accuracy for
English tweets
{EducationLife expectancy Income
MoreDialect
LessDialect
(Labov, 1964; Ash, 2002)
Current language detection methods perform significantly worse in less-developed countries
5
Human Development Index of text’s origin country
Estimated LID accuracy for
English tweets }23%
{EducationLife expectancy Income
MoreDialect
LessDialect
(Labov, 1964; Ash, 2002)
6
Keyword Filter“flu”, “sick”
Practical Motivation: Epidemic Detection
NLP Which symptoms?
6
6
Keyword Filter“flu”, “sick”
Practical Motivation: Epidemic Detection
NLP Which symptoms?
LanguageDetection
6
6
Keyword Filter“flu”, “sick”
Practical Motivation: Epidemic Detection
non-English
NLP Which symptoms?
LanguageDetection
6
6
Keyword Filter“flu”, “sick”
Practical Motivation: Epidemic Detection
non-English
NLP Which symptoms?
LanguageDetection
6
6
Keyword Filter“flu”, “sick”
Practical Motivation: Epidemic Detection
non-English
NLP Which symptoms?
LanguageDetection
6
6
Keyword Filter“flu”, “sick”
Practical Motivation: Epidemic Detection
NLP Which symptoms?
LanguageDetection
non-English?
6
Failing to recognize a language silences its
speakers’ voices
Current language detection methods perform significantly worse in less-developed countries
8
Human Development Index of text’s origin country
Estimated accuracy for
English tweets
MoreDialect
LessDialect
(Labov, 1964; Ash, 2002)
Current language detection methods perform significantly worse in less-developed countries
8
Human Development Index of text’s origin country
Estimated accuracy for
English tweets
MoreDialect
LessDialect
(Labov, 1964; Ash, 2002)
Our goal is make language ID performance equal for all
languages across all dialects
Current language detection methods perform significantly worse in less-developed countries
8
Human Development Index of text’s origin country
Estimated accuracy for
English tweets
MoreDialect
LessDialect
(Labov, 1964; Ash, 2002)
Our goal is make language ID performance equal for all
languages across all dialects
This is a
universal
NLP issue!
Key Problems: Current methods struggle in the global setting because
9
Key Problems: Current methods struggle in the global setting because
9
Data: No corpora that captures global variation in lexicon and dialect
Key Problems: Current methods struggle in the global setting because
9
Data: No corpora that captures global variation in lexicon and dialect
Model: makes simplistic assumptions about how multilinguals communicate
Our approach
10
NLP methodologies capable of handling linguistic variation
Better social representation through network-based
sampling
Our Data Solution: Improve linguistic representation through network-based sampling
11
Our Data Solution: Improve linguistic representation through network-based sampling
11
Bootstrap dialectic corpora using existing classifiers to find monolingual individuals
Our Data Solution: Improve linguistic representation through network-based sampling
11
Bootstrap dialectic corpora using existing classifiers to find monolingual individuals
Our Data Solution: Improve linguistic representation through network-based sampling
11
Bootstrap dialectic corpora using existing classifiers to find monolingual individuals
eng
Our Data Solution: Improve linguistic representation through network-based sampling
11
Bootstrap dialectic corpora using existing classifiers to find monolingual individuals
engeng
Our Data Solution: Improve linguistic representation through network-based sampling
11
Bootstrap dialectic corpora using existing classifiers to find monolingual individuals
engeng
eng
engeng
fra
Our Data Solution: Improve linguistic representation through network-based sampling
11
Bootstrap dialectic corpora using existing classifiers to find monolingual individuals
engeng
eng
engeng
engfra
Our Data Solution: Improve linguistic representation through network-based sampling
11
Bootstrap dialectic corpora using existing classifiers to find monolingual individuals
Sample from the geolocated Twitter social network to include text from people at all locations
engeng
eng
engeng
engfra
Build a strategically-diverse corpora and synthesize code-switched examples
12
Build a strategically-diverse corpora and synthesize code-switched examples
12
Topical
Build a strategically-diverse corpora and synthesize code-switched examples
12
Topical Geographic
Build a strategically-diverse corpora and synthesize code-switched examples
12
Topical
Social
Geographic
Build a strategically-diverse corpora and synthesize code-switched examples
12
Topical
Social
Geographic
Multilingual
Our model solution: treat language identification as a character-based sequence to sequence task.
13
Encoder
Decoder
Je vais commander à emporter. I’m too lazy to cook.Jaech et al. 2016; Samih et al. 2016
Our model solution: treat language identification as a character-based sequence to sequence task.
13
Encoder
Decoder
Je vais commander à emporter. I’m too lazy to cook.Jaech et al. 2016; Samih et al. 2016
Represents a multi-layer recurrent neural network
Our model solution: treat language identification as a character-based sequence to sequence task.
13
Encoder
Decoder
Je vais commander à emporter. I’m too lazy to cook.J e _ o k .…
Jaech et al. 2016; Samih et al. 2016
Represents a multi-layer recurrent neural network
Our model solution: treat language identification as a character-based sequence to sequence task.
13
Encodes the whole sentence using its charactersEncoder
Decoder
Je vais commander à emporter. I’m too lazy to cook.J e _ o k .…
Jaech et al. 2016; Samih et al. 2016
Represents a multi-layer recurrent neural network
Our model solution: treat language identification as a character-based sequence to sequence task.
13
Encodes the whole sentence using its charactersEncoder
Decoder
Je vais commander à emporter. I’m too lazy to cook.J e _ o k .…
Decode each word’s language from the sentence encoding
Jaech et al. 2016; Samih et al. 2016
Represents a multi-layer recurrent neural network
Our model solution: treat language identification as a character-based sequence to sequence task.
13
Encodes the whole sentence using its characters
Fra Fra Fra Fra Fra . Eng Eng Eng Eng Eng .
Encoder
Decoder
Je vais commander à emporter. I’m too lazy to cook.J e _ o k .…
Decode each word’s language from the sentence encoding
Jaech et al. 2016; Samih et al. 2016
Represents a multi-layer recurrent neural network
14
Equilid vs off-the-shelf
Lui et al. 2013, 2014
Our M
etho
d
14
0
25
50
75
100
70 Languages on Twitter
Mac
ro F
1
langi
d.py CL
D2Ou
r Met
hod
Equilid vs off-the-shelf
Lui et al. 2013, 2014
Our M
etho
d
14
0
25
50
75
100
70 Languages on Twitter
Mac
ro F
1
langi
d.py CL
D2Ou
r Met
hod
0
25
50
75
100
Geo-diverse Tweets
Mac
ro F
1
langi
d.py
CLD2 Ou
r Met
hod
Equilid vs off-the-shelf
Lui et al. 2013, 2014
Our M
etho
d
14
0
25
50
75
100
70 Languages on Twitter
Mac
ro F
1
langi
d.py CL
D2Ou
r Met
hod
0
25
50
75
100
Geo-diverse Tweets
Mac
ro F
1
langi
d.py
CLD2 Ou
r Met
hod
0
25
50
75
100
Multilingual Tweets
Mac
ro F
1
Polyg
lot CLD2
Equilid vs off-the-shelf
Lui et al. 2013, 2014
Our M
etho
d
15
Equilid even outperforms system specifically tuned for each dataset
0
50
100
70 Languages on Twitter
9291.2M
acro
F1
Our M
etho
d
Mac
ro F
1
langi
d.py CL
D2Ou
r Met
hod
0
50
100
TweetLID
79.678.7
Jaec
h et
al. (
2016
)
Our M
etho
d
Jaec
h et
al. (
2016
)
Case Study: Do our solutions provide socially-equitable language identification for
health-related queries?
16
1M Tweets with any of 385 English terms from
established lexicons for influenza, psychological well-
being, and social health
Case Study: Do our solutions provide socially-equitable language identification for
health-related queries?
Lamb et al., (2013); Smith et al., (2016); Preotiuc-Pietro et al., (2015); Park et al., (2016)16
1M Tweets with any of 385 English terms from
established lexicons for influenza, psychological well-
being, and social health
Case Study: Do our solutions provide socially-equitable language identification for
health-related queries?
Lamb et al., (2013); Smith et al., (2016); Preotiuc-Pietro et al., (2015); Park et al., (2016)16
Task: does the language identification system recognize every tweet as English?
1M Tweets with any of 385 English terms from
established lexicons for influenza, psychological well-
being, and social health
Equilid raises the bar for socially-equitable language identification
17
Human Development Index of text’s origin country
Estim
ated
acc
urac
y fo
r En
glish
twee
ts
Social Equality doesn’t stop at Language Identification
18
Methodologies capable of handling
language as it is used
Better social representation in
our data
Social Equality doesn’t stop at Language Identification
18
Methodologies capable of handling
language as it is used
Better social representation in
our data
19
David Jurgens, Yulia Tsvetkov, and Dan Jurafsky
Be equitable! https://github.com/davidjurgens/equilid
top related