the linguistics of twitter - pycon 2011 presentation

45
American English Regional Dialects Changing Speech Patterns Changing Online Measurement Michael D. Healy [email protected] http://michaeldhealy.com @MichaelDHealy @MichaelDHealy

Upload: michael-healy

Post on 18-Dec-2014

3.884 views

Category:

Technology


2 download

DESCRIPTION

'The Linguistics of Twitter' presentation from PyCon 2011 which I hope starts a dialogue about what we need to accurately measure the effects of social media.

TRANSCRIPT

Page 1: The Linguistics of Twitter - PyCon 2011 Presentation

American EnglishRegional Dialects

Changing Speech PatternsChanging Online Measurement

Michael D. [email protected]://michaeldhealy.com@MichaelDHealy

@MichaelDHealy

Page 2: The Linguistics of Twitter - PyCon 2011 Presentation

Michael D. Healy

• Econometrics• Linguistics

• Not an Engineer

Measuring and Influencing Online and Offline Behavior

Why am I here?

This Seemed Like an Interesting Problem

@MichaelDHealy

Page 3: The Linguistics of Twitter - PyCon 2011 Presentation

Plan of Action

• Background• Where We Stand

o Data Collection Interlude• Historical Context• Where We May Be Going• Potential Solutions

o Sort Of

@MichaelDHealy

Page 4: The Linguistics of Twitter - PyCon 2011 Presentation

Introduction: Hawaiian Pidgin Video

@MichaelDHealy

Page 5: The Linguistics of Twitter - PyCon 2011 Presentation

Plan of Action

• Background• Where We Stand• Historical Context• Where We May Be Going• Potential Solutions

@MichaelDHealy

Page 6: The Linguistics of Twitter - PyCon 2011 Presentation

BackgroundRegional Differences In Word Choice

@MichaelDHealy

MrEverything6's Tweet

Dallas, Texas Region

coke - Coca-Cola or soft drink in general?

Coca-Cola Probably Wants To Know

Page 7: The Linguistics of Twitter - PyCon 2011 Presentation

BackgroundRegional Differences In PronunciationMore Than Just Drawl

@MichaelDHealy

pin

Is that:Pin a tail on the donkey.-OR-Give me a 'pin' to write with.

Page 8: The Linguistics of Twitter - PyCon 2011 Presentation

Plan of Action

• Background• Where We Stand• Historical Context• Where We May Be Going• Potential Solutions

@MichaelDHealy

Page 9: The Linguistics of Twitter - PyCon 2011 Presentation

Where We Stand

@MichaelDHealy

Page 10: The Linguistics of Twitter - PyCon 2011 Presentation

Where We Stand

@MichaelDHealy

Page 11: The Linguistics of Twitter - PyCon 2011 Presentation

Detailed Dialectical MapDetailed Dialectical Map

http://aschmann.net/AmEng/

Page 12: The Linguistics of Twitter - PyCon 2011 Presentation

Where We Stand

@MichaelDHealy

Wait!Isn't This All Just Poor English?They Don't Speak The King's English!

1) America Doesn't Have A King

Page 13: The Linguistics of Twitter - PyCon 2011 Presentation

Where We Stand

@MichaelDHealy

Wait!Isn't This All Just Poor English?

2) English Doesn't Have An Authority Like:

French: L'Académie française

Spanish: Asociación de Academias de la Lengua Española

Numerous Others:http://en.wikipedia.org/wiki/List_of_language_regulators

Page 14: The Linguistics of Twitter - PyCon 2011 Presentation

Where We Stand

@MichaelDHealy

Who Is Right?Everyone

Prescriptive Linguistics: Tell You What Is Right

Descriptive Linguistics: Describe How You Communicate

Trying To Sell More Widgets?

Probably Descriptive Is Best

Page 15: The Linguistics of Twitter - PyCon 2011 Presentation

Where We Stand

@MichaelDHealy

Selected American English Dialects:• New England• Northern• North Midland• South Midland• NYC• Western• AAVE• Hawaiian Pidgin

Page 16: The Linguistics of Twitter - PyCon 2011 Presentation

Plan of Action

• Background• Where We Stand• Historical Context• Where We May Be Going• Potential Solutions

@MichaelDHealy

Page 17: The Linguistics of Twitter - PyCon 2011 Presentation

Historical Context

@MichaelDHealy

Linguists Thought TV Would Make Us All Sound The Same

Think Tom Brokaw

Area of

'StandardAmericanEnglish'

Not Overly LargeNot Largely Populated

Page 18: The Linguistics of Twitter - PyCon 2011 Presentation

Historical Context

@MichaelDHealy

Been To Wisconsin?

Seen Fargo?

Biggest Change In Spoken English Since 1750

Going On Right Now - After TV

'Oh yeah? Yeah'

Page 19: The Linguistics of Twitter - PyCon 2011 Presentation

Historical Context

@MichaelDHealy

Portions Of America Experience Some or All ofNorthern Cities Vowel Shift

Page 20: The Linguistics of Twitter - PyCon 2011 Presentation

Historical Context

@MichaelDHealy

Sum This Up:People In The Northern Cities Region Are Producing A Very Different Sounding English From Other Dialects

Page 21: The Linguistics of Twitter - PyCon 2011 Presentation

Historical Context

@MichaelDHealy

America Has Been Multi-Lingual Since July 9, 1776

Page 22: The Linguistics of Twitter - PyCon 2011 Presentation

Plan of Action

• Background• Where We Stand• Historical Context• Where We May Be Going• Potential Solutions

@MichaelDHealy

Page 23: The Linguistics of Twitter - PyCon 2011 Presentation

Where We May Be Going

@MichaelDHealy

Page 24: The Linguistics of Twitter - PyCon 2011 Presentation

Where We May Be Going

@MichaelDHealy

~ 74% of AmericansLive In A Megaregion

Megaregions Tied To Existing Dialect Regions

Page 25: The Linguistics of Twitter - PyCon 2011 Presentation

Where We May Be Going

@MichaelDHealy

William Labov, PhD.Professor of LinguisticsUniversity of Pennsylvaniahttp://www.ling.upenn.edu/~wlabov/

Pretty Much The Authority on American English Dialects

'And instead of getting a pepper-and-salt effect, we find very clear and sharp divisions between the dialects of the United States, which are getting more different from each other as time goes on.'

Page 26: The Linguistics of Twitter - PyCon 2011 Presentation

Plan of Action

• Background• Where We Stand• Historical Context• Where We May Be Going• Potential Solutions

@MichaelDHealy

Page 27: The Linguistics of Twitter - PyCon 2011 Presentation

Potential Solutions

One American Dialect Is Unique In Geography:

African-American Vernacular English (AAVE)

Not In A Geographically Contiguous Region

@MichaelDHealy

Page 28: The Linguistics of Twitter - PyCon 2011 Presentation

Potential Solutions

@MichaelDHealy

Center For Applied Linguistics.

"Thats the way baseball go."

Page 29: The Linguistics of Twitter - PyCon 2011 Presentation

Potential Solutions

@MichaelDHealy

Correct the Spelling & Grammar

import enchantfrom nltk.metrics import edit_distanceclass SpellingReplacer(object): def __init__(self, dict_name='en', max_dist=2): self.spell_dict = enchant.Dict(dict_name) self.max_dist = 2 def replace(self, word): if self.spell_dict.check(word): Return word suggestions = self.spell_dict.suggest(word)

if suggestions and edit_distance(word, suggestions[0]) <= self.max_dist: Return suggestions[0] else: return word

Page 30: The Linguistics of Twitter - PyCon 2011 Presentation

Potential Solutions

@MichaelDHealy

Example 1

well im gonna go so i’ll talk to u lata 1

Corrected Example 1

Well mi Donna go so I'll talk to U late

Page 31: The Linguistics of Twitter - PyCon 2011 Presentation

Potential Solutions

@MichaelDHealy

Build Out a Dictionary of Words

Regex Match and Replace

proper_words = {'hater': ['enemy','jealous individual','not friend']'coke': ['coke', 'soda', 'pop']}

Which Region?

Page 32: The Linguistics of Twitter - PyCon 2011 Presentation

Potential Solutions

@MichaelDHealy

Example 2

well i gotta go, i’ll talk to you later aight bye 1

Page 33: The Linguistics of Twitter - PyCon 2011 Presentation

Potential Solutions

@MichaelDHealy

import rereplacement_patterns = [ (r'gotta', 'got to'), (r"i\'ll", 'I will'), ('aight','all right')]

class RegexReplacer(object): def __init__(self, patterns=replacement_patterns): self.patterns = [(re.compile(regex), repl) for (regex, repl) in patterns] def replace(self, text): s = text for (pattern, repl) in self.patterns: (s, count) = re.subn(pattern, repl, s) return s

Page 34: The Linguistics of Twitter - PyCon 2011 Presentation

Potential Solutions

@MichaelDHealy

Example 2

well i gotta go, i’ll talk to you later aight bye 1

well i got to go, I will talk to you laterAll rightBye1 (!?)

Page 35: The Linguistics of Twitter - PyCon 2011 Presentation

Potential Solutions

@MichaelDHealy

Example 2

well i got to go, I will talk to you laterAll rightBye1 (!?)

Here '1' has the concept of: I understand

Page 36: The Linguistics of Twitter - PyCon 2011 Presentation

Potential Solutions

@MichaelDHealy

Solution?Bayesian Prediction Using a Custom Corpus

First Step: Tag Existing Data

import nltk.datatokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

def tokenize(para): print tokenizer.tokenize(para)

Page 37: The Linguistics of Twitter - PyCon 2011 Presentation

Potential Solutions

@MichaelDHealy

Solution?Bayesian Prediction Using a Custom Corpus

Oo shit she called I hit ignored..neva pick up on da first call..playa rule number 23 lol

Tokenized as:'Oo shit she called I hit ignored..neva pick up on da first call..playa rule number 23 lol'

So lots of custom work to be done . .

Page 38: The Linguistics of Twitter - PyCon 2011 Presentation

Potential Solutions

@MichaelDHealy

_andBeautyKills: – after tonight, don’t leave your boy roun’ me,umma #true playa fareal.

Local To SF:Neecy89: This african boy jus started askin me hella questions idk if he was tryin to be nice or tryna kill me lol

Page 39: The Linguistics of Twitter - PyCon 2011 Presentation

Potential Solutions

@MichaelDHealy

Geographic IndexingSimpleGeoimport simplegeo.shared, simplegeo.placesfrom simplegeo.shared import Feature

client = simplegeo.places.Client('your-oauth-token', 'your-oauth-secret')properties = {"province":"CA","city":"San Francisco","name":"SimpleGeo SF", \\ "country":"US", "phone":"+1 415 626 1375","address":"41 Decatur St", \\ "postcode":"94103"}f = simplegeo.places.Feature((37.772392, -122.405752), properties=properties)client.add_feature(f)'SG_5uZpvipNjVaSbbDv5bvZaa_37.772392_-122.405752@1291847366'

Page 40: The Linguistics of Twitter - PyCon 2011 Presentation

Potential Solutions

@MichaelDHealy

Geographic IndexingSimpleGeo: Queries

import simplegeo.placesdef start(lon,lat): oauth,secret = open('/home/michael/.simplegeo','r').read().strip().split('\n') client = simplegeo.places.Client(oauth,secret) results = client.search(lon,lat) return results

def search(lon,lat,tweet) results = start(lon,lat) for word in tweet.split(): for i in results: data = i.to_dict() if word == data['properties']['name']: print data['name'],word

Page 41: The Linguistics of Twitter - PyCon 2011 Presentation

Potential Solutions: SimpleGeo-Tools

@MichaelDHealy

import simplegeo.placesimport simplegeo.context

class SimpleGeoAuth(object): def __init__(self): self.oauth,self.secret = open('/home/michael/.simplegeo','r').read().strip().split('\n') self.places_client = simplegeo.places.Client(self.oauth,self.secret) self.context_client = simplegeo.context.Client(self.oauth,self.secret) def SimpleGeoContextualQuery(self,lat,lon,text): geo_results = self.places_client.search(lat,lon) for word in text.split(): for geo_result in geo_results: data = geo_result.to_dict() if word == data['properties']['name']: return data['name'],word def SimpleGeoContextQuery(self,lat,lon): context_results = self.context_client.get_context(lat,lon) return context_results

Page 42: The Linguistics of Twitter - PyCon 2011 Presentation

Potential Solutions:Connect the APIS

@MichaelDHealy

Page 43: The Linguistics of Twitter - PyCon 2011 Presentation

References

@MichaelDHealy

Jacob Perkins: NLTK Master Ninja Python Text Processing with NLTK2.0 Cookbook https://www.packtpub.com/python-text-processing-nltk-20-cookbook/book http://streamhacker.com/

A Latent Variable Model for Geographic Lexical Variation. Eisenstein, J., O'Connor, B., Smith, N., and Xing, E. (2010). In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Cambridge, MA, October 2010.

You are where you tweet: a content-based approach to geo-locating twitter users. (2010). Cheng, Z., Caverlee, J., Lee, K. CIKM '10: Proceedings of the 19th ACM international conference on Information and knowledge management, 2010

Page 44: The Linguistics of Twitter - PyCon 2011 Presentation

References

@MichaelDHealy

Repustate: Sentiment Analysis API http://repustate.com/

Rapleaf Personalization API https://www.rapleaf.com/

SimpleGeo GIS Solution API http://simplegeo.com/

Page 45: The Linguistics of Twitter - PyCon 2011 Presentation

Michael D. Healy SimpleGeo-Tools

@MichaelDHealy

Michael D. Healy [email protected] http://michaeldhealy.com @MichaelDHealy

SimpleGeo-Tools https://github.com/michaeldhealy/SimpleGeo-Tools