spatio-temporal demographic classification of the twitter users
DESCRIPTION
Use of social media continues to increase day by day, with implications for the creation of ‘big’ data – Twitter alone was forecast to have created 1.8 zettabytes of data in 2011. This talk presents an initial work towards the creation of geo-temporal geodemgoraphic classifications by using the Twitter social media data. London was chosen as the study area because of its high incidence of users and the consequent expectation that higher penetration might be associated with lower demographic bias.TRANSCRIPT
Spatio-temporal demographic classification of the Twitter users
Paul Longley, Muhammad Adnan, Guy LansleyDepartment of Geography, University College London
Web: http://www.uncertaintyofidentity.com
Outline
1. Introduction• Geodemographics • Social Media Geodemographics
2. Twitter
3. A geo-temporal demographic classification of Twitter users• Residence of Twitter users• Ethnic classification of Twitter users
• Age classification of Twitter users• Computing the demographic classification
Introduction
• Geodemographics• Analysis of people by where they live” [1] • Night time characteristics of the population
• Social Media Geodemographics • Moving beyond the night time geography
• Who: Ethnicity, Gender, and Age of social media users
• When: What time of day conversations happen
• Where: Where social media conversations happen
[1] Sleight, P. (2004). Targetting Customers-How to Use Geodemographic and Lifestyle Data in Your Business.
Twitter (www.twitter.com)
• Online social-networking and micro blogging service• Launched in 2006
• Users can send messages of 140 characters or less
• Approximately 200 million active users [2]
• 350 million tweets daily
• In 2013, UK and London were ranked 4th and 3rd, respectively, in terms of the number of posted tweets [3]
[2] Twitter. 2012. What is Twitter ?. Retrieved 31st December, 2012, from https://business.twitter.com/basics/what-is-twitter/.
[3] Bennet, S. 2013. Revealed: The Top 20 Countries and Cities of Twitter [STATS]. Retrieved 31st December, 2013, from http://www.mediabistro.com/alltwitter/twitter-top-countries_b26726.
Data available through the Twitter API
• User Creation Date
• Followers
• Friends
• User ID• Language• Location• Name
• Screen Name
• Time Zone
• Geo Enabled• Latitude• Longitude
• Tweet date and time
• Tweet text
Twitter data for the case study
• Approx. 8 million geo-tagged tweets (Jan – Dec, 2013)• Sent by 385,050 unique users
• 155,249 users sent 5 or more tweets (7.6 million tweets)
Variables for creating a geo-temporal classification
1. Residence• Where twitter users live
1. Ethnicity• Probable ethnic origins of Twitter users
1. Age• Probable Age of Twitter users
1. Land Use Category of a Tweet message• Residential; Non-domestic building; Park etc.
2. Temporal Scales• Day, Afternoon, Night, Peak travel hours
Residence of Twitter Users
• 170m X 170m grid was used to find the probable residence of users
• Probable residence was found for the 75,522 users
Extracting demographic attributes of Twitter users by using their forenames and surnames
A name is a statement of the bearer’s cultural, ethnic, and linguistic identity [4]
[4] Mateos P, Longley P A, O’Sullivan D 2011. Ethnicity and population structure in personal naming networks. PloS ONE (Public Library of Science) 6 (9) e22943.
Analysing Names on Twitter
• Some examples of NAME variations on Twitter
• Approx. 68% of the accounts have real names
Fake Names
Castor 5.
WHAT IS LOVE?
MysticMind
KIRILL_aka_KID
Vanessa
Justin Bieber Home
Real Names
Kevin Hodge
Andre Alves
Jose de Franco
Carolina Thomas, Dr.
Prof. Martha Del Val
Fabíola Sanchez Fernandes
Onomap: Names to Ethnicity classification
• Onomap was created by clustering names of 1 billion individuals around the world
• Applied ONOMAP (www.onomap.org) on forename – surname pairs
Kevin Hodge (English)
Pablo Mateos (Spanish)
…
…
…
…
• Monica dataset provided by CACI Ltd, UK• Supplemented with UK birth certificate records
Age estimation from ‘forenames’
[5] Longley, P., Adnan, M., Lansley, G. 2013. “The geo-temporal demographics of Twitter usage”. Environment and Planning A. (In Press)
Age distribution of Twitter users
Twitter Users vs. 2011 Census (Greater London)
[5] Longley, P., Adnan, M., Lansley, G. 2013. “The geo-temporal demographics of Twitter usage”. Environment and Planning A. (In Press)
Land-use Categories• Every tweet message was assigned a land-use category
Variables for creating a geo-temporal classification1. ResidenceV1: Tweet made near probable London residence
V2: Tweeter lives ‘outside the UK’
V3: Tweeter lives in the rest of the UK outside London
2. Total Number of TweetsV4: Total number of tweets made by the user
3. EthnicityV5: West European
V6: East European
V7: Greek or Turkish
V8: South East Asian
V9: Other Asian
V10: African & Caribbean
V11: Jewish
V12: Chinese
V13: Other minority
4. AgeV14: <=20
V15: 21 - 30
V16: 31 - 40
V17: 41 - 50
V18: 50+
5. Tweets outside the UKV19: In West Europe (not including UK)
V20: In East Europe
V21: In North America
V22: In Central or South American
V23: In Australasia
V24: In Africa
V25: In Middle East
V26: In Asia
V27: In Paris
Variables for creating a geo-temporal classification
6. Number of countries visitedV28: Number of countries tweeter has visited
7. London Land Use CategoryV29: Residential location
V30: Non-domestic buildings
V31: Transport links and locations
V32: Green-spaces
V33: All other land uses
8. 2011 London Output Area ClassificationV34: Intermediate Lifestyles
V35: High Density and High Rise Flats
V36: Settled Asians
V37: Urban Elites
V38: City Vibe
V39: London Life-Cycle
V40: Multi-Ethnic Suburbs
V41: Ageing-City Fringe
9. Temporal ScalesV42: Morning Peak Hours
V43: Week Day
V44: Afternoon
V45: Week Night
V46: Weekend
• Segmentations were created by using K-means clustering algorithm
• K-means tries to find cluster centroids by minimising
• Seven clusters
• Group A: London Residents
• Group B: Commuting Professionals
• Group C: Student Lifestyle
• Group D: The Daily Grind
• Group E: Spectators
• Group F: Visitors
• Group G: Workplace and tourist activity
Computing the geo-temporal classifications
∑∑ −= =
=n
x
n
yyxV z
1 1
2
)( µ
Group A: London Residents
• Tweets made near primary residential locations
• Tweets made on weeknights or weekends
Group B: Commuting Professionals
• Tweets made from• Transport locations• ‘Urban Elites’ LOAC classification
• Tweets made by individuals of intermediate age (21-30)
Group F: Visitors
• Tweeters live outside London
• Tweets originated from residential land uses
• Mixed age groups
Group G: Workplace and tourist activity
• Tweets sent from non-domestic buildings
• Full range of Twitter age cohorts
• Tweets originate from a mix of residents and international visitors
Conclusion
• Geo-temporal demographic classifications• Census (night time geography)
• Social media data (day and travel time geography)• Issues of representation
• An insight into the residential and travel geographies of individuals
• An insight into the spatial activity patterns of different kind of social media users
Any Questions ?
Thank you for Listening