phd colloquium spatial analysis
DESCRIPTION
Presentation given as part of a PHD Colloquium on Spatial Analysis delivered on Wed 11th January 2013TRANSCRIPT
Data Mining to Understand International Dimensions to Online Identity - a classification of 2+ billion names and their linkage to virtual identities and social network traffic.
• Alistair Leak• UCL SECReT• [email protected]
Who am I?
Education:Kingston University (BSc) - GIS
UCL (M.Res) - Advanced Spatial Analysis and Visualisation
UCL 3+1 - PhD Security and Crime Science
Supervisors:1st Supervisor: Professor Paul Longley
2nd Supervisor: Dr James Cheshire
Definitions:• Netnography
– “A qualitative, interpretive research methodology that uses internet-optimized ethnographic research techniques to study the social context in online communities” (Kozinets,2009)
• Cybergeodemographics– “The analysis of people by where they live and by whom they
interact with, in real and virtual space” (Longley, 2012)
Uncertainty of Identity: Work Package 4: Cybergeodemographics
• Use of primary and secondary data to relate virtual Internet traffic to the probable physical locations from which it emanated; and the development of typologies of social networks that are robust, generalized and related to physical locations.
Data Collection Tools (WP1)
Text Analytics(WP2)
Cybergeodemographics (WP4)
Secondary Data
Working Title:
• “Data Mining to Understand International Dimensions to Online Identity - a classification of 2+ billion names and their linkage to virtual identities and social network traffic”
Objectives:
• Develop spatial context of name network classification• Develop typologies of social networks• Measure how representative social media is of the
underlying population.
Work Plan• M.Res (Present – 2013)
– Foundation work• Assess representative capability of tweet data
– Skills Development• Spatio-Temporal Data Mining• Database Management
• Ph.D (2013 – 2016)
– Objectives• Develop spatial component of names networks• Develop typologies of social networks• Develop a measure of uncertainty
– Completion in August 2016
Data Sources:
*Sina Weibo
Case Study: Tweets in London
• 1.4 Million Tweets over 3 months Sep - Dec 2012
What’s in a Tweet?
First Name
SurnameUnique ID
Popularity
Interactions
# Themes
Possibilities:•Political Affiliation•Gender•Age•Location
Time/Date
Location
• Gender– Database of 62000 names + genders– Determined by Forename
• Demographic– OAC – Output area classifier
• ONOMAP– Ethnicity, Religion, Geographical Origin.– Determined by Forename Surname combination
Data Classification
Data Classification
Tw
eets
by
ON
OM
AP
Rel
igio
n
Tw
eets
by
ON
OM
AP
Rel
igio
n
Tw
eets
by
ON
OM
AP
Gro
up
Challenges of Study
• Signal from Noise– Tweets are not all sent from individuals homes
• Day and night demographics
– Not all location tweets are real people
• Data Quality/Sample Size– Twitter users are self selecting
• Only a small proportion have enabled location services• Dataset currently has 92,000 unique users
Target Areas of Study
• Spatio-temporal differentiation of tweets– Night– Day– Travel
• Expansion of the Methodology for World Names– Initially into Europe.
• Application of new name datasets.
References:• Dale, M. R. T., and M-J. Fortin. "From graphs to spatial graphs." Annual Review of Ecology,
Evolution, and Systematics 41.1 (2010): 21.• Fischer, E. (July, 2011). World Map of Flikr and Twitter Locations. In See Something or Say
Something. Available at http://www.flickr.com/photos/walkingsf/5912169471/in/set-72157627140310742
• http://urbantick.blogspot.co.uk/2010/12/ncl-social-networks.html
• Kozinets, Robert V. Netnography: Doing ethnographic research online. Sage Publications Limited, 2009.
• R Core Team (2012). R: A language and environment for statistical computing. R Foundation for
• Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R-project.org/.
• Rao, D., Yarowsky, D., Shreevats, A., & Gupta, M. (2010, October). Classifying latent user attributes in twitter. In Proceedings of the 2nd international workshop on Search and mining user-generated contents (pp. 37-44). ACM.
Thank-you
X Factor GraphProduced with R and Gephi