big social data: the spatial turn in big data (video available soon on youtube)
DESCRIPTION
Big Social Data: The Spatial Turn in Big Data By Richard Heimann & Abe Usher University of Maryland Baltimore County Webinar Description: The increased access to spatial data and overall improved application of spatial analytical methods present certain potential to social scientific research. This webinar is designed to focus on substantive social science research perspectives while exposing rewards involved in the application of geographic information systems (GIS), Big Data, and spatial analytics in their own research. What is witnessed as the hype of Web 2.0 has worn off and the collaborative use of the Internet becomes a societal norm is an unprecedented explosion in the creation and analysis of geospatial data. Just as major governments are reducing their investments in location intelligence, individuals and non-government organizations are fueling a bonfire of innovation in the world of GIS data. Traditional spatial analyses grew up in an era of sparse data and very weak computational power. Today, both of those circumstances are reversed and many of the old solutions are no longer suitable to answer todays questions. "Big Social Data: The Spatial Turn in Big Data" reflects this change and combines two things which, until recently, engaged quite different groups of researchers and practitioners. Together, they require particular techniques and a sophisticated understanding of the special problems associated with spatial social data. Geographic Data Mining, or Geographic Knowledge Discovery, is not new, but is developing and changing rapidly as both more, and different, data becomes available, and people see new applications. The days of ‘Big Data’ require fresh thinking. The webinar will highlight connections between spatial concepts and data availability. New emerging social media data will be promoted over traditional social science data, which better reflect some of the more recently developments in Big Data - most notably the socially critical exploration of such data.TRANSCRIPT
1
Big Social Data:The Spatial Turn in Big Data
Rich Heimann, UMBC Adjunct FacultyAbe Usher, HumanGeo GroupMay 9, 2013
2
Agenda Major Trends; Foundational Definitions.
[Abe] Long Tail of Big Social Data [Rich] Laws of the Spatial Sciences [Rich]
– Big Data; Small Theory [Rich] Important Big Data Concepts [Abe]
– The Kitchen Model [Abe] Vignettes [Rich & Abe] So, what? Additional Resources 2
3
Major Trends
Location Explosion 2004- present
4
Location Explosion 2004- present Proliferation of mobile computing
Major Trends
7 billion devices in 2014
5
Location Explosion 2004- present Proliferation of mobile computing Social networking
Major Trends
> 700 million comments daily
> 144 million connections daily
6
Location Explosion 2004- present Proliferation of mobile computing Social networking Gamification of geo
Major Trends
7
Location Explosion 2004- present Proliferation of mobile computing Social networking Gamification of geo
Impact:Continuous, global geo-located observations, shared across the Internet.
Impact:Continuous, global geo-located observations, shared across the Internet.
Major Trends
8
Definitions
Volunteered Geographic Information* (VGI)
“harnessing of tools to create, assemble, and disseminate geographic data provided voluntarily by individuals”
* http://en.wikipedia.org/wiki/Volunteered_geographic_information
8
9
Volunteered Geographic Information (VGI) Social Media"a group of Internet-based applications that
build on the ideological and technological foundations of Web 2.0, and that allow the creation and exchange of user-generated content”
* http://goo.gl/oSrIS
9
Definitions
10
Volunteered Geographic Information (VGI) Social Media Big Data
“is high-volume, velocity and variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making”
* http://goo.gl/DFFbr 1
0
Definitions
11
Long Tail: Traditional Social Science DataLong Tail: Traditional Social Science Data
Head: Big Data; Head: Big Data; nontraditional social nontraditional social science data.science data.
Head: Big Data – Large continuous datasets coincident coincident over Time & Space. Ideal for multivariate analysis.Tail {power law distribution} Data in tail is often unmaintained beyond their initially designed use case and individually curated. As a result, the data is discontiguous from other research efforts and discontinuous over space and time. Dark data is suspected to exist or ought to exist but is difficult or impossible to find. The problem of dark data is real and prevalent in the tail. The long tail is an intractably large management problem.
Long Tail of Big Social Data
12
Power lawPower law 80%80% 20%20%
Number of Grants 7,478 1,869
Dollar Amount $938,548,595 $1,199,088,125
Total Grants (NSF07) 9,347 (Count) $2,137,636,716 (Amount)
Long Tail of NSF Data
13
Tobler’s [Tobler, 1970] First Law of Geography (TFLG)
TFLG: “All things are related, but nearby things are more related than distant things”
Spatial Heterogeneity“Second law of geography”[Goodchild, 2003].
Spatial Simpson’s ParadoxGlobal model will always compete and may be inconsistent
with local models.
Anyon (1982): social science should be empirically grounded, theoretically explanatory and socially critical.
Laws of Spatial Science
13
http://www.bigdatarepublic.com/author.asp?section_id=2948
14
Spatial Simpson’s ParadoxSpatial Simpson’s ParadoxGlobal standards will always compete with local social phenomenon.Global standards will always compete with local social phenomenon.
Violence in the south
Violence in Violence in the norththe north
Violence in the south
Violence in Violence in the norththe north
Violence
GlobalGlobal models average regionally variant phenomenon. Local Local models account for regional variation.
Big Data; Small Theory
14
15
Important Big Data Concepts
Aggregation
Association
Correlation
15
16
Important Big Data Concepts Aggregation
Quantitative methods for creating descriptive statistics
Association Methods of identifying relationships of one data
element to another
Correlation The process of quantifying a correspondence
between two comparable entities
17
Two Vignettes
1. Spatially patterning of Tweet composition from the Presidential elections of 2012.
2. Pattern of life analysis of a major US city.
18
Kitchen Model
Chef Ingredients Utensils Recipes
19
Kitchen Model
Chef Ingredients Utensils Recipes
20
Practice: Recommended Tools
• Python• R• Quantum GIS• Google Earth
20
21
Vignette 1:The Flesch-Kincaid Reading Algorithm
22
RE = 206.835 – (1.015 x ASL) – (84.6 x ASW)
RE = Readability Ease; ASL = Average Sentence Length (i.e., the number of words divided by the number of sentences); ASW = Average number of syllables per word (i.e., the number of syllables divided by the number of words)
The output, i.e., RE is a number generally ranging from 0 to 100. The higher the number, the easier the text is to read.
• Scores between 90.0 and 100.0 are considered easily understandable by an average 5th grader.
• Scores between 60.0 and 70.0 are considered easily understood by 8th and 9th graders.
• Scores between 0.0 and 30.0 are considered easily understood by college graduates.
The Flesch-Kincaid Reading Algorithm
23
Clean Text “this gas situation is absolutely ridiculous.”
Language english
Latitude 41.0862
Longitude -74.1520
USERID “ ”
Kincaid 14.3
Flesch 3.3
Flesch-Kincaid (Mean Centered)
-76.273849
Leesbaarheid Score 56
Leesbaarheid Grade 11
The Flesch-Kincaid Reading Algorithm
24
Clean Text “down here in beach bout to shut this down wit & feeling the vibe s.”
Language english
Latitude 33.68709
Longitude -78.88915
USERID “ ”
Kincaid 3.5
Flesch 100
Flesch-Kincaid (Mean Centered)
20.42615
Leesbaarheid Score 22.9
Leesbaarheid Grade 4
The Flesch-Kincaid Reading Algorithm
25
Time Span: 2012-10-23 to 2012-11-06 (1 temporal bin, 2 weeks);
Spatial Area: Data Clipped to US;
Original Sample: 110,737 obs; 418,085 words & 1,446,494 characters without stop words (519,974 & 2,326,500 with stop words);
Data processing: Removal of hashtags, @{users}, URLs, thresholding and mean centering;
Pruned Sample: 47,690 observations;
Method: Local Indicator of Spatial Autocorrelation (Moran’s I) with LISA Classifications of High-High (HH), Low-Low (LL), High-Low (HL), Low-High (LH);
Spatial Weights: knn40;
Data Reduction: pseudo p-values 0.05, 0.01, 0.001.
By the numbers...
26
Region mean SD 0% 25% 50% 75% 100% data:n
East North Central 0.6193 16.514 -76.274 -5.77 4.93 11.92 20.426 7579
East South Central 0.6314 16.576 -74.673 -5.27 4.93 12.23 20.426 3028
Mid-Atlantic -0.1988 16.590 -76.273 -6.47 3.73 11.43 20.426 6278
Mountain -0.1212 16.586 -73.174 -7.00 4.32 11.43 20.426 2452
New England -0.1837 16.864 -73.174 -7.00 4.32 11.43 20.426 2392
Pacific -0.8560 17.276 -78.274 -7.78 3.72 11.43 20.426 5390
Southeast 0.1469 16.730 -79.373 -5.78 4.32 11.43 20.426 10022
West North
Central 0.6010 16.385 -78.274 -5.78 5.22 12.23 20.426 2781
West South
Central 0.8323 16.386 -79.273 -4.77 5.33 12.12 20.426 5572
The Flesch-Kincaid Reading Algorithm
27
The Flesch-Kincaid Reading Algorithm
library(ggplot2)ggplot(Twitter, aes(x=regiontxt, y=flecMC, ylab="Flesch Kincaid Index", xlab="Region", data=Twitter))
geom_point(colour="lightblue", alpha=0.1, position="jitter") +geom_boxplot(outlier.size=1, alpha=0.1)
boxplot(flecMC~regiontxt, ylab="flecMC", xlab="regiontxt", data=Twitter)
https://gist.github.com/rheimann/5525909
29
https://github.com/rheimann
The Flesch-Kincaid Reading Algorithm
Raw Data: data:n 47,690
30
High, High [n=77]
Low, Low [n=74]
Low, High [n=53]
High, Low [n=55]
= El Paso, Oklahoma City, Omaha, Detroit, Memphis
= NYC & San Jose #nerds
= Sacramento
= Wichita, Kansas City, Tulsa, Nashville
pseudo p-value < 0.05data:n 862 (3-digit Zip Codes)
Gassaway, WV
Watertown NY
Ithaca NY
Columbus OH
Fresno CA
https://github.com/rheimann
The Flesch-Kincaid Reading Algorithm
31
Rank ZIP code, City, State Median Home Price ($)
Flesch-Kincaid Index Mean Centered
Leesbaarheid School Index
100 Zip Code -3.2266 5.446 10014, New York, NY 4,116,5068 10021, New York, NY 3,980,8291 10065, New York, NY 6,534,43010 10075, New York, NY 3,885,409
076 Zip Code -3.761 5.52 07620, Alpine, NJ 5,745,038
119 Zip Code -0.0538 5.24 11962, Sagaponack, NY 4,180,3855 940 Zip Code3 94027, Atherton, CA 4,897,8645 94010, Hillsborough, CA 4,127,2507 94022, Los Altos Hills, CA 4,016,050 -3.596 5.87
902 Zip Code9 90274, Rolling Hills, CA 3,972,500 1.4095 4.96
The Flesch Reading Ease Algorithm
32
Green Eggs and Ham by Dr. Suess averages 5.7 words per sentence and 1.02 syllables per word, with a grade level of −1.3. (Most of the 50 used words are monosyllabic; "anywhere", which occurs 8 times, is the only exception.) The 50 dimensional space is small.
Even this fairly small Twitter sample & after lots of data processing to remove words of count:1 and words fewer than three characters the N:12,603 dimensional space.
Data Processing includes removing stop words and stemming.
110,737 obs; 418,085 words & 1,446,494 characters without stop words (519,974 & 2,326,500 with stop words);
Top 50 words include: [romney, obama, election, vote, hope]
Green Eggs and Ham: N - Dimensional Problems
33
Vignette 2:Spatial Patterns of Activity
34
Spatial Patterns of Activity:Geolocated Social Media
New forms of aggregation unlock
new insights in your data.
Useful for coarse pattern analysis
Looks interesting
Difficult to analyze directly
35
Rich & Abe GeolocatedSocial Media
PythonGeohash
Algorithm
Code on Github
Spatial Patterns of Activity:Applying the Kitchen Sink
36
States, Counties, and Census tracks
All different sizes Sometimes change This is a problem:
MAUP http://goo.gl/wQLTW
Spatial Patterns of Activity:Let’s use Political Boundaries
37
States, Counties, and Census tracks
All different sizes Sometimes change This is a problem:
MAUP http://goo.gl/wQLTW
Spatial Patterns of Activity:Let’s NOT use Political Boundaries
38
Invented in 2008 by Gustavo Niemeyer
Similar to quadtree; breaks the world into rectangles
Based on a z-curve algorithm
Useful for 2-d binning
Spatial Patterns of Activity:Geohash
39
4
4
4
8
5
6
4
2
4
4
4
9
5
4
4
3
2
4
4
2
1
4
1
1
6
5
5
4
2
Spatial Patterns of Activity:Geohash Math
Notional example:Occurrence of geolocated tweets related to coffee.
40
4
4
4
8
5
6
4
2
4
4
4
9
5
4
4
3
2
4
4
2
1
4
1
1
6
5
5
4
2
Spatial Patterns of Activity:Geohash Math
41
Spatial Patterns of Activity:Geohash Math
42
Activity near Washington DC
Spatial Patterns of Activity:3-d Google Earth
43
Activity near Washington DC
Spatial Patterns of Activity:3-d Google Earth
44
Spatial Patterns of Activity:Avoid the Classic Blunders
http://xkcd.com/1138/
45
Night activity near Washington DC
Spatial Patterns of Activity:Isolating a Time Series
46
Spatial Patterns of Activity:Isolating a Time Series
School Event
Tourists
School Event
Spatial Patterns of Activity:A Caffeinated Example
Aggregation Where is the most
commentary about coffee and Starbucks?
Association Is commentary about coffee
and Starbucks associated with the location of Starbucks stores? (Yes)
Correlation What is the numeric
relationship between geo-located coffee commentary and actual stores?
Where is Starbucks?
81 spatial regions identified with textual references to the words ‘coffee’ and/or ‘Starbucks.’
8 of the 81 regions are boxes that include both references to ‘coffee’ and ‘Starbucks’ within a narrow window of time.
7 of 8 (88%) accurately classify a region as containing a Starbucks by using simple text analysis alone.
.09
.52
.88
49
• Putting data in geospatial context unlocks insight.
• Location teaches us more about what we are analyzing.
• Adhere to statistical assumptions and avoid misspecification in our models.
• The “Big Data” aspects of social media mean that the faucet is always running, enabling experimentation.
So, What?
50
Eugene Wigner (1960 Nobel Laureate)
““The Unreasonable Effectiveness of The Unreasonable Effectiveness of Data”Data”
Peter Norvig Director of Research at Google Inc.
““The Unreasonable Effectiveness of The Unreasonable Effectiveness of Mathematics in the Natural Sciences”Mathematics in the Natural Sciences”
Academic Works; Embracing Complexity
51
Additional resources; Code and stuff...
Rich HeimannCode and Data: https://github.com/rheimann Slides: http://www.slideshare.net/rheimann04 Twitter: @rheimannUMBC: [email protected] Company: Data Tactics Corporation: http://goo.gl/8QWty
Abe UsherCode and Data; https://github.com/abeusherTwitter: @abeusherCompany: HumanGeo Group: http://goo.gl/uDbZP
52
Thank you!!
http://www.umbc.edu/shadygrove/gis/gis.php
53
Recommended resources: Books
54
Foundational data:1. Geonames.org: http://www.geonames.org/2. GADM.org: http://gadm.org/
Streaming data:1. Twitter API: https://dev.twitter.com/– Datasift: http://datasift.com/1. GNIP: http://gnip.com/
Recommended resources: Data
54