big social data: the spatial turn in big data (video available soon on youtube)

1

Big Social Data:The Spatial Turn in Big Data

Rich Heimann, UMBC Adjunct FacultyAbe Usher, HumanGeo GroupMay 9, 2013

2

Agenda Major Trends; Foundational Definitions.

[Abe] Long Tail of Big Social Data [Rich] Laws of the Spatial Sciences [Rich]

– Big Data; Small Theory [Rich] Important Big Data Concepts [Abe]

– The Kitchen Model [Abe] Vignettes [Rich & Abe] So, what? Additional Resources 2

3

Major Trends

Location Explosion 2004- present

4

Location Explosion 2004- present Proliferation of mobile computing

Major Trends

7 billion devices in 2014

5

Location Explosion 2004- present Proliferation of mobile computing Social networking

Major Trends

> 700 million comments daily

> 144 million connections daily

6

Location Explosion 2004- present Proliferation of mobile computing Social networking Gamification of geo

Major Trends

7

Location Explosion 2004- present Proliferation of mobile computing Social networking Gamification of geo

Impact:Continuous, global geo-located observations, shared across the Internet.

Impact:Continuous, global geo-located observations, shared across the Internet.

Major Trends

8

Definitions

Volunteered Geographic Information* (VGI)

“harnessing of tools to create, assemble, and disseminate geographic data provided voluntarily by individuals”

* http://en.wikipedia.org/wiki/Volunteered_geographic_information

8

http://en.wikipedia.org/wiki/Volunteered_geographic_information

9

Volunteered Geographic Information (VGI) Social Media"a group of Internet-based applications that

build on the ideological and technological foundations of Web 2.0, and that allow the creation and exchange of user-generated content”

* http://goo.gl/oSrIS

9

Definitions

http://goo.gl/oSrIS

http://goo.gl/oSrIS

10

Volunteered Geographic Information (VGI) Social Media Big Data

“is high-volume, velocity and variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making”

* http://goo.gl/DFFbr 1

0

Definitions

11

Long Tail: Traditional Social Science DataLong Tail: Traditional Social Science Data

Head: Big Data; Head: Big Data; nontraditional social nontraditional social science data.science data.

Head: Big Data – Large continuous datasets coincident coincident over Time & Space. Ideal for multivariate analysis.Tail {power law distribution} Data in tail is often unmaintained beyond their initially designed use case and individually curated. As a result, the data is discontiguous from other research efforts and discontinuous over space and time. Dark data is suspected to exist or ought to exist but is difficult or impossible to find. The problem of dark data is real and prevalent in the tail. The long tail is an intractably large management problem.

Long Tail of Big Social Data

12

Power lawPower law 80%80% 20%20%

Number of Grants 7,478 1,869

Dollar Amount $938,548,595 $1,199,088,125

Total Grants (NSF07) 9,347 (Count) $2,137,636,716 (Amount)

Long Tail of NSF Data

13

Tobler’s [Tobler, 1970] First Law of Geography (TFLG)

TFLG: “All things are related, but nearby things are more related than distant things”

Spatial Heterogeneity“Second law of geography”[Goodchild, 2003].

Spatial Simpson’s ParadoxGlobal model will always compete and may be inconsistent

with local models.

Anyon (1982): social science should be empirically grounded, theoretically explanatory and socially critical.

Laws of Spatial Science

13

http://www.bigdatarepublic.com/author.asp?section_id=2948

http://www.bigdatarepublic.com/author.asp?section_id=2948

14

Spatial Simpson’s ParadoxSpatial Simpson’s ParadoxGlobal standards will always compete with local social phenomenon.Global standards will always compete with local social phenomenon.

Violence in the south

Violence in Violence in the norththe north

Violence in the south

Violence in Violence in the norththe north

Violence

GlobalGlobal models average regionally variant phenomenon. Local Local models account for regional variation.

Big Data; Small Theory

14

15

Important Big Data Concepts

Aggregation

Association

Correlation

15

16

Important Big Data Concepts Aggregation

Quantitative methods for creating descriptive statistics

Association Methods of identifying relationships of one data

element to another

Correlation The process of quantifying a correspondence

between two comparable entities

17

Two Vignettes

1. Spatially patterning of Tweet composition from the Presidential elections of 2012.

2. Pattern of life analysis of a major US city.

18

Kitchen Model

Chef Ingredients Utensils Recipes

19

Kitchen Model

Chef Ingredients Utensils Recipes

20

Practice: Recommended Tools

• Python• R• Quantum GIS• Google Earth

20

21

Vignette 1:The Flesch-Kincaid Reading Algorithm

22

RE = 206.835 – (1.015 x ASL) – (84.6 x ASW)

RE = Readability Ease; ASL = Average Sentence Length (i.e., the number of words divided by the number of sentences); ASW = Average number of syllables per word (i.e., the number of syllables divided by the number of words)

The output, i.e., RE is a number generally ranging from 0 to 100. The higher the number, the easier the text is to read.

• Scores between 90.0 and 100.0 are considered easily understandable by an average 5th grader.

• Scores between 60.0 and 70.0 are considered easily understood by 8th and 9th graders.

• Scores between 0.0 and 30.0 are considered easily understood by college graduates.

The Flesch-Kincaid Reading Algorithm

23

Clean Text “this gas situation is absolutely ridiculous.”

Language english

Latitude 41.0862

Longitude -74.1520

USERID “ ”

Kincaid 14.3

Flesch 3.3

Flesch-Kincaid (Mean Centered)

-76.273849

Leesbaarheid Score 56

Leesbaarheid Grade 11


24

Clean Text “down here in beach bout to shut this down wit & feeling the vibe s.”

Language english

Latitude 33.68709

Longitude -78.88915

USERID “ ”

Kincaid 3.5

Flesch 100

Flesch-Kincaid (Mean Centered)

20.42615

Leesbaarheid Score 22.9

Leesbaarheid Grade 4


25

Time Span: 2012-10-23 to 2012-11-06 (1 temporal bin, 2 weeks);

Spatial Area: Data Clipped to US;

Original Sample: 110,737 obs; 418,085 words & 1,446,494 characters without stop words (519,974 & 2,326,500 with stop words);

Data processing: Removal of hashtags, @{users}, URLs, thresholding and mean centering;

Pruned Sample: 47,690 observations;

Method: Local Indicator of Spatial Autocorrelation (Moran’s I) with LISA Classifications of High-High (HH), Low-Low (LL), High-Low (HL), Low-High (LH);

Spatial Weights: knn40;

Data Reduction: pseudo p-values 0.05, 0.01, 0.001.

By the numbers...

26

Region mean SD 0% 25% 50% 75% 100% data:n

East North Central 0.6193 16.514 -76.274 -5.77 4.93 11.92 20.426 7579

East South Central 0.6314 16.576 -74.673 -5.27 4.93 12.23 20.426 3028

Mid-Atlantic -0.1988 16.590 -76.273 -6.47 3.73 11.43 20.426 6278

Mountain -0.1212 16.586 -73.174 -7.00 4.32 11.43 20.426 2452

New England -0.1837 16.864 -73.174 -7.00 4.32 11.43 20.426 2392

Pacific -0.8560 17.276 -78.274 -7.78 3.72 11.43 20.426 5390

Southeast 0.1469 16.730 -79.373 -5.78 4.32 11.43 20.426 10022

West North

Central 0.6010 16.385 -78.274 -5.78 5.22 12.23 20.426 2781

West South

Central 0.8323 16.386 -79.273 -4.77 5.33 12.12 20.426 5572


27


library(ggplot2)ggplot(Twitter, aes(x=regiontxt, y=flecMC, ylab="Flesch Kincaid Index", xlab="Region", data=Twitter))

geom_point(colour="lightblue", alpha=0.1, position="jitter") +geom_boxplot(outlier.size=1, alpha=0.1)

boxplot(flecMC~regiontxt, ylab="flecMC", xlab="regiontxt", data=Twitter)

https://gist.github.com/rheimann/5525909

https://gist.github.com/rheimann/5525909

29

https://github.com/rheimann


Raw Data: data:n 47,690


30

High, High [n=77]

Low, Low [n=74]

Low, High [n=53]

High, Low [n=55]

= El Paso, Oklahoma City, Omaha, Detroit, Memphis

= NYC & San Jose #nerds

= Sacramento

= Wichita, Kansas City, Tulsa, Nashville

pseudo p-value < 0.05data:n 862 (3-digit Zip Codes)

Gassaway, WV

Watertown NY

Ithaca NY

Columbus OH

Fresno CA




31

Rank ZIP code, City, State Median Home Price ($)

Flesch-Kincaid Index Mean Centered

Leesbaarheid School Index

100 Zip Code -3.2266 5.446 10014, New York, NY 4,116,5068 10021, New York, NY 3,980,8291 10065, New York, NY 6,534,43010 10075, New York, NY 3,885,409

076 Zip Code -3.761 5.52 07620, Alpine, NJ 5,745,038

119 Zip Code -0.0538 5.24 11962, Sagaponack, NY 4,180,3855 940 Zip Code3 94027, Atherton, CA 4,897,8645 94010, Hillsborough, CA 4,127,2507 94022, Los Altos Hills, CA 4,016,050 -3.596 5.87

902 Zip Code9 90274, Rolling Hills, CA 3,972,500 1.4095 4.96

The Flesch Reading Ease Algorithm

32

Green Eggs and Ham by Dr. Suess averages 5.7 words per sentence and 1.02 syllables per word, with a grade level of −1.3. (Most of the 50 used words are monosyllabic; "anywhere", which occurs 8 times, is the only exception.) The 50 dimensional space is small.

Even this fairly small Twitter sample & after lots of data processing to remove words of count:1 and words fewer than three characters the N:12,603 dimensional space.

Data Processing includes removing stop words and stemming.

110,737 obs; 418,085 words & 1,446,494 characters without stop words (519,974 & 2,326,500 with stop words);

Top 50 words include: [romney, obama, election, vote, hope]

Green Eggs and Ham: N - Dimensional Problems

http://en.wikipedia.org/wiki/Green_Eggs_and_Ham

33

Vignette 2:Spatial Patterns of Activity

34

Spatial Patterns of Activity:Geolocated Social Media

New forms of aggregation unlock

new insights in your data.

Useful for coarse pattern analysis

Looks interesting

Difficult to analyze directly

35

Rich & Abe GeolocatedSocial Media

PythonGeohash

Algorithm

Code on Github

Spatial Patterns of Activity:Applying the Kitchen Sink

36

States, Counties, and Census tracks

All different sizes Sometimes change This is a problem:

MAUP http://goo.gl/wQLTW

Spatial Patterns of Activity:Let’s use Political Boundaries

http://goo.gl/wQLTW

37

States, Counties, and Census tracks

All different sizes Sometimes change This is a problem:

MAUP http://goo.gl/wQLTW

Spatial Patterns of Activity:Let’s NOT use Political Boundaries

http://goo.gl/wQLTW

38

Invented in 2008 by Gustavo Niemeyer

Similar to quadtree; breaks the world into rectangles

Based on a z-curve algorithm

Useful for 2-d binning

Spatial Patterns of Activity:Geohash

39

4

4

4

8

5

6

4

2

4

4

4

9

5

4

4

3

2

4

4

2

1

4

1

1

6

5

5

4

2

Spatial Patterns of Activity:Geohash Math

Notional example:Occurrence of geolocated tweets related to coffee.

40

4

4

4

8

5

6

4

2

4

4

4

9

5

4

4

3

2

4

4

2

1

4

1

1

6

5

5

4

2


41


42

Activity near Washington DC

Spatial Patterns of Activity:3-d Google Earth

43

Activity near Washington DC

Spatial Patterns of Activity:3-d Google Earth

44

Spatial Patterns of Activity:Avoid the Classic Blunders

http://xkcd.com/1138/

http://xkcd.com/1138/

45

Night activity near Washington DC

Spatial Patterns of Activity:Isolating a Time Series

46

Spatial Patterns of Activity:Isolating a Time Series

School Event

Tourists

School Event

Spatial Patterns of Activity:A Caffeinated Example

Aggregation Where is the most

commentary about coffee and Starbucks?

Association Is commentary about coffee

and Starbucks associated with the location of Starbucks stores? (Yes)

Correlation What is the numeric

relationship between geo-located coffee commentary and actual stores?

Where is Starbucks?

81 spatial regions identified with textual references to the words ‘coffee’ and/or ‘Starbucks.’

8 of the 81 regions are boxes that include both references to ‘coffee’ and ‘Starbucks’ within a narrow window of time.

7 of 8 (88%) accurately classify a region as containing a Starbucks by using simple text analysis alone.

.09

.52

.88

49

• Putting data in geospatial context unlocks insight.

• Location teaches us more about what we are analyzing.

• Adhere to statistical assumptions and avoid misspecification in our models.

• The “Big Data” aspects of social media mean that the faucet is always running, enabling experimentation.

So, What?

50

Eugene Wigner (1960 Nobel Laureate)

““The Unreasonable Effectiveness of The Unreasonable Effectiveness of Data”Data”

Peter Norvig Director of Research at Google Inc.

““The Unreasonable Effectiveness of The Unreasonable Effectiveness of Mathematics in the Natural Sciences”Mathematics in the Natural Sciences”

Academic Works; Embracing Complexity

51

Additional resources; Code and stuff...

Rich HeimannCode and Data: https://github.com/rheimann Slides: http://www.slideshare.net/rheimann04 Twitter: @rheimannUMBC: [email protected] Company: Data Tactics Corporation: http://goo.gl/8QWty

Abe UsherCode and Data; https://github.com/abeusherTwitter: @abeusherCompany: HumanGeo Group: http://goo.gl/uDbZP






http://www.slideshare.net/rheimann04






mailto:[email protected]



http://goo.gl/8QWty

http://goo.gl/8QWty

52

Thank you!!

http://www.umbc.edu/shadygrove/gis/gis.php

http://www.umbc.edu/shadygrove/gis/gis.php

53

Recommended resources: Books

54

Foundational data:1. Geonames.org: http://www.geonames.org/2. GADM.org: http://gadm.org/

Streaming data:1. Twitter API: https://dev.twitter.com/– Datasift: http://datasift.com/1. GNIP: http://gnip.com/

Recommended resources: Data

54

http://www.geonames.org/

http://gadm.org/

https://dev.twitter.com/

big social data: the spatial turn in big data (video available soon on youtube)

Technology