big social data: the spatial turn in big data (video available soon on youtube)

53
1 Big Social Data: The Spatial Turn in Big Data Rich Heimann, UMBC Adjunct Faculty Abe Usher, HumanGeo Group May 9, 2013

Upload: richard-heimann

Post on 27-Jan-2015

107 views

Category:

Technology


4 download

DESCRIPTION

Big Social Data: The Spatial Turn in Big Data By Richard Heimann & Abe Usher University of Maryland Baltimore County Webinar Description: The increased access to spatial data and overall improved application of spatial analytical methods present certain potential to social scientific research. This webinar is designed to focus on substantive social science research perspectives while exposing rewards involved in the application of geographic information systems (GIS), Big Data, and spatial analytics in their own research.  What is witnessed as the hype of Web 2.0 has worn off and the collaborative use of the Internet becomes a societal norm is an unprecedented explosion in the creation and analysis of geospatial data. Just as major governments are reducing their investments in location intelligence, individuals and non-government organizations are fueling a bonfire of innovation in the world of GIS data.  Traditional spatial analyses grew up in an era of sparse data and very weak computational power. Today, both of those circumstances are reversed and many of the old solutions are no longer suitable to answer todays questions.  "Big Social Data: The Spatial Turn in Big Data" reflects this change and combines two things which, until recently, engaged quite different groups of researchers and practitioners. Together, they require particular techniques and a sophisticated understanding of the special problems associated with spatial social data. Geographic Data Mining, or Geographic Knowledge Discovery, is not new, but is developing and changing rapidly as both more, and different, data becomes available, and people see new applications. The days of ‘Big Data’ require fresh thinking. The webinar will highlight connections between spatial concepts and data availability. New emerging social media data will be promoted over traditional social science data, which better reflect some of the more recently developments in Big Data - most notably the socially critical exploration of such data. 

TRANSCRIPT

Page 1: Big Social Data: The Spatial Turn in Big Data (Video available soon on YouTube)

1

Big Social Data:The Spatial Turn in Big Data

Rich Heimann, UMBC Adjunct FacultyAbe Usher, HumanGeo GroupMay 9, 2013

Page 2: Big Social Data: The Spatial Turn in Big Data (Video available soon on YouTube)

2

Agenda Major Trends; Foundational Definitions.

[Abe] Long Tail of Big Social Data [Rich] Laws of the Spatial Sciences [Rich]

– Big Data; Small Theory [Rich] Important Big Data Concepts [Abe]

– The Kitchen Model [Abe] Vignettes [Rich & Abe] So, what? Additional Resources 2

Page 3: Big Social Data: The Spatial Turn in Big Data (Video available soon on YouTube)

3

Major Trends

Location Explosion 2004- present

Page 4: Big Social Data: The Spatial Turn in Big Data (Video available soon on YouTube)

4

Location Explosion 2004- present Proliferation of mobile computing

Major Trends

7 billion devices in 2014

Page 5: Big Social Data: The Spatial Turn in Big Data (Video available soon on YouTube)

5

Location Explosion 2004- present Proliferation of mobile computing Social networking

Major Trends

> 700 million comments daily

> 144 million connections daily

Page 6: Big Social Data: The Spatial Turn in Big Data (Video available soon on YouTube)

6

Location Explosion 2004- present Proliferation of mobile computing Social networking Gamification of geo

Major Trends

Page 7: Big Social Data: The Spatial Turn in Big Data (Video available soon on YouTube)

7

Location Explosion 2004- present Proliferation of mobile computing Social networking Gamification of geo

Impact:Continuous, global geo-located observations, shared across the Internet.

Impact:Continuous, global geo-located observations, shared across the Internet.

Major Trends

Page 8: Big Social Data: The Spatial Turn in Big Data (Video available soon on YouTube)

8

Definitions

Volunteered Geographic Information* (VGI)

“harnessing of tools to create, assemble, and disseminate geographic data provided voluntarily by individuals”

* http://en.wikipedia.org/wiki/Volunteered_geographic_information

8

Page 9: Big Social Data: The Spatial Turn in Big Data (Video available soon on YouTube)

9

Volunteered Geographic Information (VGI) Social Media"a group of Internet-based applications that

build on the ideological and technological foundations of Web 2.0, and that allow the creation and exchange of user-generated content”

* http://goo.gl/oSrIS

9

Definitions

Page 10: Big Social Data: The Spatial Turn in Big Data (Video available soon on YouTube)

10

Volunteered Geographic Information (VGI) Social Media Big Data

“is high-volume, velocity and variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making”

* http://goo.gl/DFFbr 1

0

Definitions

Page 11: Big Social Data: The Spatial Turn in Big Data (Video available soon on YouTube)

11

Long Tail: Traditional Social Science DataLong Tail: Traditional Social Science Data

Head: Big Data; Head: Big Data; nontraditional social nontraditional social science data.science data.

Head: Big Data – Large continuous datasets coincident coincident over Time & Space. Ideal for multivariate analysis.Tail {power law distribution} Data in tail is often unmaintained beyond their initially designed use case and individually curated. As a result, the data is discontiguous from other research efforts and discontinuous over space and time. Dark data is suspected to exist or ought to exist but is difficult or impossible to find. The problem of dark data is real and prevalent in the tail. The long tail is an intractably large management problem.

Long Tail of Big Social Data

Page 12: Big Social Data: The Spatial Turn in Big Data (Video available soon on YouTube)

12

Power lawPower law 80%80% 20%20%

Number of Grants 7,478 1,869

Dollar Amount $938,548,595 $1,199,088,125

Total Grants (NSF07) 9,347 (Count) $2,137,636,716 (Amount)

Long Tail of NSF Data

Page 13: Big Social Data: The Spatial Turn in Big Data (Video available soon on YouTube)

13

Tobler’s [Tobler, 1970] First Law of Geography (TFLG)

TFLG: “All things are related, but nearby things are more related than distant things”

Spatial Heterogeneity“Second law of geography”[Goodchild, 2003].

Spatial Simpson’s ParadoxGlobal model will always compete and may be inconsistent

with local models.

Anyon (1982): social science should be empirically grounded, theoretically explanatory and socially critical.

Laws of Spatial Science

13

http://www.bigdatarepublic.com/author.asp?section_id=2948

Page 14: Big Social Data: The Spatial Turn in Big Data (Video available soon on YouTube)

14

Spatial Simpson’s ParadoxSpatial Simpson’s ParadoxGlobal standards will always compete with local social phenomenon.Global standards will always compete with local social phenomenon.

Violence in the south

Violence in Violence in the norththe north

Violence in the south

Violence in Violence in the norththe north

Violence

GlobalGlobal models average regionally variant phenomenon. Local Local models account for regional variation.

Big Data; Small Theory

14

Page 15: Big Social Data: The Spatial Turn in Big Data (Video available soon on YouTube)

15

Important Big Data Concepts

Aggregation

Association

Correlation

15

Page 16: Big Social Data: The Spatial Turn in Big Data (Video available soon on YouTube)

16

Important Big Data Concepts Aggregation

Quantitative methods for creating descriptive statistics

Association Methods of identifying relationships of one data

element to another

Correlation The process of quantifying a correspondence

between two comparable entities

Page 17: Big Social Data: The Spatial Turn in Big Data (Video available soon on YouTube)

17

Two Vignettes

1. Spatially patterning of Tweet composition from the Presidential elections of 2012.

2. Pattern of life analysis of a major US city.

Page 18: Big Social Data: The Spatial Turn in Big Data (Video available soon on YouTube)

18

Kitchen Model

Chef Ingredients Utensils Recipes

Page 19: Big Social Data: The Spatial Turn in Big Data (Video available soon on YouTube)

19

Kitchen Model

Chef Ingredients Utensils Recipes

Page 20: Big Social Data: The Spatial Turn in Big Data (Video available soon on YouTube)

20

Practice: Recommended Tools

• Python• R• Quantum GIS• Google Earth

20

Page 21: Big Social Data: The Spatial Turn in Big Data (Video available soon on YouTube)

21

Vignette 1:The Flesch-Kincaid Reading Algorithm

Page 22: Big Social Data: The Spatial Turn in Big Data (Video available soon on YouTube)

22

RE = 206.835 – (1.015 x ASL) – (84.6 x ASW)

RE = Readability Ease; ASL = Average Sentence Length (i.e., the number of words divided by the number of sentences); ASW = Average number of syllables per word (i.e., the number of syllables divided by the number of words)

The output, i.e., RE is a number generally ranging from 0 to 100. The higher the number, the easier the text is to read.

• Scores between 90.0 and 100.0 are considered easily understandable by an average 5th grader.

• Scores between 60.0 and 70.0 are considered easily understood by 8th and 9th graders.

• Scores between 0.0 and 30.0 are considered easily understood by college graduates.

The Flesch-Kincaid Reading Algorithm

Page 23: Big Social Data: The Spatial Turn in Big Data (Video available soon on YouTube)

23

Clean Text “this gas situation is absolutely ridiculous.”

Language english

Latitude 41.0862

Longitude -74.1520

USERID “ ”

Kincaid 14.3

Flesch 3.3

Flesch-Kincaid (Mean Centered)

-76.273849

Leesbaarheid Score 56

Leesbaarheid Grade 11

The Flesch-Kincaid Reading Algorithm

Page 24: Big Social Data: The Spatial Turn in Big Data (Video available soon on YouTube)

24

Clean Text “down here in beach bout to shut this down wit & feeling the vibe s.”

Language english

Latitude 33.68709

Longitude -78.88915

USERID “ ”

Kincaid 3.5

Flesch 100

Flesch-Kincaid (Mean Centered)

20.42615

Leesbaarheid Score 22.9

Leesbaarheid Grade 4

The Flesch-Kincaid Reading Algorithm

Page 25: Big Social Data: The Spatial Turn in Big Data (Video available soon on YouTube)

25

Time Span: 2012-10-23 to 2012-11-06 (1 temporal bin, 2 weeks);

Spatial Area: Data Clipped to US;

Original Sample: 110,737 obs; 418,085 words & 1,446,494 characters without stop words (519,974 & 2,326,500 with stop words);

Data processing: Removal of hashtags, @{users}, URLs, thresholding and mean centering;

Pruned Sample: 47,690 observations;

Method: Local Indicator of Spatial Autocorrelation (Moran’s I) with LISA Classifications of High-High (HH), Low-Low (LL), High-Low (HL), Low-High (LH);

Spatial Weights: knn40;

Data Reduction: pseudo p-values 0.05, 0.01, 0.001.

By the numbers...

Page 26: Big Social Data: The Spatial Turn in Big Data (Video available soon on YouTube)

26

Region mean SD 0% 25% 50% 75% 100% data:n

East North Central 0.6193 16.514 -76.274 -5.77 4.93 11.92 20.426 7579

East South Central 0.6314 16.576 -74.673 -5.27 4.93 12.23 20.426 3028

Mid-Atlantic -0.1988 16.590 -76.273 -6.47 3.73 11.43 20.426 6278

Mountain -0.1212 16.586 -73.174 -7.00 4.32 11.43 20.426 2452

New England -0.1837 16.864 -73.174 -7.00 4.32 11.43 20.426 2392

Pacific -0.8560 17.276 -78.274 -7.78 3.72 11.43 20.426 5390

Southeast 0.1469 16.730 -79.373 -5.78 4.32 11.43 20.426 10022

West North

Central 0.6010 16.385 -78.274 -5.78 5.22 12.23 20.426 2781

West South

Central 0.8323 16.386 -79.273 -4.77 5.33 12.12 20.426 5572

The Flesch-Kincaid Reading Algorithm

Page 27: Big Social Data: The Spatial Turn in Big Data (Video available soon on YouTube)

27

The Flesch-Kincaid Reading Algorithm

library(ggplot2)ggplot(Twitter, aes(x=regiontxt, y=flecMC, ylab="Flesch Kincaid Index", xlab="Region", data=Twitter))

geom_point(colour="lightblue", alpha=0.1, position="jitter") +geom_boxplot(outlier.size=1, alpha=0.1)

boxplot(flecMC~regiontxt, ylab="flecMC", xlab="regiontxt", data=Twitter)

https://gist.github.com/rheimann/5525909

Page 28: Big Social Data: The Spatial Turn in Big Data (Video available soon on YouTube)

29

https://github.com/rheimann

The Flesch-Kincaid Reading Algorithm

Raw Data: data:n 47,690

Page 29: Big Social Data: The Spatial Turn in Big Data (Video available soon on YouTube)

30

High, High [n=77]

Low, Low   [n=74]

Low, High  [n=53]

High, Low  [n=55]

= El Paso, Oklahoma City, Omaha, Detroit, Memphis

= NYC & San Jose #nerds

= Sacramento

= Wichita, Kansas City, Tulsa, Nashville

pseudo p-value < 0.05data:n 862 (3-digit Zip Codes)

Gassaway, WV

Watertown NY

Ithaca NY

Columbus OH

Fresno CA

https://github.com/rheimann

The Flesch-Kincaid Reading Algorithm

Page 30: Big Social Data: The Spatial Turn in Big Data (Video available soon on YouTube)

31

Rank ZIP code, City, State Median Home Price ($)

Flesch-Kincaid Index Mean Centered

Leesbaarheid School Index

100 Zip Code -3.2266 5.446 10014, New York, NY 4,116,5068 10021, New York, NY 3,980,8291 10065, New York, NY 6,534,43010 10075, New York, NY 3,885,409

076 Zip Code -3.761 5.52 07620, Alpine, NJ 5,745,038

119 Zip Code -0.0538 5.24 11962, Sagaponack, NY 4,180,3855 940 Zip Code3 94027, Atherton, CA 4,897,8645 94010, Hillsborough, CA 4,127,2507 94022, Los Altos Hills, CA 4,016,050 -3.596 5.87

902 Zip Code9 90274, Rolling Hills, CA 3,972,500 1.4095 4.96

The Flesch Reading Ease Algorithm

Page 31: Big Social Data: The Spatial Turn in Big Data (Video available soon on YouTube)

32

Green Eggs and Ham by Dr. Suess averages 5.7 words per sentence and 1.02 syllables per word, with a grade level of −1.3. (Most of the 50 used words are monosyllabic; "anywhere", which occurs 8 times, is the only exception.) The 50 dimensional space is small.

Even this fairly small Twitter sample & after lots of data processing to remove words of count:1 and words fewer than three characters the N:12,603 dimensional space.

Data Processing includes removing stop words and stemming.

110,737 obs; 418,085 words & 1,446,494 characters without stop words (519,974 & 2,326,500 with stop words);

Top 50 words include: [romney, obama, election, vote, hope]

Green Eggs and Ham: N - Dimensional Problems

Page 32: Big Social Data: The Spatial Turn in Big Data (Video available soon on YouTube)

33

Vignette 2:Spatial Patterns of Activity

Page 33: Big Social Data: The Spatial Turn in Big Data (Video available soon on YouTube)

34

Spatial Patterns of Activity:Geolocated Social Media

New forms of aggregation unlock

new insights in your data.

Useful for coarse pattern analysis

Looks interesting

Difficult to analyze directly

Page 34: Big Social Data: The Spatial Turn in Big Data (Video available soon on YouTube)

35

Rich & Abe GeolocatedSocial Media

PythonGeohash

Algorithm

Code on Github

Spatial Patterns of Activity:Applying the Kitchen Sink

Page 35: Big Social Data: The Spatial Turn in Big Data (Video available soon on YouTube)

36

States, Counties, and Census tracks

All different sizes Sometimes change This is a problem:

MAUP http://goo.gl/wQLTW

Spatial Patterns of Activity:Let’s use Political Boundaries

Page 36: Big Social Data: The Spatial Turn in Big Data (Video available soon on YouTube)

37

States, Counties, and Census tracks

All different sizes Sometimes change This is a problem:

MAUP http://goo.gl/wQLTW

Spatial Patterns of Activity:Let’s NOT use Political Boundaries

Page 37: Big Social Data: The Spatial Turn in Big Data (Video available soon on YouTube)

38

Invented in 2008 by Gustavo Niemeyer

Similar to quadtree; breaks the world into rectangles

Based on a z-curve algorithm

Useful for 2-d binning

Spatial Patterns of Activity:Geohash

Page 38: Big Social Data: The Spatial Turn in Big Data (Video available soon on YouTube)

39

4

4

4

8

5

6

4

2

4

4

4

9

5

4

4

3

2

4

4

2

1

4

1

1

6

5

5

4

2

Spatial Patterns of Activity:Geohash Math

Notional example:Occurrence of geolocated tweets related to coffee.

Page 39: Big Social Data: The Spatial Turn in Big Data (Video available soon on YouTube)

40

4

4

4

8

5

6

4

2

4

4

4

9

5

4

4

3

2

4

4

2

1

4

1

1

6

5

5

4

2

Spatial Patterns of Activity:Geohash Math

Page 40: Big Social Data: The Spatial Turn in Big Data (Video available soon on YouTube)

41

Spatial Patterns of Activity:Geohash Math

Page 41: Big Social Data: The Spatial Turn in Big Data (Video available soon on YouTube)

42

Activity near Washington DC

Spatial Patterns of Activity:3-d Google Earth

Page 42: Big Social Data: The Spatial Turn in Big Data (Video available soon on YouTube)

43

Activity near Washington DC

Spatial Patterns of Activity:3-d Google Earth

Page 43: Big Social Data: The Spatial Turn in Big Data (Video available soon on YouTube)

44

Spatial Patterns of Activity:Avoid the Classic Blunders

http://xkcd.com/1138/

Page 44: Big Social Data: The Spatial Turn in Big Data (Video available soon on YouTube)

45

Night activity near Washington DC

Spatial Patterns of Activity:Isolating a Time Series

Page 45: Big Social Data: The Spatial Turn in Big Data (Video available soon on YouTube)

46

Spatial Patterns of Activity:Isolating a Time Series

School Event

Tourists

School Event

Page 46: Big Social Data: The Spatial Turn in Big Data (Video available soon on YouTube)

Spatial Patterns of Activity:A Caffeinated Example

Aggregation Where is the most

commentary about coffee and Starbucks?

Association Is commentary about coffee

and Starbucks associated with the location of Starbucks stores? (Yes)

Correlation What is the numeric

relationship between geo-located coffee commentary and actual stores?

Page 47: Big Social Data: The Spatial Turn in Big Data (Video available soon on YouTube)

Where is Starbucks?

81 spatial regions identified with textual references to the words ‘coffee’ and/or ‘Starbucks.’

8 of the 81 regions are boxes that include both references to ‘coffee’ and ‘Starbucks’ within a narrow window of time.

7 of 8 (88%) accurately classify a region as containing a Starbucks by using simple text analysis alone.

.09

.52

.88

Page 48: Big Social Data: The Spatial Turn in Big Data (Video available soon on YouTube)

49

• Putting data in geospatial context unlocks insight.

• Location teaches us more about what we are analyzing.

• Adhere to statistical assumptions and avoid misspecification in our models.

• The “Big Data” aspects of social media mean that the faucet is always running, enabling experimentation.

So, What?

Page 49: Big Social Data: The Spatial Turn in Big Data (Video available soon on YouTube)

50

Eugene Wigner (1960 Nobel Laureate)

““The Unreasonable Effectiveness of The Unreasonable Effectiveness of Data”Data”

Peter Norvig Director of Research at Google Inc.

““The Unreasonable Effectiveness of The Unreasonable Effectiveness of Mathematics in the Natural Sciences”Mathematics in the Natural Sciences”

Academic Works; Embracing Complexity

Page 50: Big Social Data: The Spatial Turn in Big Data (Video available soon on YouTube)

51

Additional resources; Code and stuff...

Rich HeimannCode and Data: https://github.com/rheimann Slides: http://www.slideshare.net/rheimann04 Twitter: @rheimannUMBC: [email protected] Company: Data Tactics Corporation: http://goo.gl/8QWty

Abe UsherCode and Data; https://github.com/abeusherTwitter: @abeusherCompany: HumanGeo Group: http://goo.gl/uDbZP

Page 51: Big Social Data: The Spatial Turn in Big Data (Video available soon on YouTube)

52

Thank you!!

http://www.umbc.edu/shadygrove/gis/gis.php

Page 52: Big Social Data: The Spatial Turn in Big Data (Video available soon on YouTube)

53

Recommended resources: Books

Page 53: Big Social Data: The Spatial Turn in Big Data (Video available soon on YouTube)

54

Foundational data:1. Geonames.org: http://www.geonames.org/2. GADM.org: http://gadm.org/

Streaming data:1. Twitter API: https://dev.twitter.com/– Datasift: http://datasift.com/1. GNIP: http://gnip.com/

Recommended resources: Data

54