the spatial scan statistic

82
1 National Institute of Statistical Sciences National Institute of Statistical Sciences Workshop on Statistics and Counterterrorism Workshop on Statistics and Counterterrorism G. P. Patil G. P. Patil November 20, 2004 November 20, 2004 New York University New York University

Upload: nitesh

Post on 13-Jan-2016

64 views

Category:

Documents


1 download

DESCRIPTION

National Institute of Statistical Sciences Workshop on Statistics and Counterterrorism G. P. Patil November 20, 2004 New York University. The Spatial Scan Statistic. Move a circular window across the map. Use a variable circle radius, from zero up - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: The Spatial Scan Statistic

11

National Institute of Statistical SciencesNational Institute of Statistical Sciences

Workshop on Statistics and CounterterrorismWorkshop on Statistics and Counterterrorism

G. P. PatilG. P. Patil

November 20, 2004November 20, 2004New York UniversityNew York University

Page 2: The Spatial Scan Statistic

2

Spatiallydistributedresponsevariables

Hotspotanalysis

PrioritizationDecisionsupportsystems

Geoinformatic spatio-temporal data from a variety of dataproducts and data sources with agencies, academia, and industry

Masks, filters

Indicators, weights

Masks, filters

Geoinformatic Surveillance System

Spatiallydistributedresponsevariables

Hotspotanalysis

PrioritizationDecisionsupportsystems

Geoinformatic spatio-temporal data from a variety of dataproducts and data sources with agencies, academia, and industry

Masks, filters

Indicators, weights

Masks, filters

Geoinformatic Surveillance System

Page 3: The Spatial Scan Statistic

3

Agency Databases Thematic Databases Other Databases

HomelandSecurity

Disaster Management

Public Health

Ecosystem Health

Other CaseStudies

Statistical Processing: Hotspot Detection, Prioritization, etc.

Data Sharing, Interoperable Middleware

Standard or De Facto Data Model, Data Format, Data Access

Arbitrary Data Model, Data Format, Data Access

Application Specific De Facto Data/Information Standard

Agency Databases Thematic Databases Other Databases

HomelandSecurity

Disaster Management

Public Health

Ecosystem Health

Other CaseStudies

Statistical Processing: Hotspot Detection, Prioritization, etc.

Data Sharing, Interoperable Middleware

Standard or De Facto Data Model, Data Format, Data Access

Arbitrary Data Model, Data Format, Data Access

Application Specific De Facto Data/Information Standard

Page 4: The Spatial Scan Statistic

4

Page 5: The Spatial Scan Statistic

55

The Spatial Scan StatisticThe Spatial Scan Statistic

Move a circular window across the Move a circular window across the map.map.

Use a variable circle radius, from zero Use a variable circle radius, from zero upup

to a maximum where 50 percent of to a maximum where 50 percent of the population is included.the population is included.

Page 6: The Spatial Scan Statistic

6A small sample of the circles used

Page 7: The Spatial Scan Statistic

77

Detecting Emerging Detecting Emerging ClustersClusters Instead of a circular window in two Instead of a circular window in two

dimensions, we use a cylindrical window dimensions, we use a cylindrical window in three dimensions.in three dimensions.

The base of the cylinder represents The base of the cylinder represents space, while the height represents time.space, while the height represents time.

The cylinder is flexible in its circular The cylinder is flexible in its circular base and starting date, but we only base and starting date, but we only consider those cylinders that reach all consider those cylinders that reach all the way to the end of the study period. the way to the end of the study period. Hence, we are only considering ‘alive’ Hence, we are only considering ‘alive’ clusters.clusters.

Page 8: The Spatial Scan Statistic

88

West Nile Virus Surveillance West Nile Virus Surveillance in New York Cityin New York City

2000 Data: Simulation/Testing of 2000 Data: Simulation/Testing of Prospective Surveillance SystemProspective Surveillance System 2001 Data: Real Time 2001 Data: Real Time Implementation of Daily Prospective Implementation of Daily Prospective SurveillanceSurveillance

Page 9: The Spatial Scan Statistic

99

Major epicenter on Staten IslandMajor epicenter on Staten Island

Dead bird surveillance system: June 14Dead bird surveillance system: June 14 Positive bird report: July 16 (coll. July 5)Positive bird report: July 16 (coll. July 5) Positive mosquito trap: July 24 (coll. Positive mosquito trap: July 24 (coll.

July 7)July 7) Human case report: July 28 (onset July Human case report: July 28 (onset July

20)20)

West Nile Virus West Nile Virus Surveillance in New York Surveillance in New York CityCity

Page 10: The Spatial Scan Statistic

10

Page 11: The Spatial Scan Statistic

1111

Hospital Emergency Hospital Emergency Admissions in New York CityAdmissions in New York City

Hospital emergency admissions data from a Hospital emergency admissions data from a majority of New York City hospitals.majority of New York City hospitals.

At midnight, hospitals report last 24 hour of At midnight, hospitals report last 24 hour of

data to New York City Department of Healthdata to New York City Department of Health A spatial scan statistic analysis is performed A spatial scan statistic analysis is performed

every morningevery morning If an alarm, a local investigation is If an alarm, a local investigation is

conductedconducted

Page 12: The Spatial Scan Statistic

1212

Issues

Page 13: The Spatial Scan Statistic

1313

Geospatial Surveillance

Page 14: The Spatial Scan Statistic

1414

Spatial Temporal Surveillance

Page 15: The Spatial Scan Statistic

1515

Syndromic Crisis-Index Surveillance

Page 16: The Spatial Scan Statistic

1616

Hotspot Prioritization

Page 17: The Spatial Scan Statistic

17

Page 18: The Spatial Scan Statistic

1818

National ApplicationsNational Applications

BiosurveillanceBiosurveillance Carbon ManagementCarbon Management Coastal ManagementCoastal Management Community Community

InfrastructureInfrastructure Crop SurveillanceCrop Surveillance Disaster ManagementDisaster Management Disease SurveillanceDisease Surveillance Ecosystem HealthEcosystem Health Environmental JusticeEnvironmental Justice Sensor NetworksSensor Networks Robotic NetworksRobotic Networks

Environmental Environmental ManagementManagement

Environmental PolicyEnvironmental Policy Homeland SecurityHomeland Security Invasive SpeciesInvasive Species Poverty PolicyPoverty Policy Public HealthPublic Health Public Health and Public Health and

EnvironmentEnvironment Syndromic SurveillanceSyndromic Surveillance Social NetworksSocial Networks Stream NetworksStream Networks

Page 19: The Spatial Scan Statistic

1919

Geographic Surveillance and Hotspot Detection for Homeland Security: Cyber Security and Computer Network Diagnostics Securing the nation's computer networks from cyber attack is an important aspect of Homeland Security. Project develops diagnostic tools for detecting security attacks, infrastructure failures, and other operational aberrations of computer networks. Geographic Surveillance and Hotspot Detection for Homeland Security: Tasking of Self-Organizing Surveillance Mobile Sensor Networks Many critical applications of surveillance sensor networks involve finding hotspots. The upper level set scan statistic is used to guide the search by estimating the location of hotspots based on the data previously taken by the surveillance network.Geographic Surveillance and Hotspot Detection for Homeland Security: Drinking Water Quality and Water Utility Vulnerability New York City has installed 892 drinking water sampling stations. Currently, about 47,000 water samples are analyzed annually. The ULS scan statistic will provide a real-time surveillance system for evaluating water quality across the distribution system.Geographic Surveillance and Hotspot Detection for Homeland Security: Surveillance Network and Early Warning Emerging hotspots for disease or biological agents are identified by modeling events at local hospitals. A time-dependent crisis index is determined for each hospital in a network. The crisis index is used for hotspot detection by scan statistic methodsGeographic Surveillance and Hotspot Detection for Homeland Security: West Nile Virus: An Illustration of the Early Warning Capability of the Scan Statistic West Nile virus is a serious mosquito-borne disease. The mosquito vector bites both humans and birds. Scan statistical detection of dead bird clusters provides an early crisis warning and allows targeted public education and increased mosquito control.Geographic Surveillance and Hotspot Detection for Homeland Security: Crop Pathogens and Bioterrorism Disruption of American agriculture and our food system could be catastrophic to the nation's stability. This project has the specific aim of developing novel remote sensing methods and statistical tools for the early detection of crop bioterrorism.Geographic Surveillance and Hotspot Detection for Homeland Security: Disaster Management: Oil Spill Detection, Monitoring, and Prioritization The scan statistic hotspot delineation and poset prioritization tools will be used in combination with our oil spill detection algorithm to provide for early warning and spatial-temporal monitoring of marine oil spills and their consequences.Geographic Surveillance and Hotspot Detection for Homeland Security: Network Analysis of Biological Integrity in Freshwater Streams This study employs the network version of the upper level set scan statistic to characterize biological impairment along the rivers and streams of Pennsylvania and to identify subnetworks that are badly impaired. Center for Statistical Ecology and Environmental Statistics

G. P. Patil, Director

Page 20: The Spatial Scan Statistic

2020

Attractive FeaturesAttractive Features Identifies arbitrarily shaped clustersIdentifies arbitrarily shaped clusters Data-adaptive zonation of candidate hotspotsData-adaptive zonation of candidate hotspots Applicable to data on a networkApplicable to data on a network Provides both a point estimate as well as a confidence set Provides both a point estimate as well as a confidence set

for the hotspotfor the hotspot Uses hotspot-membership rating to map hotspot boundary Uses hotspot-membership rating to map hotspot boundary

uncertaintyuncertainty Computationally efficientComputationally efficient Applicable to both discrete and continuous syndromic Applicable to both discrete and continuous syndromic

responsesresponses Identifies arbitrarily shaped clusters in the spatial-temporal Identifies arbitrarily shaped clusters in the spatial-temporal

domaindomain Provides a typology of space-time hotspots with Provides a typology of space-time hotspots with

discriminatory surveillance potentialdiscriminatory surveillance potential

Hotspot Detection Hotspot Detection InnovationInnovation

Upper Level Set Scan Upper Level Set Scan StatisticStatistic

Page 21: The Spatial Scan Statistic

2121

Candidate Zones for HotspotsCandidate Zones for Hotspots

GoalGoal: Identify geographic : Identify geographic zone(s)zone(s) in which a response is in which a response is significantly elevated relative to the rest of a regionsignificantly elevated relative to the rest of a region

A list of candidate zones A list of candidate zones Z Z is specified is specified a prioria priori– This listThis list becomes part of the parameter space and the becomes part of the parameter space and the

zone must be estimated from within this listzone must be estimated from within this list– Each candidate zone should generally be spatially Each candidate zone should generally be spatially

connected, e.g., a union of contiguous spatial units or connected, e.g., a union of contiguous spatial units or cellscells

– Longer lists of candidate zones are usually preferableLonger lists of candidate zones are usually preferable– Expanding circles or ellipses about specified centers Expanding circles or ellipses about specified centers

are a common method of generating the listare a common method of generating the list

Page 22: The Spatial Scan Statistic

2222

Scan Statistic Zonation for Scan Statistic Zonation for Circles and Space-Time Circles and Space-Time CylindersCylinders

Cholera outbreak along a river flood-plain

•Small c irc les miss much of the outbreak•La rge circles include many unwanted cells

Outbreak e xpanding in time

•Small cy linders miss much of the outbreak•La rge cylinders include many unwanted cells

Space

Tim

eCholera outbreak along a river flood-plain

•Small c irc les miss much of the outbreak•La rge circles include many unwanted cells

Outbreak e xpanding in time

•Small cy linders miss much of the outbreak•La rge cylinders include many unwanted cells

Space

Tim

e

Page 23: The Spatial Scan Statistic

2323

ULS Candidate ULS Candidate ZonesZones

QuestionQuestion:: Are there data-driven (rather than Are there data-driven (rather than a prioria priori) ways of ) ways of selecting the list of candidate zones?selecting the list of candidate zones?

Motivation for the questionMotivation for the question: A human being can look at a map : A human being can look at a map and quickly determine a reasonable set of candidate zones and quickly determine a reasonable set of candidate zones and eliminate many other zones as obviously uninteresting. and eliminate many other zones as obviously uninteresting. Can the computer do the same thing?Can the computer do the same thing?

A data-driven proposalA data-driven proposal: Candidate zones are the connected : Candidate zones are the connected

components of the upper level sets of the response surface. components of the upper level sets of the response surface. The candidate zones have a tree structure (echelon tree is a The candidate zones have a tree structure (echelon tree is a subtree), subtree),

which may assist in automated detection of multiple, but which may assist in automated detection of multiple, but

geographically separate, elevated zones.geographically separate, elevated zones. Null distributionNull distribution: If the list is data-driven (i.e., random), its : If the list is data-driven (i.e., random), its

variability must be accounted for in the null distribution. A variability must be accounted for in the null distribution. A new list must be developed for each simulated data set.new list must be developed for each simulated data set.

Page 24: The Spatial Scan Statistic

2424

Data-adaptive approach to reduced parameter space Data-adaptive approach to reduced parameter space 00

Zones in Zones in 00 are are connected componentsconnected components of of upper level setsupper level sets of the of the

empirical intensity function empirical intensity function GGa a = = YYaa / / AAaa

Upper level set (ULS) at level Upper level set (ULS) at level gg consists of all cells consists of all cells aa where where GGaa gg

Upper level setsUpper level sets may be disconnected. Connected components are may be disconnected. Connected components are

the candidate zones in the candidate zones in 00

These connected components form a rooted tree under set These connected components form a rooted tree under set inclusion. inclusion. – Root node = entire region Root node = entire region R R – Leaf nodes = local maxima of empirical intensity surfaceLeaf nodes = local maxima of empirical intensity surface– Junction nodes occur when connectivity of ULS changes withJunction nodes occur when connectivity of ULS changes with

falling intensity levelfalling intensity level

ULS Scan StatisticULS Scan Statistic

Page 25: The Spatial Scan Statistic

25

Upper Level Set (ULS) of Intensity Surface

Region R

Intensity G

g

1Z 2Z 3Z

Hotspot zones at level g(Connected Components of upper level set)

Page 26: The Spatial Scan Statistic

26

Changing Connectivity of ULS as Level Drops

Region R

Intensity G

g

4Z 5Z 6Z

1Z 2Z 3Z

g

Page 27: The Spatial Scan Statistic

27

ULS Connectivity Tree

Intensity G

g

g4Z 5Z 6Z

1Z 2Z 3Z

Schematicintensity “surface”

N.B. Intensity surface is cellular (piece-wise constant), with only finitely many levelsA, B, C are junction nodes where multiple zones coalesce into a single zone

A

B

C

Page 28: The Spatial Scan Statistic

2828

A confidence set of hotspots on the ULS tree. The A confidence set of hotspots on the ULS tree. The different connected components correspond to different connected components correspond to different hotspot loci while the nodes within a different hotspot loci while the nodes within a connected component correspond to different connected component correspond to different delineations of that hotspotdelineations of that hotspot

MLE

Junction Node

Alternative Hotspot Locus

AlternativeHotspot Delineation

Tessellated Region R

MLE

Junction Node

Alternative Hotspot Locus

AlternativeHotspot Delineation

Tessellated Region R

Page 29: The Spatial Scan Statistic

2929

Network Analysis of Network Analysis of Biological Integrity in Biological Integrity in Freshwater StreamsFreshwater Streams

Page 30: The Spatial Scan Statistic

30

New York CityWater Distribution

Network

Page 31: The Spatial Scan Statistic

31

NYC Drinking Water Quality Within-City Sampling Stations

• 892 sampling stations• Each station about 4.5 feet high and

draws water from a nearby water main• Sampling frequency increased after 9-11

Currently, about 47,000 water samples analyzed annually

• Parameters analyzed: Bacteria Chlorine levels pH Inorganic and organic pollutants Color, turbidity, odor Many others

Page 32: The Spatial Scan Statistic

3232

Network-Based SurveillanceNetwork-Based Surveillance

Subway system surveillanceSubway system surveillance Drinking water distribution system Drinking water distribution system

surveillancesurveillance Stream and river system surveillanceStream and river system surveillance Postal System SurveillancePostal System Surveillance Road transport surveillanceRoad transport surveillance Syndromic SurveillanceSyndromic Surveillance

Page 33: The Spatial Scan Statistic

3333

Syndromic SurveillanceSyndromic Surveillance

Symptoms of disease such as diarrhea, Symptoms of disease such as diarrhea, respiratory problems, headache, etcrespiratory problems, headache, etc

Earlier reporting than diagnosed Earlier reporting than diagnosed diseasedisease

Less specific, more noiseLess specific, more noise

Page 34: The Spatial Scan Statistic

34

a, P(a|0)b, P(b|0)

a

aa

0

1 2

3 4 5

6 7 8 9

b, P(b|2) a, P(a|2)

b, P(b|3) a, P(a|3)

a, P(a|0)b, P(b|0)

a

aa

0

1 2

3 4 5

6 7 8 9

b, P(b|2) a, P(a|2)

b, P(b|3) a, P(a|3)

…… abaaabaaababababaaabab ….

1. Language Samp le 2. Tree of all substrings of length lwith transition probabilities

a

a p(a|1),b p(b|1)

a, p(a|0)b, p(b|0)0

12

a

a p(a|1),b p(b|1)

a, p(a|0)b, p(b|0)0

12

3. PFSAa, P(a|0)b, P(b|0)

a

aa

0

1 2

3 4 5

6 7 8 9

b, P(b|2) a, P(a|2)

b, P(b|3) a, P(a|3)

a, P(a|0)b, P(b|0)

a

aa

0

1 2

3 4 5

6 7 8 9

b, P(b|2) a, P(a|2)

b, P(b|3) a, P(a|3)

…… abaaabaaababababaaabab ….

1. Language Samp le 2. Tree of all substrings of length lwith transition probabilities

a

a p(a|1),b p(b|1)

a, p(a|0)b, p(b|0)0

12

a

a p(a|1),b p(b|1)

a, p(a|0)b, p(b|0)0

12

3. PFSAa, P(a|0)b, P(b|0)

a

aa

0

1 2

3 4 5

6 7 8 9

b, P(b|2) a, P(a|2)

b, P(b|3) a, P(a|3)

a, P(a|0)b, P(b|0)

a

aa

0

1 2

3 4 5

6 7 8 9

b, P(b|2) a, P(a|2)

b, P(b|3) a, P(a|3)

…… abaaabaaababababaaabab ….

1. Language Samp le 2. Tree of all substrings of length lwith transition probabilities

a

a p(a|1),b p(b|1)

a, p(a|0)b, p(b|0)0

12

a

a p(a|1),b p(b|1)

a, p(a|0)b, p(b|0)0

12

3. PFSA

a

a p(a|1),b p(b|1)

a, p(a|0)b, p(b|0)0

12

a

a p(a|1),b p(b|1)

a, p(a|0)b, p(b|0)0

12

3. PFSAa, P(a|0)b, P(b|0)

a

aa

0

1 2

3 4 5

6 7 8 9

b, P(b|2) a, P(a|2)

b, P(b|3) a, P(a|3)

a, P(a|0)b, P(b|0)

a

aa

0

1 2

3 4 5

6 7 8 9

b, P(b|2) a, P(a|2)

b, P(b|3) a, P(a|3)

…… abaaabaaababababaaabab ….

1. Language Samp le 2. Tree of all substrings of length lwith transition probabilities

a

a p(a|1),b p(b|1)

a, p(a|0)b, p(b|0)0

12

a

a p(a|1),b p(b|1)

a, p(a|0)b, p(b|0)0

12

3. PFSA

AdmissionsRecords

AdmissionsRecords

4

3

2

1

a

a

a

a

LSA

2

1

v

v….….Method of

-machines

Formal Language Measure

Crisis Index

High dimensional vector

Reduced dimensional vector

Hospital Behavior Model Crisis Behavior

Model

Symbol Stream

a, P(a|0)b, P(b|0)

a

aa

0

1 2

3 4 5

6 7 8 9

b, P(b|2) a, P(a|2)

b, P(b|3) a, P(a|3)

a, P(a|0)b, P(b|0)

a

aa

0

1 2

3 4 5

6 7 8 9

b, P(b|2) a, P(a|2)

b, P(b|3) a, P(a|3)

…… abaaabaaababababaaabab ….

1. Language Samp le 2. Tree of all substrings of length lwith transition probabilities

a

a p(a|1),b p(b|1)

a, p(a|0)b, p(b|0)0

12

a

a p(a|1),b p(b|1)

a, p(a|0)b, p(b|0)0

12

3. PFSAa, P(a|0)b, P(b|0)

a

aa

0

1 2

3 4 5

6 7 8 9

b, P(b|2) a, P(a|2)

b, P(b|3) a, P(a|3)

a, P(a|0)b, P(b|0)

a

aa

0

1 2

3 4 5

6 7 8 9

b, P(b|2) a, P(a|2)

b, P(b|3) a, P(a|3)

…… abaaabaaababababaaabab ….

1. Language Samp le 2. Tree of all substrings of length lwith transition probabilities

a

a p(a|1),b p(b|1)

a, p(a|0)b, p(b|0)0

12

a

a p(a|1),b p(b|1)

a, p(a|0)b, p(b|0)0

12

3. PFSAa, P(a|0)b, P(b|0)

a

aa

0

1 2

3 4 5

6 7 8 9

b, P(b|2) a, P(a|2)

b, P(b|3) a, P(a|3)

a, P(a|0)b, P(b|0)

a

aa

0

1 2

3 4 5

6 7 8 9

b, P(b|2) a, P(a|2)

b, P(b|3) a, P(a|3)

…… abaaabaaababababaaabab ….

1. Language Samp le 2. Tree of all substrings of length lwith transition probabilities

a

a p(a|1),b p(b|1)

a, p(a|0)b, p(b|0)0

12

a

a p(a|1),b p(b|1)

a, p(a|0)b, p(b|0)0

12

3. PFSA

a

a p(a|1),b p(b|1)

a, p(a|0)b, p(b|0)0

12

a

a p(a|1),b p(b|1)

a, p(a|0)b, p(b|0)0

12

3. PFSAa, P(a|0)b, P(b|0)

a

aa

0

1 2

3 4 5

6 7 8 9

b, P(b|2) a, P(a|2)

b, P(b|3) a, P(a|3)

a, P(a|0)b, P(b|0)

a

aa

0

1 2

3 4 5

6 7 8 9

b, P(b|2) a, P(a|2)

b, P(b|3) a, P(a|3)

…… abaaabaaababababaaabab ….

1. Language Samp le 2. Tree of all substrings of length lwith transition probabilities

a

a p(a|1),b p(b|1)

a, p(a|0)b, p(b|0)0

12

a

a p(a|1),b p(b|1)

a, p(a|0)b, p(b|0)0

12

3. PFSA

AdmissionsRecords

AdmissionsRecords

4

3

2

1

a

a

a

a

LSA

2

1

v

v….….Method of

-machines

Formal Language Measure

Crisis Index

High dimensional vector

Reduced dimensional vector

Hospital Behavior Model Crisis Behavior

Model

Symbol Stream

(left) The overall procedure, leading from admissions records to the crisis index for a hospital. The hotspot detection algorithm is then applied to the crisis index values defined over the hospital network. (right) The -machine procedure for converting an event stream into a parse tree and finally into a probabilistic finite state automaton (PFSA).

Syndromic Surveillance

Page 35: The Spatial Scan Statistic

35

Experimental Validation

Pressure sensitive floor

1 2 3 4 5

6 7 8 9 10 11

12 13 14 15 16

17 18 19 20 21 22

23 24 25 26 27

0

1

2

3

4

5

6

7

0 1 2 3 4 5 6 7 8 9 10

28 29

Formal Language Events:a – green to red or red to greenb – green to tan or tan to greenc – green to blue or blue to greend – red to tan or tan to rede – blue to red or red to bluef – blue to tan or tan to blue

1

0

2

4 3

8 5

7 6

a

c

f

dc

f

a9

10

1112d

a

fd

cf

a

dc

f

a

d c11

00

2

44 33

88 5

77 66

a

c

f

dc

f

a99

1010

11111212d

a

fd

cf

a

dc

f

a

d c

Counter-ClockwiseClockwise

00 a, b, c, d, e, fWall following Random walk

Analyze String RejectionsTarget Behavior

Page 36: The Spatial Scan Statistic

3636

Emergent Surveillance Plexus Emergent Surveillance Plexus (ESP)(ESP)

Surveillance Sensor Network Testbed Surveillance Sensor Network Testbed Autonomous Ocean Sampling NetworkAutonomous Ocean Sampling Network

Types of HotspotsTypes of Hotspots

Hotspots due to multiple, localized, Hotspots due to multiple, localized, stationary sourcesstationary sources

Hotspots corresponding to areas of Hotspots corresponding to areas of interest in a stationary mapped fieldinterest in a stationary mapped field

Time-dependent, localized hotspotsTime-dependent, localized hotspots

Hotspots due to moving point sourcesHotspots due to moving point sources

Page 37: The Spatial Scan Statistic

37

Ocean SAmpling MObile Network OSAMON

Page 38: The Spatial Scan Statistic

3838

OOceancean SASAmpling mpling MOMObilebile NNetwork etwork OSAMONOSAMON Feedback Feedback LoopLoop

Network sensors gather preliminary dataNetwork sensors gather preliminary data

ULS scan statistic uses available data to ULS scan statistic uses available data to estimate hotspotestimate hotspot

Network controller directs sensor vehicles to Network controller directs sensor vehicles to new locationsnew locations

Updated data is fed into ULS scan statistic Updated data is fed into ULS scan statistic systemsystem

Page 39: The Spatial Scan Statistic

3939

SASAmplingmpling MOMObilebile NNetworksetworks (SAMON)(SAMON) Additional Application ContextsAdditional Application Contexts Hotspots for radioactivity and chemical or biological agents Hotspots for radioactivity and chemical or biological agents

to prevent or mitigate the effects of terrorist attacks or to to prevent or mitigate the effects of terrorist attacks or to detect nuclear testingdetect nuclear testing

Mapping elevation, wind, bathymetry, or ocean currents to Mapping elevation, wind, bathymetry, or ocean currents to better understand and protect the environmentbetter understand and protect the environment

Detecting emerging failures in a complex networked system Detecting emerging failures in a complex networked system like the electric grid, internet, cell phone systems like the electric grid, internet, cell phone systems

Mapping the gravitational field to find underground Mapping the gravitational field to find underground chambers or tunnels for rescue or combat missionschambers or tunnels for rescue or combat missions

Page 40: The Spatial Scan Statistic

40

Mote, Smart Dust: Small, flexible, low-cost sensor node

Sensor DevicesSensor Devices

RF Component of Alcohol Sensor

Miniaturized Spec Node Prototype

Giner’s Transdermal Alcohol Sensor

Page 41: The Spatial Scan Statistic

41

Scalable Wireless Geo-Telemetrywith Miniature Smart Sensors

Geo-telemetry enabled sensor nodes deployed by a UAV into a wireless ad hoc mesh network: Transmitting data and coordinates to TASS and GIS support systems

Page 42: The Spatial Scan Statistic

42

Architectural Block Diagram of Geo-Telemetry Enabled Sensor Node with Mesh Network Capability

Page 43: The Spatial Scan Statistic

43

Standards Based Geo-Processing Model

Page 44: The Spatial Scan Statistic

44

UAV Capable of Aerial Survey

Page 45: The Spatial Scan Statistic

45

Data Fusion Hierarchy for Smart Sensor Network with Scalable Wireless

Geo-Telemetry Capability

Page 46: The Spatial Scan Statistic

4646

Wireless Sensor Networks for Habitat Monitoring

Page 47: The Spatial Scan Statistic

47

Target Tracking in Distributed Sensor Networks

Page 48: The Spatial Scan Statistic

48

Video Surveillance and Data StreamsVideo Surveillance and Data Streams

Page 49: The Spatial Scan Statistic

49

Video Surveillance and Data StreamsVideo Surveillance and Data StreamsTurning Video into InformationTurning Video into Information

Measuring Behavior by SegmentsMeasuring Behavior by Segments

• Customer Intelligence

• Enterprise Intelligence

• Entrance Intelligence

• Media Intelligence

• Video Mining Service

Page 50: The Spatial Scan Statistic

50

Deterministic Finite Automata (DFA)

a

ab

b

b

c

cstart

Directed Graph (loops & multiple edges permitted) such that:• Nodes are called States

• Edges are called Transitions

• Distinguished initial (or starting) state

• Transitions are labeled by symbols from a given finite alphabet, = {a, b, c, . . . }

• The same symbol can label several transitions

• A given symbol can label at most one transition from a given state (deterministic)

Page 51: The Spatial Scan Statistic

51

Deterministic Finite Automata (DFA)Formal Definition

a

ab

b

b

c

cstart

Quadruple (Q, q0 , , ) such that:

• Q is a finite set of states

• is a finite set of symbols, called the alphabet

• q0Q is the initial state

• : Q Q {Blocked} is the transition function:

(q, a) = Blocked if there is no transition from q labeled by a

(q, a) = q' if a is a transition from q to q'

Page 52: The Spatial Scan Statistic

52

DFA and Strings

a

ab

b

b

c

cstart

Any path through the graph starting from the initial state determines a string from the alphabet. Example: The blue dashed path determines the string a b c a

Conversely, any string from the alphabet is either blocked or determines a path through the graph. Example: The following strings are blocked: c, aa, ac, abb, etc. Example: The following strings are not blocked: a, b, ab, bb, etc.

The collection of all unblocked strings is called the language accepted or determined by the DFA (all states are “final” in our approach)

Page 53: The Spatial Scan Statistic

53

Strings and Languages

= (finite) alphabet

* = set of all (finite) strings from

A language is any subset of *.

Not all languages can be determined by a DFA.

Different DFAs can accept the same language

*

* 0 1 2

1

Let ( -fold cartesian product).

consists of all strings of length .

Then, decomposes as

i

i

i

i

i

i

Page 54: The Spatial Scan Statistic

54

Probabilistic Finite Automata (PFA)

A PFA is a DFA (Q, q0 , , ) with a probability attached to each

transition such that the sum of the probabilities across all transitions from a given node is unity.

Formally, p: Q [0, 1] such that

• p(q, a) = 0 if and only if (q, a) = Blocked

• ( , ) 1 for all a

p q a q Q

Multiplying branch probabilities lets us assign a probability value (q0, s) to each string s in *. E.G., (q0, abca)=(.8)1(.6)(.4)=.192

q0

a, .4b, .2

b, 1

b, .5

c, .6

c, .5start

a, .8

Page 55: The Spatial Scan Statistic

55

Properties of (q0, s)• For fixed q0, (q0, s) is a measure on *

• Support of is the language accepted by the DFA

• For fixed q0, (q0, s) is a probability measure on i

( i

= strings of length i )

This probability measure is written as (i).• Given a probability distribution w(i) across string lengths i,

defines a probability measure across *, called the w-weighted probability measure of the PFA. If all w(i) are positive, then the support of is also the language accepted by the underlying DFA.

( )0 0

0

( , ) ( ) ( , )i

i

q s w i q s

Page 56: The Spatial Scan Statistic

56

Distance Between Two PFA

Let A and B be two PFAs on the same alphabet

Let w(i) be a probability distribution across string lengths i

Let A and B be the w-weighted probability measures of A and B

Define the distance between A and B as the variational distance between the probability measures A and B :

d(A, B) = || A B ||

Page 57: The Spatial Scan Statistic

57

Key Crop Areas

Crops

NOAAWeather

ThreatLocations

Plants Infected Non-infected Sentinel

GroundCameras

Air/SpacePlatforms

HyperspectralImagery

SignatureLibrary

DataProcessing

AnomalyReport

Crop Attack Decision Support System

GroundTruthing

Site Identification Module

Signature Development Module

Page 58: The Spatial Scan Statistic

58

Crop Biosurveillance/Biosecurity

Page 59: The Spatial Scan Statistic

59

HyperspectralImagery

SignatureLibrary

Image Segmentation(hyperclustering)

Proxy Signal(per segment)

DiseaseSignature

SimilarityIndex

(per segment)

Tessellation(segmentation)of raster grid

SignatureSimilarity

Map

Hotspot/AnomalyDetection

Crop Biosurveillance/BiosecurityData Processing Module

Page 60: The Spatial Scan Statistic

6060

We also present a prioritization innovation. It lies We also present a prioritization innovation. It lies in the ability for prioritization and ranking of in the ability for prioritization and ranking of hotspots based on multiple indicator and hotspots based on multiple indicator and stakeholder criteria without having to integrate stakeholder criteria without having to integrate indicators into an index, using Hasse diagrams indicators into an index, using Hasse diagrams and partial order sets. This leads us to early and partial order sets. This leads us to early warning systems, and also to the selection of warning systems, and also to the selection of investigational areas.investigational areas.

Prioritization InnovationPrioritization InnovationPartial Order Set RankingPartial Order Set Ranking

Page 61: The Spatial Scan Statistic

61

HUMAN ENVIRONMENT INTERFACELAND, AIR, WATER INDICATORS

RANK COUNTRY LAND AIR WATER

1 Sweden

2 Finland

3 Norway

5 Iceland

13 Austria

22 Switzerland

39 Spain

45 France

47 Germany

51 Portugal

52 Italy

59 Greece

61 Belgium

64 Netherlands

77 Denmark

78 United Kingdom

81 Ireland

69.01

76.46

27.38

1.79

40.57

30.17

32.63

28.34

32.56

34.62

23.35

21.59

21.84

19.43

9.83

12.64

9.25

35.24

19.05

63.98

80.25

29.85

28.10

7.74

6.50

2.10

14.29

6.89

3.20

0.00

1.07

5.04

1.13

1.99

100

98

100

100

100

100

100

100

100

82

100

98

100

100

100

100

100

for land - % of undomesticated land, i.e., total land area-domesticated (permanent crops and pastures, built up areas, roads, etc.)for air - % of renewable energy resources, i.e., hydro, solar, wind, geothermalfor water - % of population with access to safe drinking water

Page 62: The Spatial Scan Statistic

62

1 2 3 4 5 6 7

8 9 10

11

12

13

14

15

16

17

18

19 20

21

22

23

24

25 26 27

28

29

30

31

32

33

34

35

36

37

38

39

40

41 42

43

44

45 46 47 48

49

50

51

52

53

54

55

56

57 58

59

60

61 62 63 64

65

66

67

68

69

70

71

72 73

74 75

76

77 78 79

80

81

82

83 84 85

86

87

88

89 90 91 92

93 94

95

96

97

98 99

100

101

102

103

104

105

106

107

108 109

110

111

112 113

114

115

116

117

118

119

120

121

122

123

124

125 126

127 128

129

130

131

132

133

134

135

136

137

138

139

140 141

Hasse Diagram (all countries)

Page 63: The Spatial Scan Statistic

63

Hasse Diagram(Western Europe)

Norway

Sweden Finland Austria Switz.

Italy Portugal Greece Spain

France

Den.

Ireland

UK Germany

Belgium Neth.

Page 64: The Spatial Scan Statistic

64

Ranking Partially Ordered Sets – 5Linear extension decision tree

a b

dc

e f

a

c

e b

b

d

f f

d

e d

f

e

e

f

c

f

d

e d

f

e

e

f

d

f

e

e

f

c

f

e

e

f

c

c

f

d

e d

f

e

e

f

d

f

e

e

f

c

b a

b

a

d

Jump Size: 1 3 3 2 3 5 4 3 3 2 4 3 4 4 2 2

Poset(Hasse Diagram)

Page 65: The Spatial Scan Statistic

6565

Cumulative Rank Frequency Operator – 5Cumulative Rank Frequency Operator – 5An Example of the ProcedureAn Example of the ProcedureIn the example from the preceding slide, there are a total of 16 linear extensions, giving the following cumulative frequency table.

RankRank

ElemenElementt

11 22 33 44 55 66

a a 99 1414 1616 1616 1616 1616

b b 77 1212 1515 1616 1616 1616

c c 00 44 1010 1616 1616 1616

d d 00 22 66 1212 1616 1616

e e 00 00 11 44 1010 1616

f f 00 00 00 00 66 1616

Each entry gives the number of linear extensions in which the element (row label) receives a rank equal to or better that the column heading

Page 66: The Spatial Scan Statistic

6666

Cumulative Rank Frequency Operator – 6Cumulative Rank Frequency Operator – 6An Example of the ProcedureAn Example of the Procedure

0

4

8

12

16

1 2 3 4 5 6

Rank

Cum

ulat

ive

Fre

quen

cy

abcdef

16

The curves are stacked one above the other and the result is a linear ordering of the elements: a > b > c > d > e > f

Page 67: The Spatial Scan Statistic

67

Cumulative Rank Frequency Operator – 7An example where F must be iterated

Original Poset(Hasse Diagram)

a f

eb

c g d

h

a

f

e

b

ad

c

h

g

a

f

e

b

ad

c

h

g

F F 2

Page 68: The Spatial Scan Statistic

68

Incorporating Judgment Poset Cumulative Rank Frequency Approach

• Certain of the indicators may be deemed more important than the others

• Such differential importance can be accommodated by the poset cumulative rank frequency approach

• Instead of the uniform distribution on the set of linear extensions, we may use an appropriately weighted probability distribution , e.g.,

)()()()( 22110 ppnwnwnww

Page 69: The Spatial Scan Statistic

69

Page 70: The Spatial Scan Statistic

70

Page 71: The Spatial Scan Statistic

71

Page 72: The Spatial Scan Statistic

72

Page 73: The Spatial Scan Statistic

7373

Space-Time Poverty Hotspot Space-Time Poverty Hotspot TypologyTypology

Federal Anti-Poverty Programs have had little Federal Anti-Poverty Programs have had little success in eradicating pockets of persistent success in eradicating pockets of persistent povertypoverty

Can spatial-temporal patterns of poverty Can spatial-temporal patterns of poverty hotspots provide clues to the causes of poverty hotspots provide clues to the causes of poverty and lead to improved location-specific anti-and lead to improved location-specific anti-poverty policy ?poverty policy ?

Page 74: The Spatial Scan Statistic

74

Covariate Adjustment

Known Covariate Effects (age, population size, etc.)

count in cell

Poisson( ) where for cell

Hotspot Hy

numeri

pothesis Te

cal covaria

sting Model

te adjustment

a

a a a a

a

Y a

Y A unknown a

A known

H

relative risk

0

1

: are equal for all cells (constant relative risk)

: take two distinct values, an elevated value in an unknown zone

and a smaller value out

a

a

a

H Z

side

All connected components of upper level sets

List of can

of th a

didate zones (ULS app

e cellular dj suuste rfac

roach)

e /d a a

Z

Z

Y A

Page 75: The Spatial Scan Statistic

75

Covariate AdjustmentGiven Covariates, Unknown Effects

Poisson( ) where for cell

covariate adjustm

GLM Model

ent

vector of covariate values for

a a a a

a

a

Y A unknown a

A unknown

known

relative risk

X

0

cell

vector of covariate effects

Model: log( ) or log( ) lo

Hotspot Hypothesis Testing M

g( )

: are equal for all cells (constant relative

o

i )

l

k

de

r s

T Ta a a a a a a

a

a

unknown

A A

H a

H

β

X β X β

1 : take two distinct values, an elevated value in an unknown zone

and a smaller value outside

List of candidate zones (ULS approach)

All connec

a Z

Z

Z

ted components of upper level sets

of the cellular surface /

Here the model must be fitted under the null hypothesis before

determining the adjustments

adjusted

and the

a a

a

Y A

A candidate zones Z

Page 76: The Spatial Scan Statistic

76

Incorporating Spatial Autocorrelation Ignoring autocorrelation typically results in:

under-assessment of variability over-assessment of significance (H0 rejected too frequently)

How can we account for possible autocorrelation?

GLMM (SAR) Model

Ya = count in cell a Ya distributed as Poisson a = log(E[Ya])

The Ya are conditionally independent given the a

The a are jointly Gaussian with a Simultaneous AutoRegressive (SAR) specification( )a a ab a a a

b

W 2

Here, [ ]

are iid (0, )

is a spatial weight expressing the "degree of association"

between cells and (Take 0 and 1)

Thus, the residual

a a

a

ab

aa a

a

E

N

W

a b W W

for cell is a (by ) weighted average

of the residuals for neighboring cells plus a disturbance term a

a

a deflated

Page 77: The Spatial Scan Statistic

77

Incorporating Spatial Autocorrelation

1

12

2

2

( )

( )

( )

MVN ,

SAR Model:

Matrix Form:

Unknown Parameters:

Special Ca

(

ses:

) ( )

, ,

a a ab a a ab

T

a

W

η μ W η μ ε

η μ I W ε

η μ I W I W

2

0 classical (iid) spatial scan

( is not identifiable here)

0, 0 classical scanoverdispersed

Page 78: The Spatial Scan Statistic

78

Incorporating Spatial Autocorrelation

12

0

Poisson(exp( )) Poisson(exp( )exp( )) Poisson( )

where MVN , (

GLMM (SAR) Model

Hotspot Hypothesis Testing Mo

) ( )

: are equal (to ) for all cells (constan

l

t

de

a a a a a a a

T

a

Y A

H a

η μ I W I W

1

relative risk)

: take two distinct values, an elevated value in an unknown zone

and a smalle

List of candid

r value ou

ate zones

tsi

(ULS

a p

de

p

a

Z

H Z

Z

All connected components of upper level sets

of the cellular surface / where exp( )

Here the model must be fitted under the null hypothesi

r

s

oach

( = ) before

adjus

ted

)

a a a a

a

Y A A

determining the adjustments and the candidate zones aA Z

Page 79: The Spatial Scan Statistic

79

Spatial Autocorrelation Plus Covariates

12

In the SAR model, Poisson(exp( ))

where MVN [ ], ( ) ( ) ,

express the mean of as

[ ]

and formula Hotspot Hypotheste the in terms

of t

is Testing

he co st

M e

n

od l

a a

T

Y

E

E

η η I W I W

η

η μ Xβ

ant term in this expression.μ

Page 80: The Spatial Scan Statistic

80

CAR Model

2 * ** 1

In the CAR model, Poisson(exp( )) where

MVN [ ], and is diagonal( )

However, parameters in CAR and SAR have very different interpretation

.

In

s.

CAR, the conditional va

a aY

E

AIη W Aη

2 *

riances are

Var( | , )

which (strangely) do not depend on the autocorrelation parameter .

In , the conditional variances are

SAR

Var( |

a b aa

a

b a A

2 2 2

2

, ) 1

This expression is intuitively appealing since the conditional variances

are decreasing functions of and are smallest for cells with many

strongly associated neighbors (relati

b babb a W

a

2vely large for many ).baW b

The entire formulation is similar for Conditional AutoRegressive (CAR) specs except that the form of the variance-covariance matrix of is changes.

Page 81: The Spatial Scan Statistic

81

Spatiallydistributedresponsevariables

Hotspotanalysis

PrioritizationDecisionsupportsystems

Geoinformatic spatio-temporal data from a variety of dataproducts and data sources with agencies, academia, and industry

Masks, filters

Indicators, weights

Masks, filters

Geoinformatic Surveillance System

Spatiallydistributedresponsevariables

Hotspotanalysis

PrioritizationDecisionsupportsystems

Geoinformatic spatio-temporal data from a variety of dataproducts and data sources with agencies, academia, and industry

Masks, filters

Indicators, weights

Masks, filters

Geoinformatic Surveillance System

Page 82: The Spatial Scan Statistic

82

Agency Databases Thematic Databases Other Databases

HomelandSecurity

Disaster Management

Public Health

Ecosystem Health

Other CaseStudies

Statistical Processing: Hotspot Detection, Prioritization, etc.

Data Sharing, Interoperable Middleware

Standard or De Facto Data Model, Data Format, Data Access

Arbitrary Data Model, Data Format, Data Access

Application Specific De Facto Data/Information Standard

Agency Databases Thematic Databases Other Databases

HomelandSecurity

Disaster Management

Public Health

Ecosystem Health

Other CaseStudies

Statistical Processing: Hotspot Detection, Prioritization, etc.

Data Sharing, Interoperable Middleware

Standard or De Facto Data Model, Data Format, Data Access

Arbitrary Data Model, Data Format, Data Access

Application Specific De Facto Data/Information Standard