spatial surveillance

49
Spatial biosurveillan ce Slides and Software and Papers at: http://www.autonlab.org [email protected] 412-268-7599 Thanks to: Howard Burkom, Greg Cooper, Kenny Daniel, Bill Hogan, Martin Kulldorf, Robin Sabhnani, Jeff Schneider, Rich Tsui, Mike Wagner Andrew Moore Carnegie Mellon [email protected] Daniel Neill Carnegie Mellon [email protected] Authors of Slides Note to other teachers and users of these slides. Andrew would be delighted if you found this source material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. PowerPoint originals are available. If you make use of a significant portion of these slides in your own lecture, please include this message, or the following link to the source repository of Andrew’s tutorials: http://www.cs.cmu.edu/~awm/tutorials . Comments and corrections gratefully received.

Upload: sahil-patwardhan

Post on 06-Apr-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Spatial Surveillance

8/3/2019 Spatial Surveillance

http://slidepdf.com/reader/full/spatial-surveillance 1/49

Spatial biosurveillance

Slides and Software and Papers at:

http://www.autonlab.org

[email protected]

Thanks to: 

Howard Burkom, GregCooper, Kenny Daniel, Bill

Hogan, Martin Kulldorf,Robin Sabhnani, Jeff

Schneider, Rich Tsui,Mike Wagner

Andrew Moore

Carnegie Mellon

[email protected]

Daniel Neill

Carnegie Mellon

[email protected]

Authors of Slides

Note to other teachers and users of these slides. Andrew would be delighted if you found this sourcematerial useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit

your own needs. PowerPoint originals are available. If you make use of a significant portion of theseslides in your own lecture, please include this message, or the following link to the source repository of

Andrew’s tutorials: http://www.cs.cmu.edu/~awm/tutorials . Comments and corrections gratefully received.

Page 2: Spatial Surveillance

8/3/2019 Spatial Surveillance

http://slidepdf.com/reader/full/spatial-surveillance 2/49

..Early Thursday Morning. Russia. April 1979...

Sverdlovsk

Page 3: Spatial Surveillance

8/3/2019 Spatial Surveillance

http://slidepdf.com/reader/full/spatial-surveillance 3/49

Sverdlovsk

Page 4: Spatial Surveillance

8/3/2019 Spatial Surveillance

http://slidepdf.com/reader/full/spatial-surveillance 4/49

Sverdlovsk

•During April and

May 1979, therewere 77Confirmedcases of

inhalationalanthrax

Page 5: Spatial Surveillance

8/3/2019 Spatial Surveillance

http://slidepdf.com/reader/full/spatial-surveillance 5/49

Page 6: Spatial Surveillance

8/3/2019 Spatial Surveillance

http://slidepdf.com/reader/full/spatial-surveillance 6/49

Goals of spatial cluster

detection• To identify the locations, shapes, and sizes of

potentially anomalous spatial regions.

• To determine whether each of these potentialclusters is more likely to be a “true” cluster ora chance occurrence.

• In other words, is anything unexpected goingon, and if so, where?

Page 7: Spatial Surveillance

8/3/2019 Spatial Surveillance

http://slidepdf.com/reader/full/spatial-surveillance 7/49

Disease surveillance

Given: count for each zip code

(e.g. number of Emergency Dept.

visits, or over-the-counter drugsales, of a specific type)

Do any regions have sufficiently highcounts to be indicative of an emerging

disease epidemic in that area?

How many cases dowe expect to see in

each area?

 Are there any regionswith significantly more

cases than expected?

Page 8: Spatial Surveillance

8/3/2019 Spatial Surveillance

http://slidepdf.com/reader/full/spatial-surveillance 8/49

A simple approach

• For each zip code:

 – Infer how many cases we expect to see,either from given denominator data (e.g.census population) or from historical data(e.g. time series of previous counts).

 – Perform a separate statistical significancetest on that zip code, obtaining its p -value.

• Report all zip codes that are significantat some level α.

What are the potential problems?

Page 9: Spatial Surveillance

8/3/2019 Spatial Surveillance

http://slidepdf.com/reader/full/spatial-surveillance 9/49

A simple approach

• For each zip code:

 – Infer how many cases we expect to see,either from given denominator data (e.g.census population) or from historical data(e.g. time series of previous counts).

 – Perform a separate statistical significancetest on that zip code, obtaining its p -value.

• Report all zip codes that are significantat some level α.

One abnormal zipcode might not

be interesting…

But a cluster of abnormal zip

codes would be!

What are the potential problems?

Page 10: Spatial Surveillance

8/3/2019 Spatial Surveillance

http://slidepdf.com/reader/full/spatial-surveillance 10/49

A simple approach

• For each zip code:

 – Infer how many cases we expect to see,either from given denominator data (e.g.census population) or from historical data(e.g. time series of previous counts).

 – Perform a separate statistical significancetest on that zip code, obtaining its p -value.

• Report all zip codes that are significantat some level α.

Multiple hypothesis testing

Thousands of locations to test…

5% chance of false positive for each…

 Almost certain to get largenumbers of false alarms!

How do we bound the overall probability

of getting any false alarms?

What are the potential problems?

Page 11: Spatial Surveillance

8/3/2019 Spatial Surveillance

http://slidepdf.com/reader/full/spatial-surveillance 11/49

One Step of Spatial Scan

Entire area being scanned

Page 12: Spatial Surveillance

8/3/2019 Spatial Surveillance

http://slidepdf.com/reader/full/spatial-surveillance 12/49

One Step of Spatial Scan

Entire area being scanned

Current region being considered

Page 13: Spatial Surveillance

8/3/2019 Spatial Surveillance

http://slidepdf.com/reader/full/spatial-surveillance 13/49

One Step of Spatial Scan

Entire area being scanned

Current region being considered

I have a populationof 5300 of whom 53

are sick (1%)

Everywhere else has a

population of 2,200,000 ofwhom 20,000 are sick (0.9%)

Page 14: Spatial Surveillance

8/3/2019 Spatial Surveillance

http://slidepdf.com/reader/full/spatial-surveillance 14/49

One Step of Spatial Scan

Entire area being scanned

Current region being considered

I have a populationof 5300 of whom 53

are sick (1%)

Everywhere else has a

population of 2,200,000 ofwhom 20,000 are sick (0.9%)

So... is that a big deal? 

Evaluated with Score function.

Page 15: Spatial Surveillance

8/3/2019 Spatial Surveillance

http://slidepdf.com/reader/full/spatial-surveillance 15/49

Scoringfunctions

• Define models: – of the null hypothesis

H0: no attacks.

 – of the alternative

hypotheses H1(S):attack in region S.

Page 16: Spatial Surveillance

8/3/2019 Spatial Surveillance

http://slidepdf.com/reader/full/spatial-surveillance 16/49

Scoringfunctions

• Define models: – of the null hypothesis

H0: no attacks.

 – of the alternative

hypotheses H1(S):

attack in region S.

• Derive a score functionScore(S) = Score(C, B).

 – Likelihood ratio:

 – To find the mostsignificant region:

)|Data(

))(|Data()(Score

0

1

 H  L

S H  LS =

)(Scoremaxarg*   SSS

=

Page 17: Spatial Surveillance

8/3/2019 Spatial Surveillance

http://slidepdf.com/reader/full/spatial-surveillance 17/49

Scoringfunctions

• Define models: – of the null hypothesis

H0: no attacks.

 – of the alternative

hypotheses H1(S):

attack in region S.

• Derive a score functionScore(S) = Score(C, B).

 – Likelihood ratio:

 – To find the mostsignificant region:

)|Data(

))(|Data()(Score

0

1

 H  L

S H  LS =

)(Scoremaxarg*   SSS

=

C = Countin region B = Baseline (e.g.

Population at risk)

Page 18: Spatial Surveillance

8/3/2019 Spatial Surveillance

http://slidepdf.com/reader/full/spatial-surveillance 18/49

Scoringfunctions

• Define models: – of the null hypothesis

H0: no attacks.

 – of the alternative

hypotheses H1(S):

attack in region S.

• Derive a score functionScore(S) = Score(C, B).

 – Likelihood ratio:

 – To find the mostsignificant region:

)|Data(

))(|Data()(Score

0

1

 H  L

S H  LS =

)(Scoremaxarg*   SSS

=

Example: Kulldorf’s score

Assumption: ci ~ Poisson(qbi)

H 0 : q = q all everywhere

H 1: q = q in  inside region,

q = q out outside region

Page 19: Spatial Surveillance

8/3/2019 Spatial Surveillance

http://slidepdf.com/reader/full/spatial-surveillance 19/49

Scoringfunctions

• Define models: – of the null hypothesis

H0: no attacks.

 – of the alternative

hypotheses H1(S):

attack in region S.

• Derive a score functionScore(S) = Score(C, B).

 – Likelihood ratio:

 – To find the mostsignificant region:

)|Data(

))(|Data()(Score

0

1

 H  L

S H  LS =

)(Scoremaxarg*   SSS

=

Example: Kulldorf’s score

Assumption: ci ~ Poisson(qbi)

H 0 : q = q all everywhere

H 1: q = q in  inside region,

q = q out outside region

tot 

tot 

tot 

tot 

tot 

tot 

 B

C C 

 B B

C C C C 

 B

C C S D loglog)(log)( −

−−+=

(Individually Most Powerful statistic for detecting significant increases) (but still…just an example)

Page 20: Spatial Surveillance

8/3/2019 Spatial Surveillance

http://slidepdf.com/reader/full/spatial-surveillance 20/49

The generalized spatial scan1. Obtain data for a set of spatial locations si.

2. Choose a set of spatial regions S to search.

3. Choose models of the data under nullhypothesis H0 (no clusters) and alternativehypotheses H1(S) (cluster in region S).

4. Derive a score function F(S) based onH1(S) and H0.

5. Find the most anomalous regions (i.e.

those regions S with highest F(S)).6. Determine whether each of these potential

clusters is actually an anomalous cluster.

Page 21: Spatial Surveillance

8/3/2019 Spatial Surveillance

http://slidepdf.com/reader/full/spatial-surveillance 21/49

1. Obtain data for a set of spatial locations si.

• For each spatial location si, we aregiven a count ci and a baseline bi.

• For example: ci = # of respiratorydisease cases, bi = at-risk population.

• Goal: to find regions where the countsare higher than expected, given thebaselines.

ci = 20,bi = 5000

Population-based method: Expectation-based method:

Baselines represent population, whethergiven (e.g. census) or inferred (e.g.from sales); can be adjusted for age,

risk factors, seasonality, etc.

Under null hypothesis, we expectcounts to be proportional to baselines.

Compare disease rate (count / pop)inside and outside region.

Baselines represent expected counts,inferred from the time series of 

previous counts, accounting for day-of-week and seasonality effects.

Under null hypothesis, we expectcounts to be equal to baselines.

Compare region’s actual countto its expected count.

Page 22: Spatial Surveillance

8/3/2019 Spatial Surveillance

http://slidepdf.com/reader/full/spatial-surveillance 22/49

1. Obtain data for a set of spatial locations si.

• For each spatial location si, we aregiven a count ci and a baseline bi.

• For example: ci = # of respiratorydisease cases, bi = at-risk population.

• Goal: to find regions where the countsare higher than expected, given thebaselines.

ci = 20,bi = 5000

Population-based method: Expectation-based method:

Baselines represent population, whethergiven (e.g. census) or inferred (e.g.from sales); can be adjusted for age,

risk factors, seasonality, etc.

Under null hypothesis, we expectcounts to be proportional to baselines.

Compare disease rate (count / pop)inside and outside region.

Baselines represent expected counts,inferred from the time series of 

previous counts, accounting for day-of-week and seasonality effects.

Under null hypothesis, we expectcounts to be equal to baselines.

Compare region’s actual countto its expected count.

Discussion question: When is itpreferable to use each method?

Page 23: Spatial Surveillance

8/3/2019 Spatial Surveillance

http://slidepdf.com/reader/full/spatial-surveillance 23/49

2. Choose a set of spatial regions S to search.

• Some practical considerations:

• Set of regions should cover entire search space.

• Adjacent regions should partially overlap.

• Choose a set of regions that corresponds well withthe size/shape of the clusters we want to detect.

• Typically, we consider some fixed shape (e.g. circle,rectangle) and allow its location and dimensions to vary.

Don’t search too few regions: Don’t search too many regions:

Computational infeasibility!

Overall power to detect any givensubset of regions reduced because of 

multiple hypothesis testing.

Reduced power to detect clustersoutside the search space.

Page 24: Spatial Surveillance

8/3/2019 Spatial Surveillance

http://slidepdf.com/reader/full/spatial-surveillance 24/49

2. Choose a set of spatial regions S to search.

• Our typical approach fordisease surveillance:• map spatial locations to grid

• search over the set of allgridded rectangular regions.

• Allows us to detect bothcompact and elongatedclusters (importantbecause of wind- or water-borne pathogens).

• Computationally efficient

• can evaluate any rectangularregion in constant time

• can use fast spatial scanalgorithm

Page 25: Spatial Surveillance

8/3/2019 Spatial Surveillance

http://slidepdf.com/reader/full/spatial-surveillance 25/49

2. Choose a set of spatial regions S to search.

• Our typical approach fordisease surveillance:• map spatial locations to grid

• search over the set of allgridded rectangular regions.

• Allows us to detect bothcompact and elongatedclusters (importantbecause of wind- or water-borne pathogens).

• Computationally efficient

• can evaluate any rectangularregion in constant time

• can use fast spatial scanalgorithm

Can also search overnon-axis-aligned

rectangles byexamining multiple

rotations of the data

Page 26: Spatial Surveillance

8/3/2019 Spatial Surveillance

http://slidepdf.com/reader/full/spatial-surveillance 26/49

3-4. Choose models of the data under H0 and

H1(S), and derive a score function F(S).• Most difficult steps: must choose models which are

efficiently computable and relevant.

Page 27: Spatial Surveillance

8/3/2019 Spatial Surveillance

http://slidepdf.com/reader/full/spatial-surveillance 27/49

3-4. Choose models of the data under H0 and

H1(S), and derive a score function F(S).• Most difficult steps: must choose models which are

efficiently computable and relevant.

F(S) must be computable as function of some additive sufficient statistics of region S,e.g. total count C(S) and total baseline B(S).

Page 28: Spatial Surveillance

8/3/2019 Spatial Surveillance

http://slidepdf.com/reader/full/spatial-surveillance 28/49

3-4. Choose models of the data under H0 and

H1(S), and derive a score function F(S).• Most difficult steps: must choose models which are

efficiently computable and relevant.

F(S) must be computable as function of some additive sufficient statistics of region S,e.g. total count C(S) and total baseline B(S).

 Any simplifying assumptions should notgreatly affect our ability to distinguish

between clusters and non-clusters.

tradeoff!

Page 29: Spatial Surveillance

8/3/2019 Spatial Surveillance

http://slidepdf.com/reader/full/spatial-surveillance 29/49

3-4. Choose models of the data under H0 and

H1(S), and derive a score function F(S).• Most difficult steps: must choose models which are

efficiently computable and relevant.

F(S) must be computable as function of some additive sufficient statistics of region S,e.g. total count C(S) and total baseline B(S).

 Any simplifying assumptions should notgreatly affect our ability to distinguish

between clusters and non-clusters.

tradeoff!

Iterative design process

Test detectionpower

High power?

 Y  Calibrate

Done

NFind unmodeled effect

harming performance

Remove effect

(preprocessing)?Filter out regions(postprocessing)?

 Add complexity

to model?

Basicmodel

Page 30: Spatial Surveillance

8/3/2019 Spatial Surveillance

http://slidepdf.com/reader/full/spatial-surveillance 30/49

3-4. Choose models of the data under H0 and

H1(S), and derive a score function F(S).• Most difficult steps: must choose models which are

efficiently computable and relevant.

F(S) must be computable as function of some additive sufficient statistics of region S,e.g. total count C(S) and total baseline B(S).

 Any simplifying assumptions should notgreatly affect our ability to distinguish

between clusters and non-clusters.

tradeoff!

Iterative design process

Test detectionpower

High power?

 Y  Calibrate

Done

NFind unmodeled effect

harming performance

Remove effect

(preprocessing)?Filter out regions(postprocessing)?

 Add complexity

to model?

Basicmodel

Discussion question: What effectsshould be treated with each technique?

Page 31: Spatial Surveillance

8/3/2019 Spatial Surveillance

http://slidepdf.com/reader/full/spatial-surveillance 31/49

Computing the score functionMethod 1 (Frequentist, hypothesis testing approach):

Use likelihood ratio)|Pr(

))(|Pr()(

0

1

 H  Data

S H  DataSF  =

Method 2 (Bayesian approach):

Use posterior probability)Pr(

))(Pr())(|Pr()( 11

 Data

S H S H  DataSF  =

Prior probability of region S

What to do when each hypothesis has a parameter space Θ?

Method A (Maximum likelihood approach)

Method B (Marginal likelihood approach)),|Pr(max)|Pr( )(

θ θ 

H  Data H  Data H Θ∈=

∫ Θ∈

=)(

)Pr(),|Pr()|Pr( H 

 H  Data H  Data

θ 

θ θ 

Page 32: Spatial Surveillance

8/3/2019 Spatial Surveillance

http://slidepdf.com/reader/full/spatial-surveillance 32/49

Computing the score functionMethod 1 (Frequentist, hypothesis testing approach):

Use likelihood ratio)|Pr(

))(|Pr()(

0

1

 H  Data

S H  DataSF  =

Method A (Maximum likelihood approach)

),|Pr(max)|Pr( )(θ 

θ 

H  Data H  Data H Θ∈

=

Most common (frequentist) approach: uselikelihood ratio statistic, with maximum likelihoodestimates of any free parameters, and compute

statistical significance by randomization.

Page 33: Spatial Surveillance

8/3/2019 Spatial Surveillance

http://slidepdf.com/reader/full/spatial-surveillance 33/49

5. Find the most anomalous regions, i.e.

those regions S with highest F(S).• Naïve approach: compute F(S) for each spatial

region S.

• Better approach: apply fancy algorithms (e.g.Kulldorf’s SatScan or the fast spatial scanalgorithm (Neill and Moore, KDD 2004).

Problem: millions of regions to search!

d h l

Page 34: Spatial Surveillance

8/3/2019 Spatial Surveillance

http://slidepdf.com/reader/full/spatial-surveillance 34/49

5. Find the most anomalous regions, i.e.

those regions S with highest F(S).• Naïve approach: compute F(S) for each spatial

region S.

• Better approach: apply fancy algorithms (e.g.Kulldorf’s SatScan or the fast spatial scanalgorithm (Neill and Moore, KDD 2004).

Problem: millions of regions to search!

Using a multiresolution data structure(overlap-kd tree) enables us to

efficiently move between searching atcoarse and fine resolutions.

Start by examining large rectangular regions S. If we can show thatnone of the smaller rectangles contained in S can have high scores,

we do not need to individually search each of these subregions.

5 Fi d th t l i i

Page 35: Spatial Surveillance

8/3/2019 Spatial Surveillance

http://slidepdf.com/reader/full/spatial-surveillance 35/49

5. Find the most anomalous regions, i.e.

those regions S with highest F(S).• Naïve approach: compute F(S) for each spatial

region S.

• Better approach: apply fancy algorithms (e.g.Kulldorf’s SatScan or the fast spatial scanalgorithm (Neill and Moore, KDD 2004).

Problem: millions of regions to search!

Using a multiresolution data structure(overlap-kd tree) enables us to

efficiently move between searching atcoarse and fine resolutions.

Start by examining large rectangular regions S. If we can show thatnone of the smaller rectangles contained in S can have high scores,

we do not need to individually search each of these subregions.

Result: 20-2000x speedupsvs. naïve approach, withoutany loss of accuracy

6 D t i h th h f th t ti l

Page 36: Spatial Surveillance

8/3/2019 Spatial Surveillance

http://slidepdf.com/reader/full/spatial-surveillance 36/49

6. Determine whether each of these potential

clusters is actually an anomalous cluster.• Frequentist approach: calculate statistical significance of 

each region by randomization testing.

D(S) = 21.2

D(S) = 18.5

D(S) = 15.1

Original grid G

6 D t i h th h f th t ti l

Page 37: Spatial Surveillance

8/3/2019 Spatial Surveillance

http://slidepdf.com/reader/full/spatial-surveillance 37/49

6. Determine whether each of these potential

clusters is actually an anomalous cluster.• Frequentist approach: calculate statistical significance of 

each region by randomization testing.

D(S) = 21.2

D(S) = 18.5

D(S) = 15.1

Original grid G

D*(G1) = 16.7 D*(G999) = 4.2

1. Create R = 999 replica grids by sampling under H0,using max-likelihood estimates of any free params.

6 Dete mine hethe each of these potential

Page 38: Spatial Surveillance

8/3/2019 Spatial Surveillance

http://slidepdf.com/reader/full/spatial-surveillance 38/49

6. Determine whether each of these potential

clusters is actually an anomalous cluster.• Frequentist approach: calculate statistical significance of 

each region by randomization testing.

D(S) = 21.2

D(S) = 18.5

D(S) = 15.1

Original grid G

D*(G1) = 16.7 D*(G999) = 4.2

1. Create R = 999 replica grids by sampling under H0,using max-likelihood estimates of any free params.

2. Find maximum region score D* for each replica.

6 Determine whether each of these potential

Page 39: Spatial Surveillance

8/3/2019 Spatial Surveillance

http://slidepdf.com/reader/full/spatial-surveillance 39/49

6. Determine whether each of these potential

clusters is actually an anomalous cluster.• Frequentist approach: calculate statistical significance of 

each region by randomization testing.

D(S) = 21.2

D(S) = 18.5

D(S) = 15.1

Original grid G

D*(G1) = 16.7 D*(G999) = 4.2

1. Create R = 999 replica grids by sampling under H0,using max-likelihood estimates of any free params.

2. Find maximum region score D* for each replica.

3. For each potential cluster S, count R beat = numberof replica grids G’ with D*(G’) higher than D(S).

4.  p -value of region S = (R beat+1)/(R+1).

5. All regions with  p -value < α are significant at level α.

6 Determine whether each of these potential

Page 40: Spatial Surveillance

8/3/2019 Spatial Surveillance

http://slidepdf.com/reader/full/spatial-surveillance 40/49

6. Determine whether each of these potential

clusters is actually an anomalous cluster.• Bayesian approach: calculate posterior probability

of each potential cluster.

Original grid G

1. Score of region S = Pr(Data | H1(S)) Pr(H1(S))

2. Total probability of the data: Pr(Data) =

Pr(Data | H0) Pr(H0) + ∑S Pr(Data | H1(S)) Pr(H1(S))

3. Posterior probability of region S: Pr(H1(S) | Data) =

Pr(Data | H1(S)) Pr(H1(S)) / Pr(Data).

4. Report all clusters with posterior probability >

some threshold, or “sound the alarm” if totalposterior probability of all clusters sufficiently high.

No randomization testing necessary… about1000x faster than naïve frequentist approach!

Page 41: Spatial Surveillance

8/3/2019 Spatial Surveillance

http://slidepdf.com/reader/full/spatial-surveillance 41/49

Making the spatial scan fast

Naïve frequentist scan 1000 replicas

x 12 hrs / replica

= 500 days!

256 x 256 grid = 1 billion regions!

Fast frequentist scan Naïve Bayesian scan

12 hrs (to searchoriginal grid)

1000 replicas

x 36 sec / replica

= 10 hrs

Fast Bayesian scan

Page 42: Spatial Surveillance

8/3/2019 Spatial Surveillance

http://slidepdf.com/reader/full/spatial-surveillance 42/49

Why the Scan Statistic speed obsession?

• Traditional ScanStatistics very

expensive,especially withRandomizationtests

• Going national• A few hours

could actually

matter!

Page 43: Spatial Surveillance

8/3/2019 Spatial Surveillance

http://slidepdf.com/reader/full/spatial-surveillance 43/49

Results

S f l

Page 44: Spatial Surveillance

8/3/2019 Spatial Surveillance

http://slidepdf.com/reader/full/spatial-surveillance 44/49

Summary of results

• The fast spatial scan results inhuge speedups (as comparedto exhaustive search), makingfast real-time detection ofclusters feasible.

• No loss of accuracy: fast

spatial scan finds the exactsame regions and p-values asexhaustive search.

OTC data from National

Retail Data Monitor

ED data

P f i

Page 45: Spatial Surveillance

8/3/2019 Spatial Surveillance

http://slidepdf.com/reader/full/spatial-surveillance 45/49

Performance comparison

429.854.4 ns81 minutes1.1 trillionAxis-alignedrectangles

fast spatialscan

429.853600 ns45 days1.1 trillionAxis-aligned

rectangles

exhaustive

413.56400 ns16 hours150 billionCircles

centeredat datapts

SaTScan

Likelihoodratio

Time / region

Searchtime (total)

Numberof regions

Searchspace

Algorithmname

• On ED dataset (600,000records), 1000 replicas

• For SaTScan: M=17,000distinct spatial locations

• For Exhaustive/fast: 256 x 256grid

P f i

Page 46: Spatial Surveillance

8/3/2019 Spatial Surveillance

http://slidepdf.com/reader/full/spatial-surveillance 46/49

Performance comparison

429.854.4 ns81 minutes1.1 trillionAxis-alignedrectangles

fast spatialscan

429.853600 ns45 days1.1 trillionAxis-aligned

rectangles

exhaustive

413.56400 ns16 hours150 billionCircles

centeredat datapts

SaTScan

Likelihoodratio

Time / region

Searchtime (total)

Numberof regions

Searchspace

Algorithmname

• On ED dataset (600,000records), 1000 replicas

• For SaTScan: M=17,000distinct spatial locations

• For Exhaustive/fast: 256 x 256grid

• Algorithms: Neill and Moore, NIPS 2003, KDD 2004• Deployment: Neill, Moore, Tsui and Wagner,

Morbidity and Mortality Weekly Report, Nov. ‘04 

Page 47: Spatial Surveillance

8/3/2019 Spatial Surveillance

http://slidepdf.com/reader/full/spatial-surveillance 47/49

SC

d-dimensional partitioning• Parent region S is divided into 2d

overlapping children: an “upper child”

and a “lower child” in each dimension.• Then for any rectangular subregion S’

of S, exactly one of the following istrue:

 – S’ is contained entirely in (at least) oneof the children S1… S2d.

 – S’ contains the center region SC, whichis common to all the children.

• Starting with the entire grid G and

repeating this partitioning recursively,we obtain the overlap-kd treestructure.

S5

S1

S2

S3

S4

S6

S

• Algorithm: Neill, Moore and Mitchell NIPS 2004

Limitations of the algorithm

Page 48: Spatial Surveillance

8/3/2019 Spatial Surveillance

http://slidepdf.com/reader/full/spatial-surveillance 48/49

Limitations of the algorithm

• Data must be aggregated to a grid.

• Not appropriate for very high-

dimensional data.• Assumes that we are interested in

finding (rotated) rectangular regions.

• Less useful for special cases (e.g.square regions, small regions only).

• Slower for finding multiple regions.

Related work

Page 49: Spatial Surveillance

8/3/2019 Spatial Surveillance

http://slidepdf.com/reader/full/spatial-surveillance 49/49

Related work

 – non-specific clustering: evaluates generaltendency of data to cluster

 – focused clustering: evaluates risk w.r.t. agiven spatial location (e.g. potentialhazard)

 – disease mapping: models spatial variationin risk by applying spatial smoothing.

 – spatial scan statistics (and related

techniques).