modeling & analyzing massive terrain data sets

Modeling & Analyzing Massive Modeling & Analyzing Massive Terrain Data SetsTerrain Data Sets

Pankaj K. AgarwalPankaj K. Agarwal

Duke UniversityDuke University

STREAM ProjectSTREAM Project

Scalable Techniques for hi-Resolution Elevation data Analysis & Modeling

In collaboration with: Lars Arge, Helena Mitasova

Students: Andrew Danner,Thomas Molhave,Ke Yi

Diverse elevation point data: density, Diverse elevation point data: density, distribution, accuracydistribution, accuracy

1974

20011999

1995Lidar 0.15m v. accuracy; altitude 700m and 2300m

Photogrammetry 0.76m v. accuracy (5ft contours)

2004

1998

RTK-GPS 0.10m v. accuracy

Constructing Digital Elevation ModelsConstructing Digital Elevation Models

Grid, TIN, Contour DEMs

Natural feature extractionNatural feature extraction

Extracted foredunes 1999-2004 usingprofile curvature and elevation threshold

MaMaps of topo parameters: computed simultaneously with interpolation using partial derivatives of the RST function, terrain features: combined thresholds of topo parameters

Tracking evolving featuresTracking evolving featuresslip faces: slope > 30deg 1995 new slip face in 1999 2001

how to automatically track features that change some of their properties over time?

dune crests: profile curvature threshold

Hydrology AnalysisHydrology Analysis

• Flow Analysis• Watershed

hierarchy

Challenge: Massive Data SetsChallenge: Massive Data Sets

• LIDAR– NC Coastline: 200 million points – over 7 GB– Neuse River basin (NC): 500 million points – over

17 GB

• Output is also large– 10ft grid: 10GB– 5ft grid: 40GB

Approximation AlgorithmsApproximation Algorithms

• Exact computation expensiveMany practical problems are intractable

Multiple & often conflicting optimization criteria

Suffices to find a near-optimal solution

• Tunable Approximation algorithms– Tradeoff between accuracy and efficiency– User specifies tolerance

I/O-BottleneckI/O-Bottleneck• Data resides in secondary memory• Disk access is 106 times slower than main memory

access– Maximize useful data transferred with each access– Amortize large access time by transferring large

contiguous blocks of data

• I/O – not CPU – is often the bottleneck for processing massive data sets

track

magnetic surface

read/write armread/write head

• Traditional algorithms optimize CPU computation– Not sensitive to penalty of disk access

• I/O model– Memory is finite

– Data is transferred in blocks

– Complexity measured in disk blocks transferred

I/O-efficient Algorithms [AV88]I/O-efficient Algorithms [AV88]

Disk

RAM

CPU

M

B

B ~ 2K-8K

Elevation Data to Watershed Hierarchies: Elevation Data to Watershed Hierarchies: A Pipeline ApproachA Pipeline Approach

• Given a point cloud sampled from a bivariate function z=F(x,y) (elevation data)

• Goal: construct a grid DEM for z• DEM Applications

– Hydrology

– Ecology

– Viewsheds …• Many algorithms only work on grid data

Point Cloud to Grid DEMPoint Cloud to Grid DEM

Sub-meter resolution: improved structures Sub-meter resolution: improved structures representationrepresentation

2004 lidar-based 30cm resolution DSM

1999 lidar 2m resolution DSM

Binning (within cell average) sufficient for ~1-3m DEMs but interpolation producesimproved geometry representation and allows us to create higher resolution DEMs/DSMs

2004 lidar: 0.5m resolution binning

interpolated DSM

General Approach: InterpolationGeneral Approach: Interpolation

• Kriging, IDW, Splines, …

Interpolate

O(N3)z=F(x,y)

• Modern data sets can have millions of points

• Direct interpolation does not scale

N Points

Improving Interpolation with Quad TreesImproving Interpolation with Quad Trees

• Segment initial point set using a quad tree– Each quad tree leaf has a <K points (10-100)

• Interpolate each segment separately• N/K segments

– O(K3) interpolation per segment

– O(K2N) total running time

– Linear in N

• Boundary smoothness?– Use points from nearby

segments

Overview of the AlgorithmOverview of the Algorithm

• Construct quad-tree on disk using bulk loading [AAPV 01]

• For each quad-tree node, find points in neighboring segments

• Arrange points on disk for sequential interpolation– Use the algorithm by Mitasova, Mitas, Harmon

– Can use any other interpolation method

Building an External Quad TreeBuilding an External Quad Tree• Build top few levels in memory• Buffer points in lower levels• Write buffers to disk when full

K=2

Block I/O

Building a TINBuilding a TIN

TIN: Triangulated Irregular Network

Constrained DelaunayTriangulation

Developed an I/O-efficient algorithm

Builds TIN for NeuseRiver Basin (14GB) in7.5 hours

Future Work: Hierarchical DEMsFuture Work: Hierarchical DEMs

• Considerable work in graphics and GIS communities on multiresolution hierarchies for terrains– Focuses on visualization– Most algorithms perform random memory access

• Out-of-core simplification– I/O-efficient algorithms for terrain simplification (edge

contraction method)

• Feature preserving simplification– Preserves main river networks– Preserves visibility for most of the points

Coping with Noisy DataCoping with Noisy Data

• Nearly all flow models assume [O’Callaghan and Mark 84, de Berg et al. 96]

– water flows downwards– stops at a minimum

Denoising via FloodingDenoising via Flooding [Jenson and Domingue 88, Arge et al. 03][Jenson and Domingue 88, Arge et al. 03]

• Pour water uniformly on terrain• Water level rises until steady-state

River Network After FloodingRiver Network After Flooding

Sort(N) I/Os

After flooding

Denoising via FloodingDenoising via Flooding [Jenson and Domingue 88, Arge et al. 03][Jenson and Domingue 88, Arge et al. 03]

• Pour water uniformly on terrain• Water level rises until steady-state

• Use topological persistence (Edelsbrunner et al.) as importance measure to measure the importance of a minimum

• Flood low-persistence minima, preserve high-persistence minima

Topological Persistence Topological Persistence [Edelsbrunner et al. 02][Edelsbrunner et al. 02]

The Union-Find ProblemThe Union-Find Problem

• A universe of N elements: x1, x2, …, xN

• Initially N singleton sets: {x1}, {x2 }, …, {xN}

• Each set has a representative

• Maintain the partition under– Union(xi, xj) : Joins the sets containing xi and xj

– Find(xi) : Returns the representative of the set containing xi

Formulated as Batched Union-FindFormulated as Batched Union-Find• On a terrain represented as a TIN

• When reach– A minimum or maximum: do nothing– A regular point u: Issue union(u,v) for a lower neighbor v– A saddle u: let v and w be nodes from u’s two connected pieces

in its lower link Issue: find(v), find(w), union(u,v), union(u,w)

• Sequence of union/find operations can be computed in advance

lower link

Classical Union-Find AlgorithmClassical Union-Find Algorithm

d

b j a

e g

h

f l

n

m

i

s r cz k

p

representatives

d

b j a

e g

h

f l

n

m

Union(d, h) :

link-by-rank

d

b j a

e g

h

f l n

Find(n) :

path compression

m

ComplexityComplexity

• O(N α(N)) for a sequence of N union and find operations [Tarjan 75]– α(•) : Inverse Ackermann function (very slow!)

– Optimal in the worst case [Tarjan79, Fredman and Saks 89]

• Batched (Off-line) version– Entire sequence known in advance

– Can be improved to linear on RAM [Gabow and Tarjan 85]

– Not possible on a pointer machine [Tarjan79]

Our ResultsOur Results

• An I/O-efficient algorithm for the batched union-find problem using O(sort(N)) = O(N/B logM/B(N/B)) I/Os– Same as sorting– optimal in the worst case

• A simpler algorithm using O(sort(N) log(N/M)) I/Os– Implemented

• Apply to persistence computation, flooding– O(sort(N)) time

• Other applications

I/O-Efficient Batched Union-FindI/O-Efficient Batched Union-Find

• Two-stage algorithm– Convert to interval union-find

• Compute an order on the elements s.t. each union joins two adjacent sets

– Solve batched interval union-find

• Each stage formulated as a geometric problem and solved using sort(N) I/Os

Experimental ResultsExperimental Results

Traditional algorithm good as long as data fits in memory

Experimental Results:Neuse River BasinExperimental Results:Neuse River Basin

On the entire data set: EM takes 5 hours, IM fails

0.5 billion points (14GB raw data), sampled to obtain smaller data sets

Contour TreesContour Trees

Previous ResultsPrevious Results

• Directly maintain contours– O(N log N) time [van Kreveld et al. 97]

– Needs union-split-find for circular lists– Do not extend to higher dimensions

• Two sweeps by maintaining components, then merge– O(N log N) time [Carr et al. 03]

– Extend to arbitrary dimensions

Join Tree and Split TreeJoin Tree and Split Tree [Carr et al. 03][Carr et al. 03]

9

8

76

5

4

32

1

Join tree

9

8

76

5

4

32

1

Split tree

Qualified nodes

9

8

76

5

4

3

1

Join tree

9

8

76

5

4

3

1

Split tree

Final Contour TreeFinal Contour Tree

9

8

76

5

4

32

1

Join tree

9

8

76

5

4

32

1

Split tree

9

8

76

5

4

32

1

Contour tree

Hard to BATCH!

Another CharacterizationAnother CharacterizationLet w be the highest node that is a descendant of v in join treeand ancestor of u in split tree, (u, w) is a contour tree edge

9

8

76

5

4

32

1

Join tree

9

8

76

5

4

32

1

Split tree

9

8

76

5

4

32

1

Contour tree

u

vw

u

vw

u

vw

Now can BATCH!

Experimental ResultsExperimental Results

On the Neuse River Basin data set

Coping with Noise: Future WorkCoping with Noise: Future Work

• Constructing hydrologically conditioned DEMs– Formulate as an optimization problem

– Using persistent homology

– Integrating topology and geometry

– Handling flat areas

• Scalability remains a big issue• Handling anthropogenic features

Back to Pipeline: Flow ModelingBack to Pipeline: Flow Modeling

From Elevation to River NetworksFrom Elevation to River Networks

• Where does water go?• From higher elevation to lower

elevation• Single flow directions forms a tree

Sea

80

90

100

95

85

110

Drainage AreaDrainage Area

• How much area is upstream of each node?

• Each node has initial drainage area (1)

• Drainage area of internal nodes depends on drainage area of children

1

1

1

1

1

1

1

1 1

1

3

3

2

2

2

3

2

5

7

17

25

14

11

9

7

Drainage Area example (Panama)Drainage Area example (Panama)

IFSARE and SRTM based stream analysisIFSARE and SRTM based stream analysisStream and watershed boundary extraction for Panama for on-going water quality research to provide more detailed and up-to-date data.

SRTM - 7400x3600 DSM at 90m resolution for entire Panama, IFSARE - 10800x11300 DEM at 10m resolution for the Panama canal sectionUsing r.terraflow - officially released in GRASS6.2 in 2006 - streams can be extracted in 3-4 hours, tile-based approach takes days SRTM DEM for entire Panama with main rivers extracted from DEM in 3 hr

Hierarchical Watershed DecompositionHierarchical Watershed Decomposition

Watershed HierarchiesWatershed Hierarchies• Decompose a river network into a hierarchy of hydrological

units• All water in HU flows to a common outlet• Hierarchy provides tunable level of detail• Method used: Pfafstetter [VV99]• Want a solution scalable to large modern hi-res terrains

PfafstetterPfafstetter• Find main river

• Find four largest tributaries

• Label basins/interbasins

• Recurse until single path

1

1

1

1

1

1

1

1

1

1

2

2

2

3

2

7

17

25

14119

7

3

3

5

8

6

4

2

7

53

1

9

226

25 24

23

21

2227

771

75

7374

72

RecurseRecurse

Example Watershed BoundariesExample Watershed Boundaries

1

2

3

4

5

6

7

8

9

91

92

93

94

95

9697

98

99

991

992993

994

995

996

997

998999

A Complete PipelineA Complete Pipeline

ImplementationImplementation

• TPIE: C++ primitives for I/O-efficient algorithms• GRASS: Open Source GIS• Interpolation: Regularized spline with tension (in

GRASS)• Data:

– North Carolina LIDAR

• Neuse river basin: 400 million points (NC Floodmaps)

• Outer banks coastal data : 128 million points (NOAA CSC)

– USGS 30m NED

Grid Construction ResultsGrid Construction Results

Resolution (ft) 40 20 10

Grid cells x106 221 885 3542

Points x106 205 340 415

Total time 12h32m 14h46m 26h52m

Time spent(%)

Build quad tree 8.9 7.1 5.7

Find neighbors 31.6 32.4 29.1

Interpolate 58.8 58.5 59.3

Write output 0.7 2 5.9

• Neuse• Point Thinning • Adjusted

interpolation parameters

Sample Watershed ResultsSample Watershed Results

size (MB) 150 713 5819

size (mln cells) 30.8 147 396.5

total time 10m29s 58m10s 187m43s

Time spent… %

importing data 8 7 16

sorting by flow 16 15 13

building river list 31 35 29

sorting river list 19 20 19

computing labels 7 6 6

sort by grid order 14 13 12

exporting data 5 4 5

Future WorkFuture Work

•Terrain Modeling

•Terrain Analysis

Multivalent terrain modelsMultivalent terrain models

- multiple surfaces- hybrid raster and 3D vector for structured environments (urban)- voxel for unstructured environments (forested areas)

Multiple surface representation:DSM and DEM

Terrain analysis on DEMs with structuresTerrain analysis on DEMs with structuresInvestiagte multivalent elevation field representations that preserve road and stream connectivity

Dynamic DataDynamic Data

• Continuous data streams from various sensors– Maintain smart summaries of data

– Communication, energy efficient algorithms for sensor networks

• Coping with moving/deformable objects– Tracking evolving terrain features

– Kinetic algorithms that update the structure locally

Terrain AnalysisTerrain Analysis• Flow modeling

– Erosion analysis– Soil moisture modeling (collaboration with

ecologists)

• Tracking evolving features• Visibility based navigation

– Finding paths from which most of the terrain is visible

– Computing well concealed paths– Covering terrains from a few points

AcknowledgmentsAcknowledgments

• Collaborators– Lars Arge, Helena Mitasova– Andrew Danner, Thomas Molhave, Ke Yi

• Sponsored by– Army Research Office W911NF-04-1-0278

Relevant PublicationsRelevant Publications• Danner, Yi, Molhave, A., Arge, Mitasova, TerraStream: From

Elevation Data to Watershed Hierarchies, submitted for publication, 2007.

• A., Arge, Danner, From point cloud to grid DEM: A scalable approach. In Proc. 12th SSDH, 771–788, 2006.

• A., Arge, Yi, I/O-efficient construction of constrained Delaunay triangulations. In Proc. ESA, 355–366, 2005.

• A., Arge, Yi. I/O-efficient batched union-find and its applications to terrain analysis. In Proc. 22nd SoCG, 2006.

• Arge, Danner, Haverkort, Zeh. I/O-efficient hierarchical watershed decomposition of grid terrain models. In Proc. 12th SSDH, 825–844, 2006.

• Danner. I/O Efficient Algorithms and Applications in Geographic Information Systems. PhD thesis, Duke University, 2006.

Thanks!Thanks!

Future Directions – Noise RemovalFuture Directions – Noise Removal

• Bridge detection/removal

• Other flow routing methods

• Flow routing on flat surfaces

• Comparing flow networks

Mapping LIDAR point densityMapping LIDAR point density

no. of points/2m grid cell1996 0.21997 0.9 ATM II NASA1998 0.41999 1.42001 0.2 NCflood Leica2003 2.0 EARL NASA2004 15.0 SHOALS USACE2005 6.0 SHOALS USACE

pt / 2m grid cell

2004: SHOALS2001: NC Flood

1998 2004

point density at the beginning of the project current industry standard

Building an External Quad TreeBuilding an External Quad Tree

• After first scan of data:– Write top of tree to disk– Flush all buffers

• Recurse on buffers

• levels in one scan

• I/Os total

• Identifying minima likely due to noise• Don’t want to remove real features

– Persistent homology [ELZ 02]

– Computed in Sort(N) I/Os

Coping with Noisy DataCoping with Noisy Data

InterpolationInterpolation• Scan through list of points with neighbors• Interpolate each leaf separately – Use any

interpolation method– Evaluate at grid cells

– Write grid cell values as (i,j,z) as they are computed

• Sort grid cells by raster order

Interpolate Evaluate

(i, j, z)

Sort

modeling & analyzing massive terrain data sets

Documents

kelevation data

ymodern data sets

resolution dsmbinning

quad treeeach quad tree

algorithmconstruct quadtree

quadtree node

grid datasubmeter resolution

memorybuffer points