massive data algorithmics -...
TRANSCRIPT
Massive Data Algorithmics An Introduction
In the name of Allah
Overview
MADALGO
SCALGO
Basic Concepts
The TerraFlow Project
STREAM
The TerraStream Project
TPIE
MADALGO- Introduction
Center for MAssive Data ALGOrithmics
A major basic research center funded by
The Danish National Research
Foundation
Covers all areas of the design, analysis and
implementation of algorithms and data
structures for processing massive data
MADALGO- Four core research
areas I/O-efficient algorithms
◦ Algorithms designed in a two-level external
memory (or I/O-) model
◦ The memory hierarchy consists of a main
memory of limited size M and an external
memory (disk) of unlimited size
◦ the goal is to minimize the number of times a
block of B consecutive elements is read (or
written) from (to) disk (an I/O-operation, or
simply I/O)
MADALGO- Four core research
areas cache-oblivious algorithms
◦ Algorithms designed in the I/O-model – but
without knowledge of M and B– and then
analyzed as I/O-model algorithms
◦ Holds simultaneously on all levels of any
multi-level memory hierarchy.
MADALGO- Four core research
areas streaming algorithms
◦ Only one (or a small constant number of)
sequential pass(es) over the data is (are)
allowed
◦ Solve a given problem using significantly less
space than the input data size
◦ Process each data element as fast as possible
MADALGO- Four core research
areas algorithm engineering
◦ the design and analysis of practical algorithms
◦ efficient implementation of these algorithms
◦ experimentation that provide insight into
their applicability and further improvements
SCALGO
SCALGO: SCALable alGOrithmics
Was founded in 2009 in Aarhus, Denmark
Mission: to bring cutting-edge massive
terrain data-processing technology to
market
Terrain
Terrain: The vertical and horizontal
dimension of land surface
LIDAR
LIDAR: Light Detection And Ranging
an optical remote sensing technology
measures the distance to, or other
properties of, a target by illuminating the
target with light
often uses pulses from a laser
Point cloud
A set of vertices in a three-dimensional
coordinate system
Usually defined by X, Y, and Z coordinates
Typically intended to be representative of
the external surface of an object
DEM
DEM: Digital elevation model
A digital model or 3D representation of a
terrain's surface
◦ Two most used types of DEM are regular grid
and triangulated irregular network (TIN)
Regular grid DEM
a matrix of equally spaced points with
each point having x, y and z coordinate
values
Regular grid DEM- Quadtree
a tree data structure in which each internal node has exactly four children
most often used to partition a two dimensional space by recursively subdividing it into four quadrants or regions
Triangulated Irregular Network
(TIN) irregularly distributed nodes and lines
with three-dimensional coordinates
arranged in a network of non-overlapping
triangles
TIN- Delaunay triangulation
A triangulation for a set of points such
that no point is inside the circumcircle of
any triangle
maximizes the minimum angle of all the
angles of the triangles in the triangulation
tends to avoid skinny triangles
The TerraFlow Project
Has emerged from the experiences with
terrain analysis applications which do not
scale up to large datasets
a software package for computing flow
routing and flow accumulation on massive
grid-based terrains
based on theoretically optimal algorithms
designed using external memory
paradigms
Flow direction, flow routing and
flow accumulation The flow directions of a cell correspond to the
directions in which water would flow if poured at that cell onto the terrain
◦ water cannot go uphill
The flow routing problem: the problem of assigning flow directions to all cells in the DEM such that
1. flow directions do not induce any cycles;
2. every cell has a flow path off the edge of the terrain
The flow accumulation of a terrain is an index which estimates the surface runoff for each cell in the terrain
STREAM- Introduction
STREAM: Scalable Techniques for hi-
Resolution Elevation data Analysis and
Modeling
Located in the CS department at Duke
university
funded by the U.S. Army Research Office
STREAM- Projects
Constructing DEM
◦ developed two methods for efficiently
converting LIDAR point sets to more
conventional formats:
Grid Construction: uses a quad-tree segmentation
TIN Construction: uses a Delaunay triangulation
algorithm
Terrain Flow Modeling
◦ improvements to existing work done as part
of the TerraFlow project
STREAM- Projects
Noise Removal
◦ There is some level of noise in DEMs derived
from LIDAR
◦ computes a persistence score for topological
features
◦ uses this persistence score to remove small
topological features likely the result of noise
STREAM- Projects
Hierarchical Watershed Decomposition
◦ partitions a terrain into a hierarchy of nested
watersheds
STREAM- Projects
Topographic Change
◦ Detecting topographic change can quickly
identify beach dunes damaged by hurricanes,
monitor urban development or measure
change in forest growth
TerraSTREAM- Introduction
A series of libraries and front-ends for these libraries
Allows the user to perform a series of computational tasks on very large digital elevation models
The data is represented either as a TIN or a GRID
A collaboration between Duke University CS researchers and researchers at MADALGO
TerraStream- Features
DEM Construction
◦ Computes a digital elevation model (DEM)
from a point cloud
◦ The input data is typically gathered using
LIDAR
◦ Constructs both TINs and grids
TerraStream- Features
DEM Topological Conditioning
◦ Simplifies digital elevation models by first
identifying and then removing insignificant
geographical features
◦ Significance is the feature's height, area and
volume or any combination of these
◦ A feature is insignificant if its significance is
smaller than some threshold specified by the
user
TerraStream- Features
Flow Routing
◦ Compute flow directions for each data point in a DEM
◦ The routing models supported are
steepest-flow-descent
multiple-flow-directions
flux decomposition
Flow Accumulation
◦ Accumulate amounts of, e.g., water on a DEM along flow paths as computed by the flow routing module
TerraStream- Features
Flood Simulation
◦ Flood Mask
computes a mask of the cells that are flooded if the
water lever were raised 'x' units
◦ General
Transforms a DEM to a new DEM
The height of each cell in the produced DEM is the
minimum height that the water level needs to be
raised to in order for that particular cell to flood
TerraStream- Features
Contour Map Computation
◦ Computes the contour map of a terrain
TerraStream- Features
Raster Quality Assessment
◦ takes a raster and point cloud
◦ computes how far the center of each raster cell is from the closest point in the point cloud
◦ it is easy to spot areas of the grid where there is no points close
◦ If the point cloud used is the same used for generating the input raster this can be used for quality control of the point cloud, the classification algorithm used and the produced raster
TerraStream- Features
Watershed Hierarchy Construction
◦ Construct a Pfafstetter labeling of the watersheds of a DEM
LS-Factor Computation
◦ LS-factor: an aggregate of the slope length factor (L) and the slope steepness factor (S)
◦ estimate the effects of slope length and steepness on erosion
Format Flexibility
◦ reading and writing mosaic grids in many common formats
TPIE- Introduction
TPIE: The Templated Portable I/O Environment
A tool-box providing efficient and convenient tools
To ease the implementation of algorithm and data structures on very large sets of data
The algorithms and data structures that form the core of TPIE all provide efficient worst-case space, time and disk usage guarantees
In Windows, TPIE is known to work with the Microsoft Visual Studio 2008 and 2010 compilers
TPIE- Example
Internal sorting
TPIE- Example
Reading and writing file streams
TPIE- Example
External sorting
TPIE- Example
Priority queue
TPIE- I/O parameters
M and B
get_block_size() implementation
TPIE- I/O parameters
Elements’ block size
◦ Pass the block factor to the constructor
The End
Thank you for your time