massive data algorithmics -...

39
Massive Data Algorithmics An Introduction In the name of Allah

Upload: voduong

Post on 09-Mar-2018

217 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Massive Data Algorithmics - Sharifce.sharif.edu/courses/91-92/1/ce787-1/resources/root/MADALGO... · Center for MAssive Data ALGOrithmics ... core of TPIE all provide efficient worst-case

Massive Data Algorithmics An Introduction

In the name of Allah

Page 2: Massive Data Algorithmics - Sharifce.sharif.edu/courses/91-92/1/ce787-1/resources/root/MADALGO... · Center for MAssive Data ALGOrithmics ... core of TPIE all provide efficient worst-case

Overview

MADALGO

SCALGO

Basic Concepts

The TerraFlow Project

STREAM

The TerraStream Project

TPIE

Page 3: Massive Data Algorithmics - Sharifce.sharif.edu/courses/91-92/1/ce787-1/resources/root/MADALGO... · Center for MAssive Data ALGOrithmics ... core of TPIE all provide efficient worst-case

MADALGO- Introduction

Center for MAssive Data ALGOrithmics

A major basic research center funded by

The Danish National Research

Foundation

Covers all areas of the design, analysis and

implementation of algorithms and data

structures for processing massive data

Page 4: Massive Data Algorithmics - Sharifce.sharif.edu/courses/91-92/1/ce787-1/resources/root/MADALGO... · Center for MAssive Data ALGOrithmics ... core of TPIE all provide efficient worst-case

MADALGO- Four core research

areas I/O-efficient algorithms

◦ Algorithms designed in a two-level external

memory (or I/O-) model

◦ The memory hierarchy consists of a main

memory of limited size M and an external

memory (disk) of unlimited size

◦ the goal is to minimize the number of times a

block of B consecutive elements is read (or

written) from (to) disk (an I/O-operation, or

simply I/O)

Page 5: Massive Data Algorithmics - Sharifce.sharif.edu/courses/91-92/1/ce787-1/resources/root/MADALGO... · Center for MAssive Data ALGOrithmics ... core of TPIE all provide efficient worst-case

MADALGO- Four core research

areas cache-oblivious algorithms

◦ Algorithms designed in the I/O-model – but

without knowledge of M and B– and then

analyzed as I/O-model algorithms

◦ Holds simultaneously on all levels of any

multi-level memory hierarchy.

Page 6: Massive Data Algorithmics - Sharifce.sharif.edu/courses/91-92/1/ce787-1/resources/root/MADALGO... · Center for MAssive Data ALGOrithmics ... core of TPIE all provide efficient worst-case

MADALGO- Four core research

areas streaming algorithms

◦ Only one (or a small constant number of)

sequential pass(es) over the data is (are)

allowed

◦ Solve a given problem using significantly less

space than the input data size

◦ Process each data element as fast as possible

Page 7: Massive Data Algorithmics - Sharifce.sharif.edu/courses/91-92/1/ce787-1/resources/root/MADALGO... · Center for MAssive Data ALGOrithmics ... core of TPIE all provide efficient worst-case

MADALGO- Four core research

areas algorithm engineering

◦ the design and analysis of practical algorithms

◦ efficient implementation of these algorithms

◦ experimentation that provide insight into

their applicability and further improvements

Page 8: Massive Data Algorithmics - Sharifce.sharif.edu/courses/91-92/1/ce787-1/resources/root/MADALGO... · Center for MAssive Data ALGOrithmics ... core of TPIE all provide efficient worst-case

SCALGO

SCALGO: SCALable alGOrithmics

Was founded in 2009 in Aarhus, Denmark

Mission: to bring cutting-edge massive

terrain data-processing technology to

market

Page 9: Massive Data Algorithmics - Sharifce.sharif.edu/courses/91-92/1/ce787-1/resources/root/MADALGO... · Center for MAssive Data ALGOrithmics ... core of TPIE all provide efficient worst-case

Terrain

Terrain: The vertical and horizontal

dimension of land surface

Page 10: Massive Data Algorithmics - Sharifce.sharif.edu/courses/91-92/1/ce787-1/resources/root/MADALGO... · Center for MAssive Data ALGOrithmics ... core of TPIE all provide efficient worst-case

LIDAR

LIDAR: Light Detection And Ranging

an optical remote sensing technology

measures the distance to, or other

properties of, a target by illuminating the

target with light

often uses pulses from a laser

Page 11: Massive Data Algorithmics - Sharifce.sharif.edu/courses/91-92/1/ce787-1/resources/root/MADALGO... · Center for MAssive Data ALGOrithmics ... core of TPIE all provide efficient worst-case

Point cloud

A set of vertices in a three-dimensional

coordinate system

Usually defined by X, Y, and Z coordinates

Typically intended to be representative of

the external surface of an object

Page 12: Massive Data Algorithmics - Sharifce.sharif.edu/courses/91-92/1/ce787-1/resources/root/MADALGO... · Center for MAssive Data ALGOrithmics ... core of TPIE all provide efficient worst-case

DEM

DEM: Digital elevation model

A digital model or 3D representation of a

terrain's surface

◦ Two most used types of DEM are regular grid

and triangulated irregular network (TIN)

Page 13: Massive Data Algorithmics - Sharifce.sharif.edu/courses/91-92/1/ce787-1/resources/root/MADALGO... · Center for MAssive Data ALGOrithmics ... core of TPIE all provide efficient worst-case

Regular grid DEM

a matrix of equally spaced points with

each point having x, y and z coordinate

values

Page 14: Massive Data Algorithmics - Sharifce.sharif.edu/courses/91-92/1/ce787-1/resources/root/MADALGO... · Center for MAssive Data ALGOrithmics ... core of TPIE all provide efficient worst-case

Regular grid DEM- Quadtree

a tree data structure in which each internal node has exactly four children

most often used to partition a two dimensional space by recursively subdividing it into four quadrants or regions

Page 15: Massive Data Algorithmics - Sharifce.sharif.edu/courses/91-92/1/ce787-1/resources/root/MADALGO... · Center for MAssive Data ALGOrithmics ... core of TPIE all provide efficient worst-case

Triangulated Irregular Network

(TIN) irregularly distributed nodes and lines

with three-dimensional coordinates

arranged in a network of non-overlapping

triangles

Page 16: Massive Data Algorithmics - Sharifce.sharif.edu/courses/91-92/1/ce787-1/resources/root/MADALGO... · Center for MAssive Data ALGOrithmics ... core of TPIE all provide efficient worst-case

TIN- Delaunay triangulation

A triangulation for a set of points such

that no point is inside the circumcircle of

any triangle

maximizes the minimum angle of all the

angles of the triangles in the triangulation

tends to avoid skinny triangles

Page 17: Massive Data Algorithmics - Sharifce.sharif.edu/courses/91-92/1/ce787-1/resources/root/MADALGO... · Center for MAssive Data ALGOrithmics ... core of TPIE all provide efficient worst-case

The TerraFlow Project

Has emerged from the experiences with

terrain analysis applications which do not

scale up to large datasets

a software package for computing flow

routing and flow accumulation on massive

grid-based terrains

based on theoretically optimal algorithms

designed using external memory

paradigms

Page 18: Massive Data Algorithmics - Sharifce.sharif.edu/courses/91-92/1/ce787-1/resources/root/MADALGO... · Center for MAssive Data ALGOrithmics ... core of TPIE all provide efficient worst-case

Flow direction, flow routing and

flow accumulation The flow directions of a cell correspond to the

directions in which water would flow if poured at that cell onto the terrain

◦ water cannot go uphill

The flow routing problem: the problem of assigning flow directions to all cells in the DEM such that

1. flow directions do not induce any cycles;

2. every cell has a flow path off the edge of the terrain

The flow accumulation of a terrain is an index which estimates the surface runoff for each cell in the terrain

Page 19: Massive Data Algorithmics - Sharifce.sharif.edu/courses/91-92/1/ce787-1/resources/root/MADALGO... · Center for MAssive Data ALGOrithmics ... core of TPIE all provide efficient worst-case

STREAM- Introduction

STREAM: Scalable Techniques for hi-

Resolution Elevation data Analysis and

Modeling

Located in the CS department at Duke

university

funded by the U.S. Army Research Office

Page 20: Massive Data Algorithmics - Sharifce.sharif.edu/courses/91-92/1/ce787-1/resources/root/MADALGO... · Center for MAssive Data ALGOrithmics ... core of TPIE all provide efficient worst-case

STREAM- Projects

Constructing DEM

◦ developed two methods for efficiently

converting LIDAR point sets to more

conventional formats:

Grid Construction: uses a quad-tree segmentation

TIN Construction: uses a Delaunay triangulation

algorithm

Terrain Flow Modeling

◦ improvements to existing work done as part

of the TerraFlow project

Page 21: Massive Data Algorithmics - Sharifce.sharif.edu/courses/91-92/1/ce787-1/resources/root/MADALGO... · Center for MAssive Data ALGOrithmics ... core of TPIE all provide efficient worst-case

STREAM- Projects

Noise Removal

◦ There is some level of noise in DEMs derived

from LIDAR

◦ computes a persistence score for topological

features

◦ uses this persistence score to remove small

topological features likely the result of noise

Page 22: Massive Data Algorithmics - Sharifce.sharif.edu/courses/91-92/1/ce787-1/resources/root/MADALGO... · Center for MAssive Data ALGOrithmics ... core of TPIE all provide efficient worst-case

STREAM- Projects

Hierarchical Watershed Decomposition

◦ partitions a terrain into a hierarchy of nested

watersheds

Page 23: Massive Data Algorithmics - Sharifce.sharif.edu/courses/91-92/1/ce787-1/resources/root/MADALGO... · Center for MAssive Data ALGOrithmics ... core of TPIE all provide efficient worst-case

STREAM- Projects

Topographic Change

◦ Detecting topographic change can quickly

identify beach dunes damaged by hurricanes,

monitor urban development or measure

change in forest growth

Page 24: Massive Data Algorithmics - Sharifce.sharif.edu/courses/91-92/1/ce787-1/resources/root/MADALGO... · Center for MAssive Data ALGOrithmics ... core of TPIE all provide efficient worst-case

TerraSTREAM- Introduction

A series of libraries and front-ends for these libraries

Allows the user to perform a series of computational tasks on very large digital elevation models

The data is represented either as a TIN or a GRID

A collaboration between Duke University CS researchers and researchers at MADALGO

Page 25: Massive Data Algorithmics - Sharifce.sharif.edu/courses/91-92/1/ce787-1/resources/root/MADALGO... · Center for MAssive Data ALGOrithmics ... core of TPIE all provide efficient worst-case

TerraStream- Features

DEM Construction

◦ Computes a digital elevation model (DEM)

from a point cloud

◦ The input data is typically gathered using

LIDAR

◦ Constructs both TINs and grids

Page 26: Massive Data Algorithmics - Sharifce.sharif.edu/courses/91-92/1/ce787-1/resources/root/MADALGO... · Center for MAssive Data ALGOrithmics ... core of TPIE all provide efficient worst-case

TerraStream- Features

DEM Topological Conditioning

◦ Simplifies digital elevation models by first

identifying and then removing insignificant

geographical features

◦ Significance is the feature's height, area and

volume or any combination of these

◦ A feature is insignificant if its significance is

smaller than some threshold specified by the

user

Page 27: Massive Data Algorithmics - Sharifce.sharif.edu/courses/91-92/1/ce787-1/resources/root/MADALGO... · Center for MAssive Data ALGOrithmics ... core of TPIE all provide efficient worst-case

TerraStream- Features

Flow Routing

◦ Compute flow directions for each data point in a DEM

◦ The routing models supported are

steepest-flow-descent

multiple-flow-directions

flux decomposition

Flow Accumulation

◦ Accumulate amounts of, e.g., water on a DEM along flow paths as computed by the flow routing module

Page 28: Massive Data Algorithmics - Sharifce.sharif.edu/courses/91-92/1/ce787-1/resources/root/MADALGO... · Center for MAssive Data ALGOrithmics ... core of TPIE all provide efficient worst-case

TerraStream- Features

Flood Simulation

◦ Flood Mask

computes a mask of the cells that are flooded if the

water lever were raised 'x' units

◦ General

Transforms a DEM to a new DEM

The height of each cell in the produced DEM is the

minimum height that the water level needs to be

raised to in order for that particular cell to flood

Page 29: Massive Data Algorithmics - Sharifce.sharif.edu/courses/91-92/1/ce787-1/resources/root/MADALGO... · Center for MAssive Data ALGOrithmics ... core of TPIE all provide efficient worst-case

TerraStream- Features

Contour Map Computation

◦ Computes the contour map of a terrain

Page 30: Massive Data Algorithmics - Sharifce.sharif.edu/courses/91-92/1/ce787-1/resources/root/MADALGO... · Center for MAssive Data ALGOrithmics ... core of TPIE all provide efficient worst-case

TerraStream- Features

Raster Quality Assessment

◦ takes a raster and point cloud

◦ computes how far the center of each raster cell is from the closest point in the point cloud

◦ it is easy to spot areas of the grid where there is no points close

◦ If the point cloud used is the same used for generating the input raster this can be used for quality control of the point cloud, the classification algorithm used and the produced raster

Page 31: Massive Data Algorithmics - Sharifce.sharif.edu/courses/91-92/1/ce787-1/resources/root/MADALGO... · Center for MAssive Data ALGOrithmics ... core of TPIE all provide efficient worst-case

TerraStream- Features

Watershed Hierarchy Construction

◦ Construct a Pfafstetter labeling of the watersheds of a DEM

LS-Factor Computation

◦ LS-factor: an aggregate of the slope length factor (L) and the slope steepness factor (S)

◦ estimate the effects of slope length and steepness on erosion

Format Flexibility

◦ reading and writing mosaic grids in many common formats

Page 32: Massive Data Algorithmics - Sharifce.sharif.edu/courses/91-92/1/ce787-1/resources/root/MADALGO... · Center for MAssive Data ALGOrithmics ... core of TPIE all provide efficient worst-case

TPIE- Introduction

TPIE: The Templated Portable I/O Environment

A tool-box providing efficient and convenient tools

To ease the implementation of algorithm and data structures on very large sets of data

The algorithms and data structures that form the core of TPIE all provide efficient worst-case space, time and disk usage guarantees

In Windows, TPIE is known to work with the Microsoft Visual Studio 2008 and 2010 compilers

Page 33: Massive Data Algorithmics - Sharifce.sharif.edu/courses/91-92/1/ce787-1/resources/root/MADALGO... · Center for MAssive Data ALGOrithmics ... core of TPIE all provide efficient worst-case

TPIE- Example

Internal sorting

Page 34: Massive Data Algorithmics - Sharifce.sharif.edu/courses/91-92/1/ce787-1/resources/root/MADALGO... · Center for MAssive Data ALGOrithmics ... core of TPIE all provide efficient worst-case

TPIE- Example

Reading and writing file streams

Page 35: Massive Data Algorithmics - Sharifce.sharif.edu/courses/91-92/1/ce787-1/resources/root/MADALGO... · Center for MAssive Data ALGOrithmics ... core of TPIE all provide efficient worst-case

TPIE- Example

External sorting

Page 36: Massive Data Algorithmics - Sharifce.sharif.edu/courses/91-92/1/ce787-1/resources/root/MADALGO... · Center for MAssive Data ALGOrithmics ... core of TPIE all provide efficient worst-case

TPIE- Example

Priority queue

Page 37: Massive Data Algorithmics - Sharifce.sharif.edu/courses/91-92/1/ce787-1/resources/root/MADALGO... · Center for MAssive Data ALGOrithmics ... core of TPIE all provide efficient worst-case

TPIE- I/O parameters

M and B

get_block_size() implementation

Page 38: Massive Data Algorithmics - Sharifce.sharif.edu/courses/91-92/1/ce787-1/resources/root/MADALGO... · Center for MAssive Data ALGOrithmics ... core of TPIE all provide efficient worst-case

TPIE- I/O parameters

Elements’ block size

◦ Pass the block factor to the constructor

Page 39: Massive Data Algorithmics - Sharifce.sharif.edu/courses/91-92/1/ce787-1/resources/root/MADALGO... · Center for MAssive Data ALGOrithmics ... core of TPIE all provide efficient worst-case

The End

Thank you for your time