datomic r-trees

31
Datomic R-trees James Sofra @sofra https://github.com/jsofra/datomic-rtree

Upload: jsofra

Post on 05-Dec-2014

1.645 views

Category:

Technology


0 download

DESCRIPTION

Slides for a talk given at Melbourne Functional Users Group on an R-tree based spatial indexer for Datomic.

TRANSCRIPT

Page 1: Datomic R-trees

Datomic R-trees

James Sofra@sofra

https://github.com/jsofra/datomic-rtree

Page 2: Datomic R-trees

Summary

● Motivations● Datomic overview● Datomic R-tree implementation● Hilbert Curves● Bulk loading (via Hilbert Curves)● Future plans

Page 3: Datomic R-trees

Motivations

● I have an interest in geospatial applications– e.g. Thunderstorm probability application

(THESPA)

● Datomic is an interesting database that makes different trade-offs to other databases– Wonder how far we can take the ability to

describe arbitrary structures in Datomic

Page 4: Datomic R-trees

Why don't we have both?

Page 5: Datomic R-trees

Datomic Overview

● Immutable database● Time-base facts (stored as entites)● ACID transactions● Expressive queries using Datalog● Pluggable storage● Flexible enough to act as row, column or graph database● Schema that describes attributes that can be attached to

entities– Attributes have a type; String, Long, Double, Inst, Ref etc.

● Database functions– Stored in the database, see the in transaction value

Page 6: Datomic R-trees

Datomic Overview - Architecture

Page 7: Datomic R-trees

Datomic Motivations

● Things that make Datomic appealing for spatial data– Time-base nature of Datomic is useful for time series data which we

often have– No need to add spatial operations (union, intersection, etc.) to the

database, can be handled by libraries in the peers– Spatial indexes can be stored as regular data, allows for a lot of

freedom over choice of index, handling multiple indexes over subsets of the data in space and time

– Flexible entity structures are useful because spatial data frequently does not fit nicely in a table

– Immutability is surprisingly useful in lots of different applications!

Page 8: Datomic R-trees

R-trees

● "R-Trees: A Dynamic Index Structure for Spatial Searching"– Guttman, A (1984)

● Efficient query of multi-dimensional data

● Groups nearby objects● Balanced (all leaf nodes at

same level)● Aims for nodes minimise

empty space coverage and overlap

● Designed for storage on disk (as used in databases)

Page 9: Datomic R-trees

R-trees - Insertions

● Choose a leaf node to insert● Insert entry into leaf node and enlarge

node● If node has more than max number of

children split the node and propagate enlargement and splits up tree

Page 10: Datomic R-trees

Datomic R-tree - Schema

:rtree/root :db.type/ref

:rtree/max-children :db.type/long

:rtree/min-children :db.type/long

:node/children :db.type/ref

:node/is-leaf? :db.type/boolean

:node/entry :db.type/ref

:bbox/min-x :db.type/double

:bbox/min-y :db.type/double

:bbox/max-x :db.type/double

:bbox/max-y :db.type/double

Page 11: Datomic R-trees

Datomic R-tree - choose-leaf

Page 12: Datomic R-trees

Datomic R-tree - split-node

Page 13: Datomic R-trees

Datomic R-tree - pick-seeds

Page 14: Datomic R-trees

Datomic R-tree - pick-next

Page 15: Datomic R-trees

Datomic R-tree – regular transaction

Database function

New entry with new ID

Add new entry as child to leaf node

Transaction for adding new entry, calls database function

Page 16: Datomic R-trees

Datomic R-tree – split transaction

New entry

Remove root

Create new leaf nodes

Add new root

Page 17: Datomic R-trees

Bulk loading

● Issues with single insertion loading of R-tree– Becomes slow with with many insertions

– The resulting tree is not as always as efficient as it could be

● Bulk loading builds a tree once from a number of entities

● Two basic approaches top-down and bottom-up

● Bulk loading does not imply bulk insertion

Page 18: Datomic R-trees

Bulk loading – sort based loading

● Aims for better R-tree performance● Bottom-up approach● Sorts all entities in an order that aims to preserve locality● Partitions the entities into clusters that are (hopefully)

spatially collocated● Recursively apply partitioning to build up the tree

● “Sort-based Query-adaptive Loading of R-trees”– D. Achakeev; B. Seeger; P. Widmayer (2012)

● “Sort-based parallel loading of R-trees”– D. Achakeev; M. Seidemann; M. Schmidt; B. Seeger (2012)

Page 19: Datomic R-trees

Hilbert Curves● a continuous fractal

space-filling curve● first described by

mathematician David Hilbert in 1891

● useful because it enables mapping from 2D to 1D preserving some notion of locality

● Other options are; Peano curve, Z-order curve (aka Morton Curve)

Page 20: Datomic R-trees

Hilbert Curves● a continuous fractal

space-filling curve● first described by

mathematician David Hilbert in 1891

● useful because it enables mapping from 2D to 1D preserving some notion of locality

● Other options are; Peano curve, Z-order curve (aka Morton Curve)

Page 21: Datomic R-trees

Bulk loading – hilbert sort based

● Better Hilbert partitioning

Page 22: Datomic R-trees

Bulk loading via Hilbert curves

● Insert all entities into Datomic (or using existing entities)

● Entities include an indexed Hilbert value attribute

● Obtain a seq of the entities using the :avet index with the Hilbert value

● Perform partioning

Page 23: Datomic R-trees

Bulk - hilbert-ents

Takes advantage of Datomic index API to get direct access to the Hilbert index

Page 24: Datomic R-trees

Bulk - min-cost-index

List of options for the next partition point

Must be at least min-children in the partition

Page 25: Datomic R-trees

Bulk - cost-partition

Page 26: Datomic R-trees

Bulk - p-cost-partition

Page 27: Datomic R-trees

Bulk - dyn-cost-partition

Page 28: Datomic R-trees

Conclusions

● It works!

(install-single-insertions conn 50000 20 10)– "Elapsed time: 119114.342783 msecs"

(install-and-bulk-load conn 50000 20 10)– "Elapsed time: 6511.543299 msecs"

(time (naive-intersecting all-entries search-box))– "Elapsed time: 870.575802 msecs"

(time (intersecting root search-box))– "Elapsed time: 2.927883 msecs"

* note these times should be regarded with suspicion since they only use the in memory database

Page 29: Datomic R-trees

Future plans

● Retractions and updates● Bulk insertions● More search and query support● Schema for supporting Meridian Shapes

and Features● Investigate other R-trees; R* tree, R+ tree

Page 30: Datomic R-trees

Questions?

Thanks you! Any questions?

James Sofra@sofra

Page 31: Datomic R-trees

Other Interesting Resources● "The R*-tree: an efficient and robust access method for points

and rectangles"● “OMT: Overlap Minimizing Top-down Bulk Loading Algorithm for

R-tree.”– T. Lee; S. Lee (2003)

● “The Priority R-Tree: A Practically Efficient and Worst-Case Optimal R-Tree”– L. Arge; M. de Berg; K. Yi (2004)

● “Compact Hilbert Indices”– Hamilton. C (2006)

● “R-Trees: Theory and Applications”– Manolopoulos. Y; Nanopoulos. A; Papadopoulos. A. N; Theodoridis. Y

(2006)