What we’ll be covering…
Background on geospatial concepts
What is LocationTech?
Background on big data frameworks
Overview of LocationTech projects for processing big geo data.
Geospatial Data
Core of GIS (Geographic information system)
Raster (images, weather data)
Vector (points of interest, country boundries)
Vector Data (Polygons)
Source: https://ryouready.wordpress.com/2009/11/16/infomaps-using-r-visualizing-german-unemployment-rates-by-color-on-a-map/
Feature Extraction (Image Segmentation)
Source: http://www.professeurs.polymtl.ca/christopher.pal/
Large geospatial data
Landsat 8 on AWS: 311,405 scenes @ ~800 MB each. That's 250 TB and counting.
OpenStreetMap: planet.osm is 617 GB.
3 years of geotagged tweets: 3 TB
Project to build a better search engine, back in the early 2000’s.
Worked for small datasets, but was not scalable.
After reading the papers, Nutch developers added a distributed file system and MapReduce model to Nutch.
In 2006, those portions were spun out of Nutch to form…
Matei Zaharia
Worked with Hadoop at UC Berklee
Noticed Hadoop was not a good fit for Machine Learning algorithms and other iterative models.
So in 2009, he created…
Apache Accumulo
Created by the NSA in 2008
Donated to the Apache Foundation in 2011
Graduated to a top level project in 2012
Almost defunded by the US government the same year.
(Sec. 929) Prohibits any DOD component from utilizing the cloud computing database developed by the National Security Agency (NSA) and known as "Accumulo" after the end of FY2013, unless the DOD CIO certifies that: (1) there are no viable commercial open source databases that have such security features, or (2) Accumulo itself has become a successful open source database project. Requires DOD and intelligence community officials to coordinate the use by DOD components of cloud computing infrastructure and services offered by the intelligence community for purposes other than intelligence analysis.
(Sec. 929) Prohibits any DOD component from utilizing the cloud computing database developed by the National Security Agency (NSA) and known as "Accumulo" after the end of FY2013, unless the DOD CIO certifies that: (1) there are no viable commercial open source databases that have such security features, or (2) Accumulo itself has become a successful open source database project. Requires DOD and intelligence community officials to coordinate the use by DOD components of cloud computing infrastructure and services offered by the intelligence community for purposes other than intelligence analysis.
Data Node
Data Node
Data Node
Name Node
Master
Tablet Server
Tablet Server
Tablet Server
Accumulo
BigTable clone (columnar database)
Records stored on HDFS
Lexicographically sorted table index
72 Frames × 14 Billion points per frame Total = 1 Trillion points
Generated in three hours on a 10-node cluster
HEAT MAP FROM 2009 TO 2014 MONTH-BY-MONTH
SELECT tweet.text, user.name FROM tweet, user WHERE bbox(tweet.location, -115, 45, -110, 50) AND tweet.user_id = user.user_id
+
GeoTrellis
a Scala library for geospatial data types and operations.
enables Spark with geospatial capabilities (raster now, soon vector!).
storage and query raster from HDFS, Accumulo, and S3
Benchmark Results
439.5 GB of monthly temperature model output data
USA temperature yearly average, 2006 to 2100
Benchmark Results
439.5 GB of monthly temperature model output data
USA temperature yearly average, 2006 to 2100
40 m3.xlarge instances (estimated $2.00 USD per hour
on spot market)
THANK YOU
@lossyrob
gitter.im/geotrellis/geotrellis
github.com/geotrellis/geotrellis