the sequoia 2000 storage benchmark michael stonebraker presented by handy patriawan
TRANSCRIPT
The SEQUOIA 2000 Storage Benchmark
Michael Stonebraker
Presented by
Handy Patriawan
Introduction
Different types of benchmark oriented at different application areas
Business data processing (e.g. TPC-x) Represent typical needs of DBMS users (update oriented transactions, stress transaction system and overhead of simple command processing)
ECAD (Electronic Computer Aided Design) application Contains complex commands that have high locality of reference on small data set and efficiency of client-server DBMS connection
Many more See http://www.benchmarkresources.com/handbook/)
Reasons for New Benchmark Unfortunately those benchmark does not
represent all applications. One example application is Engineering and
Scientific DBMS. Why?
Massive size Complex data type Sophisticated searching
Reasons for New Benchmark (cont’d) Massive size
Usually stores substantial number of images and simulations output
Four main SEQUOIA 2000 research group has ~ 10^14 bytes (100 TB!)
Complex data type Often include multi-dimensional arrays, spatial information
and other data types (e.g. temporal data type) Usually not present in most benchmarks
Sophisticated search Current DBMS implementation search method may not be
enough for scientific & engineering community Again… usually not present. B-Tree is not good enough
Earth Science New DBMS benchmark is needed Reasons as mentioned before Categories of Earth Science (ES) research
Field studies Remote sensing Simulation
Earth Science (cont’d) Field studies
{ (longitude, latitude, elevation, array-of-value) } Santa Barbara SEQUOIA 200 group has collected field
data from Antarctic Ocean about the effect of ozone depletion on organisms
Remote sensing Value (longitude, latitude, wavelength-band, time) Thematic Mapper (TM) sensors on the Landsat
satellites sample the Earth’s surface on a 30 m x 30 m grid, 7 wavelength-band every 15 days
Earth Science (cont’d) Simulations
Array-of-values(longitude, latitude, elevation, time) General Circulation Models (GCMs) for modeling
climate.
Benchmark Data (Intro) ES researchers usually focus on problems at the
following four scales: Local (e.g. a river drainage basin) Regional (e.g. a study area of one or two states) National (e.g. a study area of one country) Earth (e.g. the study area is the whole world)
Most studies use tile based study area Coarse: several Km2/tile, simulation output Medium: 1 Km2/tile, Advance Very High Resolution
Radiometer(AVHRR) Fine: <100 m/tile, typical Thematic Mapper sensor
data
Proposed Benchmark Regional Benchmark
Focus on Regional size problem Region size is 1280 Km x 800 Km encompassing the states
of California and Nevada
National Benchmark Focus on National size problem Region size is 5500 Km x 3000 Km covers United States
Earth Benchmark The whole globe of Earth
Types of Data Raster
Capture the values of energy absorption in five wavelength bands
Point Contains the names and locations of specific
geographic features Polygon
Homogeneous landuse/landcover Directed graph
Drainage networks
Type of Data (cont’d)
Regional National Earth
Raster1.064
GB~ 17 GB 2 TB
Point 1.83 MB 27.5 MB < >
Polygon 19.1 MB 286 MB < >
Directed Graph
47.8 MB 1.1 GB < >
Benchmark Setup Data set should be stored in tertiary memory
set (e.g. optical disk robot or tape robot) Maintain the “durability” of benchmark Example, Wisconsin Benchmark that measures disk
benchmark can be fitted into memory
Benchmark Queries Eleven queries Each is written in POSTQUEL POSTQUEL queries use the following
POSTGRES schema create RASTER (time = int4, location = box, band =
int4, data = int2[][]) create POINT (name = char[], location = point) create POLYGON (landuse = int4, location = polygon) create GRAPH (identifier = int4, segment = open-
polygon)
Benchmark Queries (cont’d) Two dimensional array is broken into sub array
for storage convenience. National benchmark one array is 129 Mbytes
Five collections of benchmarks Data load Raster queries Polygon and point queries Spatial joins involving one or more types of objects recursion
Benchmark Queries (cont’d) Constants
RECTANGLE: a geographic rectangle of size 100 Km x 100 Km randomly placed in study region
BAND: a random wavelength band TIME: a random time LANDUSE: a random landuse/landcover type LOCATION: a random geographic point in the study area POINT-NAME: the name of a random point in the POINT class INT-1: an integer, set for this benchmark at 64 FLOAT-1: a floating point number, set for this benchmark at
1.0 FLOAT-2: a floating point number, set for this benchmark at
10.0
Data Load Query 1: Create and load the data base and build
any necessary secondary indexes Expose between high performance (with streaming “bulk”
copy) and low performance (just running one insert query per record)
They can differ by the factor of 10! The whole benchmark must be copied from tape to disk,
otherwise the query will run at the speed of tape R-tree indexes on location in POINT and POLYGON classes and
on segment in the GRAPH class. B-tree indexes on time and band on RASTER, name for POINT
and name for POLYGON. Index on the function, size, operating on polygon locations in
one query ??????? Time: run on 4 data sets and time to build 8 indexes ???????
Raster Queries Query 2 (time travel): Select AVHRR data for a
given wavelength band and rectangular region ordered by ascending time
retrieve (clip (RASTER.data, RECTANGLE), RASTER.time)where RASTER.band = BANDorder by ascending time
Return 26 images for the selected band, clipped to correct rectangle and ordered by ascending time
ES scientists run many queries on raster data received from satellites
Time: running the query and time for returning the data to application program (excluding putting data on the screen)
Raster Queries (cont’d) Query 3: Select AVHRR data for a given time
and geographic rectangle and the calculate an arithmetic function of the five wavelength band values for each cell in the study rectangle
retrieve (raster-avg {clip (RASTER.data), RECTANGLE})where RASTER.time = TIME
raster-avg is a user-defined function that computes a weighted average of the individual cell values in RASTER.data
ES scientists run different weighted averages for a given study area and look for the one that produces the best output
Raster Queries (cont’d) Query 4: Select AVHRR data for a given time,
wavelength band and geographic rectangle. Lower the resolution of the image by a factor of 64 to a cell size of 4 Km x 4 Km and store it as a new DBMS object
retrieve into FOO-1 (time = RASTER.time,location = RASTER.location, band = RASTER.band,data = lower-res (clip (RASTER.data, RECTANGLE), INT-1))where RASTER.time = TIME and RASTER.band = BAND
INT-1 = 64 Creating abstracts of raster data. It is useful ES scientists to
see a lower resolution of data and zoom into areas of interest.
Point and Polygon Queries Query 5: Find the POINT record that has a
specific name retrieve (POINT.all)
where POINT.name = POINT-NAME “non-spatial subsetting of POINT data on a non-spatial
attribute” A system needs to have a non-spatial indexing and able to
produce spatial and non-spatial attributes
Point and Polygon Queries Query 6: Find all polygons that intersect a
specific rectangle and store them in the DBMS retrieve into FOO-2 (POLYGON.all)
where POLYGON.location || RECTANGLE “||”: user-defined “polygon intersects rectangle” operator
that return true if the location of polygon intersects with the RECTANGLE
This query needs (some kind of ) a spatial indexing to be efficient
Point and Polygon Queries Query 7: Find all polygons that are more than
a specific size and within a specific circle retrieve (POLYGON.all)
where size(POLYGON.location) > FLOAT-1 andPOLYGON.location <|> circle (LOCATION, FLOAT-2)
FLOAT-1: threshold polygon size, 1 KmLOCATION: center of circleFLOAT-2: radius, 10 Km
<|>: return true if polygon (left operand) is inside circle (right operand)
Efficient execution requires a query optimizer capable of selecting the more restrictive clause and to execute it first.
For this benchmark, most polygons are larger than 1 Km2, so the “POLYGON.location<|>…” should be executed first
Spatial Joins Query 8: Show the landuse/landcover in a 50
Km quadrangle surrounding a given point retrieve (POLYGON.landuse, POLYGON.location)
where POLYGON.location || make-box (POINT.location, 50)and POINT.name = “POINT-NAME”
make-box: create a rectangle with radius of 50 from location
||: return true if a polygon intersects with the rectangle of interest
Joins polygon and rectangle spatial data type
Spatial Joins Query 9: Find the raster data for a given
landuse type in a study rectangle for a given wavelength band and time
retrieve (POLYGON.location, clip(RASTER.data, POLYGON.location))where POLYGON.landuse = LANDUSEand RASTER.band = BAND and RASTER.time = TIME
LANDUSE: landuse classifications desired BAND: wavelength band of desired raster data TIME: time of desired raster data Joins landuse and polygon data type and non-spatial data
type
Spatial Joins Query 10: Find the names of all points within
polygons of a specific vegetation type and create this a new DBMS object
retrieve into FOO-3 (POINT.name)where POINT.location || POLYGON.locationand POLYGON.landuse = LANDUSE
“||”: “point inside polygon” operator, allowed to be overloaded by POSTGRES
Joins point, polygon and landuse data type
Recursion Query 11: Find all segments of any waterway
that are within 20 Km downstream of a specific geographic point
retrieve into temp(GRAPH.identifier, GRAPH.segment, partial-length(GRAPH.segment, LOCATION))where LOCATION *&* GRAPH.segment
append * to temp(GRAPH.identifier, GRAPH.segment, length = temp.length + length (GRAPH.segment)where end (temp.segment) = begin (GRAPH.segment)and GRAPH.identifier notin (temp.identifier)and temp.length < 20
Recursion (cont’d) *&*: return true if a point is on a specific segment “temp” will have the final result of the query More like Breadth-First-Search Example application of this query
Dunsmuir spill query after a toxic chemical spill into Sacramento River near town Dunsmuir.
Benchmark Constraints and Reporting Conventions1. It is permissible to run this benchmark on any combination
of hardware and software, but the retail price of the hardware must be reported. The result of the benchmark should be reported as a collection of all 11 benchmarks and performance is calculated with the following formula:
performance = (total elapsed time for the benchmark)/ (retail price of hardware)
In case of incomplete test rest, the report should contain the results of the queries that could be run.
2. The new object result and data returned to application program can be discarded, no display requirement is needed. This is a storage benchmark, not a visualization benchmark.
Benchmark Constraints and Reporting Conventions (cont.)3. Benchmark can be coded in any language (e.g. use SQL for
RDBMS).The number for high-level declarative interface must be reported as well if the query is run using low level algorithmic interface to DBMS.
4. Security is still needed! Any system that does not run the application in a separate protection domain from DBMS must clearly note this fact.
Benchmark Results Not all queries can be run efficiently. The use of Z
transformation won’t help. Array simulation using blobs (binary large objects)
needs special implementation (no indexing) Conclusion? No relation DBMS and OODBMS are
included in the benchmark Contestants:
GRASS: public domain GIS written by US Army Corps of Engineers Construction Engineering Research Laboratory (CERL)
IPW: A raster-oriented image processing written at UC, Santa Barbara
POSTGRES: next-generation DBMS written at UC, Berkeley
Conclusions
Query GRASS IPW POSTGRES
1 46260 1530* 5270
2 74.0 18.9 21.2
3 117 6.0 16.0
4 4.0 0.7 6.0
5 X X 1.0
6 1.0 X 31.0**
7 13.0 X 42.7**
8 20.0 X 110**
9 5.0 X 2
10 380 X 624
11 X X X
Cost 67,300 67,300 12,000
Conclusions (cont’d) A system should not limit the object size (e.g.
one polygon consists of 5184 nodes) The fastest system on raster data is IPW,
optimize for efficiency. Better than GRASS because of selective read Better than POSTGRE because of careful pipelining without
temporary data writing to disk. GRASS is good with raster data, but the technique it
uses does not work with overlapping polygons Query Optimizer in POSTGRES is incorrectly execute the
first condition in where clause No “Swiss Cheese” implementation on POSTGRES.