d1.3.2 continuous report on performance evaluation
TRANSCRIPT
Project funded by the European Commission within the Seventh Framework Programme (2007 – 2013)
Collaborative Project
GeoKnow -‐ Making the Web an Exploratory Place for Geospatial Knowledge
Deliverable 1.3.2 Continuous Report on Performance Evaluation
Dissemination Level Public
Due Date of Deliverable Month 12, 30/11/2013
Actual Submission Date 30/11/2013
Work Package WP1 -‐ Requirements, Design, Benchmarking, Component Integration
Task T1.3 -‐ Performance Benchmarking and Evaluation
Type Report
Approval Status Final
Version 1.0
Number of Pages 50
Filename D1.3.2_Continuous_Report_on_Performance_Evaluation.pdf
Abstract: The purpose of this deliverable is to summarize the performance evaluation of the GeoKnow components as developed within the first project year.
The information in this document reflects only the author’s views and the European Community is not liable for any use that may be made of the information contained therein. The information in this document is provided “as is” without guarantee or warranty of any kind, express or implied, including but not limited to the fitness of the information for a particular purpose. The user thereof uses the information at his/ her sole risk and liability.
Project Number: 318159 Start Date of Project: 01/12/2012 Duration: 36 months
D1.3.2 – v. 1.0
Page 2
History Version Date Reason Revised by 0.0 19/11/2013 Initial Draft Mirko Spasić 0.1 21/11/2013 Initial Review Hugh Williams 0.2 25/11/2013 Summary, Acronyms, and Outline Mirko Spasić 0.3 26/11/2013 Figures and Tables Mirko Spasić 0.4 03/12/2013 Review Kostas Patroumpas 0.5 03/12/2013 Review Giorgos Giannopoulos 0.6 04/12/2013 Reviewer’s comments/suggestions added Mirko Spasić 0.7 06/01/2013 Final version Mirko Spasić Author List Organisation Name Contact Information OGL Hugh Williams [email protected] OGL Mirko Spasic [email protected] OGL Orri Erling [email protected] OGL Ivan Mikhailov [email protected] Time Schedule before Delivery Next Action Deadline Care of First version 19/11/2013 Mirko Spasić (OGL) Second version 06/01/2014 Mirko Spasić (OGL)
D1.3.2 – v. 1.0
Page 3
Executive Summary
This deliverable will give an update of the setup and configuration of the GeoKnow Benchmarking laboratory and specification of the benchmarks to be used. The benchmark is expanded in order to be used against relational data. The improved procedure for migration of OSM data from PostGIS to Virtuoso is presented. Benchmark comparison results are presented running the FacetBench program against Virtuoso (SPARQL & SQL) and PostGIS both hosting OSM data. The analytical queries are developed, as well as the exporting results of them to the .dxf file.
D1.3.2 – v. 1.0
Page 4
Abbreviations and Acronyms DXF Drawing Interchange Format ETL Extract, Transform, Load EWKT Extended Well-‐Known Text GIS Geographic Information System LGD Linked Geo Data LOD Linked Open Data OSM Open Street Map RDBMS Relational Database Management System SRID Spatial Reference System Identifier VOS Virtuoso Open Source WKT Well-‐Known Text
D1.3.2 – v. 1.0
Page 5
Table of Contents 1. INTRODUCTION ......................................................................................................................................................... 7 1.1 OUTLINE.......................................................................................................................................................................................8
2. MIGRATION OF THE OSM DATA FROM POSTGRESQL TO VIRTUOSO....................................................... 9 2.1 GLOBAL IDEA...............................................................................................................................................................................9 2.2 SCHEMA CHOICES .....................................................................................................................................................................10 2.3 MIGRATION PROCEDURES ......................................................................................................................................................10 2.4 ETL PERFORMANCE ANALYSIS .............................................................................................................................................14
3. BENCHMARK RESULTS ..........................................................................................................................................16 3.1 DATASETS .................................................................................................................................................................................16 3.2 LGD BULK LOAD .....................................................................................................................................................................17 3.3 OSM BULK LOAD OVER SQL FEDERATION.........................................................................................................................19 3.4 VIRTUOSO SPARQL RESULTS ...............................................................................................................................................19 3.5 VIRTUOSO SQL RESULTS ........................................................................................................................................................20 3.6 POSTGIS SQL RESULTS ..........................................................................................................................................................23 3.7 RESULTS COMPARISON ...........................................................................................................................................................24
4. QUERY PLANS ...........................................................................................................................................................27 4.1 POSTGIS QUERY PLANS .........................................................................................................................................................27
5. GRID DIVISION .........................................................................................................................................................28 6. ANALYTICAL QUERIES ...........................................................................................................................................30 6.1 PRODUCING DXF .....................................................................................................................................................................31
7. CONCLUSION .............................................................................................................................................................34 8. APPENDIX ..................................................................................................................................................................35 8.1 POSTGIS QUERY PLANS...........................................................................................................................................................35 8.2 GRID DIVISION..........................................................................................................................................................................36 8.3 EXECUTION OF THE ANALYTICAL QUERIES .........................................................................................................................38 8.4 VIRTUOSO PROCEDURE FOR PRODUCING DXF FILE ..........................................................................................................49
9. BIBLIOGRAPHY ........................................................................................................................................................51
D1.3.2 – v. 1.0
Page 6
List of Figures
Figure 1: Linked Geodata Browser..................................................................................................................................7
Figure 2: Virtuoso SPARQL results...............................................................................................................................20
Figure 3: Virtuoso SQL Results (single instance)...................................................................................................22
Figure 4: Virtuoso SQL Results (cluster) ...................................................................................................................23
Figure 5: PostGIS results...................................................................................................................................................24
Figure 6: Power Run Comparison.................................................................................................................................25
Figure 7: Throughput Run Comparison .....................................................................................................................25
Figure 8: A Fragment of a Bitmap with Count of Sales ........................................................................................33
List of Tables
Table 1: ETL Performance Analysis .............................................................................................................................15
Table 2: Data Distribution................................................................................................................................................19
Table 3: Virtuoso SPARQL results ................................................................................................................................20
Table 4: Virtuoso SQL Results (single instance).....................................................................................................21
Table 5: Virtuoso SQL Results (cluster) .....................................................................................................................22
Table 6: PostGIS results ....................................................................................................................................................24
Table 7: BI Query Results .................................................................................................................................................31
D1.3.2 – v. 1.0
Page 7
1. Introduction
The primary goal of this report is the summarization of the performance evaluation of the GeoKnow components (mainly Virtuoso) as developed within the first project year.
In the previous deliverable D1.3.1 (GeoKnow Consortium, 2013) we specified the setup and configuration of the GeoKnow Benchmarking laboratory. We used the geospatial benchmark built in the LOD2 project (Lod2 Consortium, 2010), as a starting point, because it is more focused on addressing practical challenges in the Geo Browsing components, as developed by the University of Leipzig (browser.linkedgeodata.org). This benchmark emulates heavy drill-‐down style online access patterns and accessing large volumes of thematic data.
Figure 1: Linked Geodata Browser
That benchmark is developed and improved further. This improvement is primarily related to the expansion of the benchmark, in order to make it employable not only to RDF data, but to relational data as well. SQL queries (Virtuoso and PostGIS) are presented there, the data migration procedures from PostGIS to Virtuoso are set, and they will be enhanced here. Furthermore, this will open opportunities for a performance comparison between RDF and relational spatial data management systems, which will be presented in this deliverable. The intent is to run this benchmark against the planet-‐wide OSM dataset in PostgreSQL and Virtuoso. With Virtuoso we will also compare scale-‐out and single server versions.
D1.3.2 – v. 1.0
Page 8
1.1 Outline In Section 2 we describe in detail the procedures that migrates geodata from PostgreSQL to
Virtuoso.
In Section 3 the benchmarking results are presented, Virtuoso in both SQL and SPARQL and PostGIS in SQL as a point of reference.
Query plans are analyzed in Section 4, as one of the reason why Virtuoso outperformed PostGIS by a large factor.
In Section 5, a grid division task is presented, as a new idea of how a scale out system can be improved.
BI queries can be found in Section 6, as well as Virtuoso procedure for producing the .dxf file.
Section 7 contains some conclusions, while Section 8 is an appendix, providing a link where the benchmark programs can be downloaded from, as well as the PostGIS query plans, the details of grid division task, and BI queries.
References are found in the Section 9.
D1.3.2 – v. 1.0
Page 9
2. Migration of the OSM Data from PostgreSQL to Virtuoso
2.1 Global Idea
In order to be able to complete a fair performance comparison between spatial data management according to the relational model and RDF, we must have the same data, or the data that are very close in terms of scalability, in every data source. In the previous deliverable D1.3.1 (GeoKnow Consortium, 2013), we presented in detail the procedures for loading OSM data into PostgreSQL, as well as the PostGIS and Virtuoso OSM schema. We gave some ideas how we can load the same data into Virtuoso. In the next section, we will elaborate on these ideas.
We will look in detail at ETL from PostgreSQL to Virtuoso via SQL federation. Here we will see how to change normalization in schemas, from a denormalized key-‐value pair-‐structure in PostGIS, to a normalized "triple table" in Virtuoso. We will also look at data type conversion, overall data transfer speed, and automatic parallelization.
ETL, even with medium data sizes, like with OSM at a little under 600 GB in PostgreSQL files, is a performance game, like everything in databases. Data must move fast, expressing the transformation logic must be compact, and parallelism must be automatic. Next to nobody can write parallel code and the few that can are needed somewhere else.
There are three options of performing this migration that we considered:
• The first possible option is to dump the data into CSV, do some sed scripts or the like for the transformation (maybe in Hadoop, if the data is really large), and then to use the target database's bulk load utility. This makes the steps so simple that they can be delegated with some possibility of success. This is what data integration tends to be like. From our experience with the TPC-‐H bulk load (Erling, 2013), CSV loading is foolproof, easy, and fast.
• The second option is to make a JDBC program to first read one database and write into another. We decided not to try this way, because this would have to be explicitly multithreaded, would have loops, would require use of array parameters in order not to get killed by client server latency, would be liable to run into oddities of JDBC implementations, and so forth. Plus, this could be a few hundred lines long, and very slow because of lock contention, because transactions are not turned off, or something of the sort.
• Here, we will explore a third possibility: vectored stored procedures. We will introduce a design pattern that runs table-‐to-‐table copy and normalization changes, with perfect parallelism and scale-‐out, in SQL procedures. This will work from the file system as well, since a CSV file can be accessed as a table. For number of code lines, time-‐to-‐solution, as well as run-‐time performance, this is unbeatable.
D1.3.2 – v. 1.0
Page 10
2.2 Schema choices
Elements (or data primitives) are the basic components of OpenStreetMap’s conceptual data model of the physical world. They consist of nodes (representing specific points on the earth’s surface, defined by their latitude and longitude, e.g. a park bench, or a water well), ways (ordered list of between 2 and 2000 nodes defining linear features and area boundaries, e.g. rivers or roads), and relations (which are sometimes used to explain how other elements work together, e.g. a route relation which lists the ways that form a major highway). All types of data elements can have tags. Tags describe functions of the particular element to which they are attached. A tag consists of two free format text fields, a key, and a value. For example, highway=residential defines the way as a road whose main function is to give access to people’s homes.
The PostgreSQL OSM implementation exists in both normalized and denormalized variants. The denormalized variant uses a H-Store column type, which is a built-‐in non-‐first-‐normal-‐form set of key-‐value pairs that can occur as a column value. In Virtuoso, the equivalent would be to use an array in a column value, but this is not very efficient. Rather, we will go the normalized route, getting outstanding JOIN performance and space efficiency from the column store. Since this is a freestyle race, we take the liberty of borrowing the IRI datatype from the RDF side of Virtuoso. This offers a fast mapping between names and integer identifiers. This is especially handy for tags. PostgreSQL likely has some similar encoding as part of the H-‐Store implementation.
The geometry types are transferred as strings, and then re-‐parsed into the Virtuoso equivalents. The EWKT syntax is compatible between the systems. The potentially long geometries are stored in a LONG ANY column, and the always short ones (e.g., bounding boxes and points) into an ANY column. In both implementations, there is an R-‐Tree index (Guttman, 1984) on the points but not on the linestrings.
2.3 Migration Procedures
To ETL the PostgreSQL based dataset, we attach the OSM tables as remote tables using Virtuoso's SQL federation (VDB) feature. This is not in the Open Source Edition (VOS) but the same effect can be achieved by dumping the tables into files, and defining the files as tables with the file-‐table feature.
The tables which have no need of special transformation go with just an INSERT ... SELECT, like this: log_enable (2); INSERT INTO users SELECT * FROM users1 &
In this example, the table users is the Virtuoso’s table, and the table users1 is the attached table from PostgreSQL. The first line disables logging and makes inserts non-‐transactional, so row-‐by-‐row autocommit is enabled.
The tables which have special datatypes (like geometries or H-Stores) need a little application logic, like this:
D1.3.2 – v. 1.0
Page 11
CREATE PROCEDURE copy_ways () { log_enable (2); RETURN ( SELECT COUNT (ins_ways ( id, version, user_id, tstamp, changeset_id, tags, linestring_wkt, bbox_wkt ) ) FROM ways1 ); }
The table ways1 is the remote attached table. The scan of the remote table is automatically split by ranges of its primary key, so there is no need for explicit parallelism. The ins_ways function is called on each thread, on a whole vector of values for each column. In this way operations are batched together, gaining by locality, and eliminating interpretation overhead.
The ins_ways procedure, with inline comments, follows: CREATE PROCEDURE ins_ways ( IN id BIGINT, IN version INT, IN user_id INT, IN tstamp DATETIME, IN changeset_id BIGINT, IN tags ANY ARRAY, IN linestring VARCHAR, IN bbox VARCHAR ) {
-- The vectored declaration means that each statement is run on the full input
-- before going to the next.
-- Thus, by default, the insert gets 10K consecutive rows to insert. The conversion functions
-- like st_ewkt_read are also run in a tight loop over a large number of values.
VECTORED; INSERT INTO ways VALUES ( id, version, user_id, tstamp, changeset_id, st_ewkt_read ( charset_recode ( linestring, '_WIDE_', 'UTF-8' )), st_ewkt_read ( charset_recode ( bbox, '_WIDE_', 'UTF-8' )) ) ;
-- The tags is a vector of strings where each string is a serialization of the H-Store content. -- split_and_decode splits each string into an array at the delimiter.
tags := split_and_decode ( TRIM ( REPLACE ( REPLACE (
D1.3.2 – v. 1.0
Page 12
REPLACE ( REPLACE (tags, '"=>"', '!!!'), '&', '%26'), '", "', '&'), '=', '%3D'), '"') ); NOT VECTORED { DECLARE a1, b1 VARCHAR ; DECLARE ws, vs, ts ANY ARRAY ; DECLARE n_sets, n_tags, set_no, wid, inx, pos, fill INT ;
-- We insert triples of the form tag, way_id, tag_value. For each of these, we reserve an
-- array of 100K elements. We put the values into the array, and insert when full, or when all rows of
-- input are done. An insert of 100K values in one go is much faster than inserting 100K values
-- singly, especially on a cluster.
ws := make_array (100000, 'ANY'); ts := make_array (100000, 'ANY'); vs := make_array (100000, 'ANY'); fill := 0; DECLARE tag_arr, str ANY ARRAY; n_sets := vec_length (tags);
-- For each row of input to the vectored function:
FOR ( set_no := 0 ; set_no < n_sets ; set_no := set_no + 1 ) { wid := vec_ref (id, set_no); tag_arr := vec_ref (tags, set_no); n_tags := LENGTH (tag_arr);
-- For each tag in the H-Store string:
FOR ( inx := 0; inx < n_tags; inx := inx + 2) {
-- split the tag into a key and a value at the !!! delimiter
str := tag_arr[inx]; pos := strstr(str, '!!!'); a1 := substring(str, 1, pos); b1 := subseq(str, pos + 3);
-- add to the array of key-value pairs to insert
way_tag_add (ws, ts, vs, fill, wid, a1, b1); }
D1.3.2 – v. 1.0
Page 13
} way_tag_ins (ws, ts, vs); } }
Now, we define the functions for adding a way, key, value triple into the batch, and for inserting the batch.
CREATE PROCEDURE way_tag_ins ( INOUT ws ANY ARRAY , INOUT ts ANY ARRAY , INOUT vs ANY ARRAY ) {
-- given an array of way ids, tag names, and tag values, insert all rows where the tag is not 0.
-- if the tag is empty, call it unknown instead.
-- the __i2id function replaces the tag name with an IRI ID that is persistently mapped to the
-- name. The insert and the tag name-to-id mapping are done as a single operation.
-- this is a single network round trip for each in a cluster setting.
FOR VECTORED ( IN wid INT := ws , IN tag ANY := ts , IN val VARCHAR := vs ) { IF (tag <> 0) { IF ('' = tag) tag := 'unknown'; INSERT INTO ways_tags VALUES ( __i2id (tag), wid, val ); } } }
CREATE PROCEDURE way_tag_add ( INOUT ws ANY ARRAY , INOUT ts ANY ARRAY , INOUT vs ANY ARRAY , INOUT fill INT , IN wid INT , INOUT tg VARCHAR , INOUT val VARCHAR ) {
-- add at the end of the arrays; if full, insert the content and replace with fresh arrays
-- the INOUT keyword means call by reference, which is important, because of coping larger arrays,
-- and returning new ones to the caller
ws[fill] := wid; ts[fill] := tg; vs[fill] := val; fill := fill + 1; IF (100000 = fill) {
D1.3.2 – v. 1.0
Page 14
way_tag_ins (ws, ts, vs); fill := 0; ws := make_array (100000, 'ANY'); ts := make_array (100000, 'ANY'); vs := make_array (100000, 'ANY'); } }
The same logic can be applied to any simple data transformation task. Vectoring and automatic parallelism make sure that there is full platform utilization without explicitly working with threads. The NOT VECTORED {} section allows the procedure to aggregate over all the values in a vector. The FOR VECTORED construct in the INSERT function switches back into running on a vector composed in the scalar part so as to get the insert throughput and cluster-‐friendly message pattern.
2.4 ETL Performance Analysis
We analyze bulk copying the nodes from the PostGIS OSM database. The copy normalizes the denormalized tags (key-‐value pairs and a hstore column) in PostGIS node table into a separate table. Besides this it copies the node row and inserts it into a geometry index.
The ETL runs in the same cluster setup as the LGD experiments. Each thread of the total 48 hardware threads reads a range of the PostGIS nods table and partitions the rows and sends these across the cluster according to their node number. This is done for each batch of 10000 consecutive nodes. A partitioned function is called in each partition that has at least one node being inserted. The function fills inserts the nodes, parses the tags and inserts the tags. When all inserts of the batch have returned the next batch is fetched. This takes place on 48 concurrent threads, each running the identical operation.
The platform utilization is 20.5 cores busy on the average.
Table 1 summarizes the top lines of the oprofile execution profile for a slice of 44 minutes of running. We note that data copying dominates, followed by R tree maintenance. The SQL insert operations do not make it into the top 16. A factor of 2 improvement in throughput is possible by removing extraneous data copying. Note that the top function is for freeing a tagged piece of memory, e.g. string or array. If these were not copied these would not have to be freed. PL interpretation overhead is high, (code_vec_run, qst_get...). The cluster interconnect operations are found at the bottom, not shown, counting for under 2% all together. In summary, scalar operations scattered around memory slow things down, as always. The dk_free_tree function is specially bad because of missing cache one line at a time when freeing arrays representing data rows. A column major representation with logically contiguous data contiguous in memory as is used in SQL execution itself is much better.
6398115 17.7354 dk_free_tree
5442236 15.0858 rd_box_union
3660531 10.1469 itc_geo_row
3409762 9.4518 dc_append_box
D1.3.2 – v. 1.0
Page 15
1726192 4.7850 box_to_any_1
1420490 3.9376 code_vec_run_v
1306923 3.6228 dc_append_bytes
1156007 3.2044 cmp_boxes_safe
1087496 3.0145 memcpy_16
931214 2.5813 sslr_qst_get
846778 2.3473 ap_alloc_box
675882 1.8735 ins_for_vect
506690 1.4045 box_deserialize_string
429502 1.1906 itc_geo_check_link
406177 1.1259 n_coerce
395266 1.0957 dk_alloc Table 1: ETL Performance Analysis
Platform utilization is lowered by each thread periodically waiting for data from PostGIS. During this time it is not available for anything else. There is a fixed set of 48 threads through the run. Each thread services one 1/48th of the data. Splitting the reading of the remote data into still more threads could improve platform utilization. When a thread is sending inserts to other partitions it also receives and executes inserts for its own partition. In this way we avoid a proliferation of threads, as each of the 48 threds sends 48 ways, potentially resulting in 48*48 concurrently executable operations. Having up to 2304 threads on 48 hardware threads is not efficient and has high transient memory consumption.
D1.3.2 – v. 1.0
Page 16
3. Benchmark results
In the previous deliverable (GeoKnow Consortium, 2013), we presented the benchmark metrics, and the reporting template containing all the metrics. In this section, we give the benchmark results. The tested systems include Virtuoso in both SQL and SPARQL and PostGIS in SQL as a point of reference.
Geoknow benchmarks cover the following types of operations:
• Bulk load of geodata • Bulk transformation and calculating covering grids of different resolutions • Lookup queries combining geospatial and thematic conditions • Analytical queries with geospatial aspects
The test platform is two machines with dual Xeon E5-‐2630 and 192GB RAM each, QDR InfiniBand interconnect. The PostGIS data is on SSD, the Virtuoso data is on 2x4 7200 rpm commodity disks. The warmup queries are run and all the data was in memory for benchmark runs, so there is no performance difference related to the different kinds of disks between PostGIS and Virtuoso. The only purpose of why the PostGIS data was on SSD is to speed up the loading of the data into Virtuoso.
3.1 Datasets
The test datasets are:
• Dbpedia -‐ miscellaneous reference data, about 800K point geometries • Geonames -‐ Geospatial hierarchy, approx 8M point geometries • Natural Earth, various datasets, countries, urban areas. • Linked Geodata of Sept 2013, 1.9G point geometries • The SQL reference dataset is a dump of Open Street Map with 1.3G nodes.
Normalization of LGD has been changed, so that the nodes refer directly to their geometry, not via an extra subject that exists only for this purpose.
The Cultural Admin Countries and Cultural Urban Areas Landscan datasets from Natural Earth 10M scale have been integrated into the LGD dataset to provide national frontiers and contours of urban areas. Each square of up to 10000 features is assigned to a country and to a city if the square intersects a country or city in the corresponding Natural Earth dataset. This integration is then imported into the LGD database as both tables and triples. Each square thus has a synthetic URI <sqxxx> where xxx is the sq_id column of the table in decimal. This occurs as a subject for the properties geo:geometry which is the square as a rectilinear polygon, <sq-belongs-to-country> with the country as a Geonames URI and <sq-belongs-to-city> with the city as a Geonames URI. This reference dataset serves to map the LGD content to recognizable countries.
D1.3.2 – v. 1.0
Page 17
3.2 LGD Bulk Load
Dbpedia and Geonames are of insignificant size for bulk load and are loaded in about 15 minutes each. From the load history we can see:
select datediff ('second',
(select min (ll_started) from load_list where ll_file like '%dbpedia%'), (select max (ll_done) from load_list where ll_file like '%dbpedia%'));
1777
select datediff ('second',
(select min (ll_started) from load_list where ll_file like '%geonames%'), (select max (ll_done) from load_list where ll_file like '%geonames%'));
1444
The LGD dataset in totality is about 30bn triples of which most are not relevant for our purpose. For example it models an OSM node, as a subject that refers to a separate subject of type geometry which has a single property which is the WKT of the point. OSM itself does not normalize in this way. Besides, the URI's of the nodes and their geometry subjects both contain the same number which is the synthetic key from OSM.
The dataset also contains a sameAs assertion of each point to a web service URI that repeats the coordinates of the point. These are never accessed but have to be stored. Some things, such as the RDF type triples which all have the same type do not take much space but the URI strings with coordinates in them do take a large amount of space since these do not compress particularly well. In any case these are sure never to be accessed.
For this reason the load rate of LGD as a whole is not a very relevant metric as it consists of redundant data that is sometimes very compressible and sometimes not. Therefore we will isolate a few different cases:
• Insert of geometries: Here each triple has a unique, never before seen URI and a unique geometry object. The predicate and graph are identical on all rows. The load rate for a run of 1.9bn triples loaded is 309Kt/s.
• A separate case occurs with the association of nodes to their geometry proxy subjects. Here every triple has a different, but pre-‐existing object and subject URI but none of these are in cache. The load rate is 410Kt/s.
Generally bulk load rates with LGD are less than with other datasets because LGD is split into files by the predicate. Most other data has different properties of the same subject in consecutive places. The latter offers locality on the subject and eliminates the overhead of resolving the subject every time. On the other hand, as most LGD properties are just for bloat it is good to partition it in this way so that one can omit whole chunks of it.
The data used in the experiments has the distribution shown in the Table 2, grouped by predicate, sorted on descending count. Numbers are in millions.
D1.3.2 – v. 1.0
Page 18
Predicate Count
http://www.w3.org/2003/01/geo/wgs84_pos#geometry 2002
http://www.opengis.net/ont/geosparql#asWKT 1996
http://www.w3.org/2003/01/geo/wgs84_pos#lat 1996
http://www.w3.org/2003/01/geo/wgs84_pos#long 1996
http://geovocab.org/geometry#geometry 1987
http://www.w3.org/2002/07/owl#sameAs 738
http://linkedgeodata.org/ontology/source 124
http://www.w3.org/1999/02/22-rdf-syntax-ns#type 106
http://linkedgeodata.org/ontology/building 61
http://www.w3.org/2000/01/rdf-schema#label 24
http://linkedgeodata.org/ontology/addr%3Ahousenumber 23
http://linkedgeodata.org/ontology/addr%3Astreet 21
http://www.w3.org/1999/02/22-rdf-syntax-ns#_1 18
http://www.w3.org/1999/02/22-rdf-syntax-ns#_0 18
http://www.w3.org/1999/02/22-rdf-syntax-ns#_2 15
http://purl.org/dc/terms/subject 15
http://purl.org/dc/terms/contributor 15
http://linkedgeodata.org/ontology/addr%3Acity 15
http://linkedgeodata.org/ontology/posSeq 14
http://linkedgeodata.org/ontology/tiger%3Acfcc 13
http://www.w3.org/1999/02/22-rdf-syntax-ns#_3 13
http://linkedgeodata.org/ontology/tiger%3Acounty 13
http://linkedgeodata.org/ontology/addr%3Acountry 12
http://linkedgeodata.org/ontology/tiger%3Areviewed 12
http://www.w3.org/1999/02/22-rdf-syntax-ns#_4 11
http://www.w3.org/ns/prov#wasDerivedFrom 11
http://www.w3.org/2000/01/rdf-schema#comment 11
http://dbpedia.org/ontology/wikiPageID 11
http://dbpedia.org/ontology/wikiPageRevisionID 11
D1.3.2 – v. 1.0
Page 19
http://dbpedia.org/ontology/abstract 10
Table 2: Data Distribution
We note that LGD has a long tail of very domain specific predicates that cannot be used in a benchmark due to the small number of occurrences.
3.3 OSM Bulk Load over SQL Federation
The task consists of copying the Open Street Map SQL structures over SQL federation into equivalent Virtuoso SQL structures. The schema is not 1:1 identical, as Virtuoso uses a normalized SQL schema and the PostgreSQL OSM implementation uses non first normal form columns for (hstore) for key-‐value pairs representing tags.
The test has a scale out Virtuoso importing on multiple threads from a single PostGIS. The PostGIS data is on 2 SSD's so as not to make the test IO bound on the PostGIS side. The Virtuoso data is on 8 commodity hard disks.
3.4 Virtuoso SPARQL results
In the previous deliverable D1.3.1 (GeoKnow Consortium, 2013), we presented the template that contains all the relevant metrics. In this section, we give the results of benchmark run over RDF data loaded in Virtuoso on a cluster of computers (Table 3). Virtuoso was run in cluster mode where one logical database (with Linked Geodata) is served by a collection of server processes (in our case, there was 4 of them) spread over a cluster of machines (2 machines).
The benchmark tested how the system behaves when it handles one user at a time (power run -‐ 1 stream row), of 16 users for the throughput run (16 streams row). It calculates how many queries can be finish per second (PagePerSec), and it reports this number divided by the dollar cost of the system being tested (PagePerSec/K$). These metrics are measured separately for each step (1-‐12) from the query workload for different zoom levels. The geometric mean of metrics stored in columns from step01 to step06 is written in the column Low zoom score, providing us information of the liability of the system to cope with low zoom level queries. Similarly, the high zoom score is reported, as well as the total score.
FacetBench 1.0 (SF=1)
hardware 2x (dual Xeon E5-2630, 2.33GHz, 192GB RAM, 8 disks)
Software Virtuoso v7
Linux 2.6
Price $13.000 @ November 23, 2013
D1.3.2 – v. 1.0
Page 20
database size
bulk load time
Metrics:
PagePerSec
PagePerSec/K$
step01
Z=0
step02
Z=1
step03
Z=2
step04
Z=3
step05
Z=4
step06
Z=4
Low
zoom
Score
step07
Z=5
step08
Z=5
step09
Z=6
step10
Z=6
step11
Z=7
step12
Z=7
High
zoom
Score
LGB
Total
Score
1 stream 0.19532
0.47221 1.00594 1.18161 2.23065 2.19829 0.901762
0.07/K$
1.6787
1.89
3.4626
3.84468
6.38162
4.50653
3.26492
0.25/K$
1.71586
0.13/K$
16 streams 0.141476
0.531156
1.06908
1.41945
1.91938
2.10602
0.878953
0.07/K$
1.62373
2.46687 4.91135
4.67658
10.8606
7.404
4.41157
0.34/K$
1.96915
0.15/K$
Table 3: Virtuoso SPARQL results
From Table 3, we can conclude that the average execution time of low zoom level queries is 1.66s, while the average of the high zoom levels is 0.34s, giving us the total average of 1.00s. These values are for the power run, and the following are for the throughput run with 16 users: low zoom level – 30.90s; high zoom level – 4.44s; total – 17.67s. Graphical representation of these numbers is given in Figure 2. For low zoom level queries, the average execution times for the power run are 18 times shorter than the execution times for the throughput run, while on the high zoom level the execution in 16 parallel streams is almost 13 times slower than the execution in power run. This is an expected result, because the CPU utilization in power run reached the peak (the system being tested has 24 cores, so the CPU utilization was 2400%). Therefore, we could not expect shorter execution times in the throughput run.
Figure 2: Virtuoso SPARQL results
3.5 Virtuoso SQL results
In this section, we present the same reporting template, but for the relational dump of Open Street
0.125 0.25 0.5 1 2 4 8 16 32 64 128
1 2 3 4 5 6 7 8 9 10 11 12
sec
Step
1 stream
16 streams
D1.3.2 – v. 1.0
Page 21
Map with 1.3G nodes. Here, we used the single Virtuoso instance, as well as the cluster configuration.
The results for single instance are in the following Table 4:
FacetBench 1.0 (SF=1)
hardware 2x (dual Xeon E5-2630, 2.33GHz, 192GB RAM, 8 disks)
Software Virtuoso v7
Linux 2.6
Price $13.000 @ November 23, 2013
database size
bulk load time
Metrics:
PagePerSec
PagePerSec/K$
step01
Z=0
step02
Z=1
step03
Z=2
step04
Z=3
step05
Z=4
step06
Z=4
Low
zoom
Score
step07
Z=5
step08
Z=5
step09
Z=6
step10
Z=6
step11
Z=7
step12
Z=7
High
zoom
Score
LGB
Total
Score
1 stream 0.0529207
0.0921005 0.203413 0.505894 0.927902 0.956663 0.276474
0.02/K$
2.02634
2.03004
5.71102
4.36872
13.7174
8.81834
4.80896
0.37/K$
1.15306
0.09/K$
16 streams 0.243775
0.360965
0.78531
2.01543
4.45029
4.52088
1.18727
0.09/K$
10.4645
10.7674 32.2244
25.8124
87.8918
54.6322
27.6458
2.13/K$
5.72914
0.44/K$
Table 4: Virtuoso SQL Results (single instance)
In the power run, the average execution time of low zoom level queries is 6.46s. But, for the high zoom levels, the average is lower, as expected: 0.26s. The total average time in the power run is 3.36s. In the throughput run the average values for low zoom level queries, high zoom level queries, and the total average are 24.23s, 0.77s and 12.50s, respectively. The queries running in isolation executed more than 3 times faster. This is an expected result, as well, because the CPU utilization in power run was not so high. These values are shown in the Figure 3.
D1.3.2 – v. 1.0
Page 22
Figure 3: Virtuoso SQL Results (single instance)
Latter in this chapter, we present the results of the same benchmark, but running Virtuoso in cluster mode (4 processes, 2 machines). The reporting template is shown on the Table 5.
FacetBench 1.0 (SF=1)
hardware 2x (dual Xeon E5-2630, 2.33GHz, 192GB RAM, 8 disks)
Software Virtuoso v7
Linux 2.6
Price $13.000 @ November 23, 2013
database size
bulk load time
Metrics:
PagePerSec
PagePerSec/K$
step01
Z=0
step02
Z=1
step03
Z=2
step04
Z=3
step05
Z=4
step06
Z=4
Low
zoom
Score
step07
Z=5
step08
Z=5
step09
Z=6
step10
Z=6
step11
Z=7
step12
Z=7
High
zoom
Score
LGB
Total
Score
1 stream 0.631473
1.42531 4.32713 8.96861 15.3139 14.245 4.43334
0.34/K$
21.1416
22.4719
56.4972
50.5051
100
78.7402
46.8512
3.60/K$
14.4121
1.11/K$
16 streams 1.92231
3.21515
4.77236
7.49611
12.4098
12.818
5.71997
0.44/K$
24.4641
25.5404 67.2199
56.166
138.739
95.5892
56.0432
4.31/K$
17.9043
1.38/K$
Table 5: Virtuoso SQL Results (cluster)
In the power run, the average execution time of low level queries, high level queries and the total average are 0.46s, 0.03, and 0.24s, respectively, while in the throughput run these numbers are: 3.55s, 0.35, and 1.95. This is about 8 times slower. All of these values are graphically presented in the Figure
0.0625 0.125 0.25 0.5 1 2 4 8 16 32 64 128
1 2 3 4 5 6 7 8 9 10 11 12
sec
Step
1 stream
16 streams
D1.3.2 – v. 1.0
Page 23
4.
Figure 4: Virtuoso SQL Results (cluster)
3.6 PostGIS SQL results
In this section, we give the benchmark results of PostGIS in SQL as a point of reference. The dataset being tested is almost the same as the dataset in question in the previous section. The reporting template is shown in the Table 6.
FacetBench 1.0 (SF=1)
hardware 2x (dual Xeon E5-2630, 2.33GHz, 192GB RAM, SSD)
Software PostgreSQL 9.1 with PostGIS 1.5.8
Linux 2.6
Price $13.000 @ November 23, 2013
database size
bulk load time
Metrics:
PagePerSec
PagePerSec/K$
step01
Z=0
step02
Z=1
step03
Z=2
step04
Z=3
step05
Z=4
step06
Z=4
Low
zoom
Score
step07
Z=5
step08
Z=5
step09
Z=6
step10
Z=6
step11
Z=7
step12
Z=7
High
zoom
Score
LGB
Total
Score
1 stream 0.00109
0.00589 0.01194 0.01681 0.02321 0.24638 0.00952
0.0007/K$
0.05367
0.06179
0.17985
0.18241
1.53304
1.069
0.23738
0.0183/K$
0.04754
0.0037/K$
16 streams 0.00964
0.04613
0.11510
0.19664
0.29607
0.30318
0.09842
0.0076/K$
0.74051
0.75399 5.28043
4.3347
23.3231
15.166
4.06399
0.3126/K$
0.63242
0.0486/K$
0.0078125 0.015625 0.03125 0.0625 0.125 0.25 0.5 1 2 4 8 16
1 2 3 4 5 6 7 8 9 10 11 12
sec
Step
1 stream
16 streams
D1.3.2 – v. 1.0
Page 24
Table 6: PostGIS results
In the power run, the average execution times of low zoom level queries, high zoom level queries, and total average are 218.96s, 7.91s and 113.43s, respectively. In the throughput run related numbers are only slightly higher (from 8% to 77%): 388.94s, 8.55s and 198.75s. This ratio is reasonable because the CPU utilization in power run in this case was so low. Figure 5 contains a graph of the average execution time of each step in the workload.
Figure 5: PostGIS results
3.7 Results Comparison
In this section, we summarize the results collected in the preceding ones. We present the comparison of these four systems separately on the power run, and on the throughput run.
In Figure 6 the power run comparison is presented. Virtuoso in both SQL and SPARQL outperformed PostGIS by large factor. Specifically, all the queries in the power run were executed 33 times slower in PostGIS than in Virtuoso SQL (single server). If we compare PostGIS with Virtuoso SPARQL, the factor will be even greater: 131 for low zoom level queries, 23 for high zoom level queries, and 113 in total. If we correlate Virtuoso SPARQL and SQL (single server), we will conclude that the relational version is slower almost 4 times on low zoom level queries, while it is faster 23% on high zoom level. In total, SQL version is slower more than 3 times. But, if we compare Virtuoso SPARQL and SQL, but with cluster configuration, we will conclude that SQL is faster more than 3 times on low zoom level, more than 13 times on high zoom level, and more than 4 times in total. Therefore, the largest factor is between PostGIS and Virtuoso SQL with cluster setting (more than 466).
0.5 1 2 4 8 16 32 64 128 256 512 1024 2048
1 2 3 4 5 6 7 8 9 10 11 12
sec
Step
1 stream
16 streams
D1.3.2 – v. 1.0
Page 25
Figure 6: Power Run Comparison
In Figure 7 the throughput run comparison is shown. Virtuoso in both variants outperformed PostGIS but not with a huge factor as in the previous case. On low zoom levels, the factor was more than 16 for SQL version (single server), and 12.6 for SPARQL version; on high zoom levels, it was 11 for SQL (single), but for SPARQL it was almost 2. From Figure 7, it is obvious that PostGIS was slightly faster than Virtuoso SPARQL on the highest zoom level. Taking into account all steps from workload, PostGIS was slower almost 16 times than Virtuoso SQL (single server), and more than 11 times than Virtuoso SPARQL. Comparing Virtuoso versions, on low zoom level queries SQL version (single server) was 22% faster, while on high zoom levels it was faster almost 6 times. In total, SQL version (single server) is faster 30%. Virtuoso running on cluster was 6 times faster than running on single server, while more than 100 times faster comparing with PostGIS.
Analyzing these results, one should bear in mind that Virtuoso SPARQL was tested only a cluster of computers.
Figure 7: Throughput Run Comparison
0.0078125
0.03125
0.125
0.5
2
8
32
128
512
1 2 3 4 5 6 7 8 9 10 11 12
sec
Step
Virtuoso SPARQL
Virtuoso SQL
PostGIS SQL
Virtuoso SQL cluster
0.0625
0.25
1
4
16
64
256
1024
1 2 3 4 5 6 7 8 9 10 11 12
sec
Step
Virtuoso SPARQL
Virtuoso SQL
PostGIS SQL
Virtuoso SQL cluster
D1.3.2 – v. 1.0
Page 27
4. Query Plans
One of the reasons why PostGIS is much slower than Virtuoso, is the query planner.
4.1 PostGIS Query Plans
For the most queries, PostGIS query planner did not choose the optional query plan, thus the average execution times are bad while comparing with Virtuoso results. Let us take for example the facet count query: EXPLAIN ANALYZE select t.type, count(*) as cnt from nodes as n, node_types as t where n.id=t.node_id and ST_Intersects(geom, ST_MakeEnvelope(LONGITUDE-WIDTH/2, LATITUDE-HEIGHT/2, LONGITUDE+WIDTH/2, LATITUDE+HEIGHT/2, 4326)) group by t.type order by cnt desc limit 50
Almost all the facet count queries have the query plan stated in the Appendix 8.1. Exceptions are the queries from the highest zoom level. In the problematic query plan, we have a hash join of the tables nodes and node_types, where in the “build” phase of the algorithm, the hash table of the relation nodes is prepared (build), and the relation node_types is scanned in sequence (probe). This is wrong, because the relation nodes is larger one. The average execution time of these queries on the lowest zoom level is about 1000s.
If we disable the query planner's use of hash-‐join plan types with SET enable_hashjoin = false;
that is by default on, we will have a merge join instead. In this case, the execution will be 20% faster, but even then, Virtuoso will be significantly faster.
If we disable the use of merge-‐join plan types, as well, with SET enable_mergejoin = false;
we will have a nested loop instead. This will bring a slight improvement in execution time, comparing with merge join.
Queries from the highest zoom level have the correct query plan (Appendix 8.1). Here, we have a nested loop, and use of geo index over nodes. This leads to the comparable execution times, noticeable in every figure and table from sections 3.6 and 3.7.
This analysis is exactly the same for all other queries (instance queries and instance aggregation queries).
D1.3.2 – v. 1.0
Page 28
5. Grid Division
This task divides the globe into squares of so many fractions of a degree on the side in such a way that no square contains more than a set number of points. The initial setting is a 30x30 division into squares of 12 degrees on the side. The process is iterative, dividing each square into 4 equal (in terms of angle) squares if the square has more points than the set limit.
The task accesses the totality of the geodata in the system and has high demand for throughput but is only moderately sensitive to latency.
The main query on each iteration is: insert into geo_stat (gs_sq_id, gs_geo, gs_cnt)
select sq_id, sq_geo, count (*)
from geo_square, rdf_quad
where sq_status = 1 and st_intersects (sq_geo, o)
group by sq_id
having count (*) > grain option (order);
This is a spatial join between the squares that had more than the desired count and the totality of the RDF geometries, so that all geometries intersecting the square are retrieved and grouped by the id of the square. Only those squares that have over the count are returned for processing on the next iteration. The algorithm stops when all squares are below the count. The access pattern is a drill down in all densely populated locations at the same time.
The task is first run with cold geo index. The cluster status summary for, before, and after can be found in Appendix 8.2.
The first pass is repeated with warm cache below, followed by the whole run. The world is split into squares until no square has over 20000 points. After each iteration, the count of squares with over 20000 points, as well as the total number of points within these squares is given.
We note that each geometry point is counted twice since it occurs in the object position of a triple twice: Once for the geometry and once for the denormalization where the node directly refers to its geometry.
At intervals we show the cluster status summary with CPU and interconnect utilization. This is not repeated for all points. We note that as soon as the working set is in memory there is near perfect platform utilization. For the first few iterations, the number of points does not decrease and the run time increases slightly. This is because the number of distinct squares, hence of distinct geo lookup keys increases, i.e. more lookups retrieve the same number of points. The increase is small though as vectoring and other techniques absorb the overhead. After this the times start dropping as the number of squares with over 20000 points and their share of the total point population starts dropping. Here we see the selective part of the geo lookups.
At the end of dividing, the following query summarizes the task: select top 100 floor (log (st_area (sq_geo)) )as a, count (*)
D1.3.2 – v. 1.0
Page 29
from geo_square
group by a
order by 2;
a aggregate
INTEGER INTEGER NOT NULL
_______________________________________________________________________________
-13 4
-14 16
3 336
-12 347
4 626
2 966
0 3548
-1 9095
-2 19499
-11 20272
-4 45051
-9 59074
-8 96539
-5 97976
-7 144831
After this each point belongs to exactly one grid square. The grid square can be efficiently determined given the point by a lookup in a small R tree. The count of distinct squares is at most in the millions, hence the squares themselves can be easily replicated on all nodes of a scale out system.
D1.3.2 – v. 1.0
Page 30
6. Analytical Queries
In order to make informed business decisions, there is a need to turn the data in corporate database into useful information. The following BI queries demonstrate this. Queries touching large fractions of the data are primarily useful for checking consistency and high level data summarization. Business intelligence analytics in this context would be more scoped, to specific countries and rural and urban areas, for example.
We begin with data summarization questions and comparisons between datasets:
• Q1: For each country, show the total count of features in Dbpedia, Geonames and OSM. sparql select ?cname (sum(if (?g_graph = <lgd_ext>, 1, 0)) as ?n_lgd) (sum(if (?g_graph = <http://dbpedia.org>, 1, 0)) as ?n_dbp) (sum(if (?g_graph != <http://dbpedia.org> && ?g_graph != <lgd_ext> && ?g_graph != <sqs>, 1, 0)) as ?n_geo) where { graph ?g_graph { ?feature geo:geometry ?sgeo . } . ?sq geo:geometry ?sqgeo . filter (bif:st_intersects (?sqgeo, ?sgeo)) ?sq <sq-belongs-to-country> ?country . ?country <http://www.geonames.org/ontology#name> ?cname } group by ?cname order by desc 2
• Q2: For each country, show the count of offers (amenities for sale), with the total price sorted by the count of them. sparql select ?country, count(1), sum (?sale_price) where { ?offer a <http://linkedgeodata.org/ontology/Offer> ; <http://linkedgeodata.org/ontology/subject> ?re_subj ; <http://linkedgeodata.org/ontology/sale_price> ?sale_price . ?re_subj geo:geometry ?sgeo . ?sq geo:geometry ?sqgeo . filter (bif:st_intersects (?sqgeo, ?sgeo)) ?sq <sq-belongs-to-country> ?country . } group by ?country order by desc 2 For this, sale events are generated so that an amenity has a 1/20 chance of being for sale in each of the 10 past years, i.e. there is a 1/2 chance that an amenity has been for sale in the past decade. A surface area is randomly chosen between 100 and 1000. The price is
D1.3.2 – v. 1.0
Page 31
±50% of a country dependent average with being in 20 km of a city doubles the price.
• Q3: Count of all features and count of features that belong to some country sparql select count (*) count (?country) where { ?feature geo:geometry ?sgeo . graph <sqs> { ?sq geo:geometry ?sqgeo . } . filter (bif:st_intersects (?sqgeo, ?sgeo)) . optional { ?sq <sq-belongs-to-country> ?country . } }
The execution times of these queries are listed in the Table 7.
Query name Time in ms
Q1 200764
Q2 131916
Q3 364100 Table 7: BI Query Results
The results, execution times and query plans of the previous queries can be found in the Appendix 8.3. Some similar BI queries could be:
• What is the ratio of the count of points within 10 km of a city center over all the points in the world?
• As above, except now we count the unique points within 10km of any city center in the country of the point and divide by the total number of points in the country. Group by country, sort by highest percentage first. Note that there can be points that are within 10km of more than one city center.
• List countries ranked by the count of points divided by the population. Where are the most active contributions?
• Return the top 100 squares with the highest price of retail space per square meter for countries in a region.
6.1 Producing DXF
The most convenient format of the results of these queries is not textual. It would be useful if the results can be displayed on a map. In this section, we present a way of doing it.
Drawing exchange format (DXF) is a file format for graphics information, for enabling data interoperability between AutoCAD and other programs. It is an ASME/ANSI standard that is used for PC-‐based CAD/CAM platforms. DXF enables vector data exchange as well as 2D and 3D graphics drawing.
Here, we detail the Virtuoso procedure that will produce the .dxf file, containing a map with the integrated features requested by BI query. For example, we consider the following query:
select ?city, <sql:BEST_LANGMATCH>(?cityoname,
D1.3.2 – v. 1.0
Page 32
"en-gb;q=0.8, en;q=0.7, fr;q=0.6, *;q=0.1", "") as ?city_official_name,
count(1) as ?sales_count,
avg(?re_subj_lat) as ?avg_lat, avg (?re_subj_long) as ?avg_long
where {
?offer a <http://linkedgeodata.org/ontology/Offer> ;
<http://linkedgeodata.org/ontology/subject> ?re_subj ;
<http://linkedgeodata.org/ontology/sale_price> ?sale_price .
?re_subj geo:lat ?re_subj_lat ; geo:long ?re_subj_long .
graph <sq-city> { ?sq <sq-belongs-to-city> ?city }
filter (?sq =
<(NUM,NUM)SHORT::sql:xy_square_iid> (?re_subj_long, ?re_subj_lat))
?city <http://www.geonames.org/ontology#officialName> ?cityoname
}
group by ?city
This query summarizes the offers per city (count of them), and returns the average latitude and longitude of them. If we export the results of the query to the .dxf file, a fragment of a bitmap rendered by software dealing with this kind of files could be as shown in the Figure 8. Colors and sizes of the cities are chosen depending on the count of the offers in the city in question.
D1.3.2 – v. 1.0
Page 33
Figure 8: A Fragment of a Bitmap with Count of Sales
Virtuoso procedure producing this .dxf file, can be found in the Appendix 8.4. Execution time of it was 56s, for the map of the whole world.
D1.3.2 – v. 1.0
Page 34
7. Conclusion
In this deliverable, we presented an update of the configuration of the GeoKnow Benchmarking System. We improved the benchmark and expanded it to be used for relational data, as well. The migration procedure of OSM data from PostGIS to Virtuoso is improved. We used this Benchmarking System to evaluate the performance of the different RDF stores, and the RDBMS’es (Virtuoso and PostGIS), and to compare the results between them. We presented PostGIS query plans, as one of the possible reasons why PostGIS is much slower than Virtuoso. The grid division task is specified, as a new idea of how a scale out system can be improved. The BI queries are specified, developed, and measured, as well as a way of producing the .dxf file from the results of the queries.
D1.3.2 – v. 1.0
Page 35
8. Appendix
All the scripts and programs from the GeoKnow Benchmark are available as a git project: https://github.com/GeoKnow/GeoBenchLab
The current version of the migration scripts that transfers the OSM data from PostgreSQL to Virtuoso are available here:
https://dl.dropboxusercontent.com/u/27316106/migration.tar.gz
8.1 PostGIS query plans
Query plan of all the facet count queries (except queries from the highest zoom level): Limit (cost=47835546.24..47835546.37 rows=50 width=101) (actual time=1125920.197..1125920.210
rows=50 loops=1)
-> Sort (cost=47835546.24..47835547.30 rows=423 width=101) (actual time=1125920.195..1125920.200 rows=50 loops=1)
Sort Key: (count(*))
Sort Method: top-N heapsort Memory: 32kB
-> HashAggregate (cost=47835527.96..47835532.19 rows=423 width=101) (actual time=1125919.583..1125919.780 rows=909 loops=1)
-> Hash Join (cost=44785326.16..47831863.32 rows=732929 width=101) (actual time=1026163.888..1124232.847 rows=2311400 loops=1)
Hash Cond: (t.node_id = n.id)
-> Seq Scan on node_types t (cost=0.00..761033.86 rows=27936686 width=109) (actual time=0.010..12868.866 rows=27936634 loops=1)
-> Hash (cost=44246225.23..44246225.23 rows=32859434 width=8) (actual time=1024822.421..1024822.421 rows=96122281 loops=1)
Buckets: 4096 Batches: 4096 (originally 2048) Memory Usage: 1025kB
-> Bitmap Heap Scan on nodes n (cost=3137472.07..44246225.23 rows=32859434 width=8) (actual time=238660.910..972205.201 rows=96122281 loops=1)
Recheck Cond: (geom && '0103000020E610000001000000050000001361C3D32B6599BF6688635DDCD648401361C3D32B6599BF6688635DDC164B404F1E166A4DF321406688635DDC164B404F1E166A4DF321406688635DDCD648401361C3D32B6599BF6688635DDCD64840'::geometry)
Filter: _st_intersects(geom, '0103000020E610000001000000050000001361C3D32B6599BF6688635DDCD648401361C3D32B6599BF6688635DDC164B404F1E166A4DF321406688635DDC164B404F1E166A4DF321406688635DDCD648401361C3D32B6599BF6688635DDCD64840'::geometry)
-> Bitmap Index Scan on idx_nodes_geom (cost=0.00..3129257.21 rows=98578313 width=0) (actual time=238636.331..238636.331 rows=96147748 loops=1)
Index Cond: (geom && '0103000020E610000001000000050000001361C3D32B6599BF6688635DDCD648401361C3D32B6599BF6688635DDC164B404F1E166A4DF321406688635DDC164B404F1E166A4DF321406688635DDCD648401361C3D32B6599BF6688635DDCD64840'::ge
D1.3.2 – v. 1.0
Page 36
ometry)
Query plan of the facet count queries from the highest zoom level: Limit (cost=424622.60..424622.73 rows=50 width=101) (actual time=393.969..393.981 rows=50
loops=1)
-> Sort (cost=424622.60..424622.93 rows=133 width=101) (actual time=393.968..393.973 rows=50 loops=1)
Sort Key: (count(*))
Sort Method: quicksort Memory: 37kB
-> HashAggregate (cost=424616.85..424618.18 rows=133 width=101) (actual time=393.865..393.882 rows=89 loops=1)
-> Nested Loop (cost=0.00..424616.19 rows=133 width=101) (actual time=2.215..392.729 rows=1232 loops=1)
-> Index Scan using idx_nodes_geom on nodes n (cost=0.00..76852.86 rows=5948 width=8) (actual time=0.415..124.744 rows=29344 loops=1)
Index Cond: (geom && '0103000020E6100000010000000500000034BA83D899C211408048BF7D1DF0494034BA83D899C21140F853E3A59BF449400ABFD4CF9B0A1240F853E3A59BF449400ABFD4CF9B0A12408048BF7D1DF0494034BA83D899C211408048BF7D1DF04940'::geometry)
Filter: _st_intersects(geom, '0103000020E6100000010000000500000034BA83D899C211408048BF7D1DF0494034BA83D899C21140F853E3A59BF449400ABFD4CF9B0A1240F853E3A59BF449400ABFD4CF9B0A12408048BF7D1DF0494034BA83D899C211408048BF7D1DF04940'::geometry)
-> Index Scan using pk_node_types on node_types t (cost=0.00..58.25 rows=17 width=109) (actual time=0.009..0.009 rows=0 loops=29344)
Index Cond: (node_id = n.id)
8.2 Grid Division
Cluster 4 nodes, 1285 s. 11 m/s 3 KB/s 666% cpu 496% read 0% clw threads 1r 0w 0i buffers 12974829 206 d 0 w 2 pfs
cl 1: 5 m/s 2 KB/s 129% cpu 204% read 0% clw threads 1r 0w 0i buffers 3269072 38 d 0 w 2 pfs
cl 2: 1 m/s 0 KB/s 186% cpu 83% read 0% clw threads 0r 0w 0i buffers 3215267 58 d 0 w 0 pfs
cl 3: 1 m/s 0 KB/s 177% cpu 100% read 0% clw threads 0r 0w 0i buffers 3237583 54 d 0 w 0 pfs
cl 4: 1 m/s 0 KB/s 173% cpu 108% read 0% clw threads 0r 0w 0i buffers 3252907 56 d 0 w 0 pfs
Iter 1 -- 219713 msec.
Cluster 4 nodes, 219 s. 60 m/s 15 KB/s 4589% cpu 0% read 0% clw threads 1r 0w 0i buffers 13110965 239 d 0 w 0 pfs
cl 1: 30 m/s 11 KB/s 1142% cpu 0% read 0% clw threads 1r 0w 0i buffers 3305481 36 d 0 w 0 pfs
cl 2: 10 m/s 1 KB/s 1135% cpu 0% read 0% clw threads 0r 0w 0i buffers 3260047 70 d 0 w 0 pfs
cl 3: 10 m/s 1 KB/s 1142% cpu 0% read 0% clw threads 0r 0w 0i buffers 3269431 66 d 0 w 0 pfs
cl 4: 10 m/s 1 KB/s 1168% cpu 0% read 0% clw threads 0r 0w 0i buffers 3276006 67 d 0 w 0 pfs
select count (*), sum (gs_cnt) from geo_stat where gs_cnt > 10000;
274 3998898331
iTER 2 Done. -- 218499 msec.
760 3997850489
D1.3.2 – v. 1.0
Page 37
Iter 3 . -- 223179 msec.
2074 3992914437
Iter 4 Done. -- 227770 msec.
4748 3969470524
Iter 5 Done. -- 233642 msec.
9897 3905745538
Iter 6 . -- 262611 msec.
20089 375759190
Iter 7 Done. -- 295055 msec.
35305 3381997920
Done. -- 308270 msec.
Cluster 4 nodes, 308 s. 6735 m/s 2208 KB/s 3442% cpu 0% read 13% clw threads 1r 0w 0i buffers 13130500 19774 d 0 w 0 pfs
cl 1: 3368 m/s 1276 KB/s 852% cpu 0% read 13% clw threads 1r 0w 0i buffers 3310532 5087 d 0 w 0 pfs
cl 2: 1122 m/s 310 KB/s 872% cpu 0% read 0% clw threads 0r 0w 0i buffers 3264875 4898 d 0 w 0 pfs
cl 3: 1122 m/s 310 KB/s 860% cpu 0% read 0% clw threads 0r 0w 0i buffers 3274259 4894 d 0 w 0 pfs
cl 4: 1122 m/s 310 KB/s 857% cpu 0% read 0% clw threads 0r 0w 0i buffers 3280834 4895 d 0 w 0 pfs
43244 2523650520
Iter 8 Done. -- 254961 msec.
28145 1322847177
Iter 9 Done. -- 151372 msec.
16041 592433083
Iter 10 Done. -- 68823 msec.
Cluster 4 nodes, 69 s. 3556 m/s 3581 KB/s 3505% cpu 0% read 13% clw threads 1r 0w 0i buffers 13138482 27756 d 0 w 0 pfs
cl 1: 1778 m/s 1883 KB/s 880% cpu 0% read 13% clw threads 1r 0w 0i buffers 3312301 6856 d 0 w 0 pfs
cl 2: 592 m/s 566 KB/s 884% cpu 0% read 0% clw threads 0r 0w 0i buffers 3266946 6969 d 0 w 0 pfs
cl 3: 592 m/s 565 KB/s 873% cpu 0% read 0% clw threads 0r 0w 0i buffers 3276330 6965 d 0 w 0 pfs
cl 4: 592 m/s 565 KB/s 866% cpu 0% read 0% clw threads 0r 0w 0i buffers 3282905 6966 d 0 w 0 pfs
5090 141521188
Iter 11 Done. -- 17592 msec.
88 2052076
Iter 12 Done. -- 1015 msec.
5 105737
Iter 13 Done. -- 655 msec.
0 0
D1.3.2 – v. 1.0
Page 38
8.3 Execution of the Analytical Queries profile ('
sparql select ?country, count(1), sum (?sale_price)
where
{
?offer a <http://linkedgeodata.org/ontology/Offer> ;
<http://linkedgeodata.org/ontology/subject> ?re_subj ;
<http://linkedgeodata.org/ontology/sale_price> ?sale_price .
?re_subj geo:geometry ?sgeo .
?sq geo:geometry ?sqgeo .
filter (bif:st_intersects (?sqgeo, ?sgeo))
?sq <sq-belongs-to-country> ?country .
} group by ?country order by desc 2
');
result
LONG VARCHAR
_______________________________________________________________________________
http://sws.geonames.org/3077311/ 103580 121039191590
http://sws.geonames.org/2635167/ 78357 109253805239
http://sws.geonames.org/3175395/ 18986 23035773345
http://sws.geonames.org/1861060/ 18881 13113363308
http://sws.geonames.org/2782113/ 11456 12394479122
http://sws.geonames.org/3144096/ 11045 12037065109
http://sws.geonames.org/4197000/ 8793 20766773241
http://sws.geonames.org/1668284/ 7756 6574466926
http://sws.geonames.org/2077456/ 6169 -4285350253
http://sws.geonames.org/6251999/ 5927 6920736725
http://sws.geonames.org/3723988/ 5867 16469098525
http://sws.geonames.org/3923057/ 5542 6398777654
http://sws.geonames.org/1694008/ 5392 2777228420
http://sws.geonames.org/3017382/ 5375 5170332015
http://sws.geonames.org/3382998/ 5125 4902294018
http://sws.geonames.org/2264397/ 4886 5199820916
http://sws.geonames.org/3575830/ 3805 4072452309
http://sws.geonames.org/3865483/ 3298 2528875647
http://sws.geonames.org/2963597/ 2983 4379407846
http://sws.geonames.org/2658434/ 2754 2409249475
D1.3.2 – v. 1.0
Page 39
{
time 5.6e-08% fanout 1 input 1 rows
time 0.0021% fanout 1 input 1 rows
{ hash filler
wait time 0% of exec real time, fanout 0
QF {
time 4.5e-06% fanout 0 input 0 rows
Stage 1
time 0.0007% fanout 21798.7 input 48 rows
RDF_QUAD 1.1e+06 rows(s_13_12_t1.O, s_13_12_t1.S)
inlined P = #/subject
time 0.18% fanout 3.44745 input 1.04634e+06 rows
Stage 2
time 0.0044% fanout 0 input 4.18535e+06 rows
Sort hf 39 replicated(s_13_12_t1.O) -> (s_13_12_t1.S)
}
}
Subquery 45
{
time 4.8e-08% fanout 1 input 1 rows
{ fork
time 0.00011% fanout 1 input 1 rows
{ fork
wait time 2.2e-09% of exec real time, fanout 0
QF {
time 4.4e-05% fanout 0 input 0 rows
Stage 1
time 0.00016% fanout 5154.27 input 48 rows
RDF_QUAD 2.4e+05 rows(s_13_12_t5.S, s_13_12_t5.O)
inlined P = #¶sq-belongs-to-country
time 3.2e-05% fanout 1 input 247405 rows
END Node
After test:
0: if ( 0 = 1 ) then 5 else 4 unkn 5
4: BReturn 1
5: BReturn 0
time 0.0022% fanout 1 input 247405 rows
RDF_QUAD 1 rows(s_13_12_t4.O)
inlined P = ##geometry , S = k_s_13_12_t5.S
time 0.37% fanout 48 input 247405 rows
D1.3.2 – v. 1.0
Page 40
Precode:
0: QNode {
time 0% fanout 0 input 0 rows
dpipe
s_13_12_t4.O -> __RO2SQ -> __ro2sq
}
2: BReturn 0
Stage 2
time 64% fanout 85.3929 input 1.18754e+07 rows
geo 3 st_intersects (__ro2sq) node on DB.DBA.RDF_GEO 0 rows
s_13_12_t3.O
time 30% fanout 0.999796 input 1.01408e+09 rows
RDF_QUAD_POGS 2 rows(s_13_12_t3.S)
P = ##geometry , O = cast
hash partition+bloom by 43 ()
time 5.2% fanout 0.000434695 input 1.01387e+09 rows
Hash source 39 1.7 rows(cast) -> (s_13_12_t1.S)
time 0.18% fanout 0.994389 input 440725 rows
Stage 3
time 0.014% fanout 0.783069 input 440725 rows
RDF_QUAD 1.1 rows(s_13_12_t2.S, s_13_12_t2.O)
inlined P = #/sale_price , S = q_s_13_12_t1.S
time 0.035% fanout 1 input 345118 rows
RDF_QUAD 0.8 rows()
inlined P = ##type , S = k_q_s_13_12_t1.S , O = #/Offer
time 0.0012% fanout 0 input 345118 rows
Sort (set_no, s_13_12_t5.O) -> (s_13_12_t2.O, inc)
}
}
time 9.8e-07% fanout 96 input 1 rows
group by read node
(gb_set_no, s_13_12_t5.O, aggregate, aggregate)
time 0.0013% fanout 0 input 96 rows
Sort (aggregate) -> (s_13_12_t5.O, aggregate)
}
time 4.3e-06% fanout 96 input 1 rows
Key from temp (s_13_12_t5.O, aggregate, aggregate)
D1.3.2 – v. 1.0
Page 41
After code:
0: QNode {
time 0% fanout 0 input 0 rows
dpipe
s_13_12_t5.O -> __RO2SQ -> __ro2sq
}
2: callret-1 := := artm aggregate
6: callret-2 := := artm aggregate
10: country := := artm __ro2sq
14: BReturn 0
time 2.3e-08% fanout 0 input 96 rows
Subquery Select(country, callret-1, callret-2)
}
After code:
0: QNode {
time 0% fanout 0 input 0 rows
dpipe
s_13_12_t5.O -> __RO2SQ -> __ro2sq
}
2: callret-1 := := artm aggregate
6: callret-2 := := artm aggregate
10: country := := artm __ro2sq
14: BReturn 0
time 2.3e-08% fanout 0 input 96 rows
Subquery Select(country, callret-1, callret-2)
}
After code:
0: QNode {
time 0% fanout 0 input 0 rows
dpipe
callret-2 -> __RO2SQ -> callret-2
callret-1 -> __RO2SQ -> callret-1
country -> __RO2SQ -> country
}
D1.3.2 – v. 1.0
Page 42
2: BReturn 0
time 2.5e-08% fanout 0 input 96 rows
Select (country, callret-1, callret-2)
}
131916 msec 3860% cpu, 1.0153e+09 rnd 4.16182e+10 seq 99.699% same seg 0.281927% same pg
633 disk reads, 0 read ahead, 0.119788% wait
42977 messages 13581 bytes/m, 0.0078% clw
Compilation: 4 msec 0 reads 0% read 0 messages 0% clw
CPU: Intel Sandy Bridge microarchitecture, speed 2299.98 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (No unit mask) count 100000
samples % symbol name
19335499 31.9950 cmpf_geo
12022031 19.8932 itc_page_search
3860950 6.3888 dv_compare
3370690 5.5776 dc_any_cmp
2833115 4.6880 itc_param_cmp
2184864 3.6154 hash_source_chash_input
1728851 2.8608 gen_qsort
1383613 2.2895 cs_decode
1198923 1.9839 box_to_any_1
669706 1.1082 itc_next
667466 1.1045 page_find_leaf
478367 0.7916 ce_result
472240 0.7814 dc_append_bytes
profile ('
sparql select ?cname
(sum(if (?g_graph = <lgd_ext>, 1, 0)) as ?n_lgd)
(sum(if (?g_graph = <http://dbpedia.org>, 1, 0)) as ?n_dbp)
(sum(if (?g_graph != <http://dbpedia.org> && ?g_graph != <lgd_ext> && ?g_graph != <sqs>, 1, 0)) as ?n_geo)
where
{
graph ?g_graph { ?feature geo:geometry ?sgeo . } .
?sq geo:geometry ?sqgeo .
filter (bif:st_intersects (?sqgeo, ?sgeo))
D1.3.2 – v. 1.0
Page 43
?sq <sq-belongs-to-country> ?country .
?country <http://www.geonames.org/ontology#name> ?cname
} group by ?cname order by desc 2
');
United Kingdom of Great Britain and Northern Ireland 361739459 133784 516810
Czech Republic 141304028 25122 180744
Japan 88147723 13342 37289
Repubblica Italiana 77138844 13073 46984
Kingdom of Norway 51099745 7182 21254
Republic of Austria 29263512 1692 44605
Republic of Suriname 17671910 4924 84530
Commonwealth of Australia 17325308 19531 166792
Canada 16661337 6508 22338
Taiwan 14649920 2231 33687
Plurinational State of Bolivia 13617794 7054 103531
Dominica 11907956 1581 12762
Republic of France 11508495 1899 63185
Republic of Chile 11187920 8733 300017
Kingdom of Tonga 10447980 4376 62762
Federal Democratic Republic of Nepal 10228427 5417 46208
Republic of Indonesia 9684176 1514 311943
Ireland 6250504 3748 21269
Portuguese Republic 6224670 2408 27947
New Zealand 5933053 1874 73332
{
time 3.9e-08% fanout 1 input 1 rows
Subquery 27
{
time 2.4e-08% fanout 1 input 1 rows
{ fork
time 9.4e-05% fanout 1 input 1 rows
{ fork
wait time 3.1e-05% of exec real time, fanout 0
QF {
time 0.0018% fanout 0 input 0 rows
Stage 1
time 0.00011% fanout 5154.27 input 48 rows
D1.3.2 – v. 1.0
Page 44
RDF_QUAD 2.4e+05 rows(s_15_8_t2.S, s_15_8_t2.O)
inlined P = #¶sq-belongs-to-country
time 2.2e-05% fanout 1 input 247405 rows
END Node
After test:
0: if ( 0 = 1 ) then 5 else 4 unkn 5
4: BReturn 1
5: BReturn 0
time 0.00071% fanout 1 input 247405 rows
RDF_QUAD 1 rows(s_15_8_t1.O)
inlined P = ##geometry , S = k_s_15_8_t2.S
time 24% fanout 0.527002 input 247405 rows
Precode:
0: QNode {
time 0% fanout 0 input 0 rows
dpipe
s_15_8_t1.O -> __RO2SQ -> __ro2sq
}
2: BReturn 0
Stage 2
time 0.00046% fanout 1 input 247405 rows
RDF_QUAD 1 rows(s_15_8_t3.O)
inlined P = ##name , S = q_s_15_8_t2.O
time 0.054% fanout 42.2287 input 247405 rows
Stage 3
time 43% fanout 85.3929 input 1.18754e+07 rows
geo 3 st_intersects (__ro2sq) node on DB.DBA.RDF_GEO 0 rows
s_7_1_t0.O
time 31% fanout 0.999796 input 1.01408e+09 rows
RDF_QUAD_POGS 2 rows(s_7_1_t0.G)
P = ##geometry , O = cast
After code:
0: neq := Call neq (s_7_1_t0.G, #/dbpedia.org )
5: neq := Call neq (s_7_1_t0.G, #¶lgd_ext )
10: __and := Call __and (neq, neq)
15: neq := Call neq (s_7_1_t0.G, #¶sqs )
20: __and := Call __and (__and, neq)
25: if (__and = 0 ) then 29 else 34 unkn 34
D1.3.2 – v. 1.0
Page 45
29: callretSimpleCASE := := artm 0
33: Jump 38 (level=0)
34: callretSimpleCASE := := artm 1
38: equ := Call equ (s_7_1_t0.G, #/dbpedia.org )
43: if (equ = 0 ) then 47 else 52 unkn 52
47: callretSimpleCASE := := artm 0
51: Jump 56 (level=0)
52: callretSimpleCASE := := artm 1
56: equ := Call equ (s_7_1_t0.G, #¶lgd_ext )
61: if (equ = 0 ) then 65 else 70 unkn 70
65: callretSimpleCASE := := artm 0
69: Jump 74 (level=0)
70: callretSimpleCASE := := artm 1
74: BReturn 0
time 1.3% fanout 0 input 1.01387e+09 rows
Sort (s_15_8_t3.O) -> (callretSimpleCASE, callretSimpleCASE, callretSimpleCASE)
}
}
time 4.3e-07% fanout 107 input 1 rows
group by read node
(s_15_8_t3.O, aggregate, aggregate, aggregate)
time 0.0009% fanout 0 input 107 rows
Sort (aggregate) -> (s_15_8_t3.O, aggregate, aggregate)
}
}
time 4.3e-07% fanout 107 input 1 rows
group by read node
(s_15_8_t3.O, aggregate, aggregate, aggregate)
time 0.0009% fanout 0 input 107 rows
Sort (aggregate) -> (s_15_8_t3.O, aggregate, aggregate)
}
time 2e-05% fanout 107 input 1 rows
Key from temp (s_15_8_t3.O, aggregate, aggregate, aggregate)
After code:
0: QNode {
time 0% fanout 0 input 0 rows
D1.3.2 – v. 1.0
Page 46
dpipe
s_15_8_t3.O -> __RO2SQ -> __ro2sq
}
2: n_lgd := := artm aggregate
6: n_dbp := := artm aggregate
10: n_geo := := artm aggregate
14: cname := := artm __ro2sq
18: BReturn 0
time 1.1e-08% fanout 0 input 107 rows
Subquery Select(cname, n_lgd, n_dbp, n_geo)
}
After code:
0: QNode {
time 0% fanout 0 input 0 rows
dpipe
n_geo -> __RO2SQ -> n_geo
n_dbp -> __RO2SQ -> n_dbp
n_lgd -> __RO2SQ -> n_lgd
cname -> __RO2SQ -> cname
}
2: BReturn 0
time 1.4e-08% fanout 0 input 107 rows
Select (cname, n_lgd, n_dbp, n_geo)
}
200764 msec 2735% cpu, 1.01476e+09 rnd 4.16172e+10 seq 99.5598% same seg 0.416313% same pg
6996 messages 74069 bytes/m, 4.7% clw
Compilation: 3 msec 0 reads 0% read 0 messages 0% clw
profile ('
sparql select count (*) count (?country)
where {
?feature geo:geometry ?sgeo .
graph <sqs> { ?sq geo:geometry ?sqgeo . } .
filter (bif:st_intersects (?sqgeo, ?sgeo)) .
D1.3.2 – v. 1.0
Page 47
optional { ?sq <sq-belongs-to-country> ?country . }
}');
2010826982 1013872046
{
time 2e-08% fanout 1 input 1 rows
time 0.00013% fanout 1 input 1 rows
{ hash filler
wait time 2.1e-06% of exec real time, fanout 0
QF {
time 1e-06% fanout 0 input 0 rows
Stage 1
time 0.0068% fanout 5154.27 input 48 rows
RDF_QUAD_POGS 2.5e+05 rows(t4.S, t4.O)
inlined P = #¶sq-belongs-to-country
time 0.00066% fanout 2.72089 input 247405 rows
Stage 2
time 0.0003% fanout 0 input 989620 rows
Sort hf 39 replicated(t4.S) -> (t4.O)
}
}
time 6.7e-06% fanout 1 input 1 rows
{ fork
wait time 0% of exec real time, fanout 498180
QF {
time 0.0055% fanout 10378.8 input 48 rows
RDF_QUAD_POGS 3.8e+08 rows(t2.S, t2.O)
inlined P = ##geometry G = #¶sqs
time 2.3e-05% fanout 1 input 498180 rows
END Node
After test:
0: if ( 0 = 1 ) then 5 else 4 unkn 5
4: BReturn 1
5: BReturn 0
time 6.5e-06% fanout 0 input 498180 rows
qf select node output: (qf_set_no, t2.S, t2.O)
}
D1.3.2 – v. 1.0
Page 48
time 1% fanout 1 input 498180 rows
Precode:
0: QNode {
time 0% fanout 0 input 0 rows
dpipe
t2.O -> __RO2SQ -> __ro2sq
}
2: BReturn 0
outer {
time 0.00058% fanout 0.496618 input 498180 rows
Hash source 39 1 rows(k_t2.S) -> (t4.O)
time 3.2e-05% fanout 2.01362 input 247405 rows
end of outer}
set_ctr
out: (t4.O)
shadow: (t4.O)
wait time 0% of exec real time, fanout 0
Precode:
0: isnotnull := Call isnotnull (t4.O)
5: BReturn 0
QF {
time 42% fanout 83.8705 input 2.39126e+07 rows
geo 3 st_intersects (__ro2sq) node on DB.DBA.RDF_GEO 0 rows
t1.O
time 57% fanout 0 input 2.00557e+09 rows
RDF_QUAD_POGS 2 rows()
P = ##geometry , O = cast
After code:
0: sum callret-1isnotnullset no set_ctr
5: sum callret-0 1 set no set_ctr
10: BReturn 0
}
}
time 1e-08% fanout 0 input 1 rows
Select (callret-0, callret-1)
}
D1.3.2 – v. 1.0
Page 49
364100 msec 4291% cpu, 2.00584e+09 rnd 7.67135e+10 seq 89.4223% same seg 10.0151% same pg
7766 disk reads, 7213 read ahead, 0.452942% wait
7781 messages 141187 bytes/m, 2.6% clw
Compilation: 2 msec 0 reads 0% read 0 messages 0% clw
8.4 Virtuoso Procedure for Producing DXF File create procedure lgd_render_amounts_of_sales ()
{
declare ses, ctx any;
ses := string_output ();
DXFOUT_PREAMBLE (ses, ctx);
for (select SHP_SOURCE_IDX, deserialize (SHP_GEOM) as geom from DB.DBA."SHP_NE1_ne_10m_admin_0_countries") do
http_st_dxf_entity (geom, vector (62, 253), ses);
for (
sparql define input:storage "" select ?city,
<sql:BEST_LANGMATCH>(?cityoname, "en-gb;q=0.8, en;q=0.7, fr;q=0.6, *;q=0.1", "") as ?city_official_name,
count(1) as ?sales_count,
avg(?re_subj_lat) as ?avg_lat, avg (?re_subj_long) as ?avg_long where
{
?offer a <http://linkedgeodata.org/ontology/Offer> ;
<http://linkedgeodata.org/ontology/subject> ?re_subj ;
<http://linkedgeodata.org/ontology/sale_price> ?sale_price .
?re_subj geo:lat ?re_subj_lat ; geo:long ?re_subj_long .
graph <sq-city> { ?sq <sq-belongs-to-city> ?city }
filter (?sq = <(NUM,NUM)SHORT::sql:xy_square_iid> (?re_subj_long, ?re_subj_lat))
?city <http://www.geonames.org/ontology#officialName> ?cityoname
} group by ?city) do
{
declare color integer;
declare sz double precision;
color := 10 + 10 * floor (log10("sales_count") * 5.0);
sz := (1+log10("sales_count")) * 0.1;
http_raw_dxf_entity (vector (0, 'TEXT', 1, "city_official_name", 6, 'BYLAYER', 62, color,
10, "avg_long" + 1.5*sz, 20, "avg_lat", 30, 0.0,
11, "avg_long" + 1.5*sz, 21, "avg_lat", 31, 0.0,
40, 2*sz, 41, 1.0, 50, 0.0, 71, 0, 72, 0, 73, 2 ), ses);
http_raw_dxf_entity (vector (0, 'CIRCLE', 6, 'BYLAYER', 62, color, 10, "avg_long", 20, "avg_lat", 40, sz), ses);
}
D1.3.2 – v. 1.0
Page 50
DXFOUT_CONCLUSION (ses, ctx);
string_to_file ('counts_of_sales.dxf', ses, -2);
}
;
create procedure xy_square_iid (in x any array, in y any array) returns IRI_ID_8
{
vectored;
return (select __i2id (sprintf ('sq%d', sq_id)) from geo_square where st_intersects (sq_geo, st_point (x, y)));
}
;
D1.3.2 – v. 1.0
Page 51
9. Bibliography Bayer, R. (1971). Binary B-‐trees for Virtual Memory. In Proceedings of the 1971 ACM SIGFIDET
(Now SIGMOD) Workshop on Data Description, Access and Control (pp. 219-‐-‐235). New York, NY, USA: ACM.
GeoKnow Consortium. (2013, 06 18). Design and Setup of Benchmarking System. Retrieved 12 04, 2013, from GeoKnow Deliverables: svn.aksw.org/projects/GeoKnow/Deliverables/D1.3.1/D1.3.1_Design_and_Setup_of_Benchmarking_System.pdf
Guttman, A. (1984). R-‐Trees: A Dynamic Index Structure for Spatial Searching. In B. Yormark, SIGMOD'84, Proceedings of Annual Meeting (pp. 47-‐57). Boston, Massachusetts: ACM Press.
Lod2 Consortium. (2010). Retrieved 12 04, 2013, from Creating Knowledge out of Interlinked Data: http://lod2.eu/Welcome.html
Open Geospatial Consortium. (2012). GeoSPARQL - A Geographic Query Language for RDF Data. Retrieved 12 04, 2013, from http://www.opengeospatial.org/standards/geosparql
OpenStreetMap Project. (n.d.). About OpenStreetMap. Retrieved 12 04, 2013, from OpenStreetMap Web site: http://www.openstreetmap.org/about
OpenStreetMap Project. (2013, 09 08). Osmosis. Retrieved 12 04, 2013, from OpenStreetMap Wiki: http://wiki.openstreetmap.org/wiki/Osmosis