geo comp 2000

8/12/2019 Geo Comp 2000

1/40

Return to Main Index

Open Sourcegeocomputation: using theRdataanalysis language integrated withGRASSGIS andPostgreSQLdata base systems.

Roger BivandEconomic Geography Section, Department of Economics, Norwegian School of Economics and Business

Administration, Breiviksveien 40, N5045 Bergen, Norway

Email:[email protected]

Markus Neteler

Institute of Physical Geography and Landscape Ecology, University of Hannover, Schneiderberg 50,

D30167 Hannover, Germany

Email:[email protected]

Abstract

We report on work in progress on the integration of the GRASS GIS, the R data analysis programming

language and environment, and the PostgreSQL database system. All of these components are released under

Open Source licenses. This means that application programming interfaces are documented both in source

code and in other materials, simplifying insight into the workings of the respective systems. Current versions

of this note and accompanying code are to be found at the HannoverGRASSsite, together with earlier papers

on related topics.

Open Source geocomputation

Open Source geocomputation: using the R data analysis language integrated with GRASS GIS and Pos1
http://www.opensource.org/http://www.postgresql.org/http://www.postgresql.org/http://www.geog.uni-hannover.de/grasshttp://www.ashville.demon.co.uk/gc2000/http://www.geog.uni-hannover.de/grassmailto:[email protected]:[email protected]://www.postgresql.org/http://www.geog.uni-hannover.de/grasshttp://cran.r-project.org/http://www.opensource.org/http://www.ashville.demon.co.uk/gc2000/sessions.htmhttp://www.ashville.demon.co.uk/gc2000/http://www.ashville.demon.co.uk/gc2000/

8/12/2019 Geo Comp 2000

2/40

1 Introduction

The practice of geocomputation is evolving to take advantage of the peerreview qualities brought to software

development by access to source code. The progress reported on in this paper covers recent advances in the

GRASS geographical information system, in particular the introduction of floating point raster cell values and

an enhanced sites data format, in the R data analysis language environment, and in their interfacing both

directly and through the PostgreSQL data base system.

The interactive R language a dialect of S (Chambers 1998) is eminently extensible, providing both for

functions written by users, for the dynamic loading of compiled code into user functions, and the packaging of

libraries of such functions for release to others. R, unlike SPLUS, another dialect of S, conducts all

computation in memory, presenting a potential difficulty for analysis of very large data sets. One way to

overcome this is to utilize a proxy mechanism through an external data base system, such as PostgreSQL;

others will be forthcoming as theOmegaproject matures.

While R is a general data analysis environment, it has been extensively used for modelling and simulation,

and for comparison of modern statistical classification methods with for example neural nets, and is well

suited to assist in unlocking the potential of data held in GIS. Over and above classical, graphical, and modernstatistical techniques in the base R library and supplementary packages, packages for point pattern analysis,

geostatistics, exploratory spatial data analysis and spatial econometrics are available. The paper is illustrated

with worked examples showing how open source software development can benefit geocomputation. Previous

work on the use of R with spatial data is reported by Bivand and Gebhardt (forthcoming), and on interfacing R

and GRASS by Bivand (1999, 2000).

2 Open Source Development and GeoComputation

One of the most striking innovations to emerge with full vigour in the late 1990's, but building on earlier

precedent, is Open Source software. While there are good technical reasons for releasing the source code of

computer programs to anyone connected to the Internet, or willing to spend little more than the cost of a bottle

of beer on a CDROM, the trend profoundly affects the measurement of research and technological

development expenditure and investment. Since the software tools are not priced, knowledge, the practice of

winning new knowledge, and the style in which new knowledge is won are sharply focused.

This style is of key importance, and involves the honing of ideas in open collaboration and competition

between at least moderately altruistic peers. The prime examples are the conception of the world wide web

and its fundamental protocols at CERN, by an errant scientist simply interested in sharing research results in

an open and seamless way, the transport mechanism for about 80% of all electronic mail sendmail written

from California but now with a world cast of thousands of contributors, and the Linux operating system,

developed by a Finnish student of undoubted organisational talent, but perhaps luckily no training in the

commercialisation of innovations, nor in business strategy.

For these innovators, it has been important not to be hindered by institutional barriers and hierarchies.

Acceptance and esteem in Open Source communities is grounded in the quality of contributions and

interventions made by individuals in a completely flat structure. Asking the right question leading to a

perverse bug being identified or solved immediately confers great honour on the person making the

contribution. In money terms, probably no commercial organisation could afford to employ problem chasers

or consultants to solve such problems they are simply too costly. The entry into the value chain is later, in

the customisation, packaging and reselling of software, in software services rather than in the production and

sale of software itself.


1 Introduction 2
http://www.omegahat.org/http://www.omegahat.org/

8/12/2019 Geo Comp 2000

3/40

AsEric Raymondhas shown in three major analyses, most importantly in ``The Magic Cauldron'', software is

generally but wrongly thought of as a product, not a service. Typically, software engineers work within

organisations to provide and administer services internally, customising database applications, adapting

purchased software and network systems to the specific needs of their employers. Very few are employed in

companies that make their living exclusively from selling software, less than ten percent. Assuming that these

programmers have equivalent productivity, one could say that corporate internal markets ``buy'' most of the

software written, and that very little of that code would be of any value to a competitor, if the competitorcould read it. It looks and feels much more like a service than a product.

If software is a service, not a product, then the commercial opportunities lie in customising it to suit specific

needs, unless there really is a secret algorithm embedded in the code which cannot be emulated in any

satisfactory way. This is an issue of some weight in relation to intellectual property rights, which are not well

adapted to software, in that software algorithms are only seldom truly original, and even less often the only

feasible solutions to a given problem. Even without seeing each others' source code, two programmers might

well reach very similar solutions to a given problem; if they did see each others' code, the joint solution would

very likely be a better one. Reverse engineering physical products can be costly and timeconsuming, but

doing the same to software is easy, although one may question the ethics of those doing this commercially.

Most often programmers are driven to reverse engineer software with bugs, because the seller cannot or will

not help them make it work. An important empirical regularity pointed out by Raymond is the immediate fall

in value of software products to zero when the producer goes out of business or terminates the product line. It

is as though customers pay for the product as some kind of advance payment for future updates and new

versions. If they knew that no new versions would be coming, they would drop the product immediately, even

though it still worked as advertised.

Raymond summarises five discriminators pushing towards open source:

reliability, stability, and scalability are critical;

correctness of design and implementation cannot readily be verified by other means than independent

peer review;

the software is critical to the user's control of his/her business;

the software establishes or enables a common computing and communications infrastructure;

key methods (or functional equivalents of them) are part of common engineering knowledge.

Any of these alone may be sufficient to cross the boundary for external economies to begin influencing

behaviour in companies, termed network externalities in Raymond's article. If the company can benefit more

in terms of its internal and/or external service quality and reliability, then there is now good reason to consider

opening source.

The second of Raymond's criteria is of key importance in academic and scientific settings, but others also

have a role to play. Bringing supercomputing power to larger and less well resourced communities has sprung

from open source projects such as PVM and MPI, upon which others again, in particular Beowulf and friends,have developed. Criteria four and five go to the heart of the success of shared, open protocols for interaction

and interfacing components of computing systems.

Next, we will present the two main program systems being integrated, before going on to show how the

interfaces are constructed and function. We assume that interested readers will be able to supplement these

sketches by accessing online documentation. In particular, online documentation for thePostgreSQL

database system (or possibly also the newly GPL'dMySQLsystem) should be refered to. PostgreSQL is a

leading open source ORDBMS, used in developing newer ideas in database technology. PostgreSQL also has

native geometry data types, which have been used to aid vector GIS integration in other projects. Since some

of our results are preliminary, we are anxious to receive suggestions from others interested in this work.


1 Introduction 3
http://www.mysql.com/http://www.mysql.com/http://www.postgresql.org/http://www.tuxedo.org/~esr/writings/

8/12/2019 Geo Comp 2000

4/40

Mention has also been made of theOmegaproject, a project deserving attention as a source of inspiration for

data analysis. Many of the key contributors to S and R are active in the Omegahat core developers' group, and

are working on systems intended to handle distributed data analysis tasks, with streams of data arriving,

possibly in real time, from distributed sources, and undergoing analysis, possibly on distributed compute

servers, for presentationto possibly multiple viewers. At the moment, Omegahat involves XML, CORBA,

and an Slike interactive language based directly on Java, and is experimental.

2.1R

The background to R is interesting, in that it is based on two computer programming traditions: Bell

Laboratories through S, and MIT through Scheme. As Itaka (1998) relates, meeting the first edition of the

classic SICP (1985, second edition: Abelson, Sussman and Sussman, 1996) opened up ``a wonderful view of

programming''. He met Scheme at the same time as an early version of S, and quickly realized that some

features of modern programming language design, such as lexical scoping, can lead to superior practical

solutions. These advantages have, among others, brought Tierney, the author of LispStat, into the R core

development team. Differences in the underlying computing models between S and R are many and

important, and sometimes concern the active user. They are outlined both in R system documentation, the

RFAQ available at the archive site, and in the R complement to Venables and Ripley (1997). Venables andRipley are also actively porting their own work to R, partly as a service to their readers, and partly because the

differences in the underlying computing models tease out potential coding infelicities. Many of these issues

are actively debated on R discussion lists for details, see the R archives.

The key differences lie in scoping and memory management. Scoping is concerned with the environments in

which symbol/value pairs are searched during the evaluation of a function, and poses a problem for porting

from S to R, when use has been made of the S model, rather than that derived from Scheme (Itaka and

Gentleman, 1996, p. 301310). Memory management differs in that S generates large numbers of data files

in the working directory during processing, allocating and reallocating memory dynamically. R, being based

on Scheme, starts by occupying a userspecified amount of memory, but subsequently works within this, not

committing any intermediate results to disk. Memory is allocated to fixed size "con cells", used to store

language elements, functions and strings, and to a memory heap, within which user data are accommodated.

Within the memory heap, a simple but efficient garbage collector makes the best possible use of the memory

at the program's disposal. This may hinder the analysis of very large data sets, since all data have to be held in

memory, but in practice this problem has been alleviated by falling RAM prices. One reason cited by some

authors for using R, is that it is easier to compile dynamically loaded code modules for R than for example

SPlus, since you just need the same (opensource) compilation system that you used to install R in the first

place.

Itaka (1998) places the future of R clearly within opensource software. Indeed, the rapid development of R

as a computing environment for data analysis and graphics bears out many of the points made by Raymond.

Over several years, it has begun to be clear that, even when wellregarded commercial software products are

available, communities of users and developers are often able to gain a momentum based on very rapiddebugging by large numbers of interested participants bazaarstyle development. Open source is not limited

to Unix or Unixbased operating systems, since opensource compilers and associated tools have been ported

to proprietary desktop systems like MS Windows95, MS NT 4, and others. These in turn have permitted

software, like R, to be ported to these platforms, with little version slack.

R supports a rich variety of statistical graphics; one infelicity is that in general no provision is made for

constraining the x and y axes of a plot to the same scale (with the exception of eqscplot()in the MASS

library). The image()function plots filled polygons centred on the given grid centres, but there is no

bundled image.legend()function in R to mirror the one in S. R has a legend()function, but it has to


2.1 R 4
http://cran.r-project.org/http://cran.r-project.org/http://www.omegahat.org/

8/12/2019 Geo Comp 2000

5/40

8/12/2019 Geo Comp 2000

6/40

8/12/2019 Geo Comp 2000

7/40

Figure 2. Integration of tools in a geographical information system.

From initially interfacing R and GRASS through loose coupling, transferring data through ad hoc text files,

the work presented here involves tight coupling, running R from within the GRASS shell environment, andadditionally system enhancement, by dynamically loading compiled GIS library functions into the R

executable environment.

Figure 3. Layering of shell, GRASS and R environments.

Because R has developed as an interactive language with clear scoping rules and sophisticated handling of

environment issues, while GRASS is a collection of partly coordinated separate programs built on common

library functions, constructing working interfaces with GRASS has been challenging. One of the advantages

of working with open source projects in this context has been that occasionally poorly documented functions

can be read as source code, and when necessary thoroughly instrumented and debugged, to establish how they


2.2 GRASS 7

8/12/2019 Geo Comp 2000

8/40

actually work. This is of particular help when interface functions are being written, since the mindsets and

attitudes of several different communities of programmers need to be established.

The key insight needed to establish the interface is that GRASS treats the shell environment as its basis,

assigning values to shell variables on initiation which are used by successive GRASS programs, all of which

are started with the values assigned to these variables. These name the current location and mapset, and these

in turn point to session and default metadata on the location's regional extent and raster resolution, stored in afixed form in the GRASS database a collection of directories under the current location. Since GRASS

simply sets up a shell environment, initiating specified shell variables, all regular programs, including R, can

in principle be run from within GRASS, in addition to GRASS programs themselves. Finally, R has a

system(command), where commandis a string giving the command to be invoked, and this command is

executed in the shell environment in which R itself was started. This means that system()can be used from

within R to run GRASS (and other) programs, including programs requiring user interaction. Alternatively,

the tcltkgrass GUI frontend can be started before (or from) R, giving full access to GRASS as well as R.

This is illustrated in Fig. 3.

Furthest to the right in Fig. 3 is the use of GRASS library functions called from R through dynamically loaded

shared libraries. The advantage of this is to speed up interaction and to allow R to be enhanced with GRASS

functionality, in particular reading GRASS data directly into R objects. Dynamically loaded shared libraries

also increase the speed of operations not vectorised within R itself, especially for loops. Vectors are the basic

building blocks of R (note: 1based vectors!), so many operations are inherently fast, because data is held in

memory. However, reading the R interpreted and compiled code (for the base library), as well as the

documentation on writing extension packages gives a good idea of where compiling shared libraries may be

worth the extra effort involved.

Figure 4. Feasible routes for data transfer between R, GRASS and PostgreSQL.

Native R supports the input and output of ascii text data, especially in the most frequently used form of data

frame objects. These have a tabular form, with rows corresponding to observations, each with a character

string row name, and columns to variables, each again with a string column name. Data frames contain

numeric vectors and factors (which may be ordered), and can also contain character vectors; factors are an R

object class of some importance, and are integervalued, with 1base integers pointing to the string label

vector attribute. The read.table()function is however a memory hog, since reading in large tables with

character variables for conversion to factors can use up much more memory than is necessary. The scan()


2.2 GRASS 8

8/12/2019 Geo Comp 2000

9/40

function is more flexible, but data frames are certainly among the most used objects in R. Data frames differ

from matrices in that matrices only contain one data type.

R objects may be output for reentry with the dump()function, and entered using source(), both for

functions and data objects. The R memory image may also be saved to disk for loading when a work session

resumes, or for transfer to a different workstation. Every user of R has been caught by problems provoked by

inadvertently loading an old image file, .RDatain the working directory on startup, in particular anapparent lack of memory despite little fresh data being read in.

There are a number of variants of data entry into R, and work is progressing on interfacing S family languages

to RDBS (James, 2000), within the Omega project. Other read.table()flavours are also available, for

example for *.csvfiles. In the following subsections, we will describe how the R/GRASS interface has

been developed.

3.1 File transfer

Work on interfacing GRASS and R began from the facility in the GRASS s.menuprogram to output an S

structure for raster map layers sampled at site point locations. This was evidently unsatisfactory as a generaltechnique, although it still more or less works for the sampled points. Progress on filebased transfer using

[rs].out.ascii, [rs].in.ascii, and r.stats 1on the GRASS side run through the R

system()function is reported in Bivand (1999) initially, and as a fully supported R package in Bivand

(2000). This route is still used for site data in the current (0.14) version of theGRASS package for R. The

help pages of this package are appended to this paper.

As an example, we can examine the code of the sites.get()function. The main part of the function is:

FILE

8/12/2019 Geo Comp 2000

10/40

8/12/2019 Geo Comp 2000

11/40

+

| Layer: elevation.dem Date: Tue Mar 14 07:15:44 1989

| Mapset: PERMANENT Login of Creator: grass

| Location: spearfish

| DataBase: /usr/local/grass5.0b/data

| Title: DEM (7.5 minute) ( elevation )

||

| Type of Map: cell Number of Categories: 1840

| Rows: 466

| Columns: 633

| Total Cells: 294978

| Projection: UTM (zone 13)

| N: 4928000 S: 4914020 Res: 30

| E: 609000 W: 590010 Res: 30

|

| Data Source:

| 7.5 minute Digital Elevation Model (DEM) produced by the USGS

|

|

| Data Description:

| Elevation above mean sea level in meters for spearfish database

|

| Comments:

| DEM was distributed on 1/2 inch tape and was extracted using

| GRASS DEM tape reading utilities. Data was originally in feet

| and was converted to meters using the GRASS map calculator

| (Gmapcalc). Data was originally supplied for both the

| Spearfish and Deadwood North USGS 1:24000 scale map series

| quadrangles and was patched together using GRASS

|

|

+

> system.time(spear str(spear)

List of 2

$ elevation.dem: num [1:106400] NA NA NA NA NA NA NA NA NA NA ...

$ landuse.f : Factor w/ 8 levels "residential",..: NA NA NA NA NA NA NA

The current interface is for raster and site GIS data; work on an interface for vector data is proceeding, but is

complicated by substantial infelicities in GRASS methods for joining attribute values of vector objects to

position data. Changes in GRASS programs like v.in.shapeand its relatives are occurring rapidly, and are

also associated with the use of PostgreSQL for storing tables of attribute data. We will return to this topic in

concluding.


3.1 File transfer 11

8/12/2019 Geo Comp 2000

12/40

3.2 Binary transfer

Early reactions to the release of the interpreted interface indicated problems with the bloat of using data

frames to contain imported raster layers now replaced by lists of vectors and with confusing differences in

metadata. It turned out that data written by both R and GRASS to ascii files were subject to assumptions about

the format and precision needed, so that direct binary transfer became a high priority to ensure that both

programs were accessing the same data. While transfer times are not of major importance, since users tend tomove data across the interface seldom, an improvement of better than an order of magnitude seemed worth the

trouble involved in programming the interface in C.

> system.time(G system.time(spear str(spear)

List of 2

$ elevation.dem: num [1:106400] NA NA NA NA NA NA NA NA NA NA ...$ landuse.f : Factor w/ 9 levels "no data","resid..",..: NA NA NA NA NA

A byproduct of programming the raster interface in C has been the possibility of moving R factor levels

across the interface as category labels loaded directly into the GRASS raster database structure. While

incurring the cost of extra disk usage as compared to GRASS reclassing technology, raster file compression

reduces the importance of this issue. In the case below of reclassifying the elevation data from Spearfish by

quartiles, the Tukey five number summary, the resulting data file is actually smaller than its null file,

recording NAs.

> elev.f str(elev.f)

Factor w/ 4 levels "(1.07e+03,1..",..: NA NA NA NA NA NA NA NA NA NA ...

> class(elev.f)

[1] "ordered" "factor"

> col plot(G, codes(elev.f), reverse=reverse(G), col=col)

> legend(c(590000,595000), c(4914000, 491700),

+ legend=levels(elev.f)[1:2], fill=col[1:2], bty="n")

> legend(c(597000,602000), c(4914000, 491700),

+ legend=levels(elev.f)[3:4], fill=col[3:4], bty="n")


3.2 Binary transfer 12

8/12/2019 Geo Comp 2000

13/40

Figure 5. plot.grassmeta()object dispatch function used to display classified elevation map in R.

> rast.put(G, "elev.f", elev.f, "Tranfer of fivenum elevation

classification",

+ cat=TRUE)

> system("ls l $LOCATION/cell/elev.f")

rwrr 1 rsb users 9909 Jul 15 14:09 elev.f

> system("r.info elev.f")+

| Layer: elev.f Date: Sat Jul 15 14:09:14 2000

| Mapset: rsb Login of Creator: rsb

| Location: spearfish

| DataBase: /usr/local/grass5.0b/data

| Title: Tranfer of fivenum elevation classification ( elev.f )

|

|

| Type of Map: raster Number of Categories: 4

| Rows: 280



8/12/2019 Geo Comp 2000

14/40

| Columns: 380

| Total Cells: 106400

| Projection: UTM (zone 13)

| N: 4928000 S: 4914000 Res: 50

| E: 609000 W: 590000 Res: 50

|

| Data Source: |

|

|

| Data Description:

| generated by rastput()

|

|

+

> table(elev.f)

elev.f

(1.07e+03,1.2e+03] (1.2e+03,1.32e+03] (1.32e+03,1.49e+03] (1.49e+03,1.84e+

26517 26162 26354 26

> system("r.report elev.f units=c")

r.stats: 100%

+

| RASTER MAP CATEGORY REPORT

|LOCATION: spearfish Sat Jul 15 14:10:54 20

|

| north: 4928000 east: 609000

|REGION south: 4914000 west: 590000

| res: 50 res: 50

|

|MASK:none

|

|MAP: Tranfer of fivenum elevation classification (elev.f in rsb)

|

| Category Information | ce

|#|description | cou

|

|1|(1.07e+03,1.2e+03] . . . . . . . . . . . . . . . . . . . . . . . . .| 265

|2|(1.2e+03,1.32e+03] . . . . . . . . . . . . . . . . . . . . . . . . .| 261

|3|(1.32e+03,1.49e+03]. . . . . . . . . . . . . . . . . . . . . . . . .| 263

|4|(1.49e+03,1.84e+03]. . . . . . . . . . . . . . . . . . . . . . . . .| 263

|*|no data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .| 10|

|TOTAL |1064

+



8/12/2019 Geo Comp 2000

15/40

Figure 6. GRASS display of R reclassed elevation.

One specific problem encountered in writing the interface is a result of mixing two differing system models: R

requires that compiled code uses its error handling routines returning control to the interactive command line

on error, while not only the GRASS programs but also their underlying libraries quite often simply exit()

on error. This of course also terminates R, if the offending code is dynamically loaded, behaviour that is

hazardous for R data held in objects in memory. Patches have been applied to key functions in the GRASS

libgis.a library, and are both available from the CVS repository and will be in the released source code from

5.0 beta 8.

3.3 Transfer via shared PostgreSQL database

The static memory model of R leads to problems in case of extended datasets. Such datasets have to be

handled especially in the field of geospatial analysis. To overcome this problem, the "R PostgreSQL Package"

(RPgSQL package) has been developed. It links R to the open source DBMSsystemPostgreSQL. Users can

read and write data frames to and from a postgresql database. As the RPgSQL provides an interface to

PostgreSQL, the user does not need to know about SQL statements. Comparing to other R extensions a set of

commands is available to exchange data frames and select data from the PostgreSQL database. The RPgSQL

package also offers support for "proxy data frames". Such a proxy data frame is an R object that inherits from


3.3 Transfer via shared PostgreSQL database 15
http://www.postgresql.org/http://rpgsql.sourceforge.net/

8/12/2019 Geo Comp 2000

16/40

the data.frame class, but does not contains any data. Instead of writing to R memory, all the "proxy data

frame" generate the appropriate SQL query and retrieve the resulting data from the PostgreSQL. The RPgSQL

system allows to process large datasets without loading them completely into R memory space.

The RPgSQL is loaded after starting R:

> library(RPgSQL)

In our case we are starting R within GRASS to access GRASS data directly. Before using the RPgSQL, an

empty database needs to be created. This can be done externally:

$ createdb test

or within R:

> system("createdb geodata")

The user must have permission to create databases. This can be specified when creating a PostgreSQL user.

(createuser command).

Then the connection to the database "geodata" has to be established:

> db.connect("localhost", "5432", "geodata")

Within this database "geodata" a set of tables can be stored. The database may of course be initiated

externally, by copying data into previously created tables using the standard database tools, for example

psql. The key goal of this form of data transfer is to use the GRASSPostgreSQL interface to establish the

database prior to work in R, and to use the RPostgreSQL interface to pass back results for further work in

GRASS.

An issue of some weight and as yet not fully resolved in the GRASSPostgreSQL interface is stewardship

of the GIS metadata: GRASS assumes that it "knows" the metadata universe, but this is in fact not "attached"

to data stored in the external database. It is probable that fuller use of this solution to data transfer will require

the choice of one of two future routes: the registration of a GRASS metadata table in PostgreSQL (somewhat

similar to the use of gmeta()above); or the implementation of a GRASS daemon serving metadata, and

possibly data in general, to programs requesting it. Although these are serious matters, and indicate that this isfar from resolved, attempts to interface through a shared external database do illuminate the problems

involved in a very helpful way.

> data("soildata")

In case the user has the data frame available loaded into R, it can be moved further to PostgreSQL. It is

recommended to use the "names" parameter to avoid name convention conflicts with PostgreSQL:

> db.write.table("soildata", name="soildata", no.clobber=F)

Now the memory of R can be freed of the "soildata" data frame as it was copied to PostgreSQL:

> rm(soildata)

To reconnect the data set to R, the proxy binding follows. Due to the proxy method the data interchange

between PostgreSQL and R is established:> bind.db.proxy("soildata")

> summary("soildata")

Table name: soildata

Database: geodata

Host: localhost

Dimensions: 10 (columns) 1000 (rows)

The data access is different to the common data access as the proxy methods only delivers selected data to the

R system. An example for selective data access is by addressing rows and columns:

> selected.data

8/12/2019 Geo Comp 2000

17/40

> selected.data

area perimeter g2.i stonr horiz humus pH

25 5232.31 322.597 26 20 Ah 20 5.4

26 75256.80 1986.950 27 10 Ap1 10 5.6

27 5276.81 285.431 28 6 aAp 6 4.8

or by addressing fields through column names:> humus.data humus.data

[1] 1.45 1.33 1.45 2.71 1.45 1.33 1.33 1.45 1.74 1.45 2.81 1.33 1.78

1.69 2.51

[16] 1.33 1.77 1.33 2.41 0.00 0.00 1.62 1.33 2.10 2.10 1.69 2.38 1.33

2.51 1.33

[31] 1.34 2.10 0.00 1.92 0.00 1.45 1.33 2.10 0.00 1.33 2.51 2.71 1.69

1.33 2.71

[46] 2.10 1.33 1.69 2.10 2.71 1.34 2.71 2.71 2.10 1.33 1.33 2.68 1.74

1.33 1.45

[61] 2.71 2.38 2.68 1.69 1.33 1.33 1.33 1.33 1.33 2.71 2.71 2.10 2.71

2.10 1.69

[76] 1.33 1.45 1.33

These selected data (or results from calculations) can be written as a new table to the PostgreSQL database,

too:

> db.write.table(selected.data, name="selected_data", no.clobber=F)

To read it back the command:

> bind.db.proxy("selected_data")

will be used.

Although users do not need to know SQL query language, it is possible to send SQL statements to thePostgreSQL backend:

> db.execute("select * from humus", clear=F)

> db.read.column("humus", as.is=T)

Calculations are performed by addressing the rows to use in calculation. To calculate the mean of humus

contents from row 1 to 1000, enter:

> mean(na.omit(soildata$humus[1:1000,]))

[1] 1.723718

Finally the data cleanup can be done. The command

> rm(soildata)would remove the data frame from R memory, and

> db.rm("soildata", ask=F)

removes the table from the PostgreSQL database "geodata". The

> db.disconnect()

disconnects the data link between R and the PostgreSQL backend.

The new RPostgreSQL interface offers rather unlimited data capacities to overcome the memory limitations

within R. A basic problem of the current implementation is the large speed decrease caused by the proxy

method. The recently introduced ODBC driver has not so far been tested to compare it to the RPgSQL route.

Perhaps future versions of the PostgreSQLdriver will offer fast data exchange as it is required for GIS


3.3 Transfer via shared PostgreSQL database 17

8/12/2019 Geo Comp 2000

18/40

8/12/2019 Geo Comp 2000

19/40

Figure 7: Soil pHvalues within the Spearfish region.

For starting up the systems, R is called within GRASS software:

grass5.0beta

R vsize=30M nsize=500K

Then the GRASS/R interface is loaded:

> library(GRASS)

> G soilsph names(soilsph) gc()

free total (Mb)

Ncells 367838 512000 9.8

Vcells 3218943 3932160 30.0

Depending on the resolution previously set in GRASS, the demand on memory is different. The next step is to

copy the map "soilsph" into the PostgreSQL database system. A new database needs to be created to store the

new pHvalue table:


3.4 Alternative technologies: CORBA/IDL, XML 19

8/12/2019 Geo Comp 2000

20/40

> system("createdb soils_ph_db")

> library(RPgSQL)

> db.connect("localhost", "5432", "soils_ph_db")

> db.write.table(soilsph$ph, name="soilsph", no.clobber=F)

It is recommended to use the "names" parameter in db.write.table to avoid name conflicts (as PostgreSQL

rejects special character within table names).To simulate a later access to the stored table, we disconnect from the SQLbackend and reconnect again

without loading the dataset. Using the "gc()" command we can watch the memory consumption of our dataset.

> db.disconnect()

> rm(soilsph)

> gc()

free total (Mb)

Ncells 364378 512000 9.8

Vcells 3884663 3932160 30.0

The memory in R is rather free now. We connect back to PostgreSQL and now, instead of reading in the table,

a proxy binding is set as described above:

> db.connect("localhost", "5432", "soils_ph_db")

> bind.db.proxy("soilsph")

> gc()

free total (Mb)

Ncells 364395 512000 9.8

Vcells 3882569 3932160 30.0

There is nearly no change in memory requirement. A small problem is the decrease in speed as the requested

data have to pass the proxy system.

> summary(soilsph)

Table name: soilsph

Database: soils_ph_db

Host: localhost

Dimensions: 1 (columns) 26600 (rows)

Selected data can be accessed by specifying the minimum and maximum row numbers:

> soilsph[495:500]

I.data.

495 4

496 4

497 4

498 4

499 4500 3

But usually we would want to analyse the full dataset (here equal to our soils map with distributed pHvalue).

To find out the maximum number of values alias raster cells, we can enter:

> maxrows soils.ph.frame soilsph$ph[soilsph$ph == 0]

8/12/2019 Geo Comp 2000

21/40

> names(soils.ph.frame) summary(soils.ph.frame)

x y z

Min. :590100 Min. :4914000 Min. : 1.000

1st Qu.:594800 1st Qu.:4918000 1st Qu.: 3.000

Median :599500 Median :4921000 Median : 4.000

Mean :599500 Mean :4921000 Mean : 3.3793rd Qu.:604300 3rd Qu.:4924000 3rd Qu.: 4.000

Max. :609000 Max. :4928000 Max. : 5.000

NA's :977.000

To calculate the cubic trend surface, we have to mask NULL values from calculation. The functions to fit the

trend surface through the data points is provided by "spatial" package:

> library(spatial)

> names(soils.ph.frame)

> ph.ctrend plot(G, trmat.G(ph.ctrend, G, east=east(G), north=north(G)),

+ reverse=reverse(G), col=terrain.colors(20))


3.4 Alternative technologies: CORBA/IDL, XML 21

8/12/2019 Geo Comp 2000

22/40

Figure 8: Trend surface of pHvalue within Spearfish region.

4.2 Pontypool PCBs sites data

The interface package is furnished with examples using adata setused extensively in Bailey and Gatrell

(1995, p. 149), with positions and values of standardised PCB contamination around a plant for incinerating

chemical wastes. In the following, the sites data is in a GRASS database called "pontypool2", and the

interface will be used to display and analyse the data.

Initially, the data are moved to R as a data frame; at present sites.getand sites.putuse ASCII file

transfer, and column names in the GRASS sites file (if available) are ignored. The names of the data frame are

tidied up, and the contamination values are classified into five classes, one of which turns out to be empty.

Using glyph and colour coding, a map of the values is drawn, showing three markedly higher values at or near

the incinerator site.

> library(GRASS)

Running in /usr/local/grass5.0b/data/pontypool2/rsb

> system("g.list sites")


4.2 Pontypool PCBs sites data 22
http://pcbs.html/

8/12/2019 Geo Comp 2000

23/40

site list files available in mapset rsb:

ex.pcbs.in

> G summary(G)

Data from GRASS 5.0 LOCATION pontypool2 with 88 columns and 108 rows;UTM, zone: 30

The westeast range is: 68000, 70200, and the southnorth: 74500, 77200;

Westeast cell sizes are 25 units, and southnorth 25 units.

> pcbs str(pcbs)

`data.frame': 70 obs. of 4 variables:

$ east : num 68504 68353 68405 68546 68510 ...

$ north: num 76800 76868 76709 76631 77091 ...

$ var1 : num 1 2 3 4 5 6 7 8 9 10 ...

$ var2 : num 9 6 11 25 11 9 3 7 6 5 ...

> colnames(pcbs) str(pcbs)

`data.frame': 70 obs. of 4 variables:

$ x : num 68504 68353 68405 68546 68510 ...

$ y : num 76800 76868 76709 76631 77091 ...

$ id: num 1 2 3 4 5 6 7 8 9 10 ...

$ z : num 9 6 11 25 11 9 3 7 6 5 ...

> summary(pcbs$z)

Min. 1st Qu. Median Mean 3rd Qu. Max.

3.00 8.00 11.00 106.00 19.75 4620.00

> pcbs.o table(pcbs.o)

pcbs.o

insignificant low medium crisis

53 14 1 2

> plot(G)

> rect(G$xlims[1], G$ylims[1], G$xlims[2], G$ylims[2], col="gray")

> Col points(pcbs$x, pcbs$y, pch = codes(pcbs.o), col=Col[codes(pcbs.o)])

> legend(c(67560, 68300), c(75050, 74500), pch = c(1:5),

+ legend = levels(pcbs.o), col=Col, bg="gray")

> title("PCBs contamination levels")



8/12/2019 Geo Comp 2000

24/40

Before proceeding to interpolate contamination values across the study area, the distribution of the observed

measurements are plotted against the normal distribution on quantilequantile plots:

> oldpar qqnorm(pcbs$z, ylab="Sample quantiles of pcbs values")

> qqnorm(log(pcbs$z), ylab="Sample quantiles of log pcbs values")> par(oldpar)



8/12/2019 Geo Comp 2000

25/40

Finally, two plots are made of spline interpolation, the first from the akimalibrary, using a wrapper function

in the interface package employing bicubic splines and default parameters. The second uses the recently

revised s.surf.rstGRASS program with specified parameters; in both cases interpolation is based on the

logarithm of the observed values. In the first case, interpolations are only made within the convex hull of the

observation points, while in the second case for the whole region.

> oldpar ak2 plot(G, ak2, reverse=reverse(G))

> points(pcbs$x, pcbs$y, pch=18)

> title("Bicubic spline interpolation")

> sites.put(G, lname="ex.pcbslog.in", east=pcbs$x, north=pcbs$y,

+ var=log(pcbs$z))

> system("g.list sites")

site list files available in mapset rsb:

ex.pcbs.in ex.pcbslog.in

> system(paste("s.surf.rst input=ex.pcbslog.in elev=ex.rst",



8/12/2019 Geo Comp 2000

26/40

+ "tension=160 smooth=0.0 segmax=700"))

> plot(G, rast.get(G, "ex.rst", F)$ex.rst, reverse=reverse(G))

> points(pcbs$x, pcbs$y, pch=18)

> title("Regularized Spline with Tension")

> par(oldpar)

4.3 Assist/Leicester

We will here start with the simple example of exercise 1 in theNW Leicestershire Grass Seeds data set, for an

area of 12x12 km at 50x50m cell resolution. The two raster map layers involved are topo, elevation in

metres interpolated from 20m intervat contours digitised from a 1:50,000 topographic map, and landcov, a

land use classification into eight classes from Landsat TM imagery. Exercise 1 asks the user to find the

average height of each of the land use classes, and the area occupied by each of the classes in hectares. In

GRASS, this is done by creating a reclassed map layer in r.average, which assigns mean values to the

category labels, and r.reportto access the results.

> system("r.average base=landcov cover=topo output=avheight")

r.stats: 100%

percent complete: 100%

> system("r.report landcov units=c,h")

r.stats: 100%

+

| RASTER MAP CATEGORY REPORT

|LOCATION: leics Mon Jul 17 15:07:39 20

|

| north: 322000 east: 456000

|REGION south: 310000 west: 444000

| res: 50 res: 50

|


4.3 Assist/Leicester 26
http://www.geog.le.ac.uk/assist/grass/data/leics/leics.tar.Z

8/12/2019 Geo Comp 2000

27/40

8/12/2019 Geo Comp 2000

28/40

elevation by land use, making box width proportional to the numbers of cells in each land use class (note that

the first, empty, count is dropped by subsetting c2[2:9]).

> library(GRASS)

Running in /usr/local/grass5.0b/data/leics/rsb

> G leics str(leics)

List of 2

$ landcov.f: Factor w/ 9 levels "Background","In..",..: 6 6 6 6 6 6 6 6 6 6

$ topo : num [1:57600] 80 80 80 80 80 80 80 80 78 76 ...

> c1 c2 c3 colnames(c3) c3[2:9,]

Average Height Total Area

Industry 83.61101 417.75

Residential 76.30066 1658.00

Quarry 135.25405 138.75

Woodland 142.94454 437.25

Arable 104.02143 5728.25

Pasture 133.48335 4204.00

Scrub 148.24870 1639.50

Water 99.26912 176.50

> boxplot(leics$topo ~ leics$landcov.f, width=c2[2:9], col="gray")



8/12/2019 Geo Comp 2000

29/40

Figure 9. Boxplot of elevation in metres by land cover classes, NW Leicestershire; box widths proportional to

share of land cover types in study region.

It is of course difficult to carry out effective data analysis, especially graphical data analysis, in a GIS like

GRASS, but boxplots such as these are of great help in showing more of what the empirical distributions of

cell values look like.

Over and above substantive analysis of map layers, we are often interested in checking the data themselves,

particularly when they are the result of interpolation of some kind. Here we use a plot of the empirical

cumulative distribution function to see whether the contours used for digitising the original data are visible as

steps in the distribution as Figure 10 shows, they are.

> library(stepfun)

> plot(ecdf(leics$topo), verticals=T, do.points=F)

> for (i in seq(40,240,20)) lines(c(i,i),c(0,1),lty=2, col="sienna2")



8/12/2019 Geo Comp 2000

30/40

Figure 10. Empirical cumulative density function of elevation values in metres, NW Leicestershire; also

known as hypsometric integral.

A further tool to apply in search of the impact of interpolation artifacts is the density plot over elevation. A

conclusion that can be drawn from Figure 11 is that only at above a bandwidth of over 6 metres do theartifacts cease to be intrusive.

> plot(density(leics$topo, bw=2), col="black")

> lines(density(leics$topo, bw=4), col="green")

> lines(density(leics$topo, bw=6), col="brown")

> lines(density(leics$topo, bw=8), col="blue")

> for (i in seq(40,240,20)) lines(c(i,i),c(0,0.015),lty=2, col="sienna2")

> legend(locator(), legend=c("bw=2", "bw=4", "bw=6", "bw=8"),

+ col=c("black", "green", "brown", "blue"), lty=1)



8/12/2019 Geo Comp 2000

31/40

Figure 11. Density plots of elevation in metres from bandwidth 2 to 8 metres showing the effects of

interpolating from digitised contours.

4.4 Lisa MidWest socioeconomic data

While the interface mechanisms discussed above have addressed the needs of users with raster data, we turn

here to the challenging topic of thematic mapping of attribute data associated with closed polygons. In the S

language family, support is available outside R for a map()function, permitting polygons to be filled with

colors chosen from palette stored in a vector. This function uses binary vector data stored in an Sspecific

format in three files: a file of edge line segments with coordinates, typically geographical coordinates

projected to the user chosen specification at plot time, a polygon file listing the edges with negative pointers

for reversed edges, and a named labels file, joining the polygons to data vectors. In the S community and more

generally at Bell Labs, interest has been shown in geographical databases, which has resulted in the release of

anS map.builderenabling users to construct their own databases from collections of line segments,

defined as pairs of coordinates. The methods employed are described in Becker (1993) and Becker and Wilks

(1995). Ray Brownrigg has ported some of the code to R, and maintainsan FTP sitewith helpful mapmaterials.

The data set used for the exampled given here was included in the materials of theESDA with LISA

conferenceheld in Leicester in 1996. The data are for the five midwest states of Illinois, Indiana, Michigan,

Ohio, and Wisconsin, and stems from the 1990 census. The boundary data stems from the same source, but

the units used are not well documented; there are 437 counties included in the data set. For this presentation,

the position data were transformed from an unknown projection back to geographical coordinates, and

projected from these to UTM zone 16. R regression and prediction functions were used to carry out a second

order polynomial transformation, as in GRASS i.rectify2used for image data. Projection was carried out

using theGMT mapprojectprogram.


4.4 Lisa MidWest socioeconomic data 31
http://lib.stat.cmu.edu/general/map.builderftp://ftp.mcs.vuw.ac.nz/pub/statistics/maphttp://www.mimas.ac.uk/argus/esdalisa/midwest/Data.htmlhttp://www.soest.hawaii.edu/gmthttp://www.soest.hawaii.edu/gmthttp://www.mimas.ac.uk/argus/esdalisa/midwest/Data.htmlhttp://www.mimas.ac.uk/argus/esdalisa/midwest/Data.htmlftp://ftp.mcs.vuw.ac.nz/pub/statistics/maphttp://lib.stat.cmu.edu/general/map.builderhttp://lib.stat.cmu.edu/general/map.builder

8/12/2019 Geo Comp 2000

32/40

It did not prove difficult to import the reformatted midwest.arcfile into GRASS with v.in.arc, but

joining it to the midwest.patdata file was substantially more difficult. As an alternative, the

midwest.arcwas converted using awkto a line segment input file for map.builder a collection of

simple awkand shell scripts and two small programs compiled from C: intersectand gdbuild. This,

after the removal of duplicate line segments identified by intersect, permitted the production of a

polylines file and a polygon file. These were converted in awkto a form permitting R lists to be built using

split(). From there, getting R to do maps was uncomplicated, by reading in the files, and using splittingthem. The split()functionis an R workhorse, being central to tapply()used above, and is an

optimised part of R's internal compiled code. It splits its first argument by the second, returning a list with

elements containing vectors with the component parts of the first argument. A short utility function was used

mkpoly() to build a new list of closed polygons from the input data.

> mwpl mwpe s.mwpe sx.mwpl sy.mwpl polys for (i in 1:length(s.mwpe))

+ polys[[i]] mwcpl library(splancs)

> pnames mwdata ophp id.ophp pointers for (i in 1:length(pointers))

+ pointers[i] gr library(MASS)

> eqscplot(c(100000, 1100000), c(100000, 1300000), type="n")

> for (i in 1:nrow(id.ophp))

+ polygon(polys[[pointers[i]]], col=gr[codes(id.ophp$ophp[i])])


http://mkpoly.r/http://mkpoly.r/http://mkpoly.r/http://mkpoly.r/http://mkpoly.r/

8/12/2019 Geo Comp 2000

33/40

> legend(c(107000, 119300), c(476000, 95000),

+ legend=levels(id.ophp$ophp), fill=gr)

Figure 12. Percentage of persons aged 25 years and over with professional or higher degrees, 1990, five

MidWest states.

A further example is to use the pointer vector to paint polygons meeting chosen attribute conditions this is

analogous to the examples given in the GRASSPostgreSQL documentation. The Rwhich()

function is notunlike the SQL SELECT command, but returns the row numbers of the objects found in the data table, rather

than the requested tuples. To retrieve selected columns, their names or indices can be given in the columns

field of the [, ]access operator for data frames. As may have been noticed, there is

also an [[]]access operator for lists, so that polys[[pointers[i]]]returns the

pointers[i]'th list element of polys, which is itself a list of two numeric vectors of equal length,

containing the x and y coordinates of the polygon. When we attempt to label the five counties with

percentages of persons 25 years and over with professional and higher degrees over 14 percent, we have to go

back through the chain of pointers to find the label point coordinates originally used to label the polygons at

which to plot the text.



8/12/2019 Geo Comp 2000

34/40

> where 14)

> str(where)

int [1:5] 10 39 155 181 275

> mwdata[where, c("pop", "pcd", "php")]

pop pcd php

CHAMPAIGN (IL) 173025 41.29581 17.75745

JACKSON (IL) 61067 36.64367 14.08989MONROE (IN) 108978 37.74230 17.20123

TIPPECANOE (IN) 130598 36.24544 15.25713

WASHTENAW (MI) 282937 48.07851 20.79132

> pointers for (i in 1:length(pointers))

+ pointers[i] eqscplot(c(100000, 1100000), c(100000, 1300000), type="n")

> for (i in 1:nrow(mwdata)) polygon(polys[[i]])

> for (i in which(mwdata$php > 4.447 & mwdata$php for (i in which(mwdata$php > 10 & mwdata$php for (i in which(mwdata$php > 14))

+ polygon(polys[[pointers[i]]], col="red")

> texts text(mwcpl$x[pnames$id[pointers[where]]],

+ mwcpl$y[pnames$id[pointers[where]]], texts, pos=4, cex=0.8)



8/12/2019 Geo Comp 2000

35/40

Figure 13. Finding counties with high percentages of persons aged 25 years and over with professional or

higher degrees, 1990, five MidWest states.

To conclude this subsection, the mapspackage countydatabase is used to map the same variable; the

database is plotted using unprojected geographical coordinates at present. The problem that arises is that the

map()function fills the polygons with specified shades or colours in the order of the polygons in the chosen

database, and that joining the data table attribute rows to the polygon database is thus a point of weakness.

Here we see that the county database returns a vector of 436 names of counties in the five states, while theEsdaLisa MidWest data reports 437. After translating the county names into the same format, the %in%

operator is used to locate the county name or names not in the county database. Fortunately, only one county

is not matched, Menominee(WI), an Indian Reservation in the North East of the state. The next task is to use

its label coordinates from the MidWest data set to find out in which county database polygon it has been

included. While there must be a way of reformatting the polygon data extracted from map(), use was made

here of awkexternally, and then split()and inout()as above. The county from the county database in

which Menominee has been subsumed is its neighbour Shawano, with which it is aggregated on the relevant

variables before mapping. In fact, both Menominee and Shawano were both in the bottom decile before

aggregation, with Menominee holding the lowest rank for percentage aged 25 and over holding higher or

professional degrees, and Shawano ranked 33. The aggregated Shawano comes in at rank 20 of the 436 county


ftp://ftp.mcs.vuw.ac.nz/pub/statistics/mapftp://ftp.mcs.vuw.ac.nz/pub/statistics/map

8/12/2019 Geo Comp 2000

36/40

database counties.

> library(maps)

> mw cnames str(cnames)chr [1:436] "illinois,adams" "illinois,alexander" "illinois,bond" ...

> mwnames str(mwnames)

chr [1:437] "adams (il)" "alexander (il)" "bond (il)" "boone (il)" ...

> ntab names(ntab) ntab

illinois indiana michigan ohio wisconsin

"(il)" "(in)" "(mi)" "(oh)" "(wi)"

> for (i in 1:length(mwnames)) {

+ x mw[405,]

id area pop dens white black ai as or pw pb

MENOMINEE (WI) 3020 0.021 3890 185.2381 416 0 3469 0 5 10.69409 0

pai pas por pop25 phs pcd php popp

MENOMINEE (WI) 89.17738 0 0.1285347 1922 62.69511 7.336108 0.5202914 38

ppoppov ppov ppov17 ppov59 ppov60 met categ

MENOMINEE (WI) 98.20051 48.6911 64.30848 43.31246 18.21862 0 LHR

> mwpts mwpts[which(mwpts$id == 3020),]

id x y e n

51 3020 13.983 10.2195 88.6893 44.9859> cpolys str(cpolys)

List of 3

$ x : num [1:7381] 91.5 91.5 91.5 91.5 91.4 ...

$ y : num [1:7381] 40.2 40.2 40.0 40.0 39.9 ...

$ range: num [1:4] 92.9 80.5 37.0 47.5

> summary(cpolys$x)

Min. 1st Qu. Median Mean 3rd Qu. Max. NA's

92.92 89.16 87.15 86.80 84.47 80.51 435.00



8/12/2019 Geo Comp 2000

37/40

8/12/2019 Geo Comp 2000

38/40

8/12/2019 Geo Comp 2000

39/40

Figure 14. Percentage of persons aged 25 years and over with professional or higher degrees, 1990, five

MidWest states, mapped using the R mapspackage.

This worked example points up the difficulty of ignoring absent metadata needed for integrating data tables

with mapped representations, and suggests that further work on the mapspackage is needed to protect users

from assuming that they are mapping what they assume. It is very possible that the RPgSQLlink could be

used to monitor the joining of map polygon databases to attribute database tables.

5 Conclusions and further developments

GRASS, R, PostgreSQL, XGobi and other programs used here are all open source, and for this reason are

inherently transparent. Within the open source community generally, it has become a matter of common

benefit to adopt openness especially with respect to problems, because this increases the chances of resolving

them. In this spirit, we would like to point more to issues needing resolution than to the selfevident benefits

of using both the programs discussed above, and the proposed interface mechanisms.

While progress is being made in GRASS on associating site, line and area objects with attribute values, not

least in the context of the GRASSPostgreSQL link, it is still far from easy to join geographical objects totables of attributes using object idnumbers. Migration to the new sites format in GRASS is not fully

complete, and even within this format, it is not always clear how the attribute table should be used. Similar

reservation hold for vector map layers.

In the past, attempts have been made to use PostgreSQL's geometric data types, which support the same kind

of object model as the OGC Simple Feature model used in GML. They are not in themselves built topologies,

but to move from cleaned elements to topology is feasible. GRASS has interactive tools for accomplishing

this, but on many occasions GRASS users import vector data from other sources, support for which is

increasing, now also with support for storing attribute tables in PostgreSQL. In the example above, use was

made of tools from the S map.buildercollection to prepare cleaned and built topologies for the Midwest

data set. Here again, it was necessary to combine both automation of topology construction with objectlabelling to establish a correct mapping from clean polygons to attribute data, as also in GRASS v.in.arc

and other import programs.

These concerns are certainly among those that increase the relevance of work in the Omega project on

interfacing database, analysis and presentation systems, including threetier interfaces using CORBA/IDL.

They also indicate a possible need for rethinking the GRASS database model, so that more components can be

entrusted to an abstract access layer, rather than to hardwired directory trees work that is already being

done for attribute data.

On the data analysis side, most of the key questions have been taken up and are being investigated in the

Omega project, especially those of importance for raster GIS interfaces like very large data sets. R's memory

model is most effective for realistically sized statistical problems, but is restricting for applications like

remote sensing or medical imaging, when special solutions are needed (particularly to avoid using data

frames). This issue of scalability, and ways of parallelising R tasks for distribution to multiple compute

servers deserves urgent attention, and combined with the RPostgreSQL link to feed compute nodes with

subsets of proxied data tables is not far off.

Finally, we would like to encourage those who feel a need or interest in open source development to

contribute it is rewarding, and although combining projects with closed deliverables with apparent altruism

may seem difficult, experience of the open source movement indicates that, on balance, projects advance

better when difficulties are shared with others, who may have met simmilar difficulties before.


5 Conclusions and further developments 39

8/12/2019 Geo Comp 2000

40/40

References

Abelson, H., Sussman, G. and Sussman, J. 1996Structure and interpretation of computer programs.

(Cambridge MA: MIT Press).

Bailey, T. and Gatrell, A. 1995Interactive spatial data analysis(Harlow: Longman).

Becker, R. A. 1993.Maps in S. Bell Labs, Lucent Technologies, Murray Hill, NJ.

Becker, R. A. and Wilks, A. R. 1995.Constructing a geographical database. Bell Labs, Lucent Technologies,

Murray Hill, NJ.

Bivand, R. S., 1999. Integrating GRASS 5.0 and R: GIS and modern statistics for data analysis. In:

Proceedings 7th Scandinavian Research Conference on Geographical Information Science, Aalborg,

Denmark, pp. 111127.

Bivand, R. S. 2000. Using the R statistical data analysis language on GRASS 5.0 GIS database files.

Computers & Geosciences, 26, pp. 10431052.

Bivand, R. S. and Gebhardt, A. (forthcoming) Implementing functions for spatial statistical analysis using the

R language.Journal of Geographical Systems.

Chambers, J. M. 1998.Programming with data. (New York: Springer).

Ihaka R, 1998 R: past and future history.Unpublished paper.

Ihaka, R. and Gentleman, R. 1996R: a language for data analysis and graphics.Journal of Computational and

Graphical Statistics, 5, 299314.

Jones, D. A. 2000A common interface to relational databases from R and S a proposal. Bell Labs, Lucent

Technologies, Murray Hill, NJ (Omega project).

Neteler, M. 2000Advances in opensource GIS. Proceedings 1st National Geoinformatics Conference of

Thailand, Bangkok.

Venables, W. N. and Ripley, B. D. 1999Modern applied statistics with SPLUS. (New York: Springer);

(book resources).

Appendices

Interface package help pages

GRASS package for R

R code for making and labelling polygons

http://cm.bell-labs.com/cm/ms/departments/sia/doc/rs-dbi.pdfhttp://cm.bell-labs.com/cm/ms/departments/sia/doc/rs-dbi.pdfhttp://www.geog.uni-hannover.de/users/neteler/geoinformatics2000.pdfhttp://www.stats.ox.ac.uk/pub/MASS3http://mkpoly.r/http://grass_0.1-4.tar.gz/http://00index.html/http://www.stats.ox.ac.uk/pub/MASS3http://www.geog.uni-hannover.de/users/neteler/geoinformatics2000.pdfhttp://cm.bell-labs.com/cm/ms/departments/sia/doc/rs-dbi.pdfhttp://cran.r-project.org/doc/html/interface98-paper/paper.htmlhttp://cm.bell-labs.com/cm/ms/departments/sia/doc/95.2.pshttp://cm.bell-labs.com/cm/ms/departments/sia/doc/93.2.ps

geo comp 2000

Documents