geo comp 2000

Upload: juan-carlos-soto-orjuela

Post on 03-Jun-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/12/2019 Geo Comp 2000

    1/40

    Return to Main Index

    Open Sourcegeocomputation: using theRdataanalysis language integrated withGRASSGIS andPostgreSQLdata base systems.

    Roger BivandEconomic Geography Section, Department of Economics, Norwegian School of Economics and Business

    Administration, Breiviksveien 40, N5045 Bergen, Norway

    Email:[email protected]

    Markus Neteler

    Institute of Physical Geography and Landscape Ecology, University of Hannover, Schneiderberg 50,

    D30167 Hannover, Germany

    Email:[email protected]

    Abstract

    We report on work in progress on the integration of the GRASS GIS, the R data analysis programming

    language and environment, and the PostgreSQL database system. All of these components are released under

    Open Source licenses. This means that application programming interfaces are documented both in source

    code and in other materials, simplifying insight into the workings of the respective systems. Current versions

    of this note and accompanying code are to be found at the HannoverGRASSsite, together with earlier papers

    on related topics.

    Open Source geocomputation

    Open Source geocomputation: using the R data analysis language integrated with GRASS GIS and Pos1

    http://www.opensource.org/http://www.postgresql.org/http://www.postgresql.org/http://www.geog.uni-hannover.de/grasshttp://www.ashville.demon.co.uk/gc2000/http://www.geog.uni-hannover.de/grassmailto:[email protected]:[email protected]://www.postgresql.org/http://www.geog.uni-hannover.de/grasshttp://cran.r-project.org/http://www.opensource.org/http://www.ashville.demon.co.uk/gc2000/sessions.htmhttp://www.ashville.demon.co.uk/gc2000/http://www.ashville.demon.co.uk/gc2000/
  • 8/12/2019 Geo Comp 2000

    2/40

    1 Introduction

    The practice of geocomputation is evolving to take advantage of the peerreview qualities brought to software

    development by access to source code. The progress reported on in this paper covers recent advances in the

    GRASS geographical information system, in particular the introduction of floating point raster cell values and

    an enhanced sites data format, in the R data analysis language environment, and in their interfacing both

    directly and through the PostgreSQL data base system.

    The interactive R language a dialect of S (Chambers 1998) is eminently extensible, providing both for

    functions written by users, for the dynamic loading of compiled code into user functions, and the packaging of

    libraries of such functions for release to others. R, unlike SPLUS, another dialect of S, conducts all

    computation in memory, presenting a potential difficulty for analysis of very large data sets. One way to

    overcome this is to utilize a proxy mechanism through an external data base system, such as PostgreSQL;

    others will be forthcoming as theOmegaproject matures.

    While R is a general data analysis environment, it has been extensively used for modelling and simulation,

    and for comparison of modern statistical classification methods with for example neural nets, and is well

    suited to assist in unlocking the potential of data held in GIS. Over and above classical, graphical, and modernstatistical techniques in the base R library and supplementary packages, packages for point pattern analysis,

    geostatistics, exploratory spatial data analysis and spatial econometrics are available. The paper is illustrated

    with worked examples showing how open source software development can benefit geocomputation. Previous

    work on the use of R with spatial data is reported by Bivand and Gebhardt (forthcoming), and on interfacing R

    and GRASS by Bivand (1999, 2000).

    2 Open Source Development and GeoComputation

    One of the most striking innovations to emerge with full vigour in the late 1990's, but building on earlier

    precedent, is Open Source software. While there are good technical reasons for releasing the source code of

    computer programs to anyone connected to the Internet, or willing to spend little more than the cost of a bottle

    of beer on a CDROM, the trend profoundly affects the measurement of research and technological

    development expenditure and investment. Since the software tools are not priced, knowledge, the practice of

    winning new knowledge, and the style in which new knowledge is won are sharply focused.

    This style is of key importance, and involves the honing of ideas in open collaboration and competition

    between at least moderately altruistic peers. The prime examples are the conception of the world wide web

    and its fundamental protocols at CERN, by an errant scientist simply interested in sharing research results in

    an open and seamless way, the transport mechanism for about 80% of all electronic mail sendmail written

    from California but now with a world cast of thousands of contributors, and the Linux operating system,

    developed by a Finnish student of undoubted organisational talent, but perhaps luckily no training in the

    commercialisation of innovations, nor in business strategy.

    For these innovators, it has been important not to be hindered by institutional barriers and hierarchies.

    Acceptance and esteem in Open Source communities is grounded in the quality of contributions and

    interventions made by individuals in a completely flat structure. Asking the right question leading to a

    perverse bug being identified or solved immediately confers great honour on the person making the

    contribution. In money terms, probably no commercial organisation could afford to employ problem chasers

    or consultants to solve such problems they are simply too costly. The entry into the value chain is later, in

    the customisation, packaging and reselling of software, in software services rather than in the production and

    sale of software itself.

    Open Source geocomputation

    1 Introduction 2

    http://www.omegahat.org/http://www.omegahat.org/
  • 8/12/2019 Geo Comp 2000

    3/40

    AsEric Raymondhas shown in three major analyses, most importantly in ``The Magic Cauldron'', software is

    generally but wrongly thought of as a product, not a service. Typically, software engineers work within

    organisations to provide and administer services internally, customising database applications, adapting

    purchased software and network systems to the specific needs of their employers. Very few are employed in

    companies that make their living exclusively from selling software, less than ten percent. Assuming that these

    programmers have equivalent productivity, one could say that corporate internal markets ``buy'' most of the

    software written, and that very little of that code would be of any value to a competitor, if the competitorcould read it. It looks and feels much more like a service than a product.

    If software is a service, not a product, then the commercial opportunities lie in customising it to suit specific

    needs, unless there really is a secret algorithm embedded in the code which cannot be emulated in any

    satisfactory way. This is an issue of some weight in relation to intellectual property rights, which are not well

    adapted to software, in that software algorithms are only seldom truly original, and even less often the only

    feasible solutions to a given problem. Even without seeing each others' source code, two programmers might

    well reach very similar solutions to a given problem; if they did see each others' code, the joint solution would

    very likely be a better one. Reverse engineering physical products can be costly and timeconsuming, but

    doing the same to software is easy, although one may question the ethics of those doing this commercially.

    Most often programmers are driven to reverse engineer software with bugs, because the seller cannot or will

    not help them make it work. An important empirical regularity pointed out by Raymond is the immediate fall

    in value of software products to zero when the producer goes out of business or terminates the product line. It

    is as though customers pay for the product as some kind of advance payment for future updates and new

    versions. If they knew that no new versions would be coming, they would drop the product immediately, even

    though it still worked as advertised.

    Raymond summarises five discriminators pushing towards open source:

    reliability, stability, and scalability are critical;

    correctness of design and implementation cannot readily be verified by other means than independent

    peer review;

    the software is critical to the user's control of his/her business;

    the software establishes or enables a common computing and communications infrastructure;

    key methods (or functional equivalents of them) are part of common engineering knowledge.

    Any of these alone may be sufficient to cross the boundary for external economies to begin influencing

    behaviour in companies, termed network externalities in Raymond's article. If the company can benefit more

    in terms of its internal and/or external service quality and reliability, then there is now good reason to consider

    opening source.

    The second of Raymond's criteria is of key importance in academic and scientific settings, but others also

    have a role to play. Bringing supercomputing power to larger and less well resourced communities has sprung

    from open source projects such as PVM and MPI, upon which others again, in particular Beowulf and friends,have developed. Criteria four and five go to the heart of the success of shared, open protocols for interaction

    and interfacing components of computing systems.

    Next, we will present the two main program systems being integrated, before going on to show how the

    interfaces are constructed and function. We assume that interested readers will be able to supplement these

    sketches by accessing online documentation. In particular, online documentation for thePostgreSQL

    database system (or possibly also the newly GPL'dMySQLsystem) should be refered to. PostgreSQL is a

    leading open source ORDBMS, used in developing newer ideas in database technology. PostgreSQL also has

    native geometry data types, which have been used to aid vector GIS integration in other projects. Since some

    of our results are preliminary, we are anxious to receive suggestions from others interested in this work.

    Open Source geocomputation

    1 Introduction 3

    http://www.mysql.com/http://www.mysql.com/http://www.postgresql.org/http://www.tuxedo.org/~esr/writings/
  • 8/12/2019 Geo Comp 2000

    4/40

    Mention has also been made of theOmegaproject, a project deserving attention as a source of inspiration for

    data analysis. Many of the key contributors to S and R are active in the Omegahat core developers' group, and

    are working on systems intended to handle distributed data analysis tasks, with streams of data arriving,

    possibly in real time, from distributed sources, and undergoing analysis, possibly on distributed compute

    servers, for presentationto possibly multiple viewers. At the moment, Omegahat involves XML, CORBA,

    and an Slike interactive language based directly on Java, and is experimental.

    2.1R

    The background to R is interesting, in that it is based on two computer programming traditions: Bell

    Laboratories through S, and MIT through Scheme. As Itaka (1998) relates, meeting the first edition of the

    classic SICP (1985, second edition: Abelson, Sussman and Sussman, 1996) opened up ``a wonderful view of

    programming''. He met Scheme at the same time as an early version of S, and quickly realized that some

    features of modern programming language design, such as lexical scoping, can lead to superior practical

    solutions. These advantages have, among others, brought Tierney, the author of LispStat, into the R core

    development team. Differences in the underlying computing models between S and R are many and

    important, and sometimes concern the active user. They are outlined both in R system documentation, the

    RFAQ available at the archive site, and in the R complement to Venables and Ripley (1997). Venables andRipley are also actively porting their own work to R, partly as a service to their readers, and partly because the

    differences in the underlying computing models tease out potential coding infelicities. Many of these issues

    are actively debated on R discussion lists for details, see the R archives.

    The key differences lie in scoping and memory management. Scoping is concerned with the environments in

    which symbol/value pairs are searched during the evaluation of a function, and poses a problem for porting

    from S to R, when use has been made of the S model, rather than that derived from Scheme (Itaka and

    Gentleman, 1996, p. 301310). Memory management differs in that S generates large numbers of data files

    in the working directory during processing, allocating and reallocating memory dynamically. R, being based

    on Scheme, starts by occupying a userspecified amount of memory, but subsequently works within this, not

    committing any intermediate results to disk. Memory is allocated to fixed size "con cells", used to store

    language elements, functions and strings, and to a memory heap, within which user data are accommodated.

    Within the memory heap, a simple but efficient garbage collector makes the best possible use of the memory

    at the program's disposal. This may hinder the analysis of very large data sets, since all data have to be held in

    memory, but in practice this problem has been alleviated by falling RAM prices. One reason cited by some

    authors for using R, is that it is easier to compile dynamically loaded code modules for R than for example

    SPlus, since you just need the same (opensource) compilation system that you used to install R in the first

    place.

    Itaka (1998) places the future of R clearly within opensource software. Indeed, the rapid development of R

    as a computing environment for data analysis and graphics bears out many of the points made by Raymond.

    Over several years, it has begun to be clear that, even when wellregarded commercial software products are

    available, communities of users and developers are often able to gain a momentum based on very rapiddebugging by large numbers of interested participants bazaarstyle development. Open source is not limited

    to Unix or Unixbased operating systems, since opensource compilers and associated tools have been ported

    to proprietary desktop systems like MS Windows95, MS NT 4, and others. These in turn have permitted

    software, like R, to be ported to these platforms, with little version slack.

    R supports a rich variety of statistical graphics; one infelicity is that in general no provision is made for

    constraining the x and y axes of a plot to the same scale (with the exception of eqscplot()in the MASS

    library). The image()function plots filled polygons centred on the given grid centres, but there is no

    bundled image.legend()function in R to mirror the one in S. R has a legend()function, but it has to

    Open Source geocomputation

    2.1 R 4

    http://cran.r-project.org/http://cran.r-project.org/http://www.omegahat.org/
  • 8/12/2019 Geo Comp 2000

    5/40

  • 8/12/2019 Geo Comp 2000

    6/40

  • 8/12/2019 Geo Comp 2000

    7/40

    Figure 2. Integration of tools in a geographical information system.

    From initially interfacing R and GRASS through loose coupling, transferring data through ad hoc text files,

    the work presented here involves tight coupling, running R from within the GRASS shell environment, andadditionally system enhancement, by dynamically loading compiled GIS library functions into the R

    executable environment.

    Figure 3. Layering of shell, GRASS and R environments.

    Because R has developed as an interactive language with clear scoping rules and sophisticated handling of

    environment issues, while GRASS is a collection of partly coordinated separate programs built on common

    library functions, constructing working interfaces with GRASS has been challenging. One of the advantages

    of working with open source projects in this context has been that occasionally poorly documented functions

    can be read as source code, and when necessary thoroughly instrumented and debugged, to establish how they

    Open Source geocomputation

    2.2 GRASS 7

  • 8/12/2019 Geo Comp 2000

    8/40

    actually work. This is of particular help when interface functions are being written, since the mindsets and

    attitudes of several different communities of programmers need to be established.

    The key insight needed to establish the interface is that GRASS treats the shell environment as its basis,

    assigning values to shell variables on initiation which are used by successive GRASS programs, all of which

    are started with the values assigned to these variables. These name the current location and mapset, and these

    in turn point to session and default metadata on the location's regional extent and raster resolution, stored in afixed form in the GRASS database a collection of directories under the current location. Since GRASS

    simply sets up a shell environment, initiating specified shell variables, all regular programs, including R, can

    in principle be run from within GRASS, in addition to GRASS programs themselves. Finally, R has a

    system(command), where commandis a string giving the command to be invoked, and this command is

    executed in the shell environment in which R itself was started. This means that system()can be used from

    within R to run GRASS (and other) programs, including programs requiring user interaction. Alternatively,

    the tcltkgrass GUI frontend can be started before (or from) R, giving full access to GRASS as well as R.

    This is illustrated in Fig. 3.

    Furthest to the right in Fig. 3 is the use of GRASS library functions called from R through dynamically loaded

    shared libraries. The advantage of this is to speed up interaction and to allow R to be enhanced with GRASS

    functionality, in particular reading GRASS data directly into R objects. Dynamically loaded shared libraries

    also increase the speed of operations not vectorised within R itself, especially for loops. Vectors are the basic

    building blocks of R (note: 1based vectors!), so many operations are inherently fast, because data is held in

    memory. However, reading the R interpreted and compiled code (for the base library), as well as the

    documentation on writing extension packages gives a good idea of where compiling shared libraries may be

    worth the extra effort involved.

    Figure 4. Feasible routes for data transfer between R, GRASS and PostgreSQL.

    Native R supports the input and output of ascii text data, especially in the most frequently used form of data

    frame objects. These have a tabular form, with rows corresponding to observations, each with a character

    string row name, and columns to variables, each again with a string column name. Data frames contain

    numeric vectors and factors (which may be ordered), and can also contain character vectors; factors are an R

    object class of some importance, and are integervalued, with 1base integers pointing to the string label

    vector attribute. The read.table()function is however a memory hog, since reading in large tables with

    character variables for conversion to factors can use up much more memory than is necessary. The scan()

    Open Source geocomputation

    2.2 GRASS 8

  • 8/12/2019 Geo Comp 2000

    9/40

    function is more flexible, but data frames are certainly among the most used objects in R. Data frames differ

    from matrices in that matrices only contain one data type.

    R objects may be output for reentry with the dump()function, and entered using source(), both for

    functions and data objects. The R memory image may also be saved to disk for loading when a work session

    resumes, or for transfer to a different workstation. Every user of R has been caught by problems provoked by

    inadvertently loading an old image file, .RDatain the working directory on startup, in particular anapparent lack of memory despite little fresh data being read in.

    There are a number of variants of data entry into R, and work is progressing on interfacing S family languages

    to RDBS (James, 2000), within the Omega project. Other read.table()flavours are also available, for

    example for *.csvfiles. In the following subsections, we will describe how the R/GRASS interface has

    been developed.

    3.1 File transfer

    Work on interfacing GRASS and R began from the facility in the GRASS s.menuprogram to output an S

    structure for raster map layers sampled at site point locations. This was evidently unsatisfactory as a generaltechnique, although it still more or less works for the sampled points. Progress on filebased transfer using

    [rs].out.ascii, [rs].in.ascii, and r.stats 1on the GRASS side run through the R

    system()function is reported in Bivand (1999) initially, and as a fully supported R package in Bivand

    (2000). This route is still used for site data in the current (0.14) version of theGRASS package for R. The

    help pages of this package are appended to this paper.

    As an example, we can examine the code of the sites.get()function. The main part of the function is:

    FILE

  • 8/12/2019 Geo Comp 2000

    10/40

  • 8/12/2019 Geo Comp 2000

    11/40

    +

    | Layer: elevation.dem Date: Tue Mar 14 07:15:44 1989

    | Mapset: PERMANENT Login of Creator: grass

    | Location: spearfish

    | DataBase: /usr/local/grass5.0b/data

    | Title: DEM (7.5 minute) ( elevation )

    ||

    | Type of Map: cell Number of Categories: 1840

    | Rows: 466

    | Columns: 633

    | Total Cells: 294978

    | Projection: UTM (zone 13)

    | N: 4928000 S: 4914020 Res: 30

    | E: 609000 W: 590010 Res: 30

    |

    | Data Source:

    | 7.5 minute Digital Elevation Model (DEM) produced by the USGS

    |

    |

    | Data Description:

    | Elevation above mean sea level in meters for spearfish database

    |

    | Comments:

    | DEM was distributed on 1/2 inch tape and was extracted using

    | GRASS DEM tape reading utilities. Data was originally in feet

    | and was converted to meters using the GRASS map calculator

    | (Gmapcalc). Data was originally supplied for both the

    | Spearfish and Deadwood North USGS 1:24000 scale map series

    | quadrangles and was patched together using GRASS

    |

    |

    +

    > system.time(spear str(spear)

    List of 2

    $ elevation.dem: num [1:106400] NA NA NA NA NA NA NA NA NA NA ...

    $ landuse.f : Factor w/ 8 levels "residential",..: NA NA NA NA NA NA NA

    The current interface is for raster and site GIS data; work on an interface for vector data is proceeding, but is

    complicated by substantial infelicities in GRASS methods for joining attribute values of vector objects to

    position data. Changes in GRASS programs like v.in.shapeand its relatives are occurring rapidly, and are

    also associated with the use of PostgreSQL for storing tables of attribute data. We will return to this topic in

    concluding.

    Open Source geocomputation

    3.1 File transfer 11

  • 8/12/2019 Geo Comp 2000

    12/40

    3.2 Binary transfer

    Early reactions to the release of the interpreted interface indicated problems with the bloat of using data

    frames to contain imported raster layers now replaced by lists of vectors and with confusing differences in

    metadata. It turned out that data written by both R and GRASS to ascii files were subject to assumptions about

    the format and precision needed, so that direct binary transfer became a high priority to ensure that both

    programs were accessing the same data. While transfer times are not of major importance, since users tend tomove data across the interface seldom, an improvement of better than an order of magnitude seemed worth the

    trouble involved in programming the interface in C.

    > system.time(G system.time(spear str(spear)

    List of 2

    $ elevation.dem: num [1:106400] NA NA NA NA NA NA NA NA NA NA ...$ landuse.f : Factor w/ 9 levels "no data","resid..",..: NA NA NA NA NA

    A byproduct of programming the raster interface in C has been the possibility of moving R factor levels

    across the interface as category labels loaded directly into the GRASS raster database structure. While

    incurring the cost of extra disk usage as compared to GRASS reclassing technology, raster file compression

    reduces the importance of this issue. In the case below of reclassifying the elevation data from Spearfish by

    quartiles, the Tukey five number summary, the resulting data file is actually smaller than its null file,

    recording NAs.

    > elev.f str(elev.f)

    Factor w/ 4 levels "(1.07e+03,1..",..: NA NA NA NA NA NA NA NA NA NA ...

    > class(elev.f)

    [1] "ordered" "factor"

    > col plot(G, codes(elev.f), reverse=reverse(G), col=col)

    > legend(c(590000,595000), c(4914000, 491700),

    + legend=levels(elev.f)[1:2], fill=col[1:2], bty="n")

    > legend(c(597000,602000), c(4914000, 491700),

    + legend=levels(elev.f)[3:4], fill=col[3:4], bty="n")

    Open Source geocomputation

    3.2 Binary transfer 12

  • 8/12/2019 Geo Comp 2000

    13/40

    Figure 5. plot.grassmeta()object dispatch function used to display classified elevation map in R.

    > rast.put(G, "elev.f", elev.f, "Tranfer of fivenum elevation

    classification",

    + cat=TRUE)

    > system("ls l $LOCATION/cell/elev.f")

    rwrr 1 rsb users 9909 Jul 15 14:09 elev.f

    > system("r.info elev.f")+

    | Layer: elev.f Date: Sat Jul 15 14:09:14 2000

    | Mapset: rsb Login of Creator: rsb

    | Location: spearfish

    | DataBase: /usr/local/grass5.0b/data

    | Title: Tranfer of fivenum elevation classification ( elev.f )

    |

    |

    | Type of Map: raster Number of Categories: 4

    | Rows: 280

    Open Source geocomputation

    3.2 Binary transfer 13

  • 8/12/2019 Geo Comp 2000

    14/40

    | Columns: 380

    | Total Cells: 106400

    | Projection: UTM (zone 13)

    | N: 4928000 S: 4914000 Res: 50

    | E: 609000 W: 590000 Res: 50

    |

    | Data Source: |

    |

    |

    | Data Description:

    | generated by rastput()

    |

    |

    +

    > table(elev.f)

    elev.f

    (1.07e+03,1.2e+03] (1.2e+03,1.32e+03] (1.32e+03,1.49e+03] (1.49e+03,1.84e+

    26517 26162 26354 26

    > system("r.report elev.f units=c")

    r.stats: 100%

    +

    | RASTER MAP CATEGORY REPORT

    |LOCATION: spearfish Sat Jul 15 14:10:54 20

    |

    | north: 4928000 east: 609000

    |REGION south: 4914000 west: 590000

    | res: 50 res: 50

    |

    |MASK:none

    |

    |MAP: Tranfer of fivenum elevation classification (elev.f in rsb)

    |

    | Category Information | ce

    |#|description | cou

    |

    |1|(1.07e+03,1.2e+03] . . . . . . . . . . . . . . . . . . . . . . . . .| 265

    |2|(1.2e+03,1.32e+03] . . . . . . . . . . . . . . . . . . . . . . . . .| 261

    |3|(1.32e+03,1.49e+03]. . . . . . . . . . . . . . . . . . . . . . . . .| 263

    |4|(1.49e+03,1.84e+03]. . . . . . . . . . . . . . . . . . . . . . . . .| 263

    |*|no data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .| 10|

    |TOTAL |1064

    +

    Open Source geocomputation

    3.2 Binary transfer 14

  • 8/12/2019 Geo Comp 2000

    15/40

    Figure 6. GRASS display of R reclassed elevation.

    One specific problem encountered in writing the interface is a result of mixing two differing system models: R

    requires that compiled code uses its error handling routines returning control to the interactive command line

    on error, while not only the GRASS programs but also their underlying libraries quite often simply exit()

    on error. This of course also terminates R, if the offending code is dynamically loaded, behaviour that is

    hazardous for R data held in objects in memory. Patches have been applied to key functions in the GRASS

    libgis.a library, and are both available from the CVS repository and will be in the released source code from

    5.0 beta 8.

    3.3 Transfer via shared PostgreSQL database

    The static memory model of R leads to problems in case of extended datasets. Such datasets have to be

    handled especially in the field of geospatial analysis. To overcome this problem, the "R PostgreSQL Package"

    (RPgSQL package) has been developed. It links R to the open source DBMSsystemPostgreSQL. Users can

    read and write data frames to and from a postgresql database. As the RPgSQL provides an interface to

    PostgreSQL, the user does not need to know about SQL statements. Comparing to other R extensions a set of

    commands is available to exchange data frames and select data from the PostgreSQL database. The RPgSQL

    package also offers support for "proxy data frames". Such a proxy data frame is an R object that inherits from

    Open Source geocomputation

    3.3 Transfer via shared PostgreSQL database 15

    http://www.postgresql.org/http://rpgsql.sourceforge.net/
  • 8/12/2019 Geo Comp 2000

    16/40

    the data.frame class, but does not contains any data. Instead of writing to R memory, all the "proxy data

    frame" generate the appropriate SQL query and retrieve the resulting data from the PostgreSQL. The RPgSQL

    system allows to process large datasets without loading them completely into R memory space.

    The RPgSQL is loaded after starting R:

    > library(RPgSQL)

    In our case we are starting R within GRASS to access GRASS data directly. Before using the RPgSQL, an

    empty database needs to be created. This can be done externally:

    $ createdb test

    or within R:

    > system("createdb geodata")

    The user must have permission to create databases. This can be specified when creating a PostgreSQL user.

    (createuser command).

    Then the connection to the database "geodata" has to be established:

    > db.connect("localhost", "5432", "geodata")

    Within this database "geodata" a set of tables can be stored. The database may of course be initiated

    externally, by copying data into previously created tables using the standard database tools, for example

    psql. The key goal of this form of data transfer is to use the GRASSPostgreSQL interface to establish the

    database prior to work in R, and to use the RPostgreSQL interface to pass back results for further work in

    GRASS.

    An issue of some weight and as yet not fully resolved in the GRASSPostgreSQL interface is stewardship

    of the GIS metadata: GRASS assumes that it "knows" the metadata universe, but this is in fact not "attached"

    to data stored in the external database. It is probable that fuller use of this solution to data transfer will require

    the choice of one of two future routes: the registration of a GRASS metadata table in PostgreSQL (somewhat

    similar to the use of gmeta()above); or the implementation of a GRASS daemon serving metadata, and

    possibly data in general, to programs requesting it. Although these are serious matters, and indicate that this isfar from resolved, attempts to interface through a shared external database do illuminate the problems

    involved in a very helpful way.

    > data("soildata")

    In case the user has the data frame available loaded into R, it can be moved further to PostgreSQL. It is

    recommended to use the "names" parameter to avoid name convention conflicts with PostgreSQL:

    > db.write.table("soildata", name="soildata", no.clobber=F)

    Now the memory of R can be freed of the "soildata" data frame as it was copied to PostgreSQL:

    > rm(soildata)

    To reconnect the data set to R, the proxy binding follows. Due to the proxy method the data interchange

    between PostgreSQL and R is established:> bind.db.proxy("soildata")

    > summary("soildata")

    Table name: soildata

    Database: geodata

    Host: localhost

    Dimensions: 10 (columns) 1000 (rows)

    The data access is different to the common data access as the proxy methods only delivers selected data to the

    R system. An example for selective data access is by addressing rows and columns:

    > selected.data

  • 8/12/2019 Geo Comp 2000

    17/40

    > selected.data

    area perimeter g2.i stonr horiz humus pH

    25 5232.31 322.597 26 20 Ah 20 5.4

    26 75256.80 1986.950 27 10 Ap1 10 5.6

    27 5276.81 285.431 28 6 aAp 6 4.8

    or by addressing fields through column names:> humus.data humus.data

    [1] 1.45 1.33 1.45 2.71 1.45 1.33 1.33 1.45 1.74 1.45 2.81 1.33 1.78

    1.69 2.51

    [16] 1.33 1.77 1.33 2.41 0.00 0.00 1.62 1.33 2.10 2.10 1.69 2.38 1.33

    2.51 1.33

    [31] 1.34 2.10 0.00 1.92 0.00 1.45 1.33 2.10 0.00 1.33 2.51 2.71 1.69

    1.33 2.71

    [46] 2.10 1.33 1.69 2.10 2.71 1.34 2.71 2.71 2.10 1.33 1.33 2.68 1.74

    1.33 1.45

    [61] 2.71 2.38 2.68 1.69 1.33 1.33 1.33 1.33 1.33 2.71 2.71 2.10 2.71

    2.10 1.69

    [76] 1.33 1.45 1.33

    These selected data (or results from calculations) can be written as a new table to the PostgreSQL database,

    too:

    > db.write.table(selected.data, name="selected_data", no.clobber=F)

    To read it back the command:

    > bind.db.proxy("selected_data")

    will be used.

    Although users do not need to know SQL query language, it is possible to send SQL statements to thePostgreSQL backend:

    > db.execute("select * from humus", clear=F)

    > db.read.column("humus", as.is=T)

    Calculations are performed by addressing the rows to use in calculation. To calculate the mean of humus

    contents from row 1 to 1000, enter:

    > mean(na.omit(soildata$humus[1:1000,]))

    [1] 1.723718

    Finally the data cleanup can be done. The command

    > rm(soildata)would remove the data frame from R memory, and

    > db.rm("soildata", ask=F)

    removes the table from the PostgreSQL database "geodata". The

    > db.disconnect()

    disconnects the data link between R and the PostgreSQL backend.

    The new RPostgreSQL interface offers rather unlimited data capacities to overcome the memory limitations

    within R. A basic problem of the current implementation is the large speed decrease caused by the proxy

    method. The recently introduced ODBC driver has not so far been tested to compare it to the RPgSQL route.

    Perhaps future versions of the PostgreSQLdriver will offer fast data exchange as it is required for GIS

    Open Source geocomputation

    3.3 Transfer via shared PostgreSQL database 17

  • 8/12/2019 Geo Comp 2000

    18/40

  • 8/12/2019 Geo Comp 2000

    19/40

    Figure 7: Soil pHvalues within the Spearfish region.

    For starting up the systems, R is called within GRASS software:

    grass5.0beta

    R vsize=30M nsize=500K

    Then the GRASS/R interface is loaded:

    > library(GRASS)

    > G soilsph names(soilsph) gc()

    free total (Mb)

    Ncells 367838 512000 9.8

    Vcells 3218943 3932160 30.0

    Depending on the resolution previously set in GRASS, the demand on memory is different. The next step is to

    copy the map "soilsph" into the PostgreSQL database system. A new database needs to be created to store the

    new pHvalue table:

    Open Source geocomputation

    3.4 Alternative technologies: CORBA/IDL, XML 19

  • 8/12/2019 Geo Comp 2000

    20/40

    > system("createdb soils_ph_db")

    > library(RPgSQL)

    > db.connect("localhost", "5432", "soils_ph_db")

    > db.write.table(soilsph$ph, name="soilsph", no.clobber=F)

    It is recommended to use the "names" parameter in db.write.table to avoid name conflicts (as PostgreSQL

    rejects special character within table names).To simulate a later access to the stored table, we disconnect from the SQLbackend and reconnect again

    without loading the dataset. Using the "gc()" command we can watch the memory consumption of our dataset.

    > db.disconnect()

    > rm(soilsph)

    > gc()

    free total (Mb)

    Ncells 364378 512000 9.8

    Vcells 3884663 3932160 30.0

    The memory in R is rather free now. We connect back to PostgreSQL and now, instead of reading in the table,

    a proxy binding is set as described above:

    > db.connect("localhost", "5432", "soils_ph_db")

    > bind.db.proxy("soilsph")

    > gc()

    free total (Mb)

    Ncells 364395 512000 9.8

    Vcells 3882569 3932160 30.0

    There is nearly no change in memory requirement. A small problem is the decrease in speed as the requested

    data have to pass the proxy system.

    > summary(soilsph)

    Table name: soilsph

    Database: soils_ph_db

    Host: localhost

    Dimensions: 1 (columns) 26600 (rows)

    Selected data can be accessed by specifying the minimum and maximum row numbers:

    > soilsph[495:500]

    I.data.

    495 4

    496 4

    497 4

    498 4

    499 4500 3

    But usually we would want to analyse the full dataset (here equal to our soils map with distributed pHvalue).

    To find out the maximum number of values alias raster cells, we can enter:

    > maxrows soils.ph.frame soilsph$ph[soilsph$ph == 0]

  • 8/12/2019 Geo Comp 2000

    21/40

    > names(soils.ph.frame) summary(soils.ph.frame)

    x y z

    Min. :590100 Min. :4914000 Min. : 1.000

    1st Qu.:594800 1st Qu.:4918000 1st Qu.: 3.000

    Median :599500 Median :4921000 Median : 4.000

    Mean :599500 Mean :4921000 Mean : 3.3793rd Qu.:604300 3rd Qu.:4924000 3rd Qu.: 4.000

    Max. :609000 Max. :4928000 Max. : 5.000

    NA's :977.000

    To calculate the cubic trend surface, we have to mask NULL values from calculation. The functions to fit the

    trend surface through the data points is provided by "spatial" package:

    > library(spatial)

    > names(soils.ph.frame)

    > ph.ctrend plot(G, trmat.G(ph.ctrend, G, east=east(G), north=north(G)),

    + reverse=reverse(G), col=terrain.colors(20))

    Open Source geocomputation

    3.4 Alternative technologies: CORBA/IDL, XML 21

  • 8/12/2019 Geo Comp 2000

    22/40

    Figure 8: Trend surface of pHvalue within Spearfish region.

    4.2 Pontypool PCBs sites data

    The interface package is furnished with examples using adata setused extensively in Bailey and Gatrell

    (1995, p. 149), with positions and values of standardised PCB contamination around a plant for incinerating

    chemical wastes. In the following, the sites data is in a GRASS database called "pontypool2", and the

    interface will be used to display and analyse the data.

    Initially, the data are moved to R as a data frame; at present sites.getand sites.putuse ASCII file

    transfer, and column names in the GRASS sites file (if available) are ignored. The names of the data frame are

    tidied up, and the contamination values are classified into five classes, one of which turns out to be empty.

    Using glyph and colour coding, a map of the values is drawn, showing three markedly higher values at or near

    the incinerator site.

    > library(GRASS)

    Running in /usr/local/grass5.0b/data/pontypool2/rsb

    > system("g.list sites")

    Open Source geocomputation

    4.2 Pontypool PCBs sites data 22

    http://pcbs.html/
  • 8/12/2019 Geo Comp 2000

    23/40

    site list files available in mapset rsb:

    ex.pcbs.in

    > G summary(G)

    Data from GRASS 5.0 LOCATION pontypool2 with 88 columns and 108 rows;UTM, zone: 30

    The westeast range is: 68000, 70200, and the southnorth: 74500, 77200;

    Westeast cell sizes are 25 units, and southnorth 25 units.

    > pcbs str(pcbs)

    `data.frame': 70 obs. of 4 variables:

    $ east : num 68504 68353 68405 68546 68510 ...

    $ north: num 76800 76868 76709 76631 77091 ...

    $ var1 : num 1 2 3 4 5 6 7 8 9 10 ...

    $ var2 : num 9 6 11 25 11 9 3 7 6 5 ...

    > colnames(pcbs) str(pcbs)

    `data.frame': 70 obs. of 4 variables:

    $ x : num 68504 68353 68405 68546 68510 ...

    $ y : num 76800 76868 76709 76631 77091 ...

    $ id: num 1 2 3 4 5 6 7 8 9 10 ...

    $ z : num 9 6 11 25 11 9 3 7 6 5 ...

    > summary(pcbs$z)

    Min. 1st Qu. Median Mean 3rd Qu. Max.

    3.00 8.00 11.00 106.00 19.75 4620.00

    > pcbs.o table(pcbs.o)

    pcbs.o

    insignificant low medium crisis

    53 14 1 2

    > plot(G)

    > rect(G$xlims[1], G$ylims[1], G$xlims[2], G$ylims[2], col="gray")

    > Col points(pcbs$x, pcbs$y, pch = codes(pcbs.o), col=Col[codes(pcbs.o)])

    > legend(c(67560, 68300), c(75050, 74500), pch = c(1:5),

    + legend = levels(pcbs.o), col=Col, bg="gray")

    > title("PCBs contamination levels")

    Open Source geocomputation

    4.2 Pontypool PCBs sites data 23

  • 8/12/2019 Geo Comp 2000

    24/40

    Before proceeding to interpolate contamination values across the study area, the distribution of the observed

    measurements are plotted against the normal distribution on quantilequantile plots:

    > oldpar qqnorm(pcbs$z, ylab="Sample quantiles of pcbs values")

    > qqnorm(log(pcbs$z), ylab="Sample quantiles of log pcbs values")> par(oldpar)

    Open Source geocomputation

    4.2 Pontypool PCBs sites data 24

  • 8/12/2019 Geo Comp 2000

    25/40

    Finally, two plots are made of spline interpolation, the first from the akimalibrary, using a wrapper function

    in the interface package employing bicubic splines and default parameters. The second uses the recently

    revised s.surf.rstGRASS program with specified parameters; in both cases interpolation is based on the

    logarithm of the observed values. In the first case, interpolations are only made within the convex hull of the

    observation points, while in the second case for the whole region.

    > oldpar ak2 plot(G, ak2, reverse=reverse(G))

    > points(pcbs$x, pcbs$y, pch=18)

    > title("Bicubic spline interpolation")

    > sites.put(G, lname="ex.pcbslog.in", east=pcbs$x, north=pcbs$y,

    + var=log(pcbs$z))

    > system("g.list sites")

    site list files available in mapset rsb:

    ex.pcbs.in ex.pcbslog.in

    > system(paste("s.surf.rst input=ex.pcbslog.in elev=ex.rst",

    Open Source geocomputation

    4.2 Pontypool PCBs sites data 25

  • 8/12/2019 Geo Comp 2000

    26/40

    + "tension=160 smooth=0.0 segmax=700"))

    > plot(G, rast.get(G, "ex.rst", F)$ex.rst, reverse=reverse(G))

    > points(pcbs$x, pcbs$y, pch=18)

    > title("Regularized Spline with Tension")

    > par(oldpar)

    4.3 Assist/Leicester

    We will here start with the simple example of exercise 1 in theNW Leicestershire Grass Seeds data set, for an

    area of 12x12 km at 50x50m cell resolution. The two raster map layers involved are topo, elevation in

    metres interpolated from 20m intervat contours digitised from a 1:50,000 topographic map, and landcov, a

    land use classification into eight classes from Landsat TM imagery. Exercise 1 asks the user to find the

    average height of each of the land use classes, and the area occupied by each of the classes in hectares. In

    GRASS, this is done by creating a reclassed map layer in r.average, which assigns mean values to the

    category labels, and r.reportto access the results.

    > system("r.average base=landcov cover=topo output=avheight")

    r.stats: 100%

    percent complete: 100%

    > system("r.report landcov units=c,h")

    r.stats: 100%

    +

    | RASTER MAP CATEGORY REPORT

    |LOCATION: leics Mon Jul 17 15:07:39 20

    |

    | north: 322000 east: 456000

    |REGION south: 310000 west: 444000

    | res: 50 res: 50

    |

    Open Source geocomputation

    4.3 Assist/Leicester 26

    http://www.geog.le.ac.uk/assist/grass/data/leics/leics.tar.Z
  • 8/12/2019 Geo Comp 2000

    27/40

  • 8/12/2019 Geo Comp 2000

    28/40

    elevation by land use, making box width proportional to the numbers of cells in each land use class (note that

    the first, empty, count is dropped by subsetting c2[2:9]).

    > library(GRASS)

    Running in /usr/local/grass5.0b/data/leics/rsb

    > G leics str(leics)

    List of 2

    $ landcov.f: Factor w/ 9 levels "Background","In..",..: 6 6 6 6 6 6 6 6 6 6

    $ topo : num [1:57600] 80 80 80 80 80 80 80 80 78 76 ...

    > c1 c2 c3 colnames(c3) c3[2:9,]

    Average Height Total Area

    Industry 83.61101 417.75

    Residential 76.30066 1658.00

    Quarry 135.25405 138.75

    Woodland 142.94454 437.25

    Arable 104.02143 5728.25

    Pasture 133.48335 4204.00

    Scrub 148.24870 1639.50

    Water 99.26912 176.50

    > boxplot(leics$topo ~ leics$landcov.f, width=c2[2:9], col="gray")

    Open Source geocomputation

    4.3 Assist/Leicester 28

  • 8/12/2019 Geo Comp 2000

    29/40

    Figure 9. Boxplot of elevation in metres by land cover classes, NW Leicestershire; box widths proportional to

    share of land cover types in study region.

    It is of course difficult to carry out effective data analysis, especially graphical data analysis, in a GIS like

    GRASS, but boxplots such as these are of great help in showing more of what the empirical distributions of

    cell values look like.

    Over and above substantive analysis of map layers, we are often interested in checking the data themselves,

    particularly when they are the result of interpolation of some kind. Here we use a plot of the empirical

    cumulative distribution function to see whether the contours used for digitising the original data are visible as

    steps in the distribution as Figure 10 shows, they are.

    > library(stepfun)

    > plot(ecdf(leics$topo), verticals=T, do.points=F)

    > for (i in seq(40,240,20)) lines(c(i,i),c(0,1),lty=2, col="sienna2")

    Open Source geocomputation

    4.3 Assist/Leicester 29

  • 8/12/2019 Geo Comp 2000

    30/40

    Figure 10. Empirical cumulative density function of elevation values in metres, NW Leicestershire; also

    known as hypsometric integral.

    A further tool to apply in search of the impact of interpolation artifacts is the density plot over elevation. A

    conclusion that can be drawn from Figure 11 is that only at above a bandwidth of over 6 metres do theartifacts cease to be intrusive.

    > plot(density(leics$topo, bw=2), col="black")

    > lines(density(leics$topo, bw=4), col="green")

    > lines(density(leics$topo, bw=6), col="brown")

    > lines(density(leics$topo, bw=8), col="blue")

    > for (i in seq(40,240,20)) lines(c(i,i),c(0,0.015),lty=2, col="sienna2")

    > legend(locator(), legend=c("bw=2", "bw=4", "bw=6", "bw=8"),

    + col=c("black", "green", "brown", "blue"), lty=1)

    Open Source geocomputation

    4.3 Assist/Leicester 30

  • 8/12/2019 Geo Comp 2000

    31/40

    Figure 11. Density plots of elevation in metres from bandwidth 2 to 8 metres showing the effects of

    interpolating from digitised contours.

    4.4 Lisa MidWest socioeconomic data

    While the interface mechanisms discussed above have addressed the needs of users with raster data, we turn

    here to the challenging topic of thematic mapping of attribute data associated with closed polygons. In the S

    language family, support is available outside R for a map()function, permitting polygons to be filled with

    colors chosen from palette stored in a vector. This function uses binary vector data stored in an Sspecific

    format in three files: a file of edge line segments with coordinates, typically geographical coordinates

    projected to the user chosen specification at plot time, a polygon file listing the edges with negative pointers

    for reversed edges, and a named labels file, joining the polygons to data vectors. In the S community and more

    generally at Bell Labs, interest has been shown in geographical databases, which has resulted in the release of

    anS map.builderenabling users to construct their own databases from collections of line segments,

    defined as pairs of coordinates. The methods employed are described in Becker (1993) and Becker and Wilks

    (1995). Ray Brownrigg has ported some of the code to R, and maintainsan FTP sitewith helpful mapmaterials.

    The data set used for the exampled given here was included in the materials of theESDA with LISA

    conferenceheld in Leicester in 1996. The data are for the five midwest states of Illinois, Indiana, Michigan,

    Ohio, and Wisconsin, and stems from the 1990 census. The boundary data stems from the same source, but

    the units used are not well documented; there are 437 counties included in the data set. For this presentation,

    the position data were transformed from an unknown projection back to geographical coordinates, and

    projected from these to UTM zone 16. R regression and prediction functions were used to carry out a second

    order polynomial transformation, as in GRASS i.rectify2used for image data. Projection was carried out

    using theGMT mapprojectprogram.

    Open Source geocomputation

    4.4 Lisa MidWest socioeconomic data 31

    http://lib.stat.cmu.edu/general/map.builderftp://ftp.mcs.vuw.ac.nz/pub/statistics/maphttp://www.mimas.ac.uk/argus/esdalisa/midwest/Data.htmlhttp://www.soest.hawaii.edu/gmthttp://www.soest.hawaii.edu/gmthttp://www.mimas.ac.uk/argus/esdalisa/midwest/Data.htmlhttp://www.mimas.ac.uk/argus/esdalisa/midwest/Data.htmlftp://ftp.mcs.vuw.ac.nz/pub/statistics/maphttp://lib.stat.cmu.edu/general/map.builderhttp://lib.stat.cmu.edu/general/map.builder
  • 8/12/2019 Geo Comp 2000

    32/40

    It did not prove difficult to import the reformatted midwest.arcfile into GRASS with v.in.arc, but

    joining it to the midwest.patdata file was substantially more difficult. As an alternative, the

    midwest.arcwas converted using awkto a line segment input file for map.builder a collection of

    simple awkand shell scripts and two small programs compiled from C: intersectand gdbuild. This,

    after the removal of duplicate line segments identified by intersect, permitted the production of a

    polylines file and a polygon file. These were converted in awkto a form permitting R lists to be built using

    split(). From there, getting R to do maps was uncomplicated, by reading in the files, and using splittingthem. The split()functionis an R workhorse, being central to tapply()used above, and is an

    optimised part of R's internal compiled code. It splits its first argument by the second, returning a list with

    elements containing vectors with the component parts of the first argument. A short utility function was used

    mkpoly() to build a new list of closed polygons from the input data.

    > mwpl mwpe s.mwpe sx.mwpl sy.mwpl polys for (i in 1:length(s.mwpe))

    + polys[[i]] mwcpl library(splancs)

    > pnames mwdata ophp id.ophp pointers for (i in 1:length(pointers))

    + pointers[i] gr library(MASS)

    > eqscplot(c(100000, 1100000), c(100000, 1300000), type="n")

    > for (i in 1:nrow(id.ophp))

    + polygon(polys[[pointers[i]]], col=gr[codes(id.ophp$ophp[i])])

    Open Source geocomputation

    4.4 Lisa MidWest socioeconomic data 32

    http://mkpoly.r/http://mkpoly.r/http://mkpoly.r/http://mkpoly.r/http://mkpoly.r/
  • 8/12/2019 Geo Comp 2000

    33/40

    > legend(c(107000, 119300), c(476000, 95000),

    + legend=levels(id.ophp$ophp), fill=gr)

    Figure 12. Percentage of persons aged 25 years and over with professional or higher degrees, 1990, five

    MidWest states.

    A further example is to use the pointer vector to paint polygons meeting chosen attribute conditions this is

    analogous to the examples given in the GRASSPostgreSQL documentation. The Rwhich()

    function is notunlike the SQL SELECT command, but returns the row numbers of the objects found in the data table, rather

    than the requested tuples. To retrieve selected columns, their names or indices can be given in the columns

    field of the [, ]access operator for data frames. As may have been noticed, there is

    also an [[]]access operator for lists, so that polys[[pointers[i]]]returns the

    pointers[i]'th list element of polys, which is itself a list of two numeric vectors of equal length,

    containing the x and y coordinates of the polygon. When we attempt to label the five counties with

    percentages of persons 25 years and over with professional and higher degrees over 14 percent, we have to go

    back through the chain of pointers to find the label point coordinates originally used to label the polygons at

    which to plot the text.

    Open Source geocomputation

    4.4 Lisa MidWest socioeconomic data 33

  • 8/12/2019 Geo Comp 2000

    34/40

    > where 14)

    > str(where)

    int [1:5] 10 39 155 181 275

    > mwdata[where, c("pop", "pcd", "php")]

    pop pcd php

    CHAMPAIGN (IL) 173025 41.29581 17.75745

    JACKSON (IL) 61067 36.64367 14.08989MONROE (IN) 108978 37.74230 17.20123

    TIPPECANOE (IN) 130598 36.24544 15.25713

    WASHTENAW (MI) 282937 48.07851 20.79132

    > pointers for (i in 1:length(pointers))

    + pointers[i] eqscplot(c(100000, 1100000), c(100000, 1300000), type="n")

    > for (i in 1:nrow(mwdata)) polygon(polys[[i]])

    > for (i in which(mwdata$php > 4.447 & mwdata$php for (i in which(mwdata$php > 10 & mwdata$php for (i in which(mwdata$php > 14))

    + polygon(polys[[pointers[i]]], col="red")

    > texts text(mwcpl$x[pnames$id[pointers[where]]],

    + mwcpl$y[pnames$id[pointers[where]]], texts, pos=4, cex=0.8)

    Open Source geocomputation

    4.4 Lisa MidWest socioeconomic data 34

  • 8/12/2019 Geo Comp 2000

    35/40

    Figure 13. Finding counties with high percentages of persons aged 25 years and over with professional or

    higher degrees, 1990, five MidWest states.

    To conclude this subsection, the mapspackage countydatabase is used to map the same variable; the

    database is plotted using unprojected geographical coordinates at present. The problem that arises is that the

    map()function fills the polygons with specified shades or colours in the order of the polygons in the chosen

    database, and that joining the data table attribute rows to the polygon database is thus a point of weakness.

    Here we see that the county database returns a vector of 436 names of counties in the five states, while theEsdaLisa MidWest data reports 437. After translating the county names into the same format, the %in%

    operator is used to locate the county name or names not in the county database. Fortunately, only one county

    is not matched, Menominee(WI), an Indian Reservation in the North East of the state. The next task is to use

    its label coordinates from the MidWest data set to find out in which county database polygon it has been

    included. While there must be a way of reformatting the polygon data extracted from map(), use was made

    here of awkexternally, and then split()and inout()as above. The county from the county database in

    which Menominee has been subsumed is its neighbour Shawano, with which it is aggregated on the relevant

    variables before mapping. In fact, both Menominee and Shawano were both in the bottom decile before

    aggregation, with Menominee holding the lowest rank for percentage aged 25 and over holding higher or

    professional degrees, and Shawano ranked 33. The aggregated Shawano comes in at rank 20 of the 436 county

    Open Source geocomputation

    4.4 Lisa MidWest socioeconomic data 35

    ftp://ftp.mcs.vuw.ac.nz/pub/statistics/mapftp://ftp.mcs.vuw.ac.nz/pub/statistics/map
  • 8/12/2019 Geo Comp 2000

    36/40

    database counties.

    > library(maps)

    > mw cnames str(cnames)chr [1:436] "illinois,adams" "illinois,alexander" "illinois,bond" ...

    > mwnames str(mwnames)

    chr [1:437] "adams (il)" "alexander (il)" "bond (il)" "boone (il)" ...

    > ntab names(ntab) ntab

    illinois indiana michigan ohio wisconsin

    "(il)" "(in)" "(mi)" "(oh)" "(wi)"

    > for (i in 1:length(mwnames)) {

    + x mw[405,]

    id area pop dens white black ai as or pw pb

    MENOMINEE (WI) 3020 0.021 3890 185.2381 416 0 3469 0 5 10.69409 0

    pai pas por pop25 phs pcd php popp

    MENOMINEE (WI) 89.17738 0 0.1285347 1922 62.69511 7.336108 0.5202914 38

    ppoppov ppov ppov17 ppov59 ppov60 met categ

    MENOMINEE (WI) 98.20051 48.6911 64.30848 43.31246 18.21862 0 LHR

    > mwpts mwpts[which(mwpts$id == 3020),]

    id x y e n

    51 3020 13.983 10.2195 88.6893 44.9859> cpolys str(cpolys)

    List of 3

    $ x : num [1:7381] 91.5 91.5 91.5 91.5 91.4 ...

    $ y : num [1:7381] 40.2 40.2 40.0 40.0 39.9 ...

    $ range: num [1:4] 92.9 80.5 37.0 47.5

    > summary(cpolys$x)

    Min. 1st Qu. Median Mean 3rd Qu. Max. NA's

    92.92 89.16 87.15 86.80 84.47 80.51 435.00

    Open Source geocomputation

    4.4 Lisa MidWest socioeconomic data 36

  • 8/12/2019 Geo Comp 2000

    37/40

  • 8/12/2019 Geo Comp 2000

    38/40

  • 8/12/2019 Geo Comp 2000

    39/40

    Figure 14. Percentage of persons aged 25 years and over with professional or higher degrees, 1990, five

    MidWest states, mapped using the R mapspackage.

    This worked example points up the difficulty of ignoring absent metadata needed for integrating data tables

    with mapped representations, and suggests that further work on the mapspackage is needed to protect users

    from assuming that they are mapping what they assume. It is very possible that the RPgSQLlink could be

    used to monitor the joining of map polygon databases to attribute database tables.

    5 Conclusions and further developments

    GRASS, R, PostgreSQL, XGobi and other programs used here are all open source, and for this reason are

    inherently transparent. Within the open source community generally, it has become a matter of common

    benefit to adopt openness especially with respect to problems, because this increases the chances of resolving

    them. In this spirit, we would like to point more to issues needing resolution than to the selfevident benefits

    of using both the programs discussed above, and the proposed interface mechanisms.

    While progress is being made in GRASS on associating site, line and area objects with attribute values, not

    least in the context of the GRASSPostgreSQL link, it is still far from easy to join geographical objects totables of attributes using object idnumbers. Migration to the new sites format in GRASS is not fully

    complete, and even within this format, it is not always clear how the attribute table should be used. Similar

    reservation hold for vector map layers.

    In the past, attempts have been made to use PostgreSQL's geometric data types, which support the same kind

    of object model as the OGC Simple Feature model used in GML. They are not in themselves built topologies,

    but to move from cleaned elements to topology is feasible. GRASS has interactive tools for accomplishing

    this, but on many occasions GRASS users import vector data from other sources, support for which is

    increasing, now also with support for storing attribute tables in PostgreSQL. In the example above, use was

    made of tools from the S map.buildercollection to prepare cleaned and built topologies for the Midwest

    data set. Here again, it was necessary to combine both automation of topology construction with objectlabelling to establish a correct mapping from clean polygons to attribute data, as also in GRASS v.in.arc

    and other import programs.

    These concerns are certainly among those that increase the relevance of work in the Omega project on

    interfacing database, analysis and presentation systems, including threetier interfaces using CORBA/IDL.

    They also indicate a possible need for rethinking the GRASS database model, so that more components can be

    entrusted to an abstract access layer, rather than to hardwired directory trees work that is already being

    done for attribute data.

    On the data analysis side, most of the key questions have been taken up and are being investigated in the

    Omega project, especially those of importance for raster GIS interfaces like very large data sets. R's memory

    model is most effective for realistically sized statistical problems, but is restricting for applications like

    remote sensing or medical imaging, when special solutions are needed (particularly to avoid using data

    frames). This issue of scalability, and ways of parallelising R tasks for distribution to multiple compute

    servers deserves urgent attention, and combined with the RPostgreSQL link to feed compute nodes with

    subsets of proxied data tables is not far off.

    Finally, we would like to encourage those who feel a need or interest in open source development to

    contribute it is rewarding, and although combining projects with closed deliverables with apparent altruism

    may seem difficult, experience of the open source movement indicates that, on balance, projects advance

    better when difficulties are shared with others, who may have met simmilar difficulties before.

    Open Source geocomputation

    5 Conclusions and further developments 39

  • 8/12/2019 Geo Comp 2000

    40/40

    References

    Abelson, H., Sussman, G. and Sussman, J. 1996Structure and interpretation of computer programs.

    (Cambridge MA: MIT Press).

    Bailey, T. and Gatrell, A. 1995Interactive spatial data analysis(Harlow: Longman).

    Becker, R. A. 1993.Maps in S. Bell Labs, Lucent Technologies, Murray Hill, NJ.

    Becker, R. A. and Wilks, A. R. 1995.Constructing a geographical database. Bell Labs, Lucent Technologies,

    Murray Hill, NJ.

    Bivand, R. S., 1999. Integrating GRASS 5.0 and R: GIS and modern statistics for data analysis. In:

    Proceedings 7th Scandinavian Research Conference on Geographical Information Science, Aalborg,

    Denmark, pp. 111127.

    Bivand, R. S. 2000. Using the R statistical data analysis language on GRASS 5.0 GIS database files.

    Computers & Geosciences, 26, pp. 10431052.

    Bivand, R. S. and Gebhardt, A. (forthcoming) Implementing functions for spatial statistical analysis using the

    R language.Journal of Geographical Systems.

    Chambers, J. M. 1998.Programming with data. (New York: Springer).

    Ihaka R, 1998 R: past and future history.Unpublished paper.

    Ihaka, R. and Gentleman, R. 1996R: a language for data analysis and graphics.Journal of Computational and

    Graphical Statistics, 5, 299314.

    Jones, D. A. 2000A common interface to relational databases from R and S a proposal. Bell Labs, Lucent

    Technologies, Murray Hill, NJ (Omega project).

    Neteler, M. 2000Advances in opensource GIS. Proceedings 1st National Geoinformatics Conference of

    Thailand, Bangkok.

    Venables, W. N. and Ripley, B. D. 1999Modern applied statistics with SPLUS. (New York: Springer);

    (book resources).

    Appendices

    Interface package help pages

    GRASS package for R

    R code for making and labelling polygons

    Open Source geocomputation

    http://cm.bell-labs.com/cm/ms/departments/sia/doc/rs-dbi.pdfhttp://cm.bell-labs.com/cm/ms/departments/sia/doc/rs-dbi.pdfhttp://www.geog.uni-hannover.de/users/neteler/geoinformatics2000.pdfhttp://www.stats.ox.ac.uk/pub/MASS3http://mkpoly.r/http://grass_0.1-4.tar.gz/http://00index.html/http://www.stats.ox.ac.uk/pub/MASS3http://www.geog.uni-hannover.de/users/neteler/geoinformatics2000.pdfhttp://cm.bell-labs.com/cm/ms/departments/sia/doc/rs-dbi.pdfhttp://cran.r-project.org/doc/html/interface98-paper/paper.htmlhttp://cm.bell-labs.com/cm/ms/departments/sia/doc/95.2.pshttp://cm.bell-labs.com/cm/ms/departments/sia/doc/93.2.ps