query languages for statistical databases

14
Statistics and Computing (1995) 5, 59-72 Query languages for statistical databases ABDULLAH UZ TANSEL Baruch College, City University of New York, New York, N Y 10010, USA Received 1992 Statistical database management systems keep raw, elementary and/or aggregated data and include query languages with facilities to calculate various statistics from this data. In this article we examine statistical database query languages with respect to the criteria identified and taxonomy developed in Ozsoyogluand Ozsoyoglu(1985b). The criteria include statistical metadata and objects, aggregation features and interface to statistical packages. The taxonomy of statistical database query languages classifiesthem with respect to the data model used, the type of user interface and method of implementation. Temporal databases are rich sources of data for statistical analysis. Aggregation features of temporal query languages, as well as the issues in calculating aggregates from temporal data, are also examined. Keywords." Aggregation, statistical query languages, statistical databases, summary tables, tem- poral databases 1. Introduction Database management systems (DBMS) maintain and manage the data about an organization and its opera- tions. Traditionally, databases have been developed for commercial business data processing to allow easy and fast access to data, and to improve productivity of appli- cation development. Such databases can be labelled corporate database management systems (CDBMS). These databases provide vital information for the opera- tion and management of the organizations they serve. They support day-to-day operation of the enterprise (such as transaction processing), as well as the functions of middle and top management (auditing, planning, staffing, marketing, etc). These functions require extensive use of reports that summarize and/or classify data extracted from the database as well as presenting results of appli- cation of various mathematical and statistical techniques, such as calculation of averages, sums, indexes or trends. However, CDBMS generally provide limited support in this regard, perhaps providing no more than sums, averages, maxima and minima. Moreover, CDBMS are not suitable for the management of demographic, census, social and economic data. These applications require exten- sive use of statistical analysis techniques that range from calculating simple summary statistics to complex statistical techniques such as factor analysis, discriminant analysis and so on. They also require special conceptual and internal modelling constructs which are not available 0960-3174 1995Chapman& Hall in CDBMS. Furthermore, data aggregation features of CDBMS are add-on, ad hoc and usually inefficient. Databases that provide statistical analysis capabilities and/or maintain data about large populations are called statistical databases (SDB). A statistical database manage- ment system (SDBMS) models data in a way suitable for the SDB user's needs and allows application of statistical analysis techniques as its user interface. Thus, an SDBMS is expected to have powerful, easy-to-use, and efficient data aggregation features. For more advanced statistical data analysis requirements, the SDBMS provides inter- faces to statistical analysis procedures, which may be trans- parent to users or produce explicit output data in a format to be fed into statistical packages. Statistical software packages have been available for a long time. They have been widely and extensively used by economists and researchers in social sciences. Examples of such packages are SPSS, P-STAT, BMD and SAS. How- ever, the data management capabilities of these packages are limited and most user requirements are met by file management systems and customized application pro- grams. To accommodate these needs, new features have been added, at an increasng pace, to the statistical packages: for example, B+ tree file organization in P-STAT (Buhler 1981), new data manipulation commands of SPSS-X (Fry 1981), and an SQL interface to SAS (SAS 1982). However, there are major differences between statistical packages and statistical databases. Statistical databases provide conceptual modelling of statistical

Upload: unila

Post on 02-Feb-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

Statistics and Computing (1995) 5, 59-72

Query languages for statistical databases

ABDULLAH UZ TANSEL

Baruch College, City University of New York, New York, NY 10010, USA

Received 1992

Statistical database management systems keep raw, elementary and/or aggregated data and include query languages with facilities to calculate various statistics from this data. In this article we examine statistical database query languages with respect to the criteria identified and taxonomy developed in Ozsoyoglu and Ozsoyoglu (1985b). The criteria include statistical metadata and objects, aggregation features and interface to statistical packages. The taxonomy of statistical database query languages classifies them with respect to the data model used, the type of user interface and method of implementation. Temporal databases are rich sources of data for statistical analysis. Aggregation features of temporal query languages, as well as the issues in calculating aggregates from temporal data, are also examined.

Keywords." Aggregation, statistical query languages, statistical databases, summary tables, tem- poral databases

1. Introduction

Database management systems (DBMS) maintain and manage the data about an organization and its opera- tions. Traditionally, databases have been developed for commercial business data processing to allow easy and fast access to data, and to improve productivity of appli- cation development. Such databases can be labelled corporate database management systems (CDBMS). These databases provide vital information for the opera- tion and management of the organizations they serve. They support day-to-day operation of the enterprise (such as transaction processing), as well as the functions of middle and top management (auditing, planning, staffing, marketing, etc). These functions require extensive use of reports that summarize and/or classify data extracted from the database as well as presenting results of appli- cation of various mathematical and statistical techniques, such as calculation of averages, sums, indexes or trends. However, CDBMS generally provide limited support in this regard, perhaps providing no more than sums, averages, maxima and minima. Moreover, CDBMS are not suitable for the management of demographic, census, social and economic data. These applications require exten- sive use of statistical analysis techniques that range from calculating simple summary statistics to complex statistical techniques such as factor analysis, discriminant analysis and so on. They also require special conceptual and internal modelling constructs which are not available 0960-3174 �9 1995 Chapman & Hall

in CDBMS. Furthermore, data aggregation features of CDBMS are add-on, ad hoc and usually inefficient.

Databases that provide statistical analysis capabilities and/or maintain data about large populations are called statistical databases (SDB). A statistical database manage- ment system (SDBMS) models data in a way suitable for the SDB user's needs and allows application of statistical analysis techniques as its user interface. Thus, an SDBMS is expected to have powerful, easy-to-use, and efficient data aggregation features. For more advanced statistical data analysis requirements, the SDBMS provides inter- faces to statistical analysis procedures, which may be trans- parent to users or produce explicit output data in a format to be fed into statistical packages.

Statistical software packages have been available for a long time. They have been widely and extensively used by economists and researchers in social sciences. Examples of such packages are SPSS, P-STAT, BMD and SAS. How- ever, the data management capabilities of these packages are limited and most user requirements are met by file management systems and customized application pro- grams. To accommodate these needs, new features have been added, at an increasng pace, to the statistical packages: for example, B+ tree file organization in P-STAT (Buhler 1981), new data manipulation commands of SPSS-X (Fry 1981), and an SQL interface to SAS (SAS 1982). However, there are major differences between statistical packages and statistical databases. Statistical databases provide conceptual modelling of statistical

60 Tansel

objects, easy-to-use query languages, and rich internal modelling constructs which are vital in meeting SDB user needs. Although the statistical packages provide powerful and advanced statistical analysis procedures, their data management facilities are far more restrictive than that of SDBMS. This is the major difference between statistical packages and SDBMS and also partly explains the ratio- nale for the development of SDBMS. In this article, we survey existing and proposed statistical database query languages. We follow the approach of Ozsoyoglu and Ozsoyoglu (1985b), which is a comprehensive study of such languages up to that date. We include new languages that have appeared since the publication of that study. We also examine the temporal dimension of statistical data as well as the aggregation features of temporal query languages.

The article has eight sections. Section 2 identifies the criteria to be used in evaluating and classifying SDB query languages. Section 3 gives a taxonomy of SDB query languages. Sections 4 and 5 survey the existing and pro- posed SDBMS with respect to the taxonomy developed in Section 3. Summary tables are commonly and widely used statistical objects. Section 6 is therefore devoted to examin- ing summary table generation features of SDB query languages. Section 7 examines temporal query languages and their facilities to calculate aggregates from temporal data, and Section 8 is the conclusion.

2. Evaluation criteria for the SDB query languages

In an SDB environment, operational use of data dictates the type of data modelling and manipulation capabilities of SDBMS. For instance, evaluation of census data requires summarization, classification and aggregation of raw data whereas in exploratory data analysis the users deal with representative, interpreted or 'cleaned' subsets of data. Also, to meet SDB user requirements, comprehen- sive conceptual modelling capabilities for various statistical objects need to be incorporated into the data model. These include summary tables, matrices and scatter diagrams (Denning et al. 1983; Maier and Cirili 1983; Ozsoyoglu and Ozsoyoglu 1983b, 1984b; Shoshani 1982; Sue t aI. 1983). Following the approach of Ozsoyoglu and Ozsoyoglu (1985b) we will use data and metadata definition, data manipulation, interface to statistical packages and the expressive power of the languages as the major criteria in evaluating SDB query languages. For the metadata defini- tion we will consider; (1) the objects definable in the data model and processable in the language; (2) data descrip- tion such as units of measurement, missing values, and data quality information; (3) footnotes; (4) keywords; (5) textual descriptions; (6) temporal data and time dimen- sion; and (7) editing specifications and data structuring capabilities.

Among SDB objects, summary tables--that is, tabular representations of aggregated data--have an important place. They are commonly and frequently used in SDB applications. All statistical packages provide some sum- mary table output formatting facilities, albeit limited and mostly at an elementary level. There are many SDBMS that include summary tables as a modelling object and pro- vide powerful and easy-to-use summary table processing languages. STBE is a good example of such a language (Ozsoyoglu and Ozsoyoglu 1984a). Considering the objects manipulated by a language, its expressive power determines the possible objects derivable by that lan- guage. We will evaluate data manipulation capabilities of SDB languages with respect to: (1) aggregation capa- bilities; (2) subsetting and sampling; (3) metadata mani- pulation; and (4) handling the time dimension explicitly. Whenever possible we will comment on the expressive power of SDB languages as well as ease of use, syntax and functionality of these languages.

Data aggregation functions--the calculation of simple summary statistics such as average, minimum, maximum, count and sum--are very common operations. In one form or another query languages of CDBMS include these functions. However, they are added to the query languages in a add-on and ad hoc manner and their calculation is usually inefficient. Since aggregation operations are extremely frequent in statistical database applications, query languages of most SDBMS provide powerful and user-friendly aggregation facilities, e.g. STRAND (Johnson 1981), GENSYS (Manes and Dintelman 1981b), SSDB (Ozsoyoglu and Ozsoyoglu 1984b).

Statistical packages include advanced statistical analysis procedures. Execution of these procedures usually requires that a long list of parameters be initialized by a set of syn- tactically complex commands. However, modern statistical languages (such as St, SC, and SPIDA) are easier to use. There are three forms of interface between an SDB query language and a statistical analysis procedure (package).

Alternative I The SDBMS includes a library of statistical analysis procedures. Its query language provides syntacti- cally simple and easy-to-use commands to invoke the statis- tical analysis procedures. The user specifies the commands only; the execution of the statistical analysis procedures is arranged by the SDBMS and it is transparent to the user. This approach is limited and inflexible since it interfaces only to the statistical library of the system, not to the other statistical packages which may contain a rich library of such procedures.

Alternative II The user specifies the execution of a statisti- cal procedure from a statistical package in his or her query. The SDBMS prepares the input and the necessary com- mands for the package. Then, the user initiates the execution of the statistical analysis procedure as a separate

Query languages for statistical databases 61

operation. This alternative provides a rich set of possi- bilities. However, integration of various statistical packages to an SDBMS requires heavy overhead.

Alternative III The SDBMS prepares a fiat table of data which is later given to the statistical package as the input. The user prepares the set of commands for the statistical analysis procedure and invokes its execution.

3. A taxonomy of SDBMS

A taxonomy of statistical database management systems has been developed in Ozsoyoglu and Ozsoyoglu (1985b) and later modified in Tansel (1991). In the remainder of the article we will use this taxonomy in examining the statistical query languages.

3.1. S D B M S built on top of C D B M S

The majority of the systems in this category are relational systems which include HSDB (Ikeda and Kobayashi 1981) on top of Model 204 (Computer Corporation of America 1979), Ghosh's extension to SQL (Ghosh 1984a), System K on SQL/DS (Maier and Cirilli 1983), STRAND (Johnson 1981) on top of INGRES (Stonebraker et al. 1976), GRAFSTAT (Stein 1986) on top of DB2 (SQL/DS 1981).

Another approach is to use a generalized interface system that links together available CDBMS, statistical packages and graphics software using a high-level interface lan- guage. Examples are PASTE (Weis and Weeks 1983), SUBYL (Heiler and Bergman 1983), GPI (Hollabaugh and Reinwald 1981), and PEPIN-SICLA (Boufares et al. 1985).

3.2. Separately developed S D B M S

Systems in this category generally use relations as data modelling tools and an algebra or calculus based lan- guage. We further group them into six categories with respect to their data model and query languages.

Relational data model (Codd 1972) and relational query languages. These systems have been developed within the formalism of the relational data model. They provide new internal (file) organization techniques and conceptual modelling tools suitable for SDBMS as well as well-defined aggregation operations in their query languages. Examples are RAPID (Turner et al. 1979) and CAS SDB (Kohji and Sata 1983) which use relational algebra, ABE (Klug 1981) which uses relational calculus, SIR/SQL (Anderson et al. 1983), GENESYS (Maness and Dintelman 1985a), and CANTOR (Karasolo and Sevenson 1986) which use SQL, JANUS (Klensin 1983) and the algebra of Fortunato

et al. (1986) which uses tables and relational algebra-like operators. QUIS is another such system that handles incomplete information (Chan and Michatewicz 1986). The July system uses a universal relation interface (D'attri and Ricci 1988). The summary data model generates summary tables from the existing ones (Chen et al. 1988).

Network and hierarchical data model. Examples are SIR/ DBMS (Anderson et al. 1983), TPL and TPLDCS (Weiss et al. 1981) and BROWSE (Hendrix et al. 1978).

Formal extensions of relational model: Examples are SSDL (Brown et al. 1985), Klug's work (Klug 1981, 1982) and extensions of Ozsoyoglu and Ozsoyoglu (1983a) and Ozsoyoglu et aI. (1987).

S D B M S with graphical user interfaces. These systems provide graphical, two-dimensional and diagrammatic query languages in contrast to the line-oriented query languages in which query statements are coded. Examples are SUBJECT (Chan and Shoshani 1980), GUIDE (Wong and Kuo 1982), ABE (Klug 1981), STBE Ozsoyoglu and Ozsoyoglu (1984a), SEEDs on-line code book (Merrill et al. 1983), ALDS data editor (Thomas and Hall 1983), GRASP (Catarci and Santucci 1980) and GRASS (Rafanelli and Ricci 1990).

Natural language based user interface. LIDS 86 (Sato 1988).

Query languages that calculate aggregates from temporal data. Examples are TQUEL (Snodgrass 1987), HQUEL (Tansel 1990), TBE (Tansel et al. 1989), Tansel's extension to relational algebra (Tansel 1986) and the temporal data model of Shoshani and Segev (Shoshani and Kawagoe 1987; Segev and Shoshani 1987), and the query language of TEER (Elmasri and Wuu 1990).

4. Query languages of SDBMS built on top of CDBMS

STRAND (Johnson 1981) is a derivative of CABLE (Shoshani 1979) and is built on top of the INGRES (Stone- braker et al. 1976) relational database system. STRAND does not have data definition language features. STRAND queries are translated into QUEL, the query language of INGRES. The STRAND query language allows easy query formulation since it is based on the E/R model (Chen 1976). Aggregate operations (called summariz- ation) can easily be applied on n-entity sets when they are connected by links. Such queries are called chain queries and it is sufficient to specify the beginning and ending entity sets of the chain. Then, the system performs n-way join operations and calculates the aggregates automatically.

62 T a n s e l

However, the existence of several paths between two entity sets causes ambiguity in query processing. STRAND query language includes the projection and restriction operations of relational algebra but not the set theoretic operations. There are no time and metadata handling capabilities in STRAND. Alternative III can be used as the interface to statistical packages.

HSDB (Ikeda and Kobayashi 1981) is an SDBMS imple- mented on top of the relational system Model 204 (Computer Corporation of America 1979). It has extensive metadata definition facilities to specify type of data (dis- crete or continuous), missing values, the unit, precision and theoretical distribution of data and summary statistics about the columns in relations. HSDB also keeps metadata information about derived data, like date of creation, the creator and the formula used in derivation. HSDB main- tains summary tables and provides a limited set of opera- tions on them. This feature of HSDB is further elaborated on in Section 6 below. Alternative I is the inter- face to statistical packages since the query language of HSDB can execute statistical analysis procedures on relations and summary tables.

Ghosh (1984a) proposes an extension to the relational model and the query language SQL. The main object of this extension is a statistical relational table (SRT) which is a relation whose attributes are the category and sum- mary attribute of the summary table. System K (Maness and Dintelman 1986a) is an object-oriented knowledge base management system which is built on top of the SQL/DS sys- tem. It has major metadata management capabilities and limited statistical functions. GRAFSTAT (Stein 1986) is an application software system containing a rich and inte- grated set of tools for scientists and engineers. It can be used to explore data interactively, to create graphics and to analyse data using functions of applied statistics. For the database functions, GRAFSTAT provides an interface to DB2 or SQL/DS via the SQL language.

BROWSE (Harlee 1986) is an on-line search facility developed as an enhancement to the LABSTAT database system used in the Bureau of Labor Statistics. The BROWSE facility was developed as a set of command pro- cedures (in the WYLBUR text editing system of IBM) that use specially designed files of metadata. It includes com- mands that allow the user to search and identify the time series according to various criteria. The search criteria are given as key words. For instance, searching the price index data by the keywords 'gasoline' and 'New York' will list all the time series available on gasoline price indexes in the New York region.

SUBYL (Heiller and Bergman 1983), PASTE (Weiss and Weeks 1983), GPI (Hollabaugh and Reinwald 1981), and PEPIN-SICLA (Boufares et al. 1985) are examples of systems that use a CDBMS, statistical package, and graphics software available off-the-shelf to create a system for managing the statistical data. These three-component

software packages are usually combined by a common interface which includes capabilities for data transforma- tions from one component to another as well as com- mands to access the facilities of each individual software component. SUBYL (Heiller and Bergman 1983) is a system that manages time series. It uses the relational CDBMS Model 204, GPI (Hollabaugh and Reinwald 1981) has 'customizer' software for tailoring a statistical package to access a specified data/file structure that is described in a dictionary. In PASTE (Weiss and Weeks 1983), the user writes a query by using the commands of the CDBMS (RAPID) and the statistical package (SPSS). PEPIN-SICLA (Boufares et al. 1985) is a data analysis sys- tem that combines two existing software packages, PEPIN, a relational database management system (Halanbondrainy 1983), and SICLA, a statistical data analysis package (Joiner et al. 1986). A common interface links these two systems. At this interface the user specifies relational algebra operations to extract data from the relational data- base. Then, the resulting data is fed to SICLA for data analysis. Data analysis routines range from elementary statistics to more complex procedures like factor analysis, and cluster analysis. The user works with menus. The system includes metadata about the variables and their modalities, i.e. their identifiers, long names, statistical types and members of modalities.

5. Query languages of separately developed SDBMS

5.1. SDB query languages based on relational models

RAPID (Hammond 1981; Turner et aI. 1979) is a relational system developed by Statistics Canada and widely used by statistical agencies in several countries. Each RAPID relation is a self-describing transposed file which allows efficient processing of statistical queries. It contains data as well as metadata such as attribute names, data types, size, domain, last update, status, etc. Additional meta- data, such as entities and items of entities, are maintained in the RAPID dictionary. An entity may be a relation, a codeset for data compression, a value set and a comment. Items describe information about entities. The query language of RAPID is based on relational algebra. RAPID provides an interface to SAS and SPSS, and to the table producing system, TPL.

GENISYS (Dintelman and Maness 1982; Maness and Dintelman 1981a, b) is another SDBMS that is based on the relational model. As the basic modelling construct it uses a tabular view of entities which is like a relation. Each table (relation) is stored as a file. GENISYS relations, however, are not in first normal form since they allow repeating fields or a range of value in columns. Metadata is stored in a dictionary which can be browsed by the user. GENISYS uses data encoding for data

Query languages for statistical databases 63

compression. Null values are used to represent undefined values. The database administrator defines links between entities which are later used for join operations. The query language of GENISYS is GQL which is similar to SQL. The user specifies aggregation operations on a population of values by grouping them with respect to the ranges defined in the category attributes (classification criteria). Then aggregates are calculated for each group. Note that grouping is used in contrast to the partitioning concept of SQL, which is similar to the aggregation-by-template operation of SSDB (Ozsoyoglu and Ozsoyoglu 1984b). As an example, consider the relation scheme PERSON (ID, SEX, BIRTH-YEAR, BIRTH-COUNTY, DEATH- YEAR) and the query to compute average age at death by sex, by birth year and by birth county. In GQL, this query is expressed as:

SELECT AVERAGE (DEATH-YEAR-BIRTH-YEAR) BY SEX BY BIRTH-YEAR (1900... 1954, 5) BY BIRTH-COUNTY ('UTAH', 'SALT LAKE', *).

Records in the PERSON relation are grouped according to sex (male, female), BIRTH-YEAR in ranges of size 5 from 1900 to 1954 (i.e. (1900, 1901, 1902, 1903, 1904), ...) and BIRTH-COUNTY for UTAH, SALT LAKE, and for all counties (i.e. *). For each group average age is calculated.

ABE (Klug 1981) is a screen oriented language similar to QBE (Zloof 1987). In ABE, queries are formulated as hier- archically arranged subqueries (windows). A subquery invokes another subquery by providing parameters and applies an aggregate operation on the result produced by the invoked subquery. This is an elegant way to implement the partitioning of tuples in a relation. Unlike SQL, this method retains the empty partitions. As far as aggregation queries are concerned, ABE is more user-friendly, as well as more powerful, than both SQL and QBE. Moreover, some nested aggregation queries that are expressible in ABE can- not be expressed in SQL. Additionally, quantifiers, like all, some, and only are implemented by set comparison opera- tors in ABE. ABE is based on a relational calculus with aggregate functions and can express conjunctive relational queries. However, it is not relationally complete since it does not include the set union operation.

The system CANTOR (Karasolo and Sevenson 1983) has been designed for the analysis of large sets of data. The data model of CANTOR is an object-based data model and includes elementary objects (e.g. integers or text), tuples and sets of objects. A relation is the special case of a set object and it is the set of uniform tuples. The query language of CANTOR is SAL which is a non- recursive algebraic language. SAL query expressions include arithmetic operators (e.g. +), comparison opera- tors (e.g. >), arithmetic functions (e.g. SQR) and logical operators (e.g. and). Set comparison and set containment are also included. Other operations of SAL are set union,

projection, selection and Cartesian product which are defined similar to the relational algebra operations. In addition to aggregation operations like COMPUTE, SUM and PRODUCT, SAL also includes a grouping operator that partitions the tuples of a relation and applies an aggregate operation on each partition. The design con- siderations for CANTOR storage structures, query proces- sing and its optimization have been reported in Karasolo and Sevenson (1986).

JANUS is a SDBMS used within a large-scale data analysis and modelling system called CONSYSTENT (Klensin 1983). JANUS uses relations with set-valued attributes. Links (similar to GENISYS) are defined to model relationships among the relations. JANUS has an algebraic query language that includes operations similar to set theoretic operations and the outer join (Rosenthal and Reiner 1984) operation of the relational algebra. For statistical analysis JANUS provides interface to the CONSYSTENT system by using Alternative II or III.

The SIR/DBMS (Anderson et al. 1983) provides a rela- tional conceptualization of data on which hierarchical and network views can be superimposed. The query language of SIR/DBMS is SIR/SQL+ which operate s on relations only. The examples provided in Anderson et al. (1983) demonstrate that SIR/SQL+ has the same expres- sive power as SQL. SIR/DBMS also has facilities for: (1) naming, labelling and documenting the data in the data- base; (2) data quality control checks; (3) I/O security con- trol; (4) a set of simple statistical procedures that include frequency distributions, histograms, descriptive statistics, scatter diagrams, line printer plots and simple linear regres- sion; (5) summary table production, and (6) an interface to the statistical packages BMDP, SAS, and SPSS via Alternative III.

CAS SDB (Kohji and Sato 1983) is another experimental system. It has extensive metadata management facilities and an interface to the statistical package SAS. The query language of CAS SDB is a subset of the relational algebra. Recently, SQL has been added to SAS (SAS User Manual 1992). Tabular statistical data is visualized as a relation. The cases in statistical data correspond to tuples of a relation and the variables correspond to attri- butes. Thus, SAS supports a full version of SQL.

QUIS (Chan and Michalewicz 1986) is a statistical query language which handles incomplete information. It is based on a relational data model. QUIS is screen oriented and tabular in nature. For the query specification, the screen is divided into five areas: output table area, schema table area, logical expression area, function definition area and message area. The output of a query is defined as a relation skeleton in the output table area. A relation skeleton con- sists of a header row and columns. Attribute names are specified in the columns. The schema display area lists the relation schemes in the database for user convenience. Selection conditions appear in the logical expression area.

64 Tanse l

The function definition area serves for the specification of functions that are complex calculations. Functions help break queries into logical segments. The message area provides feedback information to the user, including error messages, interpretation of the query, explanation of the result and indications of work load. QUIS also handles incomplete information by applying Lipski's methodology (Lipski 1979).

The summary data model is another extension of the relational data model (Chen et al. 1988). A category (also called type or class) is defined as a subset of a relation instance with respect to a classification criterion, e.g. male employees or programmers. Functions, like sum, count, etc, are applied to categories to compute summary statis- tics. These precomputed summary statistics are stored in the database and used in calculating new summary statis- tics for the categories whose statistics are not stored in the database. Note that in these calculations, the original relation instance (the raw data) is never used. Since it is not possible to store all the possible categories in a data- base, only a preselected group of categories is kept in the database. It is also shown that determining whether a category is derivable from a set of other categories is NP-hard.

The suitability of SQL-based systems for statistical and scientific databases has been discussed in Klensin and Romberg (1988). This study reports that SQL is complex and tedious for carrying out most statistical analysis applications. The complexity stems from the nature of the relational data model, i.e. normalization and lack of navigation primitives, and limited operation set of SQL. It also critically evaluates SQL2 (Melton 1988) which is being developed as an international standard by the Inter- national Standards Organization. SQL2 is an extended version of SQL for handling issues commonly encountered in statistical and scientific databases. These issues are categorized with respect to database usage patterns, data types and structures, domains and restrictions on relation- ships among values, and data analysis requirements and their implications. They include transactions and data cleaning, storage management and auditing, extended data types to manage nominal, ordinal and interval type numeric data effectively, aggregated data types such as arrays and nominal multiple response questions, storing values from computations, and access to statistical metadata.

5.2. SDB query languages based on network or hierarchical models

The majority of SDBMS, proposed or implemented, utilize the relational data model. There are few SDBMS based on the hierarchical and network data models. One such system is SIR/DBMS which allows navigation on hierarchical and network databases by a procedural language (Anderson et al. 1983).

The table processing language (TPL) descriptive code- book system (TPLDC) (Weiss et al. 1981) utilizes the trees formed from the links between entity sets. To define these trees, the database administrator prepares a graph, G, from the relationships between the entity sets. A one-to- many relationship between the entity sets A and B is represented by a directed edge from A to B. A one-to-one relationship is also represented by one directed edge. A many-to-many link between the entity sets A and B is repre- sented by two directed edges, one from A to B, another one from B to A. All possible rooted directed trees are enumer- ated into a set called S. In these trees, a node has only a one- to-many link to each of its children. The set V is obtained by deleting all the trees in S that are subtrees of other trees in S. Finally, these trees in V are defined by association statements in the database schema of the TPLDC system. Each tree is called a view and is used to produce summary tables. The use command of TPL is used to pick a view. Then, a subset of this view is formed by the select com- mand which allows arithmetic, comparison and logical operators. Finally, the user specifies the attributes (called variables) to be extracted and used in producing the summary tables. There are two problems with TPL. First, even if there are a small number of entities, the set V may be very large. In fact, it may contain as many as n trees! if there are n-entity sets. Second, a user has to deal with the entire tree even if only a subset of the tree is needed. In spite of these problems TPL is widely used in more than 200 computer centres.

A hypertext system organizes the secondary information about the statistical objects stored in the Luxemburg Income Study Database (Stephenson 1988). Such informa- tion includes methods of data collection, problems with individual observations, precise definitions of missing values, etc.

5.3. Formal extensions to the relational model

SSDL is a procedural data manipulation language based on a data model that includes various objects: set, ordered set, vector, matrix, time series, text and G-relation (a general- ized relation) (Brown et al. 1985). A G-relation is a com- plex data type borrowed from the Semantic Association Model (SAM*) (Su 1983). A G-relation is a recursive struc- ture and its attributes receive their values from a complex domain which may include sets, vectors, matrices, or another G-relation. SSDL has a procedural language with a rich set of operators for the objects it supports. These operations include relational algebra operations for the G-relations, set theory operations for the sets and ordered sets, linear algebra operations for the vectors, matrices, and time series and string manipulation operations for the text data. The query language of SSDL also provides scanning and blocking constructs. A looping statement and currency indicators are used to scan explicitly the

Query languages for statistical databases 65

tuples of a G-relation. Thus, the query language of SSDL is very powerful. However, queries may require coding of long procedures. This approach deviates from the com- monly agreed notion that a database query language should provide a minimal set of operators with an accep- table level of expressive power.

Klug extends relational algebra and relational calculus by an operation called aggregate formation (Klug 1982). This operation partitions the operand relation according to a set of designated attributes and applies an aggregate function on the specified attribute for each partition. Con- sidering the PERSON relation given above, the following algebraic statement calculates the number of persons for each BIRTH-COUNTY:

PERSON (BIRTH-COUNTY, COUNT (ID))

Klug adds a similar function to the relational calculus by defining two types of formulas: a range formula and a qualifier. The calculus language is the basis for the statisti- cal langugage ABE that we discussed earlier. Klug also shows that relational algebra and relational calculus, with the aggregate formation operation, have the same expres- sive power.

Ozsoyoglu and Ozsoyoglu (1983b) extend the relational model by set-valued attributes. An attribute value may be a simple value or a set of values. Thus, one level of nesting is allowed. They also define relational algebra (Ozsoyoglu and Ozsoyoglu 1983b) and relational calculus languages (Ozsoyoglu et al. 1987) for set-valued relations and show that these two languages have the same expressive power. The relational algebra includes pack, unpack, and aggregation-by-otemplate operations in addition to the basic relational algebra operations. Pack is a one-attribute nesting operation. To pack attribute A of a relation, the relation is partitioned according to the remaining attri- butes and the A-values in each partition are grouped into a set. The common value of a partition and the set of A-values form a tuple of the result. Unpack is a one- attribute unnest operation and is applied on a set-valued attribute. A new tuple is created for each value in the set- valued attribute by repeating the remaining attribute values. The aggregation-by-template operation is used for aggregate calculations. This operation groups the tuples of a relation with respect to ranges specified by a template relation. Then, the specified aggregate function is applied on the designated attribute and an aggregate value is calcu- lated for each group. A template value and the correspond- ing calculated aggregate value form a tuple of the result. The aggregation-by-template operation is different from the aggregate formation operation. The former uses group- ing whereas the latter uses partitioning. However, each operation can be expressed by the other operation.

The calculus language defined by Ozsoyoglu et al. (1987) manipulates set-valued relations and includes functions that are equivalent to the aggregation-by-template

operation of the relational algebra. This language forms the basis for the summary table manipulation language STBE (Ozsoyoglu and Ozsoyoglu 1984a) of the statistical database system SSDB (Ozsoyoglu and Ozsoyoglu 1984b).

5.4. SDB query languages with graphical user interfaces

A large group of people at different organizational levels interact with the databases. However, the majority of these people do not have formal education in using computers. Generally, they encounter difficulty in using the text-based query languages even though they are designed with the goal of being user-friendly, i.e. easy to use, easy to learn. Some of the difficulties identified are (Wong and Kuo 1982):

1. Usually, the conceptual schema contains a long list of entities and their attributes, so the user has to remember their names, acronyms, etc.

2. Data models are mostly based on abstract concepts such as sets or logic which do not provide easy to under- stand semantic information.

3. Lack of feedback and facilities for incremental query formulation, especially for complex queries.

4. Lack of details in level of database schema. 5. Lack of metadata browsing facilities.

The large volume of statistical databases accentuates these problems and makes the SDB query languages even more difficult to use. Considering these problems and following the approach of QBE (Zloof 1987) and other developments in graphical user interfaces, statistical database researchers and designers proposed and/or imple- mented graphical, menu-driven, browsing-based and diagrammatic query languages (Catteil 1980). Among such systems and languages, we can list STBE (Ozsoyoglu and Ozsoyoglu 1984a), ABE (Klug 1981), GUIDE (Wong and Kuo 1982), SUBJECT (Chan and Shoshani 1980), SEEDS On-line Codebook (Merrill et al. 1983), ALDS Data Editor (Thomas and Hall 1983), GRASS (Rafanelli and Ricci 1990), GRASP (Catarci and Santucci 1990), and TBE-2 (Malnborg 1988).

GUIDE (Wong and Kuo 1982) uses the E/R model (Chen 1976) to represent the entities and the relationships among them. GUIDE keeps additional metadata informa- tion in two directories. The hierarchical subject directory classifies the entity types (subjects) hierarchically. Simi- larly, the hierarchical attribute directory organizes attribute types into hierarchical groups. The user refers to both direc- tories from a sequence of menus to locate relevant subjects (entity types) and their attributes. After locating the required metadata the user traverses the E/R graph. Any part of the graph can be brought to the screen and the user selects entities and paths among them. Previously selected parts of the graph may be combined to form more complex queries. Once the scheme relevant for a query is determined the user can execute the query and

66 Tansel

see the results on the screen. A graphical data display facil- ity is made available to display the aggregation results in formats such as tables, pie and bar charts, and plots.

The SUBJECT system (Chan and Shoshani 1980) repre- sents metadata and data as an acyclic directed graph where nodes denote the categories in the system and the edges show the hierarchical relations among them. There are two types of nodes: cluster nodes and cross-product nodes. Cluster nodes are used to represent a collection of items or values of the same type. For instance, a cluster node may represent a state whose subordinate nodes are a collection of cities. Cross-product nodes are used to repre- sent multidimensional structures, usually parameters or categories of the data. As an example, in modelling popu- lation counts by state and by year, state and year cluster nodes are combined by a cross-product node.

A user can enter the SUBJECT system at the root node of the directed graph to find the subject categories that exist in the system. By selecting a subject category the user is pro- vided with more detailed information. The user can con- tinue to browse and eventually select a data file. At this point the User can explore the attributes and parameters of the file and express a query by selecting the desired nodes of the graph. An alternative to the browsing capability is provided for experienced users whereby they can search the data file using keys. Thus, they may quickly locate the desired data file and proceed to express query conditions by moving around the directed graph that file represents. Another feature of the SUBJECT system is node sharing which permits more than one arc to point to the same node. Node sharing allows attribute domains to be shared among different files, which eliminates dupli- cation of data values, achieves consistency in naming the items, and provides join domains.

Table-by-Example-2 (TBE-2) is a proposed graphical user language which utilizes a window-icon-mouse inter- face (Malnborg 1988). The two components of the system are a metadata browser and a graphic table design language. An E/R-like data model is used in representing the metadata. In using the browser, the statistical entities (objects) are pointed to and selected by the mouse. This also invokes a pop-up menu which lists various operations available. Once the user selects the relevant objects, summary tables can be created by using the table design language. Hence, this is a simple and intuitive user inter- face that requires little documentation in specifying user requests. A prototype of this system is being implemented for demonstration purposes.

5.5. Natural language based user interfaces

LIDS86 is an experimental system which uses an object- oriented statistical data model and has a natural language- based user interface (Sato 1988). Its data model organizes the metadata into two levels: conceptual level and DB

level. The Conceptual level represents all the possible statis- tical data obtainable in the object world whereas the DB level represents the actual statistical data stored in the data- base. Metadata knowledge is kept in a statistical data dictionary in a hierarchy of frames which are connected by a-kind-of links. In addition to statistical tables (called statistical objects), category and summary attributes of these tables are also considered as statistical objects. Another feature of this approach is the distinction between category and classification. Categories represent groupings of objects according to the values in the domains of category attributes. A classification, on the other hand, represents groups of categories defined to serve specific purposes. The statistical data dictionary can be browsed through using hierarchically organized menu screens which can accommodate 22 types of questions on the metadata.

The user formulates queries in a natural query language via a dialogue with the system. The initial query specifi- cation is a natural language statement which refers to statistical objects. The system checks the ambiguities in the user request and guides the user to make the right choice by providing available knowledge on category attri- butes, classifications and their hierarchical structure. Then, the command generator produces a command to retrieve the desired statistics from the database.

6. Summary table manipulation languages

Summary tables are organized in the form of cells which are qualified by the row and column captions. Figure l(a) is a fictitious summary table scheme whose instance is depicted in Fig. l(b). This summary tables gives the population counts by race and by age groups in Manhattan in thousands. The row and column captions are called the category attributes and they are used to group the individuals (entities, records) stored in the database. The cell attribute of the summary table represents aggregate data and is also called a summary attribute. An aggregate value is calculated for each combination of the row and column category attribute values. For instance the aggre- gate value is 350 for white (row category value) and {0.. . 20} (column category attribute value). Note that the category variable may have a single value (e.g. race) or a set of values (e.g. age groups). Therefore, summary tables can not directly be represented by first normal form relations. A formal extension to the relational model to allow set-valued attributes for converting summary tables into relations and utilizing the relational database theory has been reported in Ozsoyoglu and Ozsoyoglu (1983a). A summary table with only one cell attribute is called an elementary summary table. The example in Fig. 1 is an elementary summary table. Several elementary summary tables may be combined in one

Query languages for statistical databases

MANHATTAN-! POPULATION- AGE COUNTS

RACE

Cell

(COUNT)

(a)

MANHATTAN- AGE

POPULATION COUNTS {0...20} (21.40}

WHITE 350 451

BLACK 150 170 OTHER 183 204

(b)

Fig. 1. Example summary table: population data for 1989. (a) Scheme," (b) instance

composite summary table for user convenience or output formatting. Derivability of primitive summary tables from other primitive summary tables and/or elementary (raw) data has been explored in Sato (1981).

MANHATTAN- POPULATION- COUNTS

YEAR RACE

(a)

AGE

Cell

(COUNT)

MANHATTAN- POPULATIONS COUNTS

YEAR

1987

1988

1989

RACE

WHITE BLACK OTHER

WHITE BLACK OTHER

WHITE BLACK OTHER

AGE

{0...20} I {21...40} I

322 390 120 153 80 45

340 395 140 168 112 120

350 451 150 170 183 204

(b)

Fig. 2. Summary table with time. (a) Scheme," (b) instance

1987 fT i rne

198919~ / '~

MANHATTAN- POPULATION- COUNTS

RACE

AGE

Cell (Count)

67

Fig. 3. Three-dimensional view of summary tables

Time (year, month, week, etc) is commonly used as a category attribute. Consider the example summary table in Fig. 1 and rearrange it for the years 1987, 1988, and 1989. The new scheme and its instance are given in Fig. 2. Year is included in the row category attributes at a higher level than 'race'. It could also be placed below 'race'. We could even include year in the column category attributes. In each case we obtain a different summary table; how- ever, they all contain the same information.

Another alternative is to add the time as a third dimen- sion to the summary tables. This approach follows the 'time cube' metaphor which views the time as the third dimension of a relation (Tansel 1990). Figure 3 depicts the summary table of Fig. 1 in this three-dimensional view. We can visualize it as three separate summary tables (one for each year) stacked together. Note that this repre- sentation of a summary table can be converted to the two-dimensional view or vice versa.

The STRAND (Johnson 1981) query language includes an operation called summarization which calculates a com- plex summary table from several elementary summary tables. The cross-product abstraction of the SUBJECT system (Chan and Shoshani 1980) roughly corresponds to a primitive summary table. The HSDB system (Ikeda and Kobayashi 1981) has the capabilities to create and mani- pulate summary tables. It has an operation to create a primitive summary table (called an elementary summary table) from a relation. Ghosh proposes two languages to manipulate primitive summary tables that are represented as flat relations (Ghosh 1984a; b). Ranges of set-valued category attributes are converted to a single value, e.g. {0...20} is converted to 10. The first language is based on relational algebra and includes project, aggregate, and sampling operations. The second language QBSRT is a graphical language similar to QBE.

The TPL language (Table Producing Language System 1980) contains nine commands to produce arbitrarily com- plex summary tables from tree structured files. Use initiates a file and select picks a subset of that file. The table com- mand specifies the row and column category attribute trees by specifying their nodes (category attributes) in the sequence generated by the preorder tree traversal method.

68 Tansel

Define and compute commands are used to specify new category attributes and cell attributes, respectively. Post- compute command calculates new cell attribute values. Relative-time changes the values of the DATE attribute automatically in time series data. Median and quantile com- mands calculate medians and quantiles of the cell values, respectively.

STBE (Ozsoyoglu and Ozsoyoglu 1984a; 1985a) is a powerful graphical query language which utilizes the example query concept to express a query as a hierarchi- cally organized set of subqueries. There is a ROOT sub- query which defines the output of the query and invokes subqueries. The output of a subquery is a relation or a summary table. A subquery applies an aggregate operation on the result produced by a subquery it invokes. Conditions and links among the relations (joins) are specified by example elements. STBE converts references to summary tables to corresponding set-valued relations. Internally all operations are carried out by using relations. Thus, STBE utilizes the relational database theory and provides a powerful facility for the manipulation of summary tables. STL (Ozsoyoglu and Ozsoyoglu 1985a) is an algebraic language used in the implementation of STBE. STBE queries are translated to STL expressions for query processing. STL includes the basic relational algebra opera- tions along with the operations to manipulate set-valued relations (Ozsoyoglu and Ozsoyoglu 1983b) as well as operations to manipulate summary tables. Relation forma- tion converts a summary table to a set-valued relation. Pri- mitive summary table formation is the inverse operation, used to form a summary table from a relation. The concate- nate operation combines two summary tables having the same row or column category attributes. Extract is the inverse operation, used to remove a summary table from a concatenated summary table. Attribute split and attribute merge operations provide primitive/non-primitive summary table transformation capabilities by relocating the rows/columns of a summary table. These are the basic operations of STL and they are powerful enough to express any query expressible by the relational languages. However, some complex queries may require long expressions in STL. To simplify the expressions for such queries, Aggregation-over-table and Attribute- Removal-by-Aggregation operations are added to STL.

Statistical Sweden System Handler AXIS (Malmborg 1986) and Statistical Database National Land Agency of Japan (Sato et al. 1986) process summary tables and their query languages are menu driven. CSO/Bank of England tape service system stores time series data and provides on-line data via a simple command language (Whistler 1986). A system for deriving summary tables from tem- poral databases has been reported in Ahn et al. (1990). The proposed system calculates summary tables from the base relations where time-stamps are attached to tuples.

The system also allows storage of calculated summary tables so that summary tables derivable from the existing ones are not recalculated from the temporal relations. The July System (D'attri and Ricci 1988) utilizes a univer- sal relation interface in generating summary tables from a relational database. It aims at providing a simple and user-friendly environment for non-technical SDB users by shielding the schema details from them.

7. Statistical capabilities of temporal query languages

A temporal database contains past, present and, possibly, future data. Query languages of such databases are called temporal query languages. In this section we examine aggregation features of such languages. TQUEL (Snod- grass 1987) is an extension to QUEL, the query language of the INGRES database management system for hand- ling temporal data. TQUEL uses flat relations which are augmented by four implicit time attributes. Aggregate func- tions of QUEL have been extended in TQUEL (Snodgrass et al. 1987). TQUEL includes stdev, first, last, avgti, varts, earliest, and latest in addition to the max, rain, ave, sum, count, and any aggregates of QUEL. Avgti represents the average growth or decrease experienced by values of an attribute over time and rafts represents the variability of time spacing. The others are self-explanatory. Cumulative and instantaneous versions of aggregate functions are defined. In calculating the aggregates, the former considers all the values up to, and including, a time point, whereas the latter considers only the values valid at a designated time point.

A time sequence collection (TSC) can be visualized as a two-dimensional array where surrogates (identifiers of objects) and time represent the row and column indexes of the temporal values, respectively (Shoshani and Kawagoe 1987; Segev and Shoshani 1987). A TSC models a time-dependent attribute of an object (entity) type. Algebra and vector-based operations have been defined. The aggregation operation can be applied to groupings along the time dimension or the surrogate dimension to create a new TSC. In applying aggregation on the surrogate dimension the objects of the source TSC are grouped according to the surrogate values in another TSC and values in each group are aggregated.

TEER is an extension of the entity relationship model for managing temporal data (Elmasri and Wuu 1990). It is based on attribute time-stamping where disjoint union of time intervals is used as time-stamps. TEER follows the approach of Gadia and Vaishnav (1985) in defining snapshots for a time point or interval and extracting the time reference of an object. The query language of TEER includes aggregate functions and their temporal counterparts, i.e. TSUM, TMIN, TMAX, TAVERAGE, TCOUNT, and TEXIST. These functions apply the aggre-

Query languages for statistical databases 69

gation operation on the values valid over a specified time period. On the other hand, the regular aggregate functions return an aggregate value for each time point by consider- ing only the values valid at that time point.

Tansel extends the relational data model to handle temporal data by attaching time-stamps to an attribute (Tansel 1986). An attribute value (called a temporal atom) is represented as a (time-stamp, value) pair. The time stamp is a time interval [l, u) which is denoted by its lower bound (1) and its upper bound (u). A temporal atom, ([l, u),v), asserts that the value v is valid over the time period represented by [l, u). The history of an attri- bute of an object is represented by a set of temporal atoms. Thus, these are non-first normal form relations with one level of nesting (Jaeschke and Schek 1982). Temporal relational algebra of this model includes the basic relational algebra operations with appropriate modifications as well as new ones to manipulate temporal data (Tansel 1986). The Aggregate formation operation of Klug (1981) is also extended to temporal relations. In addition to the usual aggregates (AVG, MIN, MAX, COUNT and SUM), temporal aggregates are also included, e.g. FIRST, LAST, WAVG (weighted average), PSUM (proportional sum). In the last two, time serves as the weight.

A new operation, 'enumeration' has been added to the relational algebra. It creates a uniform set of values. It applies a small scale aggregate operation to the attribute values valid over a specified interval. Thus, data values for this interval are turned into uniform values. Consider the employee relation EMP (SS#, SALARY, DEPT, MANAGER) where the last three attributes are sets of triplets that represent the history of these attributes. We want to calculate the average salary in each department in 1988. If every employee has a single salary value for the entire year, computing the average salary is straight- forward. However, this is not the case most of the time. An employee may have different salary values due to promotion or demotion in a year. A representative salary value for each employee has to be calculated first. This could be the weighted average where time serves as the weight, or simply the last salary of an employee. The choice depends on how the value will be used. Finally departmental averages are calculated by using these repre- sentative (uniform) values. In the following two state- ments, enumeration and aggregate formation perform this calculation using the EMP relation:

EMPSAL = EMP (ID, DEPT, WAVG (SALARY)}T

where T contains the time interval representing 1988, i.e. [1988, 1989) and

EMPSAL (DEPT, AVG (SALARY)).

Algebra and calculus languages for arbitrarily nested rela- tions that handle time have been formulated in Tansel (1992).

HQUEL (Tansel and Arkun 1986a; Tansel 1990) is another extension to QUEL for manipulating temporal relations with attribute time-stamping and one level of nesting. HQUEL allows specification of temporal predi- cates as regular QUEL conditions in where clauses. HQUEL provides regular QUEL aggregates and time related aggregates of temporal relational algebra. Aggre- gate specification is augmented by the 'when' clause which defines time frames (intervals). Values valid over this time frame participate in the calculation of the aggregate function.

Time-by-example (TBE) (Tansel et al. 1989) is a graphi- cal user-friendly query language that extends STBE for manipulating temporal data. It uses the example query con- cept of QBE and the hierarchically arranged subqueries concept of ABE. This feature is especially useful for aggre- gate calculations. TBE includes regular and temporal aggregate functions which we have described above in the discussion of temporal relational algebra. Powerful aggre- gation features of TBE allow specification of aggregation operations in a hierarchical fashion. TBE is relationally complete since it is based on a relational calculus language defined for set-valued relations (Ozsoyoglu et al. 1987).

8. Conclusion

In this article we have surveyed existing and proposed statistical database query languages by following the criteria identified for evaluating such languages and the taxonomy of SDBMS developed in Ozsoyoglu and Ozsoyoglu (1985b). We have added two more categories to this taxonomy, namely natural language based SDB query languages and query languages of temporal data- bases with capabilities to calculate statistics. Ozsoyoglu and Ozsoyoglu (1985a) is a comprehensive and well- organized survey of SDB query languages as of that date, and includes research results presented at the first two workshops on statistical databases as well as at other con- ferences. Since then, one workshop and two conferences have been held on statistical and scientific databases. The present paper includes research reported at these confer- ences as well as the research results on temporal databases with statistical capabilities.

There are many proposals for statistical databases. How- ever, we have not observed full implementation of any new systems that satisfy the criteria listed in Section 2 in an integrated manner. Instead, there are many experimental implementations of the proposed systems. At the same time, certain existing trends have become more pre- dominant recently. Similar to the object-oriented data models in database design, object-oriented statistical data models are also being proposed. Another trend is the con- ceptualization of statistical metadata, which is voluminous in SDB environments, and storing this knowledge in direc-

70 Tansel

tory and dictionary systems for the users' convenience. Improving the user interface, for reaching the metadata and for specification of user requests, is also a major con- cern among SDB researchers and designers. Graphical browsing and interaction schemes are generally being pro- posed. We also see proposals for SDB query languages based on natural languages. Although, there is consider- able research activity in temporal databases, incorporation of the temporal dimension in statistical databases and their query languages remains to be investigated and developed.

Acknowledgements

The author would like to thank Dr Gultekin Ozsoyoglu and Z. M. Ozsoyoglu of Case Western Reverse University for their valuable contributions, and Bilge Aydin for patiently typing the manuscript. The research has been supported in part by PSC-CUNY research grant #663294.

References

Ahn, T. H., Jo, H. J., Kim, J. H., Yoon, J. L. and Byung, J. K. (1990) Temporal summary table management and graphic interface. In Proceedings of the 5th International Conference on Statistical and Scientific Database Management, Charlotte, NC.

Anderson, G. A., Snider, T., Robinson, B. and Toporek, J. (1983) An integrated support system for inter-package communi- cation and handling large volume output from statistical database analysis operation. In Proc. 2nd International Work- shop Statistical Database Management, Los Altos, CA.

Boufares, P., Elkabbaj, Y., Joiner, G., Ounally, H. (1985) La version SM90 du SGBD relationnel PEPIN. Journ~es SM90, Versailles, France.

Brown, W. A., Navathe, S. B. and Su, S. Y. W. (1983) Complex data types and a data manipulation language for scientific and statistical databases. In Proceedings of the 2nd Inter- national Workshop on Statistical Database Management, Los Altos, CA.

Buhler, R. (1981) Data manipulation in P-STAT. In Proceedings of the First International Workshop on Statistical Database Management, Menlo Park, CA.

Catarci, T. and Santucci, G. (1990) GRASP: a graphical system for statistical database. In Proceedings of the 5th Inter- national Conference on Statistical and Scientific Database Management, Charlotte, NC.

Catteil, R. G. G. (1980) An entity-based database user interface. In Proceedings of the ACM SIGMOD Conf.

Chan, C. and Michalewicz, Z. (1986) A query language capable of handling incomplete information and statistics. In Proceed- ings of the 3rd International Workshop on Statistical and Scientific Database Management, Luxemburg.

Chan, P. and Shoshani, A. (1980) SUBJECT: A dictionary driven system for organizing and accessing large statistical data- bases. In Proc. VLDB Conf.

Chen, M. C., McNamee, L. and Melkanoff, M. (1988) A model of summary data and its applications in statistical databases. In Proceedings of the 4th International Conference on Statistical and Scientific Database Management, Rome.

Chen, P. P. S. (1976) The entity relationship model: toward a uni- fying view of data. ACM Transactions on Datatbase Systems, 1(1).

Codd, E. F. (1972) Relational completeness of database sublan- guages. In Database Systems (Courtant Computer Science Symposia Series, Vol. 6), Prentice-Hall, Englewood Cliffs, NJ.

Computer Corporation of America (1979) File Manager's Technical Reference Manual, Model 204 Database Management System. Computer Corporation of America, Cambridge, MA.

D'attri, A. and Ricci, F. L. (1988) Interpretation of statistical queries to relational databases. In Proceedings of the #th International Conference on Statistical and Scientific Data- base Management, Rome.

Denning, D. E., Nichelson, W., Sande, G. and Shoshani, A. (1983) Research topics in statistical database management. In Pro- ceedings of the Second International Workshop on Statistical Database Management, Los Altos, CA.

Dintelman, S. M. and Maness, A. T. (1982) An implementation of a query language supporting path expressions. In Proc. A CM SIGMOD Conference, Orlando, FL.

Elmasri, R. and Wuu, G. T. J. (1990) A temporal model and query language for ER-databases. In Proceedings of the 6th Inter- national Conference on Data Engineering, Los Angeles, CA. pp. 76-83.

Fortunato, E., Rafanelli, N., Ricci, F. L. and Sebastio, A. (1986) An algebra for statistical data. In Proceedings of the 3rdlnter- national Workshop on Statistical and Scientific Database Management, Luxemburg.

Fry, J. B. (1981) Data manipulation in SPSS and SPSS-X. In Pro- ceedings of the 1st International Workshop on Statistical Data- base Management, Menlo Park, CA.

Gadia, S. H. and Vaishnav, J. H. (1985) A query language for a homogeneous temporal database. In Proceedings of the Sym- posium on PODS, Portland, OR. pp. 51-58.

Ghosh, S. P. (1984a) Statistical relation tables for statistical data- base management. Technical Report RJ 4394, IBM Research Laboratory, San Jose, CA.

Ghosh, S. P. (1984b) An application of statistical databases in manufacturing testing. Proceedings of the IEEE COMDEC Conference, Chicago, IL.

Halanbondrainy, H. (1983) La syst~me SICLA. In Proceedings of the 3rd International Symposium on Data Analysis, Versailles.

Hammond, R. (1981) Metadata in the RAPID DBMS. In Pro- ceedings of the 1st International Workshop on Statistical Data- base Management, Menlo Park, CA.

Harlee, G. L. (1986) LABSTAT BROWSE: a search facility built for an existing database. In Proceedings of the 3rd Interna- tional Workshop on Statistical and Scientific Database Man- agement, Luxemburg.

Heiler, S. and Bergman, R. F. (1983) SIBYL: An economist's workbench. In Proceedings of the 2nd International Work- shop on Statistical Database Management, Los Altos, CA.

Hendrix, G. G. et al. (1978) Developing a natural language inter- face to complex data. ACM Transactions on Database Systems, 3(2) pp. 105-47.

Qur languages f o r stat ist ical databases 71

Hollabaugh, L. A. and Reinwald, L. T. (1981) GPI: a statistical package/database interface. In Proceedings of the 1st Interna- tional Workshop on Statistical Database Management, Menlo Park, CA.

Ikeda, H. and Kobayashi, Y. (1981) Additional facilities of a con- ventional DBMS to support interactive statistical analysis. In Proceedings of the 1st International Workshop on Statistical Database Management, Menlo Park, CA.

Jaeschke, G. and Schek, H. J. (1982) Remarks on the algebra of non first normal form relations. In 1st ACM SIGACT/ SIGMOD PODS Conference, Los Angeles, CA. pp. 124-138.

Johnson, R. (1981) Modelling summary data. In Proceedings of the ACM SIGMOD Conference, Ann Arbor, MI.

Joiner, G., Kezouit, O., Halanbondrainy, H. (1986) Data analysis for relational databases: the PEPIN-SICLA Systems. In Pro- ceedings of the 3rd International Workshop on Statistical and Scientific Database Management, Luxemburg.

Karasolo, I. and Sevenson, P. (1983) An overview of CANTOR-- a new system for data analysis. In Proceedings of the 2nd International Workshop on Statistical Database Manage- ment, Los Altos, CA.

Karasolo, I. and Sevenson, P. (1986) The design of CANTOR--a new system for data analysis. In Proceedings of the 3rd Inter- national Workshop on Statistical and Scientific Database Management, Luxemburg.

Klensin, J. C. (1983) A statistical database component of data analysis and modelling system: lessons from eight years of user experience. In Proceedings of the 2nd International Work- shop on Statistical Database Management, Los Altos, CA.

Klensin, J. C. and Romberg, R. M. (1988) Statistical data manage- ment requirements and the SQL standards--an evolving comparison. In Proceedings of the 4th International Statisti- cal and Scientific Database Management, Rome.

Klug, A. (1981) ABE--a query language for constructing aggregates-by-example. In Proceedings of the 1st Inter- national Workshop on Statistical Database Management, Menlo Park, CA.

Klug, A. (1982) Access paths in the ABE statistical query facility. In Proceedings of the ACM SIGMOD Conference, Orlando, FL.

Kohji, S. and Sato, H. (1983) Statistical database research project in Japan and the CAS SDB project. In Proceedings of the 2nd International Workshop on Statistical Database Management, Los Altos, CA.

Lipski, D. (1979) The semantic issues connected with incomplete information. ACM Transactions on Datatbase Systems, 4(3).

Maier, M. and Cirilli, C. (1983) SYSTEM/K: a knowledge base management system. In Proceedings of the 2nd International Workshop on Statistical Database Management, Los Altos, CA.

Malnborg, E. (1986) On the semantics of aggregated data. In Pro- ceedings of the 3rd International Workshop on Statistical and Scientific Database Management, Luxemburg, pp. 152-58.

Malnborg, E. (1988) Design of user interface for an object- oriented statistical database. In Proceedings of the 4th International Conference on Statistical and Scientific Data- base Management, Rome.

Maness, A. T. and Dintelman, S. A. (1981a) Design of the genealogical information system. In Proceedings of the 1st

International Workshop on Statistical Database Manage- ment, Menlo Park, CA.

Maness, A. T. and Dintelman, S. A. (1981b) The GENISYS data definition facilities. In Proceedings of the 2nd International Workshop on Statistical Database Management, Los Altos, CA.

McCarthy, J. I. (1982) Metadata management for large statistical databases. In Proceedings of the VLDB Conference, Mexico City.

McKenzie, E. and Snodgrass, R. (1987) Supporting valid time: a historical relational algebra. Technical Report, Department of Computer Science, University of North Carolina at Chapel Hill.

Melton, J. (1988) ISD Database language. CPH-2a, ANSI X3H2- 88-127, ANSI X3H2 ISD/TEC JTC1/SCZ1/WG3, Database Languages, ISO-ANSI (Working Draft) SQL2.

Merrill, D., McCarthy, J., Gey, F. and Holmes, H. (1983) Distrib- uted data management in a minicomputer network. In Pro- ceedings of the 2nd International Workshop on Statistical Database Management, Los Altos, CA.

Olken, F. (1983) How baroque should a statistical database manage- ment system be? In Proceedings of the 2nd International Work- shop on Statistical Database Management, Los Altos, CA.

Ozsoyoglu, G. and Ozsoyoglu, Z. M. (1983a) Features of SSDB. In Proceedings of the 2nd International Workshop on Statisti- cal Database Management, Los Altos, CA.

Ozsoyoglu, G. and Ozsoyoglu, Z. M. (1983b) An extension of relational algebra for summary tables. In Proceedings of the 2nd International Workshop on Statistical Database Manage- ment, Los Altos, CA.

Ozsoyoglu, G. and Ozsoyoglu, Z. M. (1984a) STBE--a database query language for manipulating summary data. In Proceed- ings of the IEEE COMPDEC Conference, Los Angeles, CA.

Ozsoyoglu, G. and Ozsoyoglu, Z. M. (1984b) SSDB--an architec- ture for statistical databases. In Proceedings of the 4th IJCIT Conference, Jerusalem, Israel.

Ozsoyoglu, G. and Ozsoyoglu, Z. M. (1985a) A query language for statistical databases. In Query Processing in Database Systems (W. Kim, D. Reiner and D. S. Batory, eds.), Springer-Verlag, New York.

Ozsoyoglu, G. and Ozsoyoglu, Z. M. (1985b) Statistical database query languages. IEEE Transactions on Software Engineering, 11(10), pp. 1071-80.

Ozsoyoglu, G., Ozsoyoglu, Z. M. and Mata, F. (1985) A language and a physical organization technique for summary tables. In Proceedings of the ACM SIGMOD Conference, Austin, TX.

Ozsoyoglu, G., Ozsoyoglu, Z. M. and Matos, V. (1987) Extending relational algebra and relational calculus with set-valued attributes and aggregate functions. ACM Transactions on Database Systems, 12(4), pp. 566-592.

Rafanelli, M. and Ricci, F. (1990) A visual interface for browsing and manipulating statistical entities. In Proceedings of the 5th International Conference on Statistical and Scientific Database Management, Charlotte, NC.

Rosenthal, A. and Reiner, D. (1984) Extending the algebraic framework of query processing to handle outerjoins. In Proceedings of the VLDB Conference, Singapore.

SAS User Manual (1992) SAS Institute Inc, Box 8000, Cary, NC. Sato; H. (1981) Handling summary information in a data-

base: derivability. In Proceedings of the ACM SIGMOD Conference, Orlando, EL.

72 Tansel

Sato, H. (1988) A data model, knowledge base and natural lan- guage processing for sharing a large statistical database. In Proceedings of the 4th International Conference on Statistical and Scientific Database Management, Rome.

Sato, H., Takayaki, O., Youshindu, N. and Pysouke, F. (1986) Conceptual schema for a wide-scope statistical database and its applications. In Proceedings of the 3rd International Workshop on Statistical and Scientific Database Manage- ment, Luxemburg.

Segev, A. and Shoshani, A. (1987) Logical modelling of temporal databases. In Proceedings of the SIGMOD Conference. San Francisco, CA, pp. 454-466.

Shoshani, A. (1979) CABLE: a language based on the E-R model. In Proceedings of the E-R Conference, Los Angeles, CA.

Shoshani, A. (1982) Statistical databases: characteristics, prob- lems and some solutions. In Proceedings of the VLDB Conference, Mexico City, pp. 208-222.

Shoshani, A. and Kawagoe, K. (1987) Temporal data manage- ment. In Proceedings of the VLDB Conference, Kyoto, Japan, pp. 79-88.

Snodgrass, R. (1987) The temporal query language TQUEL. ACM Transactions on Database Systems, 12(2), pp. 247-298.

Snodgrass, R., Gomez, S. and McKenzie, E. (1987) Aggregates in the temporal query language TQUEL. Technical Report, Department of Computer Science, University of North Carolina at Chapel Hill.

SQL/DS (1981) SQL/Data system: general information. Report GH24-5012, IBM Corporation. Department GRIT, 180 Kost Road, Mechanicburg, PE 17055.

Stein, D. M. (1986) A database interface to an integrated data analysis and plotting tool. In Proceedings of the 3rd Inter- national Workshop on Statistical and Scientific Database Management, Luxemburg.

Stephenson, G. A. (1988) Knowledge browsing-front ends to statistical database. In Proceedings of the 4th International Statistical and Scientific Database Management, Rome.

Stonebraker, M., Wong, E., Kreps, P. and Held, G. (1976) The design and implementation of INGRES. ACM Transactions on Database Systems, 1(3), pp. 189-222.

Su, S. Y. W. (1983) SAM*: A semantic association model for corporate and scientific-statistical database. Information Sciences, 29, pp. 151 199.

Su, S. Y. W., Navathe, S. B. and Batory, D. S. (1983) Logical and physical modeling of statistical/scientific databases. In Pro- ceedings of the 2nd International Workshop on Statistical Database Management, Los Altos, CA.

Table Producing Language System, version 5 (1980) Bureau of Labor Statistics, Washington, DC.

Tansel, A. U. (1986) Adding time dimensions to relational model and extending relational algebra. Information Systems, 11(4), pp. 343-355.

Tansel, A. U. (1987) A statistical interface for historical relational databases. In Proceedings of the Data Engineering Confer- ence, Los Angeles, CA.

Tansel, A. U. (1988) A statistical database for planning and research. Technical Report, Baruch College, CUNY.

Tansel, A. U. (1990a) Modelling temporal data. Journal of Infor- mation and Software Technology, 32(8).

Tansel, A. U. (1990b) A historical query language. Information Sciences, 32(8), 514-20.

Tansel, A. U. (1991) Statistical database query languages. In Sta- tistical and Scientific Databases (ed. Z. Michalewicz), pp. 233-65. Ellis Horwood, London.

Tansel, A. U. (1992) Temporal relational data model. Technical Report, CIS-26-92, Baruch College, CUNY.

Tansel, A. U. and Arkun, M. E. (1986a) HQUEL, a query lan- guage for historical relational databases. In Proceedings of the 3rd International Workshop on Statistical and Scientific Database Management, Luxemburg.

Tansel, A. U. and Arkun, M. E. (1986b) Aggregation operations in historical relational databases. In Proceedings of the 3rd International Workshop on Statistical and Scientific Database Management, Luxemburg.

Tansel, A. U. and Garnett, I. (1991) Equivalence of algebra and calculus languages for nested relation. Computer and Mathe- matics with Applications, 23(10) 3-25.

Tanel, A. U., Arkun, M. E. and Ozsoyoglu, G. (1989) Time-by- example database query language. IEEE Transactions on Software Engineering, 15(4).

Thomas, J. J. and Hall, D. L. (1983) ALDS project: Motivation, statistical database management issues, perspectives, and directions. In Proceedings of the 2nd International Statistical Database Management, Los Altos, CA.

Turner, M., Hammond, R. and Cotten, P. (1979) A DBMS for statistical databases. In Proceedings of the VLDB Conference, Rio de Janeiro, Brazil.

Weiss, S. E. and Weeks, P. L. (1983) PASTE--a tool to put appli- cation systems together easily. In Proceedings of the 2ndInter- national Workshop on Statistical Database Management, Los Altos, CA.

Weiss, S. E., Weeks, P. L. and Byrd, N. J. (1981) Must we navigate through databases? In Proceedings of the Ist International Workshop on Statistical Database Management, Menlo Park, CA.

Whistler, D. (1986) The design of a database management system for economic time series data. In Proceedings of the 3rd Inter- national Workshop on Statistical and Scientific Database Management, Luxemburg.

Wong, H. K. T. and Kuo, I. (1982) GUIDE: Graphical user inter- face for database exploration. In Proceedings of the VLDB Conference, Mexico City.

Zloof, M. M. (1977) Query-by-example: a database language. IBM System Journal, 16(4) 324-43.