incremental data organization for ancient document databases

10
Incremental Data Organization for Ancient Document Databases Shinichi UESHIMAt , Kazuhiro OHTSUKItf , Jun-ya MORISHITAttt Qing QIANtttt, Hiroaki OISOttttt and Katsumi TANAKA++++++ t Faculty of Informatics, Kansai University. E-mail:[email protected]. tt Faculty of Cross-Cultural Studies, Kobe University. E-mail:[email protected]. ttt Faculty of Foreign Languages,Himeji Dokkyo University. E-mail:[email protected]. tttt Graduate School of Science and Technology, Kobe University. E-mail:[email protected]. ttttt Nippon Steel Information & Communication Systems Inc.(ENICOM). E-mail:[email protected]. tttttt Faculty of Engineering, Kobe University. E-mail:[email protected]. Abstract In this paper,we introduce a mechanismfor incremental dataorga- nization of semi-structured datain handling ancient Chinesedocu- ment data. The objective of the mechanism is to support scientists’ incremental and hypothetical work processes (object/type identifi- cation, classification and verification/abstraction from users’ mul- tiple viewpoints). We have developed a prototype incremental database systembased on an Object-Oriented DBMS, Gemstone. The mechanisms realized by our systemare (1) an instance-based data model which allows class extensions of heterogeneous ob- jects, multiple class memberships and multiple roles of instances, and incremental object/schema evolution, (2) anchor object defi- nition and manipulation, which enables users to define any frag- ments of text data as independent objects at run-time, (3) an active rule mechanism for enforcing the integrity constraints of the class membership of heterogeneous objects, and incremental schema/object generation, and (4) data set analyzing tool for data sets, which generates an intensional representation for a given data set to verify the validity of classification works and/or to discover concepts. 1 Introduction Recently much attention has been focused on the Scientific Database Systems[l][2][3][4]. This kind of databasesys- temsraise new researchissues, such as Treatment of a vast volume of raw or semi-structured data, which does not have a complete database schema,and should be incrementally structured. Treatment of complex-structured data and multimedia data, for which the Object-Oriented Database (OODB) systemsmight provide a reasonablesolution. Proceedings of the Fourth International Conference on Database Systems for Advanced Applications (DASFAA’95) Ed. Tok Wang Ling and Yoshifumi Masunaga Singapore, April 1O-l 3, 1995 @ World Scientific Publishing Co. Pte Ltd l Treatment of data files with multiple formats, for which it is desired to provide facilities for more so- phisticated data sharing among usersand querying. l Integration of scientific data and their application pro- grams. l Visualization techniques and effective graphical user interfaces. We areconcernedwith the first three issues,and we espe- cially focus our attention on the first issue, the incremental data organization [3] of semi-structured data in scientific databases, which we regard as one of the most important issuesof scientific data management.In scientific database systems,a large amount of primary data, often called ruw data, are already gatheredand storedin files, niost of which are semi-structured. The term ‘semi-structured’ implies that the primary data have an incomplete set of predefined attributes and an incomplete databaseschemafor classify- ing and storing them. In this paper,we describe a mechanismfor the incremen- tal data organization in handling ancient Chinese document data (representedas wooden-slip writings). The objective of the mechanism is to support scientists’ work processes, which are roughly classified into the following: the ob- jectftype identification process to discover and identify ba- sic information units in pre-stored primary data, the clussi- frcation process which is to classify those units into groups, and the verification process to examine scientists’ “con- cepts” or “hypothesis”. All of the above processes are done in a manner that is repetitive, incremental and hypothetical. These processes will involve repetitive data retrieval, data creation, and schema updates for the semi-structured data. Also, it should be noted that the results of these processes may differ depending upon each scientist’s viewpoint. As for the incremental data organization, ZdonikD] no- ticed the importance of a mechanism, which enables users 457

Upload: others

Post on 06-Dec-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Incremental Data Organization for Ancient Document Databases

Incremental Data Organization for Ancient Document Databases

Shinichi UESHIMAt , Kazuhiro OHTSUKItf , Jun-ya MORISHITAttt Qing QIANtttt, Hiroaki OISOttttt and Katsumi TANAKA++++++

t Faculty of Informatics, Kansai University. E-mail:[email protected]. tt Faculty of Cross-Cultural Studies, Kobe University. E-mail:[email protected].

ttt Faculty of Foreign Languages, Himeji Dokkyo University. E-mail:[email protected]. tttt Graduate School of Science and Technology, Kobe University. E-mail:[email protected].

ttttt Nippon Steel Information & Communication Systems Inc.(ENICOM). E-mail:[email protected]. tttttt Faculty of Engineering, Kobe University. E-mail:[email protected].

Abstract

In this paper, we introduce a mechanism for incremental data orga- nization of semi-structured data in handling ancient Chinese docu- ment data. The objective of the mechanism is to support scientists’ incremental and hypothetical work processes (object/type identifi- cation, classification and verification/abstraction from users’ mul- tiple viewpoints). We have developed a prototype incremental database system based on an Object-Oriented DBMS, Gemstone. The mechanisms realized by our system are (1) an instance-based data model which allows class extensions of heterogeneous ob- jects, multiple class memberships and multiple roles of instances, and incremental object/schema evolution, (2) anchor object defi- nition and manipulation, which enables users to define any frag- ments of text data as independent objects at run-time, (3) an active rule mechanism for enforcing the integrity constraints of the class membership of heterogeneous objects, and incremental schema/object generation, and (4) data set analyzing tool for data sets, which generates an intensional representation for a given data set to verify the validity of classification works and/or to discover concepts.

1 Introduction

Recently much attention has been focused on the Scientific Database Systems[l][2][3][4]. This kind of database sys- tems raise new research issues, such as

Treatment of a vast volume of raw or semi-structured data, which does not have a complete database schema, and should be incrementally structured.

Treatment of complex-structured data and multimedia data, for which the Object-Oriented Database (OODB) systems might provide a reasonable solution.

Proceedings of the Fourth International Conference on Database Systems for Advanced Applications (DASFAA’95) Ed. Tok Wang Ling and Yoshifumi Masunaga Singapore, April 1 O-l 3, 1995 @ World Scientific Publishing Co. Pte Ltd

l Treatment of data files with multiple formats, for which it is desired to provide facilities for more so- phisticated data sharing among users and querying.

l Integration of scientific data and their application pro- grams.

l Visualization techniques and effective graphical user interfaces.

We are concerned with the first three issues, and we espe- cially focus our attention on the first issue, the incremental data organization [3] of semi-structured data in scientific databases, which we regard as one of the most important issues of scientific data management. In scientific database systems, a large amount of primary data, often called ruw data, are already gathered and stored in files, niost of which are semi-structured. The term ‘semi-structured’ implies that the primary data have an incomplete set of predefined attributes and an incomplete database schema for classify- ing and storing them.

In this paper, we describe a mechanism for the incremen- tal data organization in handling ancient Chinese document data (represented as wooden-slip writings). The objective of the mechanism is to support scientists’ work processes, which are roughly classified into the following: the ob- jectftype identification process to discover and identify ba- sic information units in pre-stored primary data, the clussi- frcation process which is to classify those units into groups, and the verification process to examine scientists’ “con- cepts” or “hypothesis”. All of the above processes are done in a manner that is repetitive, incremental and hypothetical. These processes will involve repetitive data retrieval, data creation, and schema updates for the semi-structured data. Also, it should be noted that the results of these processes may differ depending upon each scientist’s viewpoint.

As for the incremental data organization, ZdonikD] no- ticed the importance of a mechanism, which enables users

457

Page 2: Incremental Data Organization for Ancient Document Databases

to define subobjects contained in a pre-stored object incre- mentally. This notion, say post-identij’ied objects, is also important in handling ancient Chinese text data, and we re- alized the mechanism by the notion of anchors shown in the present paper. Semi-structured data are often stored not in databases but in conventional files. Recently, Shoens et al.[5] and Abiteboul et al.[6] developed a mechanism, which is to provide database views (schemata) for exter- nal files. Both of these works will be very useful to provide database functionalities to conventional files when it is pos- sible to have a complete type (structure) definition for con- ventional files. Our approach is different from these two works because we do not know a complete type structure for ancient Chinese text data. That is, we cannot assume a complete class hierarchy for storing these data in advance.

In order to realize a mechanism for supporting scientists’ incremental and hypothetical data organization, we took the following approach:

a Instance-based object data model’ We provide a very flexible instance-bused object data model, which allows (1) class extensions that can have heterogeneous objects as their members, (2) multi- ple class memberships of instances, (3) multiple roles of instances and (4) incremental object/schema evolu- tion* .

l Run-time object identification and creation We realized the facility on our instance-based object data model, which enables users to define any frag- ments of text data as independent objects at run-time. This notion is called anchors.

l Active rule mechanism for incremental data orga- nization We adopted the well-known ECA mechanism[7] in or- der to cope with the following integrity control prob- lems:

- How to enforce integrity constraints on the class membership of heterogeneous objects.

- How to enforce integrity constraints on user’s in- cremental object/schema evolution work.

l An intensional representation of a data set We developed a mechanism, which generates an in- tensional representation, that is a query representation, for a given data set, which will be useful to verify the validity of classification works and/or discover some common concepts among data.

ITbe instance-based data model we realized is called Obase mxfel[8], and it was implemented over the conventional class-based OODBMS.

2Here, the incremental object evolution means a facility to migrate ob- jects from a class to another freely, and to add or delete attributes per an object in an incremental manner.

Figure 1: Primary object for a wooden-slip

By the above approach, we have developed a prototype incremental database system called TextLinklGem based on conventional Object-Oriented DBMS GemStone3.

2 Motivations and System Requirements

2.1 Wooden-Slip Semi-structured Data

Our primary materials are the collection of Chinese wooden-slips written in the Han Era (206 B.C.-220 A.D.). Wooden-slips, each of which is called mokkan in Japanese, were used as all kinds of documents, such as letters, Iabels, tags, diaries, official documents and so on. A long docu- ment was written by piling up these slips together. Usu- ally, a pile of slips is dissolved as it was being buried in the ground for many years. Moreover, slips themselves are often broken up.

Each piece of wooden-slip is stored in our object database as a primary object which contains its photograph image, Chinese text sentence interpreted from the photo- graph, and other predefined attributes such as identification codes, excavation place and date (see Fig. 1). These pre- defined attributes are general ones for the wooden-slip re- searchers. We assume that those primary objects are public data shared by all the researchers and that updates are sel- dom done against primary objects. Each researcher uses our system by adding specific attributes and values to primary objects from his own viewpoint.

In terms of modem interpretation, each wooden-slip corresponds to a page of a document, and a sequence of wooden-slips corresponds to a document. In analyzing

3CiemStone is a trademark of Servio Corporation.

458

Page 3: Incremental Data Organization for Ancient Document Databases

Working process Requirements

Figure 2: Workllow model and system requirements

these ancient documents, most difficult problems to cope with are the following:

l Unknown document types We do not know completely either what document types existed in ancient days, such as letters, poems, bills and orders etc., or the formats of document types. Therefore, it is difficult to determine document types themselves, supertype/subtype relationships of ancient documents, and their attribute structures. It leads to the problem that we cannot define a type hierarchy (class hierarchy) in advance under which ancient doc- uments are classified.

l Multiple viewpoints Since the order of wooden-slips may not always be preserved when they are discovered, there exist sev- eral levels of multiplicities and ambiguities. Accord- ing to each researcher’s viewpoint, a wooden-slip may be regarded as a page of different documents. Fur- thermore, a document might be regarded as a let- ter by a researcher, but also might be regarded as a bill by another researcher. Therefore, attribute struc- tures of wooden-slip data largely depend on the view- points, the analyzing methods, and the objectives of researchers.

2.2 Workflow of Scientists

As illustrated in Fig.2, the researchers’ work processes are roughly classified into the three processes: object/type identification, classification and verification. This work processes are repeated many times, and during these pro- cesses, the database schema and objects incrementally evolved in a bottom-up manner.

2.2.1 Object/type identification

The object identiflcution means to find a basic information unit, which may be embedded in a certain attribute value of a primary object, and to make it an independent entity. It is often desirable to add attributes and/or behaviors to the selected portion of a certain attribute value (for example, a substring of the interpreted Chinese sentence or a subregion of a wooden-slip’s photograph image).

The type identification is concerned with inferring the type of an object. For example, a researcher might regard a wooden-slip object as a certain page of a document of type Letter, but another researcher might regard it as a certain page of a document of type Bill. According to each inferred type, each researcher will add his own attributes and values.

Fig.3 illustrates an example of these processes. A re- searcher assumes the value ‘xxx’ to be a person’s name and created a new object 02 for describing properties of ‘xxx’. He also assumes the type of the primary object to be ‘Bill’ and added the attribute ‘kind’ to the primary ob- ject. Since different researchers may usually have their own viewpoints, it is necessary to allow a situation such that the same object may have different views, that is, different attribute structures and values. Fig.4 shows an example, where a primary object has two different views: Letter or Bill and their corresponding images.

2.2.2 Classification

A researcher tries to classify primary objects (as well as newly-created post-identified objects) into groups of sim- ilar objects. He may use querying operations using newly created objects and/or newly added attributes in order to collect objects. Sometimes, he collects objects one by one according to his intuition. Again, it should be noted that the classification results may differ depending upon each researcher’s viewpoint.

459

Page 4: Incremental Data Organization for Ancient Document Databases

type : person’s name

4vww ----

Figure 3: Example of object/type identification

type:fo . . . . . . . . P . . . . . . . .

r . . . ,... 4’ yedr : 1920

r ‘xxx++*slca

yefr : 1920 sentence : image: l **

__ ____-----_ -- A

sentence : ’ rice ..$lW image : l

----------.

contents : apologize item : / price : ‘$100 kind : ‘food official bill’

0 as Letter 0 as Bill

Figure 4: Example of multiple assignment from viewpoints

Groups classifying primary objects are called cate- gories. A primary object may belong to more than one cat- egory. That is, the multiple memberships of primary ob- jects should be allowed. It is useful to classify some cate- gories further into more detailed ones, called subcategories. The semantics of the category-subcategory relationship is closed to a-kind-ofrelationship. The category-subcategory relationships will form a partial order. This structure will be generated by researchers in a trial-and-error manner.

2.2.3 Verification

Researchers often heuristically collect primary objects un- der an ambiguous category based on his intuition and hy- pothesis. Scientific database systems should offer some analyzing tools for user’s intuitively-collected data sets. By these DBMS-assisted tools, he might discover some explicit concepts from their data, that is, he can ab- stract/certificate his vague concept and hypothesis, which, we believe, is the major objective for scientists to use sci- entific database management systems.

2.3 Requirements

Considering the above characteristics of wooden-slip data and scientists’ works, we believe, scientific database sys- tems should have the following facilities as well as the fa- cility for storing and retrieving semi-structured (raw) data:

Transforming values into objects at run-time This facility should enable to make dynamically any data fragments (a certain portion of some predefined attribute value) in a primary object as independent ob- jects.

Run-time object/type evolution This facility enables to evolve objects and types at run- time. That is, it allows to add or delete attributes to types and objects, and to migrate an object from a class to another. Furthermore, it should support a class ex- tension consisting of objects with heterogeneous at- tribute structures.

Run-time schema evolution Class hierarchy should be evolved at run-time.

Multiple class-memberships and multiple roles This facility should support that an object can belong to more than one class, and that an object can have multiple views from several viewpoints.

Supporting integrity control for evolution A mechanism is needed to enforce integrity constraints on object creation/evolution and schema evolution. Especially, our model makes the integrity problem more complicated since we allow heterogeneous ob- jects.

460

Page 5: Incremental Data Organization for Ancient Document Databases

l Generating an intensional representation for data sets One of the important facilities to help scientist’s veri- fication process is to generate an intensional represen- tation for his data set for verifying a classification or discovering some concepts from his data set.

3 Instance-Based Data Model with Multiple Viewpoints

In our data model, we do not use a class-hierarchy as a database schema. We will introduce a new data model sat- isfying our requirements4.

3.1 Instance-based Data Model

We describe basic constructs of our data model.

Unified object model It does not differentiate classes and instances, and both of them are uniformly expressed as objects. Each ob- ject can have its own attributes independently from other objects. We can freely and interactively add at- tributes to an object. An object which corresponds to an ordinary data, is called a primary data object and one which works as a container of objects, is called a category object. It is possible for each data object or each category object to have its own attributes and their values.

Superkubobject relationships The following three relationships are treated as superobject-subobject relationships in our model:

- a-kind-of relationship between category objects and its subcategory objects.

- an-instance-ofrelationship between data objects and category objects.

- set-membership relationship between category objects and data objects.

Relationship attributes and their inheritance among objects A superobject/subobject relationship itself can also have its own attributes. These attributes are called re- lationship attributes. An object can inherit both the attributes of its superobject and their relationship at- tributes by inheritance operation. This is different from the class-based inheritance because inheritance occurs between instances and not only attributes but also attribute values are inherited. This functionality is used to realize multiple viewpoints discussed later.

40ur own data model was actually implemented as a upper-level layer over a class-hierarchy data model provided by ordinary OODBMS.

super

sub

Figure 5: Data model

The inheritance operation from category object z to data object o is expressed by o as z. This is not a per- sistent object but a temporal operation. However we can consider this a temporal object modified by inher- itance. We term this virtual object. In our data model, we can treat virtual object as the same as persistent ob- ject except that it is temporal.

Figure 5 illustrates objects in our data model. Intuitively, our data model is an instance-based data model, where ob- jects are organized according to superlsubobject relation- ship and each relationship can have its own attribute val- ues. Furthermore, each object can have its own attribute structure. virtual object based on inheritance of attributes and values is also supported. Data models such as ours have already been discussed and implemented by several authors[8].

3.2 Multiple Viewpoints

Viewpoint plays an important role as discussed in section 2. However, in general, the notion of viewpoint is too gen- eral, and so, we simplify this and assume that a viewpoint is represented by a category object in our data model. In the present paper, the notion of multiple viewpoints are classi- fied into multiple membership and multiple role in our data model.

Multiple memberships An object can simultaneously belong to more than one category object in our data model. We can also classify an object to new category object interactively even if the object already belongs to any other category object. This gives multiple membership of objects.

Multiple roles An object can be enriched with inheritance from both attributes of its category object and their relationship attributes as a virtual object. This means that the object is given a role based on the category object. Further- more due to the facility of multiple membership, we

461

Page 6: Incremental Data Organization for Ancient Document Databases

r-- 0 as Letter---, - - - 0 as Bill - - -,

L-*,,,---,.., **---------I I ----------------,

Figure 6: Implementation of multiple viewpoint example

can produce various virtual objects which differ from each other from the single object.

To give a certain viewpoint to a data object, a user de- fines a category object for the viewpoint and stores the data object in it as its subobject. Researchers can freely give his viewpoint to some data objects and access them inde- pendently of any other viewpoints by multiple membership capability.

Figure 4 in the previous section can be implemented by multiple roles. We define the viewpoints Letter and Bill as categories and assign the additional attributes to each cate- gory as shown in Fig. 6. Virtual objects, 0 as Letter and 0 as Bill are desired user’s objects.

3.3 Wooden-slip Database System

Our system is implemented on a commercial OODBMS Gemstone, and so, we have implemented our own data model over a class-hierarchy data model provided by Gem- Stone. We arrange three classes under the class Object for implementation of our data model.

l Incremental object class incremental object class is a principal one which sup- ports set of attributes’ and of aid’s for super/subobject relationships. Inheritance mechanism is also sup- ported by methods in this class. All the objects in our data model is implemented as instances of this class.

l Relationship attribute class Conventionally, relationship attribute class is imple- mented as a subclass of incremental object class. This class supports attributes added to the relationship be- tween objects. Although this is a subclass of our prin- cipal class, its instances can not be accessed directly by user but done by methods for inheritance mechanism.

l Wooden-slip object class As a subclass of incremental object, this class con-

Figure 7: Windows of wooden-slip database system

structs primary object which contains elemental at- tributes concerned with wooden-slip. Attributes of wooden-slip’s are such as identification numbers, ex- cavated place and date, its image, and its interpreted text.

Figure 7 shows windows of our OODBMS-based sys- tem. There are two windows called Mokkan and Mokkan(Wooden-slip) Category Graph respectively. The larger window is a browser for wooden-slip objects whose extension is the category Mokkan. Attributes for wooden-slip object in Mokkan can be browsed in this win- dow. The lower front window, Mokkan Category Graph shows our database schema. This is a category/subcategory hierarchy constructed by a user in an incremental manner. Each wooden-slip object belonging to a certain category is browsed through the Mokkan Category Graph.

4 Incremental Object Identification and An- chor Objects

A researcher reveals an entity, which is often embedded as a portion in the attribute value of a certain primary object, then he makes the entity an independent object. This is ob- ject identification and making an entity as an object is, we say, objectifying the entity.

This process is followed by the researcher’s activities, so it goes incrementally and interactively. The created ob- jects are post-identified but are as important as predefined primary objects for the researcher, so they should be also offered the same functionalities as primary objects are.

In our system, users can create objects interactively and can handle them in the same manner as primary objects.

462

Page 7: Incremental Data Organization for Ancient Document Databases

But it is very important to choose the way how to objectify the portion embedded in some primary object.

To get the portion as an object, here we introduce un- char object which does not have a duplicated value of the selected portion’s but has only a reference to the original.

4.1 Anchor Object

Anchor object is an object which has a location information of the selected portion embedded in the attribute value of a primary object as a reference in addition to ordinary object features. The location information contains an oid of the referring object, its attribute name and the region identifier of the attribute value for the portion. The region identifier is, for instance, the position of a substring in the interpreted Chinese sentence or relative coordinates of a rectangle in wooden-slip’s image. The location information can be as- signedextensionaliy or intensionally. The former is the way to specify the target values interactively by a pointing de- vice. The latter is the way for defining anchor objects by a query. For instance, a user makes a query to obtain the positions of the desired keyword included in the text data of primary objects.

By defining anchor objects, we have the following ad- vantages:

l A user can handle the target values without concerning about their data types, such as bit sequences of image data, character sequences of text data, and so on.

l Anchor objects reduce the volume of data compared with the case that the system copies the data of the target values, and store them independently. In other words, a user can handle the target values as surro- gates.

l By extracting the values of portions, users can build new databases, such as sub-image database, text fiag- ments data, since anchor objects can be handled in the same way as primary objects. Users can browse through the sub-images data from their own interest, and classify those data.

Furthermore, anchor objects enjoy the various advan- tages that ordinary objects possess. Especially, we can em- ploy our ECA mechanism to control the anchor objects ma- nipulations, such as generation/deletion of anchor objects. We discuss this issue in Section 5.

4.2 Implementation of Anchor Objects

Anchor objects are implemented as objects of the class in- cremental objectdetined in Section 3.3 like primary objects. The location information is added to the structure of objects. The method to obtain the referring object from the selected

Figure 8: Browser for anchor objects

portion is also supported. In Fig. 3 appeared in Section 2, 02 is implemented as an anchor object. It has two pairs of attribute and value of its own, type and age and also has the reference to ‘xxx’ in the attribute value of O1 (oid of Or’s, attribute name sentence and the position of ‘xxx’). Both primary objects and anchor objects are stored in the Gemstone database independently.

Anchor objects work as data objects. Users can classify anchor objects by making them as subobjects of category objects. Anchor objects are displayed in a form of tags, we say a post-it in Fig. 7. In the image data of the larger window, three anchors are created. There, anchor objects are colored for three corresponding regions. They are also listed on the upper small pane. Each anchor object has own attributes, such as Color of the tag, Word which denotes a corresponding subtext of the wooden-slip’s interpreted text, and Attribute which gives a meaning to the specified sub- text, and so on.

Fig. 8 shows a browser for anchor objects, called Post-h Browser. It browses the anchor objects and their attributes, color of the primary object, Mokkan33. It can browse all the anchor objects defined by an user.

Since each anchor object is an ordinary object, it can be- come an attribute value of an object. Therefore, anchor ob- jects can appear as values of relationship attributes, and so, it is possible to define different anchors depending on users’ viewpoints.

5 An ECA Mechanism for Incremental Data Organization

ECA rules are often used in the field of active database systems[7]. Each ECA rule consists of descriptions of an Event, a Condition and an Action, which means that when

463

Page 8: Incremental Data Organization for Ancient Document Databases

an specified event occurs, the system examines whether the specified condition is satisfied by the database, and if the condition holds, then the specified action will be auto- matically executed. Conventionally, this ECA mechanism has been used mainly for maintaining database integrity constraints as a trigger, or for notifying some changes to database users as an alerter. Most events have been con- cerned with updates of database instances or with user- invoked interactions.

As described in Sections 3 and 4, we mainly handle the following objects: primary objects, category objects, and anchor objects. Our ECA rules are to interrelate these three kinds of objects. That is, by these ECA rules, creating ob- jects and/or updating objects invokes another object cre- ation or an object update or a user notification. Since a hierarchy of category objects (category-subcategory hier- archy) forms a database schema in our system, some ECA rules are used in order to incrementally create or maintain the database schema. For example, the following are the usage examples of ECA rules:

l Monitoring anchor object manipulation These kind of ECA rules monitor the event occur- rence concerned with creation, deletion, and update of anchor objects. If this kind of event occurs, ECA rules try to create, delete, or update other anchors af- ter checking their specified conditions. For example, when a new anchor is created for a certain subtext of a wooden-slip object, an ECA rule examines whether the referred subtext is a new one and if true, the ECA rule asks whether or not the system should create more anchors for other wooden-slip objects that have the same subtext. Fig. 9 shows such an ECA rule.

l Monitoring category object manipulation These kind of ECA rules monitor the event occur- rences of creation, deletion and update of category ob- jects. For example, when a category object is deleted from a category hierarchy, an ECA rule checks if the deleted category has primary object as its subobjects, and if true, it asks users whether or not it is necessary to migrate those primary objects.

l Monitoring primary object migration These kind of ECA rules monitor the migration of pri- mary objects to some category object. For example, users can specify the maximum number of primary ob- jects that a category object can own as a condition of an ECA rule. When a migration occurs, such an ECA rule checks the condition, and if it is violated, it noti- fies and prohibits this migration operation.

Also, these ECA rules are used to maintain the consis- tency of the database schema when the schema is updated by users. When users send some update message, which is declared as an event, the conditions, which are considered

Figure 9: Browser for ECA rule

possibly to result in an inconsistent state of the schema by this event, are checked. If the check is passed, then the ac- tions are automatically updated and executed. For example, cycle detection and removal for an object’s is-a hierarchy, detection of disconnected component in a object’s is-a hi- erarchy and its repairment, removal of transitive edges are represented as ECA rules.

6 Intensional Representation of a Data Set

6.1 Overview

A scientist, in his research work, at first examines and classifies primary objects and anchor objects from various viewpoints with his vague concepts or hypotheses. He cre- ates a category object and collects some primary objects. This category object simply gives a naive classification. His further study can modify this naive classification to the con- vincing one with detailed attributes and can reveal a new concept.

Therefore, it is desirable to prepare a data set analyzing tool that derives an abstracted feature from a set of primary objects belonging to a category object. Such a tool will help users to verify the vabdity of the category object and its data. This process may abstract their concepts, and may discover a new concept.

Here we discuss an intensional representation of a given text data set. We characterize a set of primary objects in a category object by representative words. It corresponds

464

Page 9: Incremental Data Organization for Ancient Document Databases

to deriving a disjunctive query representation of the given data. We determine those words algorithmically.

6.2 Our Approach

We first prepare some notations. Let G be a whole set of primary objects, S be a given data set of primary objects we are concerned with, and W be a set of words contained in the text data of S. When a word w is an element of W, Q(w) denotes a set of primary objects retrieved by a query finding the word w from G.

We seek a set of words, which characterize S, by a dis- junctive query operation. The resulting set of the query should be well matched to the target S and the number of the words should be appropriately small so that the set is a representative of S.

Now we represent S in terms of a disjunctive query with n words, WI, ~2,. . . , w,.

where

S=Q,uO+,-O,, (1)

0; G S-SnQ,,

0, E SnQn.

U and - denote a set union and a set d#erence, respectively. 5 is the complement of S and Qn denotes a result for the disjunctive query.

Now we introduce the following performance measure to examine how well the Qn matches S.

(0 I (2 5 l), (2)

where 1 . 1 denotes the cardinality of the set -. Note that CY is a weight parameter and the value of CY is determined according to users’ purposes. We easily see that Eq. 2 is a combination of precision factor and recall factor, which are well-known parameters in information retrieval.

Obviously, the algorithm to find a set of words maximiz- ing the performance measure is M-complete. So we pro- pose a heuristic algorithm, which is the way to determine the representative words iteratively.

Assume we have a subset of i - 1 words, WI, ~2, . . ., wi- 1. We determine the i-th word, wi in the following way:

Examine the equivalent relation with Eq. 1 for a query of a single word,

s; = q(w;) u o+ - o-, (3)

where gi denotes G - &;-I and si is S - S U &i-l. q(wi) is the result of a query of wj on gi. Equation 3 is similar

Figure 10: Representative words for wooden-slip text

with Eq. 1 but the set, Si and gi are reduced by the preced- ing Qi-1, so we can process Eq. 3 independently from the preceding steps.

Evaluate the performance measure fs(Qi) according to the candidate of wi’s. This can be easily done by the fol- lowing equation:

lSnQi[ = ISnQ;-lI+(sin~(w~li)[, I Qi I = I &i--l I + I q(w) I .

Choose the candidate maximizing the performance mea- sure as wi, and iterate this procedure until the performance measure decreases.

This approach gives a set of n words which gives the locally maximizing the performance measure at each step. Our approach does not necessarily give the optimal solu- tion, but yields an asymptotic solution.

6.3 Example

We have employed our algorithm to Chinese interpreted texts for wooden slips data. The database consists of ap- proximately 400 interpreted Chinese texts. All texts are fragmented into characters of length 1 up to 6 since each Chinese character, as well as a sequence of characters, have its meaning. We built the indices of them, and set cu = 0.5.

First we give 46 interpreted texts as S. In Fig. 10, we see that 5 words are contained in texts with fs(Q) = 0.944915, which gives the maximum in the example. In this case, precision factor and recall factor are 46/59 and 1 .O, respec- tively. 1 0+ 1 is 0, and 1 O- 1 is 13. These 5 words are con- sidered to be representative words of the given interpreted texts. Users can proceed to the verification process: build a new hypothesis to re-interpret the sentence, and conjecture the type of each wooden slip primary object.

7 Related Works

Zdonik[3] noticed the importance of incremental data orga- nization in scientific database systems. Especially, he no- ticed the existence and the representation of post-identified objects. He showed the notion of post-identified objects by an example of a subregibn in a weather map data, which is

465

Page 10: Incremental Data Organization for Ancient Document Databases

anchoring classifier schema evolution verifier zdonik[3] Yes no no no Shoens et al.[5] no yes no no Abiteboul et al.[63 yes (by grammar) no no no ours yes no yes yes

Table 1: Comparison with related works

the place of a typhoon. This notion is very closed to the notion of anchors shown in the present paper. We have im- plemented the notion of anchors in our system, where an- chors are treated as independent objects with incremental data organization capabilities.

Semi-structured data are often stored not in databases but in conventional files. Conventional files are more flex- ible than formatted databases, but they could not enjoy the data sharing facilities and querying facilities offered by conventional database systems. Recently, Shoens et al.[5] tried to provide database views (schemata) for ex- ternal files. Under the assumption that there exists a pre- defined type structure (class hierarchy) for external files, they implemented an automatic classifier for external files. For each external file, their system automatically creates a surrogate object, which is a descriptor of the external file, and those surrogate objects are stored and managed by an OODBMS. Furthermore, the automatic classifier extracts information from external files, which are stored as ob- ject attributes. Also, Abiteboul et al.163 tried to provide a database view (schema) called a structuring schema for external files. They used predefined grammars, which de- scribe the structures of external files, and provided a trans- formation mechanism from such a grammar definition into an OODB schema (class hierarchy). Our approach is dif- ferent from these two works because we cannot assume the existence of predefined structures for ancient Chinese doc- uments. Therefore, our approach is (1) to provide a very flexible object data model with multiple class memberships, multiple roles, and class extensions having heterogeneous objects, (2) to allow incremental attribute addition/deletion per an object, (3) to use ECA mechanism for supporting the consistency of incremental data organization, and (4) to provide a database supporting tool to verify the classifica- tion itself.

These recent works about semi-structured databases can be compared with our work in Table 1.

8 Concluding Remarks

In this paper, we described an object-oriented scientific database system for ancient Chinese documents. We have focused our attention on the aspect of incremental data or- ganization of semi-structured data, which is regarded as one of the most important topics of scientific data manage- ment. We have suggested the following for this purpose:

(1) a flexible data model which allows (l-l) class exten- sions that can have heterogeneous objects as their mem- bers, (l-2) multiple class memberships of instances, (l-3) multiple roles of instances, and and (l-4) incremental ob- ject/schema evolution, (2) run-time object/type identifica- tion and evolution, (3) the use of active rule mechanism for incremental data organization, and (4) a tool for intensional representation of data sets. We also described an implemen- tation of the above functionalities for ancient wooden-slips data over a commercial OODBMS. Acknowledgements This work was partly supported by the Special Coordination Fund of Promoting Science and Technology for Self-Organizing Information Base Systems (SOIBS) Project.

References [II

121

131

[41

151

161

r71

181

Shoshani, A., F. Olken, and H. Wong, Characteristics of Sci- entijkDatabases, in Proceedings of the 10th VLDB Confer- ence, Boston, MA, June 1984. French, J., A. Jones, and 3. Pfaltz, Summary of the Final Re- port of the NSF Workshop on Scientific Database Manage- ment, in SIGMOD Record, Vol. 19, No. 4 pp. 32-40, 1990. Zdonik, S., Incremental Database Systems: Databases from the Ground Up, Proc. of the 1993 ACM SIGMOD Inter- national Conference on Management of Data, pp.408-412, May 1993. IEEE Computer Society, Special Issue on Scientij’ic Databases, Bulletin of the Technical Committee on Data En- gineering, Vo1.16, No.1, March 1993. Shoens,K. et al.,The Ru. System: Igormation Organiza- tionfor Semi-structured Data, Proc. of the 19th VLDB Con- ference, Dublin, pp.97-107, Ireland, 1993. Abiteboul, S., Cluet, S., and Milo,T., Querying and Updat- ing the File, Proc. of the 19th Intl. Conf. on Very Large Data Bases, pp.73-84, August 1993. McCarthy, D.R. and Dayal, U., The Architecture of an Active Data Base Management System, Proc. of ACM SIGMOD Symposium on the Management of Data, pp.215-224, Ore- gon, 1989. Tanaka,K., Nishio, S., Yoshikawa, M., Shimojo, S., Mor- ishita, J., and Jozen,T., Obase Object Database model: Towards a More Flexible Object-Oriented Database Sys- tem, Proc of the International Symposium on Next Gener- ation Database Systems and Their Applications (NDA’93), pp. 159- 166, September 1993.

466