annotating and querying databases through colors and blocks

10
MONDRIAN: Annotating and querying databases through colors and blocks Floris Geerts Hasselt University & University of Edinburgh [email protected] Anastasios Kementsietsidis School of Informatics University of Edinburgh [email protected] Diego Milano Dipartimento di Informatica e Sistemistica Universit` a di Roma “La Sapienza” [email protected] Abstract Annotations play a central role in the curation of scien- tific databases. Despite their importance, data formats and schemas are not designed to manage the increasing vari- ety of annotations. Moreover, DBMS’s often lack support for storing and querying annotations. Furthermore, anno- tations and data are only loosely coupled. This paper intro- duces an annotation-oriented data model for the manipula- tion and querying of both data and annotations. In particu- lar, the model allows for the specification of annotations on sets of values and for effectively querying the information on their association. We use the concept of block to rep- resent an annotated set of values. Different colors applied to the blocks represent different annotations. We introduce a color query language for our model and prove it to be both complete (it can express all possible queries over the class of annotated databases), and minimal (all the algebra operatorsare primitive). We present MONDRIAN, a proto- type implementation of our annotation mechanism, and we conduct experiments that investigate the set of parameters which influence the evaluation cost for color queries. 1. Introduction From biology to astronomy, scientific databases play a central role in the advancement of science by providing ac- cess to large collections of data. At the same time, these databases are of particular interest to computer scientists due to the data management challenges that they pose [9]. Apart from the often staggering amounts of data, two ad- ditional characteristics make the management of scientific databases challenging. First, scientific data come in a vari- ety of formats which range from flat-formatted files to im- ages and electronic publications. Thus, the challenge here is to integrate [15], annotate [7] and cross-reference [20] such diverse collections of data. Second, scientists often analyze data that are collected from a variety of sources and, in turn, this analysis results in new data which are used by other sci- entists, resulting a continuous feedback of data. In such a setting, it is important to maintain data provenance [10, 22], i.e., where the data that a scientist is using come from. In this paper, we offer a new model to annotate databases and a new language to query annotated databases. Our work is motivated by the pressing needs of biological databases and our examples are drawn from this domain. Figures 1(a), (b) and (c) show three sample relations taken from GDB [2] (a human gene database), Swissprot [3] (a protein database), and PIR [1] (a protein sequence database). Each relation stores, correspondingly, pairs of identifiers and names of genes, proteins and protein sequences. Two key points distinguish our work from other mecha- nisms proposed for annotation management: First, we argue that for an annotation mechanism to be useful in practice, it should support the annotation of sets of values. In the literature, existing mechanisms assume that each annotation is attached to a particular value of a specific attribute (e.g., see [6]). So, one can annotate, for example, the gene name NF1, in Figure 1(a), with the name of the researcher that discovered this gene. A possible implemen- tation of such a mechanism requires adding a column, say annot gname, in the GDB relation and use it to store the annotations of the gname column values. However, in a number of domains, including scientific databases, it becomes increasingly important to annotate sets of values. As an example, consider the relation shown in Figure 1(d). The relation associates GDB gene identi- fiers to SwissProt protein and PIR sequence identifiers. The semantics of this association is that the specified gene is re- lated to the indicated protein (and protein sequence). Such relations are widely used in the biological domain and of- fer a quick way to cross-reference and establish associa- tions between independent biological sources [20, 22]. In this setting, it is useful to record, for each stored associa- tion, what evidence exists for its validity [13]. In the figure, we show possible annotations of the relation in the form of blocks and block labels. In more detail, a block is used to in- dicate the set of values for which an annotation exists, while

Upload: others

Post on 12-Feb-2022

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Annotating and querying databases through colors and blocks

MONDRIAN: Annotating and querying databases through colors and blocks

Floris GeertsHasselt University &

University of [email protected]

Anastasios KementsietsidisSchool of Informatics

University of [email protected]

Diego MilanoDipartimento di Informatica

e SistemisticaUniversita di Roma “La Sapienza”

[email protected]

Abstract

Annotations play a central role in the curation of scien-tific databases. Despite their importance, data formats andschemas are not designed to manage the increasing vari-ety of annotations. Moreover, DBMS’s often lack supportfor storing and querying annotations. Furthermore, anno-tations and data are only loosely coupled. This paper intro-duces an annotation-oriented data model for the manipula-tion and querying of both data and annotations. In particu-lar, the model allows for the specification of annotations onsets of values and for effectively querying the informationon their association. We use the concept of block to rep-resent an annotated set of values. Different colors appliedto the blocks represent different annotations. We introducea color query language for our model and prove it to beboth complete (it can express all possible queries over theclass of annotated databases), and minimal (all the algebraoperators are primitive). We present MONDRIAN, a proto-type implementation of our annotation mechanism, and weconduct experiments that investigate the set of parameterswhich influence the evaluation cost for color queries.

1. Introduction

From biology to astronomy, scientific databases play acentral role in the advancement of science by providing ac-cess to large collections of data. At the same time, thesedatabases are of particular interest to computer scientistsdue to the data management challenges that they pose [9].Apart from the often staggering amounts of data, two ad-ditional characteristics make the management of scientificdatabases challenging. First, scientific data come in a vari-ety of formats which range from flat-formatted files to im-ages and electronic publications. Thus, the challenge here isto integrate [15], annotate [7] and cross-reference [20] suchdiverse collections of data. Second, scientists often analyzedata that are collected from a variety of sources and, in turn,this analysis results in new data which are used by other sci-

entists, resulting a continuous feedback of data. In such asetting, it is important to maintain data provenance [10, 22],i.e., where the data that a scientist is using come from.

In this paper, we offer a new model to annotate databasesand a new language to query annotated databases. Our workis motivated by the pressing needs of biological databasesand our examples are drawn from this domain. Figures 1(a),(b) and (c) show three sample relations taken from GDB [2](a human gene database), Swissprot [3] (a protein database),and PIR [1] (a protein sequence database). Each relationstores, correspondingly, pairs of identifiers and names ofgenes, proteins and protein sequences.

Two key points distinguish our work from other mecha-nisms proposed for annotation management:• First, we argue that for an annotation mechanism to beuseful in practice, it should support the annotation of sets ofvalues. In the literature, existing mechanisms assume thateach annotation is attached to a particular value of a specificattribute (e.g., see [6]). So, one can annotate, for example,the gene name NF1, in Figure 1(a), with the name of theresearcher that discovered this gene. A possible implemen-tation of such a mechanism requires adding a column, sayannot gname, in the GDB relation and use it to store theannotations of the gname column values.

However, in a number of domains, including scientificdatabases, it becomes increasingly important to annotatesets of values. As an example, consider the relation shownin Figure 1(d). The relation associates GDB gene identi-fiers to SwissProt protein and PIR sequence identifiers. Thesemantics of this association is that the specified gene is re-lated to the indicated protein (and protein sequence). Suchrelations are widely used in the biological domain and of-fer a quick way to cross-reference and establish associa-tions between independent biological sources [20, 22]. Inthis setting, it is useful to record, for each stored associa-tion, what evidence exists for its validity [13]. In the figure,we show possible annotations of the relation in the form ofblocks and block labels. In more detail, a block is used to in-dicate the set of values for which an annotation exists, while

Page 2: Annotating and querying databases through colors and blocks

gid gname120231 NF1120232 NF2120233 NGFB120234 NGFR120235 NHS

sid snameP01138 Beta-NGFP08138 TNR16P14543 NidogenP21359 NeurofibrominP35240 Merlin

pid pnameA01399 Nerve growth factorA25218 Tumor necrosis factorA45770 MerlinI78852 NeurofibromatosisQ6T45 Nancy-Horan syndrome

120232 P35240A45770

JohnJohn, Mary

120231 P21359I78852

John Mary

pid gid sid

120234 P08138A25218Peter

120233 P01138A01399Mary

(a) GDB relation (b) Swissprot relation (c) PIR relation (d) An integrated relation

Figure 1. Three biological sources and their integrated relation

block labels are used to indicate the annotations for thisblock. In our example, the annotations indicate the names ofthe curators who verified that a particular association holds.So, in the first tuple, a block indicates Mary’s belief that thegene with gid 120231 produces protein with sid P21359.Although conceptually easy to illustrate, complex annota-tions like those shown here pose interesting challenges interms of how they can be implemented. In this paper, wepropose an implementation that has two desirable proper-ties: first, it does not require any restructuring of the ex-isting schema of the database to be annotated (only extratables need to be added); and second, it is such that the an-notation of the database imposes minimum overhead both interms of space, and in terms of query execution time whencompared to its unannotated version.

• Second, we believe that annotations should be treated asfirst-class citizens of the database, that is, we should beable to query values and annotations alike (in isolation orin unison). Currently, query languages cannot see the an-notations and they can only transmit them [11]. However,for curators, annotations are of equal or even greater im-portance than values. A curator using the relation in Figure1(d) might want to find which tuples are annotated by eitherJohn or Mary. This is a value-based query, but it refers toan annotation value (rather than a data value) which, as thefigure shows, spans attribute boundaries, and it can appearover distinct attribute sets, in different tuples. As anotherexample, the curator might be interested in finding whichgene-protein sequence (gid, sid) pairs are annotated, and bywhom. This is not a value-based annotation query. Rather,it refers to the attributes on which annotations are appliedon each tuple. Finally, we note that sometimes the lackof annotations might also be of interest to a curator. In aheavily curated database a curator might want to find whichgene-protein (gid, pid) pairs are not annotated. All theseoperations assume that we have a query language capableof expressing queries over annotated databases.

Desirable properties of such a query language are that itis at a level of abstraction that is independent of the cho-sen representation of annotations and that it is user friendly.Any relational representation of annotated databases thatwe can think of offers the relational algebra or, on a prac-

tical level, SQL as candidate query languages. However,each query posed in these languages is not annotation-representation independent. A change in the representationof annotations requires us to reformulate all our queries.Furthermore, while writing such queries, we should makesure that the result of each query must be interpretable as anannotated database again. It is clear that one needs to posesevere syntactic and semantic conditions on such queries inorder to achieve this goal. It is not clear what these condi-tions should be.

We opt for a different approach and introduce a newquery language hereafter referred to as the color algebra(since we use colors to represent annotations). By defini-tion, our algebra is annotation-representation independent.Furthermore, any query in this language produces a color(annotated) relation on every input color relation. More-over, the semantics of colors and blocks is transparent ineach algebra operator. Our algebra facilitates the queryingof color databases and can easily express queries such as theones described in the previous paragraphs.

The contributions of this paper are as follows:• We introduce the first annotation mechanism for relationaldatabases that is capable of annotating both single valuesand the associations between multiple values.• We introduce an algebra to query values and annotationsalike. Our algebra includes well-known operators, like se-lection and projection, properly re-defined to account forcolors and blocks, along with new operators that are partic-ular to the querying of annotations. We formalize the notionof annotation in relational databases and we prove that ouralgebra is both complete (it expresses all possible queriesover annotated databases) and minimal (every operator isprimitive, and thus necessary).• We present MONDRIAN1 which is an implementationof our annotation mechanism over a relational DBMS. Weinvestigate the space overhead of our representation, andwe study the cost of evaluating queries over annotateddatabases and the parameters that influence this cost.

The remainder of this paper is organized as follows.First, we review related work. Section 2 introduces the ba-

1Piet Mondrian: Dutch painter whose paintings mainly consist of colorblocks.

2

Page 3: Annotating and querying databases through colors and blocks

sic notations and presents the color algebra while Section 3describes the relational representation and presents the com-pleteness result. Section 4 introduces MONDRIAN and of-fers a description of our implementation and experiments.Section 5 concludes with a summary of the results and adiscussion on future work.

1.1. Related work

Most existing annotation systems focus on text andHTML documents (e.g., Annotea [18]) and are often spe-cialized to support annotations for a particular kind of data,e.g., genomic sequences [19, 7]. Research on these systemshas been focused on scalability, distributive support of an-notations, and other features.

Bhagwat et al. [6] propose an annotation mechanismfor relational databases where annotations are stored in ex-tra annotation attributes. The authors extend the Select-Project-Join-Union fragment of SQL with a PROPAGATEclause which allows the user to specify how annotationsshould propagate. The focus is thus on the propagation ofthe annotations through queries, and the issue of how toquery the annotations themselves is not addressed. Also,only single values are annotated. The DBNotes system [12]extends this framework and offers limited support of query-ing annotations over single values.

An extensive literature exists regarding the computationof provenance. Annotations provide a solid way of keep-ing track of provenance. Indeed, computing provenance byforwarding annotations along data transformations has beenproposed in various forms [5, 18, 21, 6]. The data prove-nance problem without the use of annotation is studied byCui et al. [14], Buneman et al. [10, 11], and Widom [23].In this work, a “reverse” query is generated to compute dataprovenance. In this paper, we will not address the issueof provenance. Instead, we provide a foundation on whichboth provenance information and other forms of annotationscan be managed.

An unrelated work (although the title suggests other-wise) regards “Colorful XML” [17]. A new XML datamodel, called multi-colored trees, is proposed and colors areused to add semantic structure over the XML data nodes.

2. Colors and Blocks

2.1. Basic notation

As already mentioned, our aim is to provide a mecha-nism for annotating sets of attribute values. We refer to sucha set of attribute values as a block. As an example, in Figure1(d), there are six different blocks, and each block has an as-sociated annotation. In the remainder of the paper, for easeof presentation and notational convenience, we assume thateach annotation is represented by a color. Therefore, in-stead of talking about annotations and annotated blocks we

talk about colors and color blocks, respectively. Similarly,we talk about color databases (databases that are annotated)and color queries (queries on annotated databases) that arewritten using a color algebra (an algebra that accounts forannotations).

Let D be a standard relational database consisting of therelations R1, . . . , Rk. For each relation Ri, we denote its setof attributes by sort(Ri), while we use ri to denote an in-stance of the relation. We use upper-case letters early in thealphabet (A, B, . . .) to denote attribute names while upper-case letters late in the alphabet (X, Y, . . .) are used to de-note sets of attributes. Accordingly, lower-case letters earlyin the alphabet (a, b, . . .) are used to denote attribute values,while those late in the alphabet (x, y, . . .) are used to denotesets of attributes values. Finally, C denotes a set of colors.

Let r be an instance of relation R and let t be a tuplein r. The annotation, or coloring, of a tuple t is performedthrough a coloring function χ. Function χ accepts as inputa tuple t and a non-empty set of attributes Y ⊆ sort(R) andassigns a set of colors to the values in t[Y ]. For a tuple t, thetriplet (t, Y, χ(t, Y )) defines a color block which consists ofthe attribute values in t[Y ] along with their assigned colors.If χ(t, Y ) = ∅, then the values in t[Y ] are not within a colorblock. Hereafter, we use 〈r, χ〉 to denote a relation r whosetuples are colored through function χ.

Example 1. Consider the relation in Figure 1(d). Then, thecoloring of each tuple in the relation is expressed throughthe following coloring function χ (where, ti is the ith tuplein the relation):

χ(t1, {pid, gid}) = {John} χ(t1, {gid, sid})= {Mary}χ(t2, {pid, gid}) = {John, Mary} χ(t2, {gid, sid})= {John}

χ(t3, {gid, sid})= {Mary}χ(t4, {pid, gid, sid}) = {Peter}

Moreover, for every tuple ti and all other sets of at-tributes Y , χ(ti, Y ) is (implicitly) defined to be the emptyset.

2.2. Color algebra (CA)

In what follows, we introduce the main set of operatorsof the color algebra (CA) and we present a number of exam-ples to illustrate their use. The full set of operators (avail-able in [16]) is omitted here due to lack of space.Projection: We define the projection πA1···Ak

as the opera-tor which takes as input any instance 〈r, χ〉 of sort contain-ing {A1, . . . , Ak} and returns the instance 〈r′, χ′〉 of sort{A1, . . . , Ak} such that

r′ = {t[A1, . . . , Ak] | t ∈ r} (normal projection)

and for any t ∈ r, and any Y ⊆ {A1, . . . , Ak},

χ′(t[A1, . . . , Ak], Y ) =[Z

χ(t, Y ∪ Z),

where Z ranges over all subsets of sort(R)\{A1, . . . , Ak}.

3

Page 4: Annotating and querying databases through colors and blocks

120232A45770

120231I78852

pid gid

120234A25218

120233A01399 120232A45770

120231I78852

pid gid

120234A25218

120232A45770

120231I78852

pid gid

120234A25218

120233A01399

(a) A simple projection (b) An L-type block projection (c) A U-type block projection

Figure 2. The projection operators

Example 2. Consider the relation 〈r, χ〉 shown in Fig-ure 1(d). Then, expression πpid,gid(r) returns the relationr′ shown in Figure 2(a). Note that the tuples t1, t2 and t3get those (projected) blocks of r involving only the {gid} at-tribute, while tuple t4 gets a (projected) block of r involvingthe {pid,gid} attributes.

The projection operator removes parts of a relationalinstance based on schema-level information. Here, weintroduce a corresponding operator that uses schema-levelinformation to remove blocks.

Block Projection: We offer two types of block projectionthat allow for the projection of blocks based on whetherblocks contain or are contained in a specified set of at-tributes. Specifically, the L-type (Lower) block projectionoperator ΠL

A1,...,Aktakes as input an instance 〈r, χ〉 of sort

containing {A1, . . . , Ak}, and returns the instance 〈r ′, χ′〉of the same sort defined by

r′ = {t | t ∈ r and there exists a block(t, Y, χ(t, Y ))

with A1, . . . , Ak ⊆ Y }and for any t ∈ r′, and any set of attributes Y ⊆ sort(R′),

χ′(t, Y ) =

(χ(t, Y ) if {A1, . . . , Ak} ⊆ Y , χ(t, Y ) �= ∅;

∅ otherwise.

The U-type (Upper) projection operator ΠUA1,...,A�

is de-fined similarly, except that r′ = r and in the definition ofχ′(t, Y ), Y ⊆ {A1, . . . , Ak} must hold. We also defineΠL

∅ = Id, while ΠU∅ only returns the unannotated tuples.

Example 3. Consider the relation 〈r, χ〉 in Figure 2(a). As-sume that we want to find all the tuples with at least one an-notation that involves the protein identifier (pid) attribute.Expression ΠL

pid(r) returns the desired result, shown in Fig-ure 2(b). Operator ΠL sets a lower bound on the set ofattributes that must participate in a selected block. Thus,in our example, both tuples that do not have an annotationinvolving the pid attribute, and blocks that do not annotatethe attribute, are removed.

On the other hand, we may want all the tuples of relationr that might have an annotation only on the gid attribute.Then, the expression ΠU

gid(r) finds all such tuples. The re-sulting relation is shown in Figure 2(c). Operator ΠU setsan upper bound on the set of attributes that may participatein a selected block. Blocks only involving a subset of thisupper bound are also selected. In our example, any blockthat involves an attribute other that gid is removed.

120232 P35240A45770

120231 P21359I78852

pid gid sid

120233 P01138A01399

pid gid sid

120234 P08138A25218

(a) A block selection (b) A simple selection

120232 P35240

120231 P21359

gid sid

120233 P01138

(c) A projected unionof two queries

120234 P08138

Figure 3. The selection and union operators

We note that an unannotated tuple always satisfies thecondition of the ΠU operator while it always violates thecondition of the ΠL operator. Thus, it is always preservedby the former and dropped by the latter.

By combining ΠL and ΠU , we can find all tuples thathave a block on a specific attribute set (and only this set).Expression ΠL

gid(ΠUgid(r)) returns all tuples with a block on

gid alone. These are the first three tuples in Figure 2(c).

Selection: On input 〈r, χ〉, operator σA=a returns the in-stance 〈r′, χ′〉 of the same sort defined by r ′ = {t | t ∈r, t[A] = a} and χ′ is the restriction of χ to r′.

On input 〈r, χ〉 of sort containing {A, B}, operatorσA=B returns the instance 〈r′, χ′〉 of the same sort definedby r′ = {t | t ∈ r, t[A] = t[B]}. Concerning χ′, for anyt ∈ r′, and any Y ⊆ sort(R′) we have that

χ′(t, Y ) =

(χ(t, Y ) A, B �∈ Y ;

χ(t, Y ) ∩ β(t, A) ∩ β(t, B) otherwise.

where β(t, A) (resp. β(t, B)) is the set of colors ofall blocks in t containing attribute A (resp. B). So, fortuples that satisfy the selection condition at the value level,only those blocks containing A (resp. B) for which thereexists a block, of the same color, containing B (resp.A), are selected. If no such blocks exist, the selectionattributes become unannotated. Thus, selection requires anagreement on both the data and block level, as one of ourfollowing examples illustrates (Example 5).

Similar to the selection operator, that identifies tupleswith a particular data value, we offer a block selectionoperator that identifies blocks which have a specific color.

Block Selection: The operator Σc, where c ∈ C, takes asinput any instance 〈r, χ〉 and returns the instance 〈r ′, χ′〉 ofthe same sort defined by

r′ = {t | t ∈ r and there exists a block in t of color c},

and for any t ∈ r′ and any set of attributes Y ⊆ sort(R),χ′(t, Y ) = χ(t, Y ) ∩ {c}.

Example 4. Consider again relation 〈r, χ〉 in Figure 1(d).Assume that we want to find all the tuples that have a blockannotated by Mary, or concern the protein with sid P08138.Also, assume that we are only interested in keeping the{gid, sid} attributes from these tuples. Then, the expression

πgid,sid((ΣMary(r)) ∪ (σsid=“P08138”(r)))

4

Page 5: Annotating and querying databases through colors and blocks

120232 P35240

120231 P21359

gid sid

120233 P01138

120234 P08138

120232A45770

120231I78852

pid gid

120234A25218

(a) A relation (b) Another relation

120232A45770120231I78852

pid gid

120234A25218

120231

gid' sid'

P21359120232 P35240

120234 P08138

(c) Cartesian-product followed by selection

Figure 4. Selection and product operators

returns the desired result. Due to space limitations, wehaven’t introduced formally the union operator whose defi-nition is rather straightforward (see [16] for all the defini-tions). Figure 3 shows both the intermediate results for eachpart of the union, and the final result of the query. Noticethat the block selection operator maintains only the tuplesthat have at least one annotation from Mary. At the sametime, from these tuples, the operator keeps only the blocksbelonging to Mary. On the other hand, a selection predi-cate of the form A = a focuses on values without alteringthe block structure. Our next example shows that this is notthe case for selections of the form A = B.

The next two operators are of particular importance bothfor the completeness of our algebra and because, alongwith the selection operator, they define the color join.

Product: Given two instances 〈r, χr〉 and 〈s, χs〉 of disjointsorts, the product operator × returns the instance 〈r ′, χ′〉with sort(R′) = sort(R) ∪ sort(S) defined by r′ = r × s(normal product). For any tuple t ∈ r ′ and Y ⊆ sort(R′),

χ′(t, Y ) =

8><>:

χr(πsort(R)(t), Y ) if Y ⊆ sort(R);

χs(πsort(S)(t), Y ) if Y ⊆ sort(S);

∅ otherwise.

Merge: A natural operation on blocks is merging. Themerge operator µY,Z , with Y, Z being sets of attributes suchthat Y ∩ Z = ∅, takes as input instances 〈r, χ〉 of sortsort(R) containing Y ∪ Z and returns the instance 〈r ′, χ′〉of the same sort defined by r ′ = r. For any t ∈ r′ and anyX ⊆ sort(R),

χ′(t, X) = χ(t, X1) ∩ χ(t, X2),

where X = X1 ∪ X2, X1 ⊆ Y , X2 ⊆ Z and χ(t, X) = ∅.Intuitively, the merge operator considers each tuple t and itidentifies pairs of blocks that are contained in Y and Z , re-spectively, and have the same color. Then, it replaces twoequi-colored blocks with a new block that is the result oftheir merging. Blocks that are contained in Y and Z butcannot be merged, are dropped, as are the blocks not con-tained in Y and Z .

Example 5. Consider relation 〈r, χ〉 from Figure 2(b) andrelation 〈r′, χ′〉 from Figure 3(c) (which are copied in Fig-ures 4(a) and (b) for convenience). Figure 4(c) shows the

120232A45770120231I78852

pid gid

120234A25218

sid'

P21359P35240

P08138

(a) Projecting out gid'

A45770I78852

pid

A25218

120231

gid' sid'

P21359120232 P35240

120234 P08138

(c) Projection after the mergeoperator is applied

(b) Projecting out gid

A45770I78852

pid

A25218

120231

gid' sid'

P21359120232 P35240

120234 P08138

Figure 5. The merge operator

relation that results from taking the product of r and r ′ andapplying an equality condition on the gid attribute, that is,

σgid=gid’(r × δ(r′))

where δ is a renaming function such that δ(gid) = gid′ andδ(sid) = sid′. Again, due to space limitations, we omit theformal definition of the renaming operator (see [16] for allthe definitions). Notice that the selection only selects thoseblocks containing gid which have a equi-colored block con-taining gid’, and vice versa. If no such blocks exists, as isthe case for gid 120231, the resulting tuple is unannotated.

Now, what if we want to remove the duplicate gid columnfrom Figure 4(c)? It turns out that, depending on whichof the two gid columns we project out, we end up with adifferent block structure, as shown in Figures 5(a) and (b).We can avoid this by using the µ{pig, gid},{gid’, sid’} operator.After the merge operator is applied, irrespectively of whichof the two attributes is projected out, the resulting relationis the same. This is shown in Figure 5(c).

The presented operators, along with the operators ofunion, renaming, and recoloring (which changes the colorof a block), which are not presented here, constitute thewhole set of operators of our algebra. Since the result ofeach operator on a color relation is again a color relation,we can compose all operators.

Definition 1. The color algebra (CA) consists of all expres-sions obtained by composing a finite number of the opera-tors mentioned above.

Our first theoretical result shows that we cannot hope toreduce the set of operators in our algebra and that all the op-erators are necessary. In the next section, Theorem 3 showsthat our operators are also sufficient for our needs.

Theorem 1 (Minimality). The set of operators in the coloralgebra is minimal.

Proof Sketch The proof considers each operator in turnand shows that it cannot be expressed in terms of the otheroperators. (see [16] for the full details) .

We conclude this section with a few observations. Thefirst observation concerns the color queries that are writ-ten using the color operators that have a relational alge-bra counter-part (e.g. selection (σ), projection (π), product(×)). Each such color query results in the same set of tuples

5

Page 6: Annotating and querying databases through colors and blocks

as the one we would get by applying the corresponding re-lational algebra query on an unannotated database. The dif-ference between the two queries is the presence of blocks.Thus, by using the CA algebra, we don’t lose any data.

Our second observation is that we can define a color joinin CA as well. This join identifies attributes based on boththeir values and block structure and merges the “common”blocks. More specifically, we define 〈r, χr〉 �� 〈s, χs〉 bythe expression (assuming we join on attributes Ai and Aj)

πsort(r)∪sort(s)\{Aj}`µsort(r),sort(s)(

ΠLAi

(σAi=Aj(r × s)) ∪ (ΠL

AjσAi=Aj

(r × s)))´

As noted in Example 5, we obtain an equivalent expressionwhen projection includes Aj instead of Ai.

3. Connection with relational model

3.1. Relational representation

In this section, we provide a relational representationof color databases. In what follows, we define a map-ping rep from color databases to a special type of relationaldatabases. The inverse mapping rep−1 can also be definedbut we omit the details due to lack of space.

Given a relation schema R with sort(R) ={A1, . . . , Ak} we define the relation SR of sort{A1, . . . , Ak, B1, . . . , Bk, γ}, where the attributes Bi

are of type Boolean, the γ attribute is of type color, andthere exist a bijection assoc(Ai, Bi) mapping attributeAi to attribute Bi and viceversa. We denote the class ofrelational databases satisfying the above schema constraintsby CRep.

Let CD denote the class of color databases. The mappingis a function rep : CD �→ CRep such that rep(R) = SR.Let 〈r, χ〉 be a color relation instance over R. For eachtuple t ∈ r, the representation rep(〈r, χ〉) contains the tuple(t, 0, . . . , 0, c), where c is a designated “blank” color notappearing in r. Furthermore, for each annotated tuple t ∈ rand each Y ⊆ sort(R) such that χ(t, Y ) = ∅, the relationrep(〈r, χ〉) contains the set of tuples

{(t, B1, . . . , Bk, c) | c ∈ χ(t, Y )},

where Bi = 1 if Ai ∈ Y and assoc(Ai, Bi) holds, andbi = 0 otherwise.

Note that the Boolean attributes in the representationare used to determine which of the corresponding data at-tributes belong to a block.

This concludes the definition of mapping rep. The ex-tension to color databases, i.e., a set of color relations, isdefined analogously.

Example 6. Consider again the color relation 〈r, χ〉 givenin Figure 1(d). The following relation contains the threetuples in rep(〈r, χ〉) which correspond to the representation

of the first tuple in 〈r, χ〉. We assume that assoc(pid, bpid),assoc(gid, bgid), and assoc(sid, bsid).

pid gid sid bpid bgid bsid γ

I78852 120231 P21359 0 0 0 cI78852 120231 P21359 1 1 0 JohnI78852 120231 P21359 0 1 1 Mary

where c is a new color. The other tuples in rep(〈r, χ〉) areobtained similarly.

A few words about the choice of representation. Notethat we can normalize our representation so that the val-ues of a tuple are not repeated for every block. Separatingthe data from the annotation representation not only savesspace but also facilitates the incorporation of our mecha-nism to existing databases since no re-structuring of theexisting schemas is necessary. We also note that our rep-resentation has several advantages over alternative repre-sentations. Indeed, we explored alternative representationswhose schemas are exponentially large, with respect to thesize of the schema of the relation to be annotated (since eachpossible block, i.e., set of attributes is represented by a sepa-rate column). Such alternative representations have severaldisadvantages including waste of space, since we need toallocate the space of a column for each possible block, evenif this block does not exist in the currently annotated tuple.Furthermore, query processing is much more expensive andcomplicated in such representations.

3.2. Expressiveness

The relational representation of color databases suggestsanother candidate query language, namely the normal re-lational algebra on this representation and specifically thefragment consisting of the union of conjunctive queries. Inthis section, we establish a link between our algebra and thisfragment of the normal relational algebra.

3.2.1 Color relational algebra (CRA)It is clear that not every relational algebra query on a rep-resentation of a color relation results in a representation ofa color relation again. For example, any projection consist-ing only of data attributes, does not correspond to a colordatabase. It would therefore be desirable to identify theclass of algebra expressions which, when applied to anyrepresentation of a color relation, results in a representationof a color relation.

Recall the CRep is the set of relational databases whichrepresent a color database through the rep mapping.

Definition 2. A positive relational algebra query Q is col-ored if for every relational database D ∈ CRep, the queryresult Q(D) ∈ CRep as well.

We identify the following three necessary and sufficientsyntactic conditions for a positive relational algebra query

6

Page 7: Annotating and querying databases through colors and blocks

to be colored. Without loss of generality we may assumethat such query is given in normal form [4]. More specif-ically it is the union of conjunctive queries of the formπXσF (R1 × · · · × Rk). Clearly, a positive query is col-ored if it is the union of colored conjunctive queries. It iseasily verified that a conjunctive query is colored iff it sat-isfies the following three properties:(i) The projection must contain a single color attribute;(ii) Since each data attribute corresponds to a uniqueBoolean attribute (and vice versa), queries should “respect”this relationship. In other words, if a data attribute is part ofa query result schema, then the associated Boolean attributeshould also be (and vice versa);(iii) If some new data attributes are introduced which are inthe schema of the query result, also associated new Booleanattributes should be introduced (and vice versa).

The above characterization is simple and provides aneasy test (which runs in linear time in the size of the ex-pression) to check whether a query, written as a union ofconjunctive queries, is colored.

Definition 3. The color relational algebra (CRA) consistsof the class of colored positive relational algebra queries.

3.2.2 Color algebra vs Color relational algebraThe color algebra (CA) and the class of color relational al-gebra (CRA) queries are closely connected. First of all,there exists a translation of any CA query into a CRA query.More specifically,

Theorem 2 (Soundness). For every color database 〈D, χ〉and every CA expression Q, there exists a color relationalalgebra (CRA) expression P such that

rep(Q(〈D, χ〉)) = P (rep(〈D, χ〉)).Moreover, given the CA expression Q, the CRA expressionP is of polynomial size, to that of Q.

Proof Sketch The proof consists of translating each oper-ator in CA into a CRA query. (see [16] for the translationrules)

The previous theorem gives us a way of implement-ing the CA on top of existing relational DBMS. Our colorDBMS, represents color databases as described in Section 3and when a CA query is issued, it translates it first into thecorresponding CRA query and then to the equivalent SQLquery. Then, the SQL query is executed in a standard rela-tional DBMS.

Our theoretical result is that the CA has the same func-tionality (expressive power) as the CRA. More specifically:

Theorem 3 (Completeness). For every relational databaseD in CRep, and every color relational algebra expression P ,there exists a color algebra expression Q such that

rep−1(P (D)) = Q(rep−1(D)).

Proof Sketch The proof consists of a translation of any CRAquery into a CA query and relies heavily on the fact thatCRA queries only have a single color attribute in their pro-jection. See [16] for details.

Since the CA and CRA have the same expressive powerwhy opt for one versus the other? For one thing, CRAqueries are complex to write. The simple CA expressionσAi=Aj (r) is equivalent to the following CRA expression(which consists of a union of four CRA queries):

σAi=Aj∧Bi=0∧Bj=0(rep(r)) ∪ σAi=Aj∧Bi=1∧Bj=1(rep(r))∪πsort(rep(r))(σAi=Aj∧Bi=1∧Bj=0(rep(r)) ��

δ(σAi=Aj∧Bi=0∧Bj=1(rep(r))∪πδ(sort(()r))(σAi=Aj∧Bi=1∧Bj=0(rep(r)) ��

δ(σAi=Aj∧Bi=0∧Bj=1(rep(r))

Clearly, a user cannot be expected to write such CRAexpressions. The CA is not only simple syntactically butalso independent of the underlying representation of anno-tations. Any change on the representation of annotationsleaves CA queries unaffected. On the other hand, sucha change would probably necessitate a re-writting of theequivalent CRA query.

4. The MONDRIAN System

The current status of the MONDRIAN system includesthe following components. MONDRIAN is implementedon top of the MySQL relational DBMS. The MySQL serverruns on a linux-based Pentium 4 PC (CPU 1.8GHz, 2GBRAM). On top of MySQL, we have implemented a modulethat accepts text-based CA queries. The module translateseach such query first to its equivalent CRA query and thento an equivalent SQL query. The resulting SQL query isthen sent to the MySQL server and is executed against therepresentation of an annotated database.

In our experiments, we used real biological data fromthe Swissprot [3] database. The relational representation ofthe Swissprot data was based on the schema of the UCSCGenome Browser database [19]. From this relational rep-resentation, we extracted two relations for our experiments,namely, relation Protein containing 200,000 protein tuples(560MB in size), and relation Public that contained fourmillion tuples that concern publications related to proteins(750MB in size). Relation Protein has eight attributes in itsschema, while relation Public has five. We used the abovetwo relations as pools from which we generated three dif-ferent experimental data sets. Each set contained five unan-notated relations for each of the two pools. The sizes ofthese five relations varied from 10,000 to 50,000 tuples (in10,000 tuple increments). Thus, the total number of createdrelations was 30. Each experimental data set allowed forexecuting experiments with different relation sizes. By us-ing different data sets, we avoided any possible bias in themeasurements due to characteristics of the underlying data.

7

Page 8: Annotating and querying databases through colors and blocks

A second module of the implementation was responsiblefor annotating unannotated relations. The annotation pro-cess is influenced by three user-specified parameters. Thefirst parameter, called MaxNo, limits the number of blocksthat can appear in each annotated tuple. The second pa-rameter, called AvgNo, specifies the average size of eachgenerated block. The last parameter is the cardinality of C(the number of available colors). In general, C can vary be-tween one (all blocks have the same color) and the numberof blocks in the relation (each block has its own color).

In this setting, we conducted two sets of experiments that(a) prove feasibility, i.e., that our annotation mechanism canbe implemented efficiently in practice, and (b) study perfor-mance, i.e., the space/time overhead of our mechanism.(1): In the first set, we compared the cost of executing CAqueries over annotated databases to the cost of executingequivalent CRA queries over the corresponding unanno-tated databases.(2): In the second set, we investigated how the annotationparameters influence the evaluation cost of CA queries.All reported times are averaged over five runs of each ex-periment, over each of the different sets of (un)annotatedrelation instances (to rule out CPU interference and any biasfrom using a single instance). For annotated relations, thereported size is the number of annotated tuples and not thenumber of representation tuples. The cumulative size of the(un)annotated data sets used in these experiments is 26GB.

4.1. Costs of using colors and blocks

In this experiment, we investigate the additional cost, interms of time and space, in querying an annotated databaseinstead of an unannotated one. The experiment has twoparts. Initially, we considered three types of queries, allwritten in relational algebra, where each type uses one ofthe operators of selection, projection and join. For exam-ple, one query projects on the id and description of a pro-tein, while another selected only proteins of tomatoes. Forthe third type, we joined the Protein relation with the Publicrelation resulting proteins along with their associated pub-lications. For each query type, we measured the evaluationtime over our experimental data sets.

In the second part, we annotated the relations used inthe first part. While generating the annotated relations, weused what we consider to be representative parameter val-ues. We assumed that both MaxNo and AvgNo are equal tothree, since tuples are expected to involve a small numberof blocks with a few attributes in each one. Furthermore,we assumed that each color represents a curator and thuswe set C to 100. Note that indeed the number of curatorsin Swissprot, one of the most heavily curated databases, isclose to 40.

Given the annotated relations, we considered queriesthat, syntactically, are identical to the queries in the first

0.00

2.00

4.00

6.00

8.00

10.00

12.00

10000 20000 30000 40000 50000

Relational instance size (in tuples)

Tim

e (

in s

eco

nd

s)

Projection

Color Projection

Selection

Color Selection

Join

Color Join

Figure 6. Color vs. normal algebra

part of the experiment. However, instead of the normal rela-tional algebra operators, the queries used their color coun-terparts, i.e., they contained a CA projection instead of aprojection, a CA selection instead of a selection and a colorjoin instead of a normal join (note that a color join can beexpressed through the select, merge and project CA op-erators). So, for example, the query that projects on theid and description of proteins was translated to a querythat projects on the data and blocks of these two attributes.Given the queries, we measured their evaluation time overthe annotated databases, and we compared these times withthe times collected from the first part.

Figure 6 shows the results of this comparison for vari-ous relation sizes. Next to the cost of each relational op-erator, we show the cost of its color counter-part. In gen-eral, each color operator costs from three to five times asmuch as its relational counterpart. There are two main rea-sons for this. First, remember that color operators are actu-ally applied on the representation of the annotated relation.This representation is bigger in size than the correspond-ing unannotated relation since it includes one tuple for eachcolor of each block. With MaxNo equal to three, annotat-ing a relation with 10,000 tuples results in a relation thatis close to 30,000 tuples (assuming single colored blocks).Thus, while a normal projection is applied on 10,000 tuples,a color projection is actually applied on 30,000. Second, re-member that color operators perform extra processing sincethey also consider Boolean attributes and operate on them.

Note that the overhead of supporting annotations is notprohibitive and it is more than balanced by the added valueof being able to represent and query complex annotations.We expect that further optimizations of our implementation(e.g. use of specialized indexes) would further reduce theabove costs making our solutions even more attractive.

4.2. Query evaluation cost parameters

In what follows, we investigate the relationship betweenevaluation cost of color queries and the three annotation pa-rameters. Fo these experiments, we considered three dif-ferent types of color queries. The first type included colorqueries that involved only the selection operator where the

8

Page 9: Annotating and querying databases through colors and blocks

0

1

2

3

4

5

6

10000 20000 30000 40000 50000

Relational instance size (in tuples)

Tim

e (

in m

illiseco

nd

s)

MaxNo = 5

MaxNo = 4

MaxNo = 3

MaxNo = 2

MaxNo = 1

(a) Selection on a data value

0

10

20

30

40

50

60

70

80

90

10000 20000 30000 40000 50000

Relational instance size (in tuples)

Tim

e (

in m

illiseco

nd

s)

MaxNo = 5

MaxNo = 4

MaxNo = 3

MaxNo = 2

MaxNo = 1

(b) Block selection and projection

Figure 7. Varying the MaxNo parameter

selection was on a data value. Note that the size of the re-sult set of such queries is expected to be independent ofblocks. The second type of color queries involved oper-ators that are mostly block-dependent, namely, the opera-tors of block projection and block selection. Finally, thethird type of queries involved both block-independent andblock-dependent operators. We used three different param-eter configurations to annotate relations. For each result-ing annotated relation, in each configuration, we executedqueries of all three types and measured the correspondingevaluation times. In what follows, we present each parame-ter configuration and we review our key findings.

Configuration 1: In this configuration, we annotated the re-lations in our experimental data sets once for each value ofMaxNo between one and five. The AvgNo parameter was setto three and the C was 100. In Figure 7, we show the run-ning times for the first and second type of queries, for differ-ent relation sizes (the third type exhibits the same runningtime trends as the second type). The main conclusion fromthese experiments is that the evaluation cost of most CA op-erators, with the exception of selections on data values, isheavily influenced by the maximum number of blocks pertuple. This is because this number increases the number oftuples in the underlying representation. The trend shown inFigure 7(b) is explained as follows. In the representation,there is one tuple for each color of each block. Assumingsingle-colored blocks, for an annotated relation with X tu-ples, there are (MaxNo + 1) × X representation tuples. Asthe figure shows, evaluation time increases sharply, when

0

2000

4000

6000

8000

10000

12000

1 2 3 4 5

AvgNo

Nu

mb

er

of

tup

les

10000

20000

30000

40000

50000

Figure 8. Result size of the ΠU operator

both MaxNo and X increase. In spite of the increase repre-sentation size, operations on the data side, like selections ondata values, are not influenced by the variance of MaxNo, asFigure 7(a) illustrates.Configuration 2: For our second configuration, we anno-tated relations of various sizes by varying, this time, theAvgNo parameter between the values of one and five. Pa-rameter MaxNo was set to three and C to 100. Again, weexecuted color queries from all the types and recorded theirevaluation times. Our experiments showed that varyingthe AvgNo parameter had negligible effects on the runningtimes of various queries, for a fixed number of annotatedtuples. To a large extent, this is to be expected since anyvariance of AvgNo does not influence the number of tuplesin the underlying representation. As AvgNo increases, theonly change, representation-wise, is that more Boolean at-tributes are set to 1, instead of 0. It is interesting to note theinteraction between the value of AvgNo and the size of theresult of block projections. As an example, in Figure 8 weshow that, for a relation of fixed size, as we increase AvgNowe decrease the number of tuples in the result size of a U-type block projection on three attributes. This is because aswe increase AvgNo, increasingly less blocks are containedwithin the projected attribute set.Configuration 3: Our last configuration considered anno-tated relations where both MaxNo and AvgNo were set tothree. Here, five different values where considered for C,namely, 1, 10, 100, 1000 and as many colors as there areblocks. Our experiments showed that the parameter hasnegligible effects on the evaluation times of algebra oper-ators (since again availability of colors does not affect therepresentation size) with the exception of the block selec-tion operator. As Figure 9(a) shows, when C is 10, there isa sharp increase on the evaluation time of the operator. Thereason for this is shown in Figure 9(b). With less availablecolors, there is a large number of blocks sharing a color.

5. Conclusions and future workIn this paper, we proposed a new model for data annota-

tions which, unlike previous works, it allows annotating notonly values but also sets of values. Furthermore, our workis novel in that it also considers the importance of query-

9

Page 10: Annotating and querying databases through colors and blocks

0

1000

2000

3000

4000

5000

6000

7000

8000

10000 20000 30000 40000 50000

Relational instance size (in tuples)

Tim

e (

in m

illi

se

co

nd

s)

1

10

100

1000

# of blocks

(a) Evaluation time of block selection

0

1000

2000

3000

4000

5000

6000

7000

8000

1 10 100 1000 alldifferent

Number of colors

Nu

mb

er

of

tup

les

10000

20000

30000

40000

50000

(b) Result tuples of block selection

Figure 9. Block selection as C varies

ing annotations. To this end, we introduced a color algebra,and we proved that it is both complete and minimal. Wepresented the MONDRIAN annotation management systemand we performed experiments on the feasibility and perfor-mance of our solutions.

We believe that colored databases provide the rightframework to answer data provenance questions. Futurework will be directed towards showing this. In terms of im-plementation, we are considering migrating our implemen-tation from the MySQL server to MonetDB [8]. MonetDBoffers a vertical fragmentation storage model and initial in-vestigation shows that this model is particularly suitable forthe processing performed by CA algebra operators. Finally,an interesting topic is the extension of our algebra to ac-count for negation and the to allow blocks across multipletuples.

Acknowledgments Floris Geerts is a postdoctoral re-searcher of the FWO Vlaanderen and is supported in partby EPSRC GR/S63205/01.

References

[1] Cathy H. Wu, Lai-Su L. Yeh, et al. The Protein InformationResource. Nucleic Acids Research, 31: 345-347, 2003.

[2] GDB(tm) Human Genome Database [database online]. urlhttp://www.gdb.org/.

[3] The SWISS-PROT Protein Knowledgebase. URL:http://www.ebi.ac.uk/swissprot/.

[4] S. Abiteboul, R. Hull, and V. Vianu. Foundations ofDatabases. Addison-Wesley, 1995.

[5] P. A. Bernstein and T. Bergstraesser. Meta-data support fordata transformations using microsoft repository. IEEE DataEng. Bull., 22(1):9–14, 1999.

[6] D. Bhagwat, L. Chiticariu, W.-C. Tan, and G. Vijay-vargiya. An Annotation Management System for RelationalDatabases. In VLDB, pages 900–911, 2004.

[7] biodas.org. http://biodas.org.[8] P. A. Boncz. Monet: A Next-Generation DBMS Kernel For

Query-Intensive Applications. Ph.d. thesis, Universiteit vanAmsterdam, Amsterdam, The Netherlands, May 2002.

[9] P. Buneman. The two cultures of digital curation. In SS-DBM, pages 7–, 2004.

[10] P. Buneman, S. Khanna, and W. C. Tan. Why and where: Acharacterization of data provenance. In ICDT, pages 316–330, 2001.

[11] P. Buneman, S. Khanna, and W. C. Tan. On propagation ofdeletions and annotations through views. In ACM PODS,pages 150–158, 2002.

[12] L. Chiticariu, W. C. Tan, and G. Vijayvargiya. Dbnotes: apost-it system for relational databases based on provenance.In ACM SIGMOD, pages 942–944, 2005.

[13] T. G. O. Consortium. The gene ontology (go) database andinformatics resource. Nucl. Acids Res, 32:258–261, 2004.

[14] Y. Cui, J. Widom, and J. L. Wiener. Tracing the lineageof view data in a warehousing environment. ACM Trans.Database Syst., 25(2):179–227, 2000.

[15] S. Davidson, G. C. Overton, and P. Buneman. Challenges inIntegrating Biological Data Sources. Journal of Computa-tional Biology, 2(4):557–572, 1995.

[16] F. Geerts, A. Kementsietsidis, and D. Milano. MON-DRIAN: Annotating and querying databases throughcolors and blocks. Technical report. Available athttp://www.lfcs.inf.ed.ac.uk/research/database/annotation.html.

[17] H. V. Jagadish, L. Lakshmanan, M. Scannapieco, D. Srivas-tava, and N. Wiwatwattana. Colorful xml: One hierarchyisn’t enough. In ACM SIGMOD, pages 251–262, 2004.

[18] J. Kahan, M.-R. Koivunen, E. Prud’hommeaux, and R. R.Swick. Annotea: an open rdf infrastructure for shared webannotations. Computer Networks, 39(5):589–608, 2002.

[19] D. Karolchik, R. Baertsch, M. Diekhans, T. Furey, A. Hin-richs, Y. Lu, K. Roskin, M. Schwartz, C. Sugnet, D. Thomas,R. Weber, D. Haussler, and W. Kent. The UCSC GenomeBrowser Database. Nucl. Acids Res, 31:51–54, 2003.

[20] A. Kementsietsidis, M. Arenas, and R. J. Miller. Data Map-ping in Peer-to-Peer Systems: Semantics and AlgorithmicIssues. In ACM SIGMOD, pages 325–336, 2003.

[21] W. C. Tan. Containment of relational queries with annota-tion propagation. In 9th Int’l Workshop on Database Pro-gramming Languages (DBPL), pages 37–53, 2003.

[22] W.-C. Tan. Research Problems in Data Provenance. IEEEData Engineering Bulletin, 27(4):45–52, Dec 2004.

[23] J. Widom. Trio: A system for integrated management ofdata, accuracy, and lineage. In CIDR, pages 262–276, 2005.

10