semistructured probabilistic databases

10
Semistructured Probabilistic Databases Alex Dekhtyar [email protected] Department of Computer Science University of Kentucky Judy Goldsmith [email protected] Department of Computer Science University of Kentucky Sean R. Hawkes [email protected] Department of Computer Science University of Kentucky Abstract This work describes a new theoretical framework for uni- form storage and management of diverse probabilistic in- formation. 1 Introduction The need to store and manage probabilistic information can appear in a number of different applications, from mul- timedia databases storing the results of image recognition to logistics databases to stock market prediction software to a wide variety of applications of Bayesian Nets [17]. Over the past 13 years there have been a number of relational [6, 2, 9, 15] and object [14, 10] data models proposed to support storage and querying of probabilistic information. Unfortunately, these approaches are not sufficiently flexi- ble to handle the different contexts in which probabilities must be discussed in analyzing a stochastic system. For in- stance, consider academic advising, where the expectation of a student’s success may be represented in a variety of forms: a simple probability distribution for one course or a joint probability distribution for several courses, or a simple or joint conditional probability distribution (success in later courses may depend on earlier grades). This variety of formats would require separate storage in any of the current probabilistic relational models, mak- ing even simple queries hard to express. Thus, we propose a new, semistructured probabilistic data model designed to alleviate this problem. Semistructured data model [1, 5, 18] has gained wide acceptance recently as the means of representing the data which lacks a rigid structure of schema. In particular, the similarity of the semistructured data model and the under- lying data model for Extensible Markup Language (XML) [4], the emerging open standard for data storage and trans- mission over the Internet make our choice of this approach attractive. In this paper, we present the formal model for semistructured probabilistic objects. This paper pro- vides the theoretical foundations for storing and managing semistructured probabilistic objects. In [13] we have started the process of translating this model into XML. In Section 2, we introduce the advising application. Sec- tion 3 gives formal definitions of semistructured probabilis- tic objects, and Section 4 introduces the underlying algebra for semistructured probabilistic databases. 2 Motivating Example The following comes from our work on modeling aca- demic advising as a process of uncertain inference [7]. Consider a database, designed to assist faculty members with the academic advising process. Typically, an advisor sees each advisee once every semester to suggest the set of courses to take next semester. The advisor tries to suggest courses that fulfill degree requirements and for which the student has the highest chance (probability) of success. Statistical information about student performance in dif- ferent classes can be extracted from the student transcript database maintained by every university. Under the assump- tion that this information correctly reflects (approximates) the true probabilities, it can be used to assist the advisor in her recommendation. Note that the type of probabilistic information available to the advisor in this example varies greatly. The simplest is a probability distribution of student performance in one course. The advisor can compare the probability distribu- tion for Databases with the probability distribution for 1

Upload: uky

Post on 15-Nov-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

Semistructured Probabilistic Databases

Alex [email protected]

Department of Computer ScienceUniversity of Kentucky

Judy [email protected]

Department of Computer ScienceUniversity of Kentucky

Sean R. [email protected]

Department of Computer ScienceUniversity of Kentucky

Abstract

This work describes a new theoretical framework for uni-form storage and management of diverse probabilistic in-formation.

1 Introduction

The need to store and manageprobabilistic informationcan appear in a number of different applications, from mul-timedia databases storing the results of image recognitionto logistics databases to stock market prediction software toa wide variety of applications of Bayesian Nets [17]. Overthe past 13 years there have been a number of relational[6, 2, 9, 15] and object [14, 10] data models proposed tosupport storage and querying of probabilistic information.Unfortunately, these approaches are not sufficiently flexi-ble to handle the different contexts in which probabilitiesmust be discussed in analyzing a stochastic system. For in-stance, consider academic advising, where the expectationof a student’s success may be represented in a variety offorms: a simple probability distribution for one course or ajoint probability distribution for several courses, or a simpleor joint conditional probability distribution (success in latercourses may depend on earlier grades).

This variety of formats would require separate storagein any of the current probabilistic relational models, mak-ing even simple queries hard to express. Thus, we proposea new,semistructuredprobabilistic data model designed toalleviate this problem.

Semistructured data model [1, 5, 18] has gained wideacceptance recently as the means of representing the datawhich lacks a rigid structure of schema. In particular, thesimilarity of the semistructured data model and the under-

lying data model for Extensible Markup Language (XML)[4], the emerging open standard for data storage and trans-mission over the Internet make our choice of this approachattractive. In this paper, we present the formal modelfor semistructured probabilistic objects. This paper pro-vides the theoretical foundations for storing and managingsemistructured probabilistic objects. In [13] we have startedthe process of translating this model into XML.

In Section 2, we introduce the advising application. Sec-tion 3 gives formal definitions of semistructured probabilis-tic objects, and Section 4 introduces the underlying algebrafor semistructured probabilistic databases.

2 Motivating Example

The following comes from our work on modeling aca-demic advising as a process of uncertain inference [7].

Consider a database, designed to assist faculty memberswith the academic advising process. Typically, an advisorsees each advisee once every semester to suggest the set ofcourses to take next semester. The advisor tries to suggestcourses that fulfill degree requirements and for which thestudent has the highest chance (probability) of success.

Statistical information about student performance in dif-ferent classes can be extracted from the student transcriptdatabase maintained by every university. Under the assump-tion that this information correctly reflects (approximates)the true probabilities, it can be used to assist the advisor inher recommendation.

Note that the type of probabilistic information availableto the advisor in this example varies greatly. The simplestis a probability distribution of student performance in onecourse. The advisor can compare the probability distribu-tion for Databases with the probability distribution for

1

DB PA 0.2B 0.3C 0.4F 0.1

DB OS PA A 0.05A B 0.05A C 0.1A F 0.01B A 0.04B B 0.08. . . . . . . . .F C 0.02F F 0.03

major : CScollege: ENGinstructor: Smithinstructor: JonesDB OS PA A 0.01A B 0.02A C 0.2A F 0.01B A 0.05B B 0.12. . . . . . . . .F C 0.03F F 0.01

major : CSyear : 2000DB OS PA A 0.1A B 0.1A C 0.01A F 0B A 0.1B B 0.2. . . . . . . . .F C 0.01F F 0

DataStructures= ALogic 2 f A, B g

Figure 1. Different types of probabilistic information to b e stored in the Advisor database (from leftto right: single variable probability distribution, joint probability distribution of 2 variables, jointprobability distribution with context, conditional joint probability distribution with context.)

Operating Systems in order to choose the course that hasa higher probability of success.

Another type of probabilistic information that can beuseful in this situation is ajoint probability distribution.When the advisor needs to consider the entire list of coursesfor the student to take next semester, she is interested in thestudent’s success inall courses at once. This brings up an-other type of probabilistic information that can be useful, ajoint probability distribution.

To make matters more complicated, we notice that stu-dent’s success in the future classes can depend intrinsicallyon her current grades. AC in a Data Structures classmay suggest to the advisor that the student might not dowell in Algorithms, while andA in Logic suggests a goodchance of success inArtificial Intelligence. Other pos-sible information that can affect the probability distribu-tion may include some general information about the stu-dent/course background such as student’s major, college,graduate/undergraduate status, professor who teaches thecourse, etc. . .

The possible types of probabilistic information to bestored a database for advising support are shown in Fig-ure 1. There, and in the examples throughout the paper welimit possible grades toA, B, C, F.

When trying to store this data using one of the previouslyproposed probabilistic database models, relational or object,a number of problems will be encountered.

Probabilistic relational models [2, 9, 15] lack flexibil-ity to store some of our data in a straightforward manner.In the advising application the courses are viewed asran-dom variables. As such, it is natural to represent eachcourse as a database attribute that can take values from thesetfA,B,C,Fg. However, with such interpretation, a jointprobability distribution of student grades in two courseswill have a schema different from the joint probability dis-

tribution of three courses, and therefore, will have to bestored in a separate relation. In such a database, express-ing queries like“Find all probability distributions that in-clude Databases as a random variable” is very incon-venient, if at all possible.

Probabilistic Object models [14, 10] are also not a goodfit for storing this kind of data. In the framework of Eiteret al. [10], a probabilistic object is a “real” object, someof whose properties are uncertain and probabilistically de-scribed. For our application, theprobability distributionisthe object that needs to be stored.

With this example in mind, we proceed to describe ourdata model1.

3 Data Model

Consider a universe V of random variablesfv01; : : : ; v0mg. With each random variablev 2 V weassociatedom(v), the set of itspossible values. Givena set V = fv1; : : : vqg � V , dom(V ) will denotedom(v1)� : : : dom(vq).

Let R = (A1; : : : ; An) be a collection ofregular rela-tional attributes. ForA 2 R, dom(A) will denote the do-main ofA. We define asemistructured schemaR? overRas amultiset of attributes fromR. For example, ifR =fyear;major; collegeg, the following are valid semistruc-tured schemas overR: R?1 = fyear; collegeg; R?2 =fyear; year;major; collegeg; R?3 = fmajor;major;majorg.

Let P denote aprobability spaceused in the frameworkto represent probabilities of different events. Examples of

1A good advisor does not inform a student that she has probability xof getting gradey in a given course. Thus, all probabilities used in exam-ples here were chosen randomly, and do not represent actual observed orcomputed probabilities for these events.

DB OS PA A 0.09A B 0.12A C 0.03A F 0.005

DB OS PB A 0.12B B 0.16B C 0.13B F 0.01

DB OS PC A 0.03C B 0.08C C 0.11C F 0.045

DB OS PF A 0F B 0.01F C 0.02F F 0.04

Table 1. Joint Probability Distribution of Student Perform ance in Databases and Operating SystemsCourses.

such probability spaces include (but are not limited to) theinterval[0; 1] and the setC[0,1] of all subintervals of[0; 1][16, 8, 15]. For each probability spaceP there should exista notion of aconsistent probability distribution overP2.

We are ready to define the key notion of our framework:Semistructured Probabilistic Objects (SPOs).

Definition 1 A Semistructured Probabilistic Object (SPO)S is defined as a quadrupleS = hT; V; P; Ci, where� T is a relational tuple over somesemistructuredschemaR? overR. We will refer toT as thecontextofS.� V = fv1; : : : ; vqg � V is a set of random variablesthatparticipate in S. We require thatV 6= ;.� P : dom(V ) �! P is the probability table of S.Note thatP need not be complete, but itmustbecon-sistentw.r.t. P .� C = f(u1; X1); : : : (us; Xs)g, wherefu1; : : : ; usg =U � V andXi � dom(ui), 1 � i � n, such thatV \ U = ;. We refer toC as theset of conditionalsofS.

An explanation of this definition is in order. In order forour data model to possess the ability to store all the prob-ability distributions mentioned in Section 2 (see Figure 1),the following information needs to be stored in a single ob-ject:

1. Participating random variables. These variables de-termine the probability distribution described in anSPO.

2. Probability Table. If only onerandom variable par-ticipates, it is asimpleprobability distribution table;otherwise the distribution will bejoint. Probabilitytable may becomplete, when the information aboutthe probability ofeveryinstance is supplied, orincom-plete.

2ForP = [0; 1] the consistency constraint states that the sum of prob-abilities in acompleteprobability distribution must add up to exactly 1.

It is convenient to visualize the probability tableP asa table of rows of the form(�x; �), where�x 2 dom(V )and� = P (�x). Thus, we will speak aboutrows andcolumnsof the probability table where it makes expla-nations more convenient.

3. Conditionals. A probability table may representcon-ditional distribution, conditioned by some prior infor-mation. Theconditional part of its SPO stores theprior information in one of two forms:“random vari-ableu has valuex” or “the value of random variableu is restricted to a subsetX of its values”. In our defi-nition, this is represented as a pair(u;X). WhenX isa singleton set, we get the first type of the condition.

4. Context provides supporting information for a proba-bility distribution – information about the known val-ues of certain parameters, whichare not consideredto be random variablesby the application.

Example 1 Consider thejoint probability distribution ofstudent grades inDatabases (DB) and Operating Sys-tems (OS) for College of Engineering majors who have agrade ofA or B in Data Structures (DS) defined in Table1.

We can break this information into our four categoriesas follows:

participating random variables: V = fDB;OSg.probability table: specified in Table 1. Here, the probabil-

ity table defines a complete and consistent probabilitydistribution.

conditionals: there is a single conditionalDS 2 fA;Bgassociated with this distribution, which is stored in anSPO asC = f(DS; fA;Bg)g.

context: information about student’s college within theUniversity isnot represented by a random variable inour universe. It is, therefore, represented as a rela-tional attributecollege. Thus,college: Engineeringis thecontextof the probabilistic information in thisexample.

4 Semistructured Probabilistic Algebra

Let us fix the universe of random variablesV , the uni-verse of context attributesR, and the probability spaceP . In the remainder of this paper we will assume thatP = [0; 1].

A finite collectionS = fS1; : : : ; Sng of semistructuredprobabilistic objectsover hV ;R;Pi is called asemistruc-tured probabilistic relation (SPR). A finite collectionDS = fS1; : : : ;Smg is called asemistructured probabilis-tic database (SPD).

One important difference betweensemistructured prob-abilistic databasesand classic relational or relational prob-abilistic databases is that different semistructured proba-bilistic relations have the same “schema” as their attributesrange over the same universehV ;R;Pi. Unlike relationaldatabases, where tuples over different schemas must belongto different relations, any two SPOs can be in the samesemistructured probabilistic relation. Thus, the division of asemistructured probabilistic database into relations is donefor the purpose of specifying the structure of the database.For example, if the SPD is built from the information sup-plied by three different experts, this information can be ar-ranged into three SPRs - according to the origin of eachobject inserted in the database.

The key to the efficient use of semistructured probabilis-tic databases in representing probabilistic information is themanagement of the data stored in SPDs. Just as withprob-abilistic relational databases, where probabilistic relationalalgebras of Barbara et al. [2], Dey and Sarkar [9] and Lak-shmanan et al. [15] extend classical relational algebra byadding probability-specific (and probability theory compli-ant) manipulation of the probabilistic attributes in the rela-tions, a new semistructured algebra needs to be developedfor SPDs, in order to capture properly the manipulation ofprobabilities.

In the remainder of this section we introduce such al-gebra, calledSemistructured Probabilistic Algebra (SP-Algebra). This algebra will extend the definitions of stan-dard relational operationsselect, project, cartesian prod-uct,join to account for the appropriate maintenance of prob-abilistic information within SPOs, as well as introduce newa operation,conditionalization (see also [9]), which is spe-cific to the probabilistic nature of the data.

4.1 Selection

Given an SPOS = hT; V; P; Ci, selection operationmay apply to any of its four parts. Each part requires itsown language of selection constraints.

Selection oncontext, participantsandconditionalswillresult in the entire SPO either being selected or not (as isthe case with selection on classical relations). Selection on

probability tableit will result in only part of the originalprobability tableP returned. A selection based on the prob-abilistic values is also possible and it should also result inonly part of the probabilistic table returned. The five differ-ent types of selections are illustrated in the following exam-ple.

Example 2 Consider the SPOS described in Example 1.The first three queries below returnS.

1. “Find all probability distributions related to Col-lege of Engineering majors”.

2. “Find all probability distributions that involve theOperating Systems course”.

3. “Find all probability distributions related to stu-dents who took Data Structures and got a gradeof C or better”.

4. “What information is available about the probabil-ity of getting an A in Databases?” Databases S:Pcontains four entries that relate to the probability ofgetting anA in Databases. This part ofS:P shouldbe returned as a result together with theT; V andCparts ofS. The remainder of theS:P shouldnot bereturned.

5. “What outcomes have the probability over 0.1?”In the probability table ofS, there are five possibleoutcomes that have the probability greater than 0.1. Inthe result of executing this query onS, S:P shouldcontain exactly these five lines, withS:T ,S:V andS:Cremaining unchanged.

4.1.1 Selection on Context, Participation or Condition-als

Here, we define the three selection operations that do notalter the content of the selected objects. We start by defin-ing the acceptable languages for selection conditions for thethree types of selects.

Recall that the universeR of context attributes con-sists of a finite set of attributesA1; : : : An with domainsdom(A1); : : : ; dom(An). With each attributeA 2 R weassociate a setPr(A) of allowed predicates. We assumethatequality andinequality areallowed for all A 2 R.

Definition 2 1. Anatomic context selection conditionisan expressionc of the formA Q x (Q(A; x)), whereA 2 R, x 2 dom(A) andQ 2 Pr(A).

2. Anatomic participation selection conditionis an ex-pressionc of the formv 2 V , wherev 2 V is a randomvariable.

S:major: CSDB OS PA A 0.1A B 0.1B A 0.13B C 0.09C C 0.16Logic = A

�DB=A(S):major: CSDB OS PA A 0.1A B 0.1Logic = A

�P>0:11(S):major: CSDB OS PB A 0.13C C 0.16Logic = A

Figure 2. Selection on Probabilistic Table and on Probabili ty values in SP-Algebra

3. An atomic conditional selection conditionis one ofthe following expressions:u = fx1; : : : xhg or u 3 xwhereu 2 V is a random variable andx; x1; : : : ; xh 2dom(u).

Complex selection conditions can be formed as Booleancombinations of atomic selection conditions.

Definition 3 Let S = hT; V; P; Ci is an SPO and letc :Q(A; x) be an atomic context selection condition. Then�c(S) = fSg iff� A 2 R?;� For some instanceA? ofA in T , (S:T:A?; x) 2 Q;

otherwise�c(S) = ;.Definition 4 Let S = hT; V; P; Ci is an SPO and letc :v 2 V be an atomic participation selection condition. Then�c(S) = fSg iff v 2 V .

Definition 5 1. LetS = hT; V; P; Ci is an SPO and letc : u = fx1; : : : ; xhg be an atomic conditional selec-tion condition. Then�c(S) = fSg iff C 3 (u;X) andX = fx1; : : : ; xhg.

2. Letc : u 3 x be an atomic conditional selection condi-tion. Then�c(S) = fSg iff C 3 h(u;X) andX 3 x.

The semantics of atomic selection conditions can beextended to their Boolean combinations in a straightfor-ward manner:�C^C0(S) = �C(�C0(S)) and�C_C0(S) =�C(S) _ �C0(S).

The interpretation ofnegationin the context selectioncondition requires some additional explanation. In orderfor a selection condition of a form:Q(A; x) to succeed onsome SPOS = hT; V; P; Ci, attributeA must be presentRT . If A is not inR?, the selection condition does not getevaluated and the result will be;. Therefore, the statementS 2 �c(S) _ S 2 �:c(S) is not necessarily true. This alsoapplies to conditional selection conditions.

Finally,�C(S) = SS2S(�C(S)).

4.1.2 Selection on Probability Table or Probabilities

Selection operations considered in previous section weresimple in that their result on a semistructured probabilisticrelation was always a subset of the relation.

The two types of selections introduced here are morecomplex. The result of each operation applied to an SPOcan bea non-emptypart of the original SPO. In particular,both operations preserve the context, participating randomvariables and conditionals in an SPO, but may returnonly asubsetof the rows of the probability table. Both in the caseof the selection on probability table and selection on proba-bilities, the selection condition will indicate which rows areto be included and which are to be omitted.

Definition 6 An atomic probabilistic table selection condi-tion is an expression of the formv = x wherev 2 V andx 2 dom(v). Probabilistic table selection conditionsareBoolean combinations ofatomic probabilistic table selec-tion conditions.

Definition 7 Let S = hT; V; P; Ci be an SPO,V =fv1; : : : ; vkg and let c : v = x be an atomic probabilis-tic table selection condition.

If v 2 V , then (assumev = vi; 1 � i � k), the result ofselection fromS onc, �c(S) is a semistructured probabilis-tic objectS0 = hT; V; P 0; Ci, whereP 0(y1; : : : ;yi; : : : ; yk) = � P (y1; : : : ;yi; : : : ; yk) if yi = x;

undefined if yi 6= x:Definition 8 An atomic probabilistic selection conditionisan expression of the formP op �, where� 2 [0; 1] andop2 f=; 6=;�;�; <;>g. Probabilistic selection conditionsare boolean combinations ofatomic probabilistic selectionconditions.

Definition 9 Let S = hT; V; P; Ci be an SPO and letc :P op � be a probabilistic atomic selection condition. Let�x 2 dom(V ). The result of selection fromS on c is definedas follows:�P op �(S) = S0 = hT; V; P 0; Ci where

S:major: CSDB OS Compilers PA A A 0.05A B A 0.04B A B 0.07A B C 0.03B A A 0.02B A C 0.04B C A 0.02B C B 0.01B B A 0.03B B B 0.05C C A 0.01C C B 0.02C C C 0.03Logic = A

�DB;OS(S) (step I):major: CSDB OS PA A 0.05A B 0.04B A 0.07A B 0.03B A 0.02B A 0.04B C 0.02B C 0.01B B 0.03B B 0.05C C 0.01C C 0.02C C 0.03Logic = A

�DB;OS(S) (step II):major: CSDB OS PA A 0.05A B 0.07B A 0.13B C 0.03B B 0.08C C 0.06Logic = A

Figure 3. Projection on Probabilistic Table and on Probabil ity values in SP-AlgebraP 0(�x) = � P (�x) if P (�x) op �;undefined otherwise.

Example 3 Figure 2 shows two examples of select querieson an SPO. The central object is obtained from the originalSPO (left) as the result of the query“Find all informationabout the probability of getting an A in Databases”,represented as�DB=A(S). In the probability table of theresulting SPO, only the rows that have the value of theDBrandom variable equal toA will remain.

The rightmost object in the figure is the result of thequery “Find all grade combinations whose probabil-ity is greater than 0.11”. This query can be written as�P>0:11(S). The probability table of the resulting objectwill contain only those rows from the original probabilitytable where the probability value was greater than0:11.

Different selection operations commute, as shown in thefollowing theorem:

Theorem 1 Let c and c0 be two (arbitrary) selection con-ditions and letS be a semistructured probabilistic relation.Then�c(�c0(S) = �c0(�c(S))4.2 Projection

Just as with selection, the results ofprojectionoperationdiffer, depending on which parts of an SPO are to be pro-jected out.

Projection oncontextandconditionalsis similar to thetraditional relational algebra projection: either context at-tribute or a conditional is removed from an SPO object,which does not change otherwise. These operations changethe semantics of the SPO and thus must be used with cau-tion. However, it can be argued that removing attributes

from the relations in a relational database system alsochanges the semantics of the data. Due to space restrictions,we do not describe such projections in further detail.

A somewhat more difficult and delicate operation is theprojection on the set ofparticipating random variables. Aremoval of a random variable from the SPO’sparticipantset entails that information related to this random variablehas to be removed from the probability table as well. In-formally, this process corresponds to a process of removingone random variable from consideration in a joint proba-bility distribution, which is usually calledmarginalization.The result of this operation is a newmarginal probabilitydistribution. It will be the purpose of our projection opera-tion to compute this marginal probability distribution.

This computation can be performed in two steps. First,the columns for random variables that are to be projectedout are removed from the probability table. In the remain-der of the table, there can now be duplicate rows: rows thathave all values (except for the probability value) coincide.All duplicate rows of the same type are then collapsed (co-alesced) into one, with the new probability value computedas the sum of the values in the collapsed rows.

A formal definition of this procedure is given below.

Definition 10 Let S = hT; V; P; Ci be an SPO,V =fv1; : : : ; vqg, q > 1 andLV � V;LV 6= ;. The projec-tion ofS onLV , denoted�LV (S), is defined to be an objectS0 = hT;LV ; P 0; Ci whereP 0 : dom(LV ) �! [0; 1] andfor each�x 2 dom(LV ),P 0(�x) = X�y2dom(V�LV );P (�x;�y)is defined

P (�x; �y):Notice that projection on the set of participants is al-

lowed only if the set of participants isnot a singletonand

S:college: EngineeringDB OS PA A 0.09A B 0.12A C 0.03A F 0.005B A 0.12B B 0.16B C 0.13B F 0.01C A 0.03C B 0.08C C 0.11C F 0.045F A 0F B 0.01F C 0.02F F 0.04DS 2 f A, B g

�OS=A(S:) (step I)college: EngineeringDB OS PA A 0.09

B A 0.12

C A 0.03

F A 0

DS 2 f A, B g

�OS=A(S:) (step II)college: EngineeringDB PA 0.375B 0.5C 0.125F 0DS 2 f A, B gOS = A

Figure 4. Conditionalization in SP-Algebra.

if at leastone random variable remains in the resulting set.

Example 4 Figure 3 illustrates how projection on the set ofparticipating random variables works. First, the columns ofrandom variables to be projected out are removed from theprobability table (step I). Next, the remaining rows are coa-lesced (step II). After theCompilers random variable hadbeen projected out, the interim probability table has threerows (B,A) with probabilities 0.07, 0.02 and 0.04 respec-tively. These rows are combined into one row with prob-ability value set to0:07 + 0:02 + 0:04 = 0:13. Similaroperations are performed on the other rows.

Theorem 2 Let S = hT; V; P; Ci be an SPO withP be-ing a joint probability distribution of random variablesV .Then, forany; 6= LV � V , the probability tableP 0 fromS0 = hT;LV ; P 0; Ci = �LV (S)contains thecorrect marginal probability distributionofvariables inLV derived fromP .

4.3 Conditionalization

Conditionalization is an operation, specific toprobabilis-tic algebras. Dey and Sarkar [9] were the first to considerthis operation in the context of probabilistic databases.

Similarly to the projection operation, conditionaliza-tion reduces the probability distribution table. The differ-ence is that the result of conditionalization is aconditionalprobability distribution. Given a joint probability distribu-tion, conditionalization answers the following general query“What is the probability distribution of the remaining

random variables if the value of some random variablev in the distribution is restricted to subset X of its val-ues?”

Informally, conditionalization operation proceeds on agiven SPO as follows. The input to the operation is oneparticipating random variable of the SPO,v, and a subsetof its valuesX . The first step of conditionalization consistsof removing from the probability table of the SPO all rowswhosev values arenot from the setX . Then, thev columnis removed from the table. Remaining rows are coalesced(if needed) in the same manner as in projection operationand the probability values arenormalized. Finally, (v;X)is added to the set of conditionals of the resulting SPO.

The formal definition of conditionalization is given be-low. Note that if the original table is incomplete, there is nomeaningful way to normalize a conditionalized probabilitydistribution. Thus, we restrict this operation to situationswhere normalization is well-defined.

Definition 11 An SPO S = hT; V; P; Ci isconditionalization- compatible with an atomic con-ditional selection conditionv = fx1; : : : ; xhg iff� v 2 V ;� The restriction ofP on fx1; : : : ; xhg for v is a com-

plete function.

Definition 12 Let S = hT; V; P; Ci be an SPO which isconditionalization-compatiblewith an atomic conditionalselection conditionc : v = fx1; : : : ; xhg.

The result ofconditionalization of S by c, denoted�c(S), is defined as follows:

�c(S) = hT; V 0; P 0; C 0i;where� V 0 = V � fvg;� C 0 = C [ f(v; fx1; : : : ; xh)g;� P 0 : V 0 �! [0; 1]. LetN = X�y2dom(V 0) Xx2fx1;:::;xhgP (�y; x):

For any�y 2 dom(V 0),P 0(�y) = Px2fx1;:::;xhg P (�y; x)N :We can show that the definition above does indeed com-

pute the conditional probability distribution.

Theorem 3 Let S = hT; V; P; Ci, v 2 V and let c :v = fx1; : : : ; xhg be an atomic selection condition. IfS is conditionalization-compatiblewith c, then�c(S) cor-rectly computes the probability distributionof the randomvariables inV � fvg from P , under the assumption thatv 2 fx1; : : : ; xhg.

Conditionalization can be extended to a semistructuredrelation in a straightforward manner. Given a relationS, �c(S) will consist of �c(S) for eachS 2 S that isconditionalization-compatible withc.Example 5 Consider the SPOS defined in Example 1 de-scribing thejoint probability distribution of performance inDatabases andOperating Systems for College of Engi-neering majors who received eitherA or B in Data Struc-tures. Figure 4 depicts the work of the conditionalizationoperation�OS=A(S). Original object is shown to the left.AsS:P is a complete distribution,S is conditionalizationcompatiblewith OS = A. The first step of conditionaliza-tion consists of removing all rows that do not satisfy theconditionalization condition fromS:P (result depicted inthe center). Then, on step II, theOS column is droppedfrom the table, probability values in the remaining rows arenormalizedandOS = A is added to the list of conditionals.The rightmost object in Figure 4 shows the final result.

4.4 Cartesian Product and Join

Cartesian product is defined only on pairs ofcompatibleSPOs. Intuitively, a cartesian product of two probabilisticdistributions is the joint probability distribution of randomvariables involved in both original distributions. Here, wewill restrict ourselves to the assumption of independence

between the probability distributions in cartesian productsand joins. In our future research we will remove this re-striction.

Intuitively, the SPOs arecompatiblefor cartesian prod-uct if their participating variables are disjoint, but their con-ditionalscoincide.

Definition 13 Two SPOsS = hT; V; P; Ci and S0 =hT 0; V 0; P 0; C 0i are cartesian product-compatible(cp-compatible) iff V \ V 0 = ; andC = C 0.We can now define the cartesian product.

Definition 14 Let S = hT; V; P; Ci and S0 =hT 0; V 0; P 0; C 0i are two cp-compatible SPOs. Then, the re-sult of their cartesian product (under assumption of inde-pendence), denotedS � S0 is defined as follows:S � S0 = S00 = hT 00; V 00; P 00; C 00i;

where� T 00 = (T; T 0);� V 00 = V [ V 0;� P 00 : dom(V 00) �! [0; 1]. For all �z 2 dom(V 00); �z =(�x; �y); �x 2 dom(V ); �y 2 dom(V 0):P 00(�z) = P (�x) � P 0(�y):� C 00 = C = C 0.We can now define probabilistic joins. Two SPOs are

join-compatibleif they some share participating variables(these will be the “join attributes”) and their conditionalscoincide.

Definition 15 Two SPOsS = hT; V; P; Ci and S0 =hT 0; V 0; P 0; C 0i are join-compatibleiff V \ V 0 6= ; andC = C 0.In classical relational algebra the equijoin of two rela-

tions can be computed by first taking the cartesian productof the two relations, then selecting rows where the values ofduplicate attributes coincide and then projecting out one ofthe duplicate attribute columns.

Our approach to a probabilistic join is very similar. Sup-poseS = hT; V; P; Ci andS0 = hT 0; V 0; P 0; C 0 are two(join-compatible) SPOs to be joined. LetV = V1 [ Vc andV 0 = V 01 [ Vc, i.e. Vc = V \ V 0. Then, we want the joinS ./ S0 of S andS0 to contain the joint probability distribu-tion of the setV1[Vc[V 01 . This distribution can be obtainedin one of two ways:

1. Random variablesVc are projected out ofS, and themarginal probability distribution obtained as a result isthen used in a cartesian product withS0.

S:major: CSDB OS PA A 0.25A B 0.25B A 0.25B B 0.25Logic = A

S’:college: ENGROS DS PA A 0.2A B 0.3B A 0.3B B 0.2Logic = A

�DB(S):major: CSDB PA 0.5B 0.5Logic = A�DS(S’):college: ENGRDS PA 0.5B 0.5Logic = A

S./1S’ = S��DS(S’)major: CScollege: ENGRDB OS DS PA A A 0.125A A B 0.125A B A 0.125A B B 0.125B A A 0.125B A B 0.125B B A 0.125B B B 0.125Logic = A

S./2S’ =�DB(S)� S’major: CScollege: ENGRDB OS DS PA A A 0.1A A B 0.15A B A 0.15A B B 0.1B A A 0.1B A B 0.15B B A 0.15B B B 0.1Logic = A

Figure 5. Join in SP-Algebra

2. Random variablesVc are projected out ofS0 and themarginal probability distribution obtained as a result isthen used in a cartesian product withS.

If the distribution of random variables fromVc in S andS0 were the same, both methods would be equivalent. We,however, do not have this guarantee. The existence of dif-ferent probability distributions on the same variables andthe means of dealing with it presents an interesting and im-portant problem which we are currently investigating. Fornow, we just accept that the results produced by the twoprocedures described above can be different. Thus we de-fine two join operations: one favoring the variables in thefirst argument, and the other favoring the variables in thesecond.

Definition 16 Let S = hT; V; P; Ci and S0 =hT 0; V 0; P 0; C 0i be two join-compatible SPOs withV \V 0 = Vc. Then we define the results of two join operationsonS andS0 as follows:� S ./1 S0 = S � �V 0�Vc(S0).� S ./2 S0 = �V�Vc(S)� S.

Example 6 Consider two simple SPOsS and S0 as pre-sented in Figure 5. S and S0 share one random vari-able (OS) and their conditional parts coincide (Logic=A).Hence,S andS0 are join-compatible.

Computation of two joins ofS and S0, S ./1 S0 andS ./2 S0 is presented in the rest of Figure 5. First,OSis projected out ofS andS0 (second column in Figure 5).Then, formulasS ./1 S0 = S � �V 0�Vc(S0) andS ./2S0 = �V�Vc(S)�S are applied to compute respective joins(results are shown in the last two columns in Figure 5).

In the resulting SPOs, the context will be a union of thecontexts of the two original objects and the conditional partwill be the same as inS andS0. The probability table isformed by joining together the random variables fromS and�DS(S0) ( �DB(S) andS0 respectively) and multiplyingthe appropriate probabilities. For example, the probabilityvalue for the rowA,B,A in S ./1 S0 is computed by mul-tiplying the probability value from the rowA,B of S (0.25)by the probability value from the rowA of �DS(S0) (0.5).In S ./2 S0 the probability value for the same row is com-puted by multiplying the probability value from the rowAin �DB(S) (0.5) by the probability value from the rowB,Ain S0 (0.3).

5 Future Work

There are three major foci of our current work: (i) imple-mentation of a semistructured DBMS based on this model;(ii) extension of the data model and the algebra to handleinterval probabilities and (iii) the study of data fusion andconflict resolution problems that arise in this framework.In [13], the data model and SP-algebra are translated intoXML [4]. This is the first step in the implementation of ourDBMS.

6 Related Work

Cavallo and Pittarelli [6], Barbara, Garcia-Molina andPorter [2], Dey and Sarkar [9] and Lakshmanan et al.[15] have described different probabilistic relational mod-els. Our work combines and extends the ideas containedin these works and applies them to a semistructured datamodel, which provides us with the added benefit of schema

flexibility. Kornatzky and Shimony [14] and Eiter et al. [10]have developed probabilistic object models. Their approachis different from ours, as the probabilistic object (e.g. as de-scribed in [10]) represents a single real world entity withuncertain attribute values. In our case, an SPO represents aprobability distribution of one of more random variables.

For more information about the Semistructured DataModel and its relationship to XML we refer the reader to[1, 5, 18]. Fernandez et al. [12] proposed recently a gen-eral purpose algebra for XML as a part of an effort to stan-dardize querying of the XML data. This algebra formed thebasis of W3C’s XML Query Algebra Working Draft [11].Both [12] and [11] focus on describing the semantics ofgeneral purposequeries to XML documents. In particular,they do not deal with querying probabilistic data. Also, thealgebra presented here, works on the semistructured data ir-respective of the format in which it is actually stored, while[12, 11] provide the syntax directed at XML data. Thesemistructured probabilistic algebra presented here has nodata format-specific syntax. Future development will takethis work as well as existing work on XML query languages[3] into account.

Aknowledgements

We’d like to thank the anonymous reviewers, whose sug-gestions improved this paper. We also want to thank V.S.Subrahmanian for helpful comments and the students work-ing in the Bayesian Advisor group for their input at variousstages, and their fabulous energy.

References

[1] S. Abiteboul, P. Buneman, D. Suciu. (1999)Data onthe Web: From Relations to Semistructured Data andXML, Morgan Kaufmann.

[2] D. Barbara, H. Garcia-Molina and D. Porter. (1992)The Management of Probabilistic Data,IEEE Trans.on Knowledge and Data Engineering, Vol. 4, pps487–502.

[3] A. Bonifati, S. Ceri (2000) Comparative Analysis ofFive XML Query Languages.SIGMOD Record, vol29, No. 1, pps 68–79.

[4] T. Bray, J. Paoli, C.M. Spreberg-McQueen (Eds.)(1998) Extensible Markup Language (XML)1.0, World Wide Web Consortium Recommen-dation, http://www.w3.org/TR/1998/REC-xml-19980210.

[5] P. Buneman. (1997) Semistructured Data, inProc.PODS’97, pp. 117–121.

[6] R. Cavallo and M. Pittarelli. (1987) The Theory ofProbabilistic Databases,Proc. VLDB’87, pp. 71-81.

[7] A. Dekhtyar, J. Goldsmith, H. Li, B. Young. (2001)The Bayesian Advisor Project I: Modeling AcademicAdvising University of Kentucky Dept. of ComputerScience Technical Report # 323-01.

[8] A. Dekhtyar and V.S. Subrahmanian. (2000) HybridProbabilistic Logic Programs,Journal of Logic Pro-gramming, vol 43, 3, pp. 187–250.

[9] D. Dey and S. Sarkar. (1996) A Probabilistic Re-lational Model and Algebra,ACM Transactions onDatabase Systems, Vol. 21, 3, pp. 339–369.

[10] T. Eiter, J. Lu, T. Lukasiewicz, V.S. Subrahmanian.(2001) Probabilistic Object Bases, accepted pendingrevisions toACM Transactions on Database Systems.

[11] P. Fankhouser, M. Fernandez, A. Malhotra, M.Rys,J. Simeon, P. Wadler (Eds.) (2001) The XML QueryAlgebra. World Wide Web Consortium WorkingDraft, http://www.w3.org/TR/query-algebra/.

[12] M. Fernandez, J. Simeon, P. Wadler. (2000) An Alge-bra for XML Query,in Proc. Found. of Softw. Theoryand Theor. CS, pps. 11-45.

[13] S. Hawkes, A. Dekhtyar (2001) Designing MarkupLanguages for Probabilistic Information,Universityof Kentucky Tech. ReportTR 319-01.

[14] E. Kornatzky, S.E. Shimony. (1994) A ProbabilisticObject Data Model,Data and Knowledge Engineer-ing, vol 12, pp. 143–166.

[15] V.S. Lakshmanan, N. Leone, R. Ross and V.S.Subrahmanian. (1997) ProbView: A Flexible Prob-abilistic Database System.ACM Transactions onDatabase Systems, Vol. 22, Nr. 3, pps 419–469.

[16] R. Ng and V.S. Subrahmanian. (1993) ProbabilisticLogic Programming,Information and Computation,101, 2, pps 150–201.

[17] J. Pearl. (1988)Probabilistic Reasoning in IntelligentSystems, Morgan Kaufmann,

[18] D. Suciu. (1998) Semistructured Data and XML, inProc. 5th. Intl. Conf. on Foundation of Data Organi-zation, pps 1–12.