provenance management & citations in curated databases

PROVENANCE MANAGEMENT & CITATIONS IN CURATED DATABASES

Kleisarchaki Sophia,

HY561, 05/05/09

He works in the Database Group in the Laboratory for Foundations of Computer Science (University of Edinburgh).

He spent many years in the Database Group of the Department of Computer and Information Science at the University of Pennsylvania.

You can find him..

..in polynomial time.

ABOUT THE AUTHOR – PETER BUNEMAN

CONTENTS

“Provenance Management In Curated Databases” Peter Buneman, Adriane P. Chapman, James Cheney

“How to cite curated databases and how to make them citable” Peter Buneman

1st paper 2nd paper

“Curated Databases” Peter Buneman, James Cheney, Wang-Chiew Tan

“Provenance in Databases (Tutorial Outline)” Peter Buneman, Wang-Chiew Tan

Before All..

CURATED DATABASES

What is a Curated Database?

The term “curated” comes from the Latin curare – to care for.

Are a result of a great deal of annotation, correction and transfer data from other sources.

Are databases that are populated & updated with a great deal of human effort through the consultation, verification and aggregation of existing sources and the interpretation of new raw data.

CURATED DATABASES

What a Curated Database IS NOT?

Curated databases are not warehouses. They are manually constructed by highly skilled scientists.

They are not views.

They are not computed automatically from existing datasets.

CURATED DATABASES

Notable examples of curated databases

UniProt (formerly called SwissProt) used in molecular biology. CIA World Factbook: source of demographic data.

IUPHAR: receptor database. Maintained by volunteers.

Such databases are not confined to biology; they are also being developed in areas such as astronomy and geology. Wikipedia and other wikis are also curated in that they are the product of direct human effort.

Nuclear Protein Database (NPD).

Reference manuals, dictionaries and gazetteers.

CURATED DATABASES

Which are the characteristics of a Curated Database?

Source. Data that is copied and edited from existing sources, perhaps other curated databases. Knowing the origin – provenance – is important.

Annotation. In addition to core data, curated databases also contain annotations that carry additional pieces of information such as provenance.

Update. A common practice is to maintain a working database updated and to “publish” versions of it. Schema and structure. Constructed “on the cheap”, usally stored in a text file. Almost inevitably the structure of the entries evolves over time.

CURATED DATABASES

Which are the characteristics of a Curated Database?

Source. Data that is copied and edited from existing sources, perhaps other curated databases. Knowing the

origin – provenance – is important. Annotation. In addition to core data, curated databases also contain annotations that carry additional pieces of information such as provenance.

Update. A common practice is to maintain a working database updated and to “publish” versions of it. Schema and structure. Constructed “on the cheap”, but almost inevitably the structure of the entries evolves over time.

PROVENANCE IN DATABASES (1/2)

Provenance – also called lineage and pedigree – describes the source and derivation of data.

Helps to: Determine the authenticity of a work. Establish the historical importance of a work by

suggesting other artists who might have seen and be influenced by it.

Determine the legitimacy of current ownership. Trust the data.

Why is provenance important?

PROVENANCE IN DATABASES (2/2)

Overview of provenance

Provenance

Workflow or coarse-grain provenance

Dataflow or fine-grain provenance

Why – provenance

Where – provenance

Describes the source

and derivation of

data.Record a complete

history of the derivation of

some data set.

Derivation of part of the

resulting data set.

Keeps the justification

for the element

appearing in the output.

The identification of the source

elements where the data in the target is copied

from.

WHERE-, WHY- PROVENANCE

Hotel Restaurant

Peacock Alley

Bull & Bear

Pacifica

Soho Kitchen & Bar

Waldorf Astoria

Waldorf AstoriaWaldorf Astoria

Holiday Inn DT

Cost$$$

$$$$

$

Hotel Zip

Rating

Waldorf Astoria

Restaurant Cost Type

Peacock Alley

Bull & Bear

PacificaSoho Kitchen & Bar

Zip

$$$ French 10022

$$$ Seafood 10022

$ Chinese 10013$ American10022

Holiday Inn DT

10022

10013

4.5

4.0

JOIN, PROJECT

NYHotels (Source table)

Why?

Where?

View

4.5

4.5

Rating

4.5

4.0

(Where-provenance)

(Why-provenance)

CONTENTS



1st paper 2nd paper

WHAT IS THE PROBLEM BEING ADDRESSED IN THE PAPER?

Database technology is employed not only to provide access to source data, but also to the derived knowledge of scientifics who have interpreted the data.

Provenance or metadata describing creation, recording, ownership, processing, or version history is essential for assessing the value of such data.

What information should be retained?

How should it be

managed?

WHAT IS THIS PAPER ABOUT?

Investigates general-purpose techniques for recording provenance for data that is copied among databases.

Describes an approach in which they track the user’s actions, in order to record them in a convenient, query able form.

Presents an implementation of this technique and use it to evaluate the feasibility of database support for provenance management.

CURATED DATABASES - EXAMPLE

Example

a) Copies records of some interesting proteins from a SwissProt webpage into her database.

b) Fixes the new entries so that the PTM (post translational modification) found in SwissProt is not confused with her.

c) Copies some publications from OMIM and NCBI.

d) One year later she finds a discrepancy between two PTMs.

THE PROBLEM It is necessary to retain provenance information

describing the source and version history of the data.

We focus on “fine-grained” provenance, which describes how data has moved through a network of databases.

Need to record both local modifications to the database (insert, delete, update) and global operations such as copying data from external sources. Constraints: 1. There is not a standard for storing or exchanging

provenance. 2. Varying practices for identifying or locating data. 3. Past versions may not be archived. 4. Curators employ a variety of application programs

that cannot be changed.

External source

databases

Local databas

eAuxiliary

provenance database

OUR APPROACH (1/2)

User’s actions are captured as a sequence of insert, delete, copy and paste by provenance-aware application.

Provenance architecture

OUR APPROACH (2/2) Implemented a naïve approach and several more

sophisticated.

The naïve approach increases the time to process each update by 28%. The amount of provenance information stored is proportional to the size of the changed data.

Optimization techniques: Transactional provenance management. Hierarchical provenance management. Together these optimizations reduce the added

processing cost of provenance tracking to less than 5-10% per operation and reduce the storage cost by a factor of 5-7 relative to the naïve approach. Typical provenance queries can be executed more efficiently.

MANUAL UPDATES AND PROVENANCE (1/2)

“Where a piece of data comes from?” We need to have a means for describing the

location of any data element.

Two assumptions: Database can be viewed as a tree. Labels on edges occur on at most one path.

(SwissProt/Release{20}/Q01780 identify a specific entry)

MANUAL UPDATES AND PROVENANCE (2/2)

Update operations are of the form:

u ::= ins{a:u} into p | del a from p | copy q into p

Inserts an edge labeled a with value v intothe subtree at p.

Deletes an edge and its subtree.

Replaces the subtree at p with a copy of the subtree at location q.

PROVENANCE TRACKING

Prov(Tid, Op, Loc, Src)


External source

databases

Local databas

eAuxiliary

provenance database

NAÏVE PROVENANCE

Store one provenance record for each copied, inserted or deleted node.

Wasteful in terms of space. Retains the maximum possible

information about the user’s actions.

One transaction per line

TRANSACTIONAL PROVENANCE

Actions are grouped into transactions larger than a single operation.

Store only provenance links describing the net changes resulting from a transaction. Details about intermediate states are not

retained. Less precise than naïve approach. Number of transactional provenance

records: i + d + ci: number of inserted nodes in the output.d: number of nodes deleted in the input.c: number copied nodes in the output. Entire update as

one transaction

HIERARCHICAL PROVENANCE (1/2)

It is not necessary to store all of the provenance links explicitly.

The provenance of a child of a copied node can often be inferred from its parent’s provenance using a simple rule. Does not discard any information. Does not require user to group

operations into transactions.

Hierarchical version of naïve approach.25% smaller than Prov, but much larger savings are possible.

HIERARCHICAL PROVENANCE (2/2)

We can define the full provenance table as a view of the hierarchical table as follows: If the provenance is specified in HProv, then it is

just copied into Prov. Otherwise, The provenance of every target path p/a not

mentioned in HProv is q/a, provided p was copied from q.Infer(t, p) ¬( x, q.Hprov(t, x, p, q))Prov(t, op, p, q) Hprov(t, op, p, q)

Prov(t, C, p/a, q/a) Prov(t, C, p, q), Infer(t, p)Prov(t, I, p/a, ) Prov(t, I, p, ), Infer(t, p)Prov(t, D, p/a, ) Prov(t, D, p, ), Infer(t, p)

TRANSACTIONAL-HIERARCHICAL PROVENANCE

Combination of transactional and hierarchical provenance techniques.

Storage is: i + d + C,i: number of inserted nodes in the output.d: number of nodes deleted in the input.C: number of roots of copied subtrees

that appear in the output.

Hierarchical version of (b).

Entire update as one transaction

PROVENANCE QUERIES

Define some convenient views of the raw Prov table.

“p was unchanged

during transaction

t”

Ins(t, p) Prov(t, I, p, )

“p was inserted during

transaction t”

Del(t, p) Prov(t, D, p, )

“p was deleted during

transaction t”

Copy(t, p, q) Prov(t, C, p, q)

“p was copied from

q during transaction

t”

Unch(t, p) ¬( x, q.Prov(t, x, p, q))

PROVENANCE QUERIES

Define some convenient views of the raw Prov table. “node p

comes from q during

transaction t”“the data at location p at the end of

transaction t “came from” the data at

location q at the end of transaction u”

Trace(p, t, q, u)Trace(p, t, p, t).Trace(p, t, q, u) Trace(p, t, r, s), Trace(r, s, q, u).Trace(p, t, q, t-1) From(t, p, q).

From(t, p, q)From(t, p, q) Copy(t, p, q)From(t, p, q) Unch(t, p)

Let’s answer some… “simple” questions!

PROVENANCE QUERIES (1/2)

Q1: Src

Q2: Hist

Q3: Mod

What transaction first created the data at a location? (e.g. who entered your telephone number incorrect?)

What is the sequence of all transactions that copied a node to its current position?

What transactions are responsible for the creation or modification of the subtree under a node?

Src(p) = {u | q.Trace(p, tnow, q, u), Ins(u, q)}

Hist(p) = {u | q.Trace(p, tnow, q, u), Copy(u, q)}

Mod(p) = {u, | q.p ≤ q, Trace(p, tnow, r, u), ¬Unch(u, r)}

PROVENANCE QUERIES (2/2) There are many interesting queries that

mention both provenance and the row data. Q4

Such queries are tricky to write by hand. Providing advanced support for provenance

queries is future work. Note: If some source databases do not track

provenance then queries stop following the chain of provenance.

Project the A field out of relation R(Id, A, B) along with its current provenance. Q(x, Px) R(k, x, y), From(tnow, “R/” + k + “/A”, Px)


Source database -

OrganelleDB

Target database -

MiMI

Auxiliary provenance

database

IMPLEMENTATION

Wrappers for source and target databases

IMPLEMENTATION OF PROVENANCE TRACKING (1/2) Naïve provenance

Is a straightforward process of recording target and source information of every transaction that affects the target database. For a paste operation we add one record per node in

the copied subtree.

Transactional provenance When a commit action occurs, CPDB stores the

provenance links connecting the current version with its predecessor. No links corresponding to temporary data are stored. The implementation maintains a provlist, of

provenance links that will be added to the provenance store when the user commits.

IMPLEMENTATION OF PROVENANCE TRACKING (2/2)

Hierarchical Provenance Stores at most one record per operation. For a copy, stores the record connecting the root

of the copied tree to the root of the source.

Hierarchical Transactional Provenance Maintains hierarchical provenance instead of

naïve provenance records in provlist. Checks and removes redundant links from

provlist.E.g. copy S/a to T/a,

copy S/a/b to T/a/b redundant links

PROVENANCE QUERIES - IMPLEMENTATION

Src, Mod, Hist implemented as programs.

For naïve and transactional provenance, query directly the provenance store.

For hierarchical provenance, the provenance store corresponds to the Hprov relation. Query the provenance store directly and

compute the appropriate provenance links on the fly.

EVALUATION

The experiments focused primarily on the storage and processing requirements of provenance tracking for the different approaches. Query optimization and database tuning left for

future work.

Chose to use random sequences of copy-paste operations to simulate worst case behavior.

EXPERIMENTAL SETUP

Performed five sets of experiments.

Used six patterns of update operations.

Update patterns Deletion patterns

FIRST TWO EXPERIMENTS

First Experiment Second Experiment

Figure 7: Number of entries in the provenance store after a variety of update patterns of length 3500.

Figure 8: Number of entries in the provenance store after mix and real update patterns of length 14000. The number at the top of each bar shows the physical size of the table.

N, T store 4 records/copy.H, HT store

only 1 record.

SECOND EXPERIMENT

Figure 9 shows the time spent on storing provenance information for all the techniques.

Figure 9: The average amount of time for target database processing and for add, delete, copy and commit operations on the provenance store during 14000-mix update.

Copying in T is close to zero,

because copies do not involve

interaction with the provenance

store.

SECOND EXPERIMENT

Figure 10: The overhead of provenance tracking per operation as a percentage of the time to perform each basic operation.

For naïve approach all operations require less

than 30% of the processing time needed for interaction with the

target DB.

H-provenance requires more

time to process inserts than

copies.

H-provenance treats deletes as

naïve provenance.

T-provenance: Insertsand copies run

essentially instantaneously,

because no interaction with the target database or provenance store is

needed.

THIRD EXPERIMENT

Measured the effects of deletes on provenance storage.

Figure 11: The effect of deletion on the provenance store. The notation (ac) indicates provenance table size when only add and copy operations are performed while (acd) includes deletes.

HT-provenance stores the fewest

records among the approaches for each update

pattern.

FOURTH EXPERIMENT

Figure 12: The effect of transaction size on provenance processing time.

Time to process a commit

grows approximately linearly with transaction

length.

FIFTH EXPERIMENT

Displays the time needed to perform basic provenance queries.

Figure 13: The time needed to perform basic provenance queries.

The queries ran fastest for

transactional provenance for all

three queries,

CONCLUSIONS

The experimental results affirm that provenance can be tracked and managed efficiently using our approach.

This is a promising first step towards providing powerful, general-purpose tools that will make life easier for scientific data curators and increase the reliability and transparency of the scientific record.

CONTENTS



1st paper 2nd paper

WHAT IS THE PROBLEM BEING ADDRESSED IN THE PAPER?

Importance of citing databases. Citing something that has: Internal structure. Evolves over time.

Propose a stable citation system for IUPHAR. Describe:

How to publish the database in a form that can be cited.

How to ensure that the citations remain valid. How to generate and validate the citations

automatically.

PRELIMINARIES (1/4)

Bioessays 17:999-1001

Bard JB and Davies JA. Development, Databases and the Internet.

PRELIMINARIES (1/4)

Citations are used to identify the source material and provide some additional information.

Example:The citations.. Ann. Phys., Lpz 18 639-641 Nature, 171,737-738

while adequate for identification, hardly convey the importance of these publications.

PRELIMINARIES (2/4)

A citation does not give us a specific mechanism for retrieving a document. It is useful to find what we are looking for. It is a structure that can be used by a variety of

mechanisms such as online indexes and search engines.

A citation consists of two kinds of information.

Bard JB and Davies JA. Development, Databases and the Internet. Bioessays. 1995 Nov;17(11):999-1001.

PRELIMINARIES (2/4)




Bard JB and Davies JA. Development, Databases

and the Internet. Bioessays. 1995 Nov;17(11):999-1001.

Location

PRELIMINARIES (2/4)




Bard JB and Davies JA. Development, Databases and the Internet. Bioessays. 1995 Nov;17(11):999-1001.

Descriptive Information

(authorship, title, date)

PRELIMINARIES (3/4)

Requirements concerning citations: There is some “thing” that is being cited. The thing should be accessible. The thing should not change over time.

There are few accepted practices for supporting citation of data. Few standards. Little supporting technology.

PRELIMINARIES (4/4)

D1 For any citation C, <C> should remain fixed Since database change, this simple requirement

is not always easy to maintain.

D2 Any citable thing T should contain a citation C such that <C> = T Anything we cite should provide us with at least

one way of citing it.This is not always done in journal publications. It is essential because:1.One wants confirmation that we have found the correct citation. Even if we have found T using some other citation C’ (<C> = <C’>), we want to be sure that they refer to the same thing2.If we found <C> by some other means (search engine) , we want to know how to cite it.

CURRENT PRACTICE

On-line databases frequently give recommendations on how to cite them. They often omit version information. Fail to provide adequate location.

The Columbia Guide to Online Style although it discusses issues of permanence of links, does not mention D1 as one of its citation “principles”.

ISO690 standard deals with citations of parts of electronic documents.

STRUCTURAL ISSUES (1/5)

Databases have explicit structure. This offers the possibility of a citation using this

structure to home in on the relevant data.

Example (IUPHAR database)

Figure 1: Rough structure of the IUPHAR web interface.

The structure of what the user sees is not the same as the underlying database.


Consider the following:1. The IUPHAR database (C1) contains no

information about Ginandtonicin. 2. The IUPHAR database (C2) lists five ligands for

Melatonin receptor MT1.3. The IUPHAR database (C3) asserts that

luzindole is an antagonist ligand for receptor MT1.

1. Making the context two narrow can be as counterproductive as making it too wide.

C1 should refer to the

whole database

<C2> should be the web

page for that receptor or maybe the receptor

family page.

Citing just that row or the table?Better, cite the receptor or its

family.


One citation is coarser than another if it refers to a higher structure (<C1> is coarser than <C2>).

D3 It should be possible to cite a database at varying degrees of coarseness.

In order to make further progress we have to look at the internal structure of citation.Life Sci., 53, 393-398journ

al

Volume

number

pages Our understanding is based on a common structure of all journals.


A “concrete syntax” for citations is a sequence {k1 = v1, k2 = v2, ..}, where k1, k2, ... are keywords and v1, v2, … are associated values.Example{Journal = “Life Sci.”, Number = 53, Pages =

3930398}

There is a natural “part of” relationship among citations.Example{Journal = “Life Sci.”} and{Journal = “Life Sci.”, Number = 53}


D4 If C and C’ are citations and <C’> is coarser than <C> then the location information in C’ should be a part of the location information in C.

C’: {DB=IUPHAR, IUPHAR-Receptor-family=Melatonin}

C: {IUPHAR-Receptor-family=Melatonin}

TEMPORAL ISSUES (1/2)

The obvious way to deal with change in citation is to provide, in the citation, a version number. {DB=IUPHAR, Version=17, Family=Melatonin} Using time may be misleading.

D5 Versions should be recorded at the database level. The rate of publication of versions is much

much slower than the rate of updates. Having such a citation obliges someone to

keep past versions. It is possible to cite a range of versions

{..Version=2-8..}

To what does the version refer?

TEMPORAL ISSUES (2/2)

Now, what is <{DB=IUPHAR, Family=Melatonin}>, a citation without a version number? The latest version of the database.

So, we need two words: One for a fixed citation, One for a “current link”, the place at which you

may find the latest information. A good job of distinguishing between “this”

version, the “latest” version and previous versions of documents was presented.

PRESENTATION, CONTENT AND PRESERVATION

The structure of the cited “thing” is not necessarily the same as the structure of the underlying database.

The underlying database contains information – working notes etc – that is not intended as part of the published material. We should not be making direct citations to the

internal structure of the database.

The hierarchy that the user sees should be represented as an XML document.

AUTOMATICALLY GENERATING CITATIONS (1/3)

Insert citation data manually is both time consuming and error prone.

Automatically generation of citations is a good check on the integrity of the document. Guarantees that the contents of the document

are consistent with the citation. Give guarantees on the descriptive information

(e.g. there is at most one Title)


A rule that generates location information:

{DB=IUPHAR, Version=$v, Family=$f} /Root[]/Version[Number=$’v]/Data[] /Family[FamilyName=$’f]

The pattern is expressed in the syntax of Xpath.

A concrete syntax of

citations with variables.

The database

or document

has a unique root.

Each Version must have a Number that uniquely

identifies the node and provides a value for $v.

Indicates that for each

Version, there is precisely one data

node.

Each family node has a FamilyName

which uniquely identifies the

family.


A rule that generates description information: {DB=IUPHAR, Version=$v, Family=$f,

Receptor=$r, Contributors=$a, Editor=$e, Date=$d, DOI=$i} /Root[]/Version[Number=$’v, Editor=$?e, DOI=#.i, Date=$.d]/Data[]/Family[FamilyName=$’f]/Contributor-list/Contributor=$+a]/Receptor[ReceptorName=$’r]

Generates: { DB=IUPHAR, Version=11, Family=Calcitonin,

Receptor=CALCR, Contributors={Debbie Hay, David R. Poyner},Editor=Tony Harmar, Date=Jan, 2006, DOI=10.1234}

Exactly one value.

At most one value.One or more

values expected.

CONCLUSIONS

We have to do a modest amount of work in structuring the data appropriately in XML, after which citations can be specified and generated by some simple rules.

provenance management & citations in curated databases

Documents

curated databaseswhich

curated databaseswhat

working database

database group

receptor database

core data

transfer data

origin provenance