provenance management & citations in curated databases

66
PROVENANCE MANAGEMENT & CITATIONS IN CURATED DATABASES Kleisarchaki Sophia, HY561, 05/05/09

Upload: garnet

Post on 20-Jan-2016

34 views

Category:

Documents


0 download

DESCRIPTION

Provenance Management & Citations in Curated Databases. Kleisarchaki Sophia, HY561, 05/05/09. About the Author – Peter Buneman. He works in the Database Group in the Laboratory for Foundations of Computer Science (University of Edinburgh). - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Provenance Management & Citations in  Curated  Databases

PROVENANCE MANAGEMENT & CITATIONS IN CURATED DATABASES

Kleisarchaki Sophia,

HY561, 05/05/09

Page 2: Provenance Management & Citations in  Curated  Databases

He works in the Database Group in the Laboratory for Foundations of Computer Science (University of Edinburgh).

He spent many years in the Database Group of the Department of Computer and Information Science at the University of Pennsylvania.

You can find him..

..in polynomial time.

ABOUT THE AUTHOR – PETER BUNEMAN

Page 3: Provenance Management & Citations in  Curated  Databases

CONTENTS

“Provenance Management In Curated Databases” Peter Buneman, Adriane P. Chapman, James Cheney

“How to cite curated databases and how to make them citable” Peter Buneman

1st paper 2nd paper

“Curated Databases” Peter Buneman, James Cheney, Wang-Chiew Tan

“Provenance in Databases (Tutorial Outline)” Peter Buneman, Wang-Chiew Tan

Before All..

Page 4: Provenance Management & Citations in  Curated  Databases

CURATED DATABASES

What is a Curated Database?

The term “curated” comes from the Latin curare – to care for.

Are a result of a great deal of annotation, correction and transfer data from other sources.

Are databases that are populated & updated with a great deal of human effort through the consultation, verification and aggregation of existing sources and the interpretation of new raw data.

Page 5: Provenance Management & Citations in  Curated  Databases

CURATED DATABASES

What a Curated Database IS NOT?

Curated databases are not warehouses. They are manually constructed by highly skilled scientists.

They are not views.

They are not computed automatically from existing datasets.

Page 6: Provenance Management & Citations in  Curated  Databases

CURATED DATABASES

Notable examples of curated databases

UniProt (formerly called SwissProt) used in molecular biology. CIA World Factbook: source of demographic data.

IUPHAR: receptor database. Maintained by volunteers.

Such databases are not confined to biology; they are also being developed in areas such as astronomy and geology. Wikipedia and other wikis are also curated in that they are the product of direct human effort.

Nuclear Protein Database (NPD).

Reference manuals, dictionaries and gazetteers.

Page 7: Provenance Management & Citations in  Curated  Databases

CURATED DATABASES

Which are the characteristics of a Curated Database?

Source. Data that is copied and edited from existing sources, perhaps other curated databases. Knowing the origin – provenance – is important.

Annotation. In addition to core data, curated databases also contain annotations that carry additional pieces of information such as provenance.

Update. A common practice is to maintain a working database updated and to “publish” versions of it. Schema and structure. Constructed “on the cheap”, usally stored in a text file. Almost inevitably the structure of the entries evolves over time.

Page 8: Provenance Management & Citations in  Curated  Databases

CURATED DATABASES

Which are the characteristics of a Curated Database?

Source. Data that is copied and edited from existing sources, perhaps other curated databases. Knowing the

origin – provenance – is important. Annotation. In addition to core data, curated databases also contain annotations that carry additional pieces of information such as provenance.

Update. A common practice is to maintain a working database updated and to “publish” versions of it. Schema and structure. Constructed “on the cheap”, but almost inevitably the structure of the entries evolves over time.

Page 9: Provenance Management & Citations in  Curated  Databases

PROVENANCE IN DATABASES (1/2)

Provenance – also called lineage and pedigree – describes the source and derivation of data.

Helps to: Determine the authenticity of a work. Establish the historical importance of a work by

suggesting other artists who might have seen and be influenced by it.

Determine the legitimacy of current ownership. Trust the data.

Why is provenance important?

Page 10: Provenance Management & Citations in  Curated  Databases

PROVENANCE IN DATABASES (2/2)

Overview of provenance

Provenance

Workflow or coarse-grain provenance

Dataflow or fine-grain provenance

Why – provenance

Where – provenance

Describes the source

and derivation of

data.Record a complete

history of the derivation of

some data set.

Derivation of part of the

resulting data set.

Keeps the justification

for the element

appearing in the output.

The identification of the source

elements where the data in the target is copied

from.

Page 11: Provenance Management & Citations in  Curated  Databases

WHERE-, WHY- PROVENANCE

Hotel Restaurant

Peacock Alley

Bull & Bear

Pacifica

Soho Kitchen & Bar

Waldorf Astoria

Waldorf AstoriaWaldorf Astoria

Holiday Inn DT

Cost$$$

$$$$

$

Hotel Zip

Rating

Waldorf Astoria

Restaurant Cost Type

Peacock Alley

Bull & Bear

PacificaSoho Kitchen & Bar

Zip

$$$ French 10022

$$$ Seafood 10022

$ Chinese 10013$ American10022

Holiday Inn DT

10022

10013

4.5

4.0

JOIN, PROJECT

NYHotels (Source table)

Why?

Where?

View

4.5

4.5

Rating

4.5

4.0

(Where-provenance)

(Why-provenance)

Page 12: Provenance Management & Citations in  Curated  Databases

CONTENTS

“Provenance Management In Curated Databases” Peter Buneman, Adriane P. Chapman, James Cheney

“How to cite curated databases and how to make them citable” Peter Buneman

1st paper 2nd paper

Page 13: Provenance Management & Citations in  Curated  Databases

WHAT IS THE PROBLEM BEING ADDRESSED IN THE PAPER?

Database technology is employed not only to provide access to source data, but also to the derived knowledge of scientifics who have interpreted the data.

Provenance or metadata describing creation, recording, ownership, processing, or version history is essential for assessing the value of such data.

What information should be retained?

How should it be

managed?

Page 14: Provenance Management & Citations in  Curated  Databases

WHAT IS THIS PAPER ABOUT?

Investigates general-purpose techniques for recording provenance for data that is copied among databases.

Describes an approach in which they track the user’s actions, in order to record them in a convenient, query able form.

Presents an implementation of this technique and use it to evaluate the feasibility of database support for provenance management.

Page 15: Provenance Management & Citations in  Curated  Databases

CURATED DATABASES - EXAMPLE

Example

a) Copies records of some interesting proteins from a SwissProt webpage into her database.

b) Fixes the new entries so that the PTM (post translational modification) found in SwissProt is not confused with her.

c) Copies some publications from OMIM and NCBI.

d) One year later she finds a discrepancy between two PTMs.

Page 16: Provenance Management & Citations in  Curated  Databases

THE PROBLEM It is necessary to retain provenance information

describing the source and version history of the data.

We focus on “fine-grained” provenance, which describes how data has moved through a network of databases.

Need to record both local modifications to the database (insert, delete, update) and global operations such as copying data from external sources. Constraints: 1. There is not a standard for storing or exchanging

provenance. 2. Varying practices for identifying or locating data. 3. Past versions may not be archived. 4. Curators employ a variety of application programs

that cannot be changed.

Page 17: Provenance Management & Citations in  Curated  Databases

External source

databases

Local databas

eAuxiliary

provenance database

OUR APPROACH (1/2)

User’s actions are captured as a sequence of insert, delete, copy and paste by provenance-aware application.

Provenance architecture

Page 18: Provenance Management & Citations in  Curated  Databases

OUR APPROACH (2/2) Implemented a naïve approach and several more

sophisticated.

The naïve approach increases the time to process each update by 28%. The amount of provenance information stored is proportional to the size of the changed data.

Optimization techniques: Transactional provenance management. Hierarchical provenance management. Together these optimizations reduce the added

processing cost of provenance tracking to less than 5-10% per operation and reduce the storage cost by a factor of 5-7 relative to the naïve approach. Typical provenance queries can be executed more efficiently.

Page 19: Provenance Management & Citations in  Curated  Databases

MANUAL UPDATES AND PROVENANCE (1/2)

“Where a piece of data comes from?” We need to have a means for describing the

location of any data element.

Two assumptions: Database can be viewed as a tree. Labels on edges occur on at most one path.

(SwissProt/Release{20}/Q01780 identify a specific entry)

Page 20: Provenance Management & Citations in  Curated  Databases

MANUAL UPDATES AND PROVENANCE (2/2)

Update operations are of the form:

u ::= ins{a:u} into p | del a from p | copy q into p

Inserts an edge labeled a with value v intothe subtree at p.

Deletes an edge and its subtree.

Replaces the subtree at p with a copy of the subtree at location q.

Page 21: Provenance Management & Citations in  Curated  Databases

PROVENANCE TRACKING

Prov(Tid, Op, Loc, Src)

Provenance architecture

External source

databases

Local databas

eAuxiliary

provenance database

Page 22: Provenance Management & Citations in  Curated  Databases

NAÏVE PROVENANCE

Store one provenance record for each copied, inserted or deleted node.

Wasteful in terms of space. Retains the maximum possible

information about the user’s actions.

One transaction per line

Page 23: Provenance Management & Citations in  Curated  Databases

TRANSACTIONAL PROVENANCE

Actions are grouped into transactions larger than a single operation.

Store only provenance links describing the net changes resulting from a transaction. Details about intermediate states are not

retained. Less precise than naïve approach. Number of transactional provenance

records: i + d + ci: number of inserted nodes in the output.d: number of nodes deleted in the input.c: number copied nodes in the output. Entire update as

one transaction

Page 24: Provenance Management & Citations in  Curated  Databases

HIERARCHICAL PROVENANCE (1/2)

It is not necessary to store all of the provenance links explicitly.

The provenance of a child of a copied node can often be inferred from its parent’s provenance using a simple rule. Does not discard any information. Does not require user to group

operations into transactions.

Hierarchical version of naïve approach.25% smaller than Prov, but much larger savings are possible.

Page 25: Provenance Management & Citations in  Curated  Databases

HIERARCHICAL PROVENANCE (2/2)

We can define the full provenance table as a view of the hierarchical table as follows: If the provenance is specified in HProv, then it is

just copied into Prov. Otherwise, The provenance of every target path p/a not

mentioned in HProv is q/a, provided p was copied from q.Infer(t, p) ¬( x, q.Hprov(t, x, p, q))Prov(t, op, p, q) Hprov(t, op, p, q)

Prov(t, C, p/a, q/a) Prov(t, C, p, q), Infer(t, p)Prov(t, I, p/a, ) Prov(t, I, p, ), Infer(t, p)Prov(t, D, p/a, ) Prov(t, D, p, ), Infer(t, p)

Page 26: Provenance Management & Citations in  Curated  Databases

TRANSACTIONAL-HIERARCHICAL PROVENANCE

Combination of transactional and hierarchical provenance techniques.

Storage is: i + d + C,i: number of inserted nodes in the output.d: number of nodes deleted in the input.C: number of roots of copied subtrees

that appear in the output.

Hierarchical version of (b).

Entire update as one transaction

Page 27: Provenance Management & Citations in  Curated  Databases

PROVENANCE QUERIES

Define some convenient views of the raw Prov table.

“p was unchanged

during transaction

t”

Ins(t, p) Prov(t, I, p, )

“p was inserted during

transaction t”

Del(t, p) Prov(t, D, p, )

“p was deleted during

transaction t”

Copy(t, p, q) Prov(t, C, p, q)

“p was copied from

q during transaction

t”

Unch(t, p) ¬( x, q.Prov(t, x, p, q))

Page 28: Provenance Management & Citations in  Curated  Databases

PROVENANCE QUERIES

Define some convenient views of the raw Prov table. “node p

comes from q during

transaction t”“the data at location p at the end of

transaction t “came from” the data at

location q at the end of transaction u”

Trace(p, t, q, u)Trace(p, t, p, t).Trace(p, t, q, u) Trace(p, t, r, s), Trace(r, s, q, u).Trace(p, t, q, t-1) From(t, p, q).

From(t, p, q)From(t, p, q) Copy(t, p, q)From(t, p, q) Unch(t, p)

Page 29: Provenance Management & Citations in  Curated  Databases

Let’s answer some… “simple” questions!

Page 30: Provenance Management & Citations in  Curated  Databases

PROVENANCE QUERIES (1/2)

Q1: Src

Q2: Hist

Q3: Mod

What transaction first created the data at a location? (e.g. who entered your telephone number incorrect?)

What is the sequence of all transactions that copied a node to its current position?

What transactions are responsible for the creation or modification of the subtree under a node?

Src(p) = {u | q.Trace(p, tnow, q, u), Ins(u, q)}

Hist(p) = {u | q.Trace(p, tnow, q, u), Copy(u, q)}

Mod(p) = {u, | q.p ≤ q, Trace(p, tnow, r, u), ¬Unch(u, r)}

Page 31: Provenance Management & Citations in  Curated  Databases

PROVENANCE QUERIES (2/2) There are many interesting queries that

mention both provenance and the row data. Q4

Such queries are tricky to write by hand. Providing advanced support for provenance

queries is future work. Note: If some source databases do not track

provenance then queries stop following the chain of provenance.

Project the A field out of relation R(Id, A, B) along with its current provenance. Q(x, Px) R(k, x, y), From(tnow, “R/” + k + “/A”, Px)

Page 32: Provenance Management & Citations in  Curated  Databases

Provenance architecture

Source database -

OrganelleDB

Target database -

MiMI

Auxiliary provenance

database

IMPLEMENTATION

Wrappers for source and target databases

Page 33: Provenance Management & Citations in  Curated  Databases

IMPLEMENTATION OF PROVENANCE TRACKING (1/2) Naïve provenance

Is a straightforward process of recording target and source information of every transaction that affects the target database. For a paste operation we add one record per node in

the copied subtree.

Transactional provenance When a commit action occurs, CPDB stores the

provenance links connecting the current version with its predecessor. No links corresponding to temporary data are stored. The implementation maintains a provlist, of

provenance links that will be added to the provenance store when the user commits.

Page 34: Provenance Management & Citations in  Curated  Databases

IMPLEMENTATION OF PROVENANCE TRACKING (2/2)

Hierarchical Provenance Stores at most one record per operation. For a copy, stores the record connecting the root

of the copied tree to the root of the source.

Hierarchical Transactional Provenance Maintains hierarchical provenance instead of

naïve provenance records in provlist. Checks and removes redundant links from

provlist.E.g. copy S/a to T/a,

copy S/a/b to T/a/b redundant links

Page 35: Provenance Management & Citations in  Curated  Databases

PROVENANCE QUERIES - IMPLEMENTATION

Src, Mod, Hist implemented as programs.

For naïve and transactional provenance, query directly the provenance store.

For hierarchical provenance, the provenance store corresponds to the Hprov relation. Query the provenance store directly and

compute the appropriate provenance links on the fly.

Page 36: Provenance Management & Citations in  Curated  Databases

EVALUATION

The experiments focused primarily on the storage and processing requirements of provenance tracking for the different approaches. Query optimization and database tuning left for

future work.

Chose to use random sequences of copy-paste operations to simulate worst case behavior.

Page 37: Provenance Management & Citations in  Curated  Databases

EXPERIMENTAL SETUP

Performed five sets of experiments.

Used six patterns of update operations.

Update patterns Deletion patterns

Page 38: Provenance Management & Citations in  Curated  Databases

FIRST TWO EXPERIMENTS

First Experiment Second Experiment

Figure 7: Number of entries in the provenance store after a variety of update patterns of length 3500.

Figure 8: Number of entries in the provenance store after mix and real update patterns of length 14000. The number at the top of each bar shows the physical size of the table.

N, T store 4 records/copy.H, HT store

only 1 record.

Page 39: Provenance Management & Citations in  Curated  Databases

SECOND EXPERIMENT

Figure 9 shows the time spent on storing provenance information for all the techniques.

Figure 9: The average amount of time for target database processing and for add, delete, copy and commit operations on the provenance store during 14000-mix update.

Copying in T is close to zero,

because copies do not involve

interaction with the provenance

store.

Page 40: Provenance Management & Citations in  Curated  Databases

SECOND EXPERIMENT

Figure 10: The overhead of provenance tracking per operation as a percentage of the time to perform each basic operation.

For naïve approach all operations require less

than 30% of the processing time needed for interaction with the

target DB.

H-provenance requires more

time to process inserts than

copies.

H-provenance treats deletes as

naïve provenance.

T-provenance: Insertsand copies run

essentially instantaneously,

because no interaction with the target database or provenance store is

needed.

Page 41: Provenance Management & Citations in  Curated  Databases

THIRD EXPERIMENT

Measured the effects of deletes on provenance storage.

Figure 11: The effect of deletion on the provenance store. The notation (ac) indicates provenance table size when only add and copy operations are performed while (acd) includes deletes.

HT-provenance stores the fewest

records among the approaches for each update

pattern.

Page 42: Provenance Management & Citations in  Curated  Databases

FOURTH EXPERIMENT

Figure 12: The effect of transaction size on provenance processing time.

Time to process a commit

grows approximately linearly with transaction

length.

Page 43: Provenance Management & Citations in  Curated  Databases

FIFTH EXPERIMENT

Displays the time needed to perform basic provenance queries.

Figure 13: The time needed to perform basic provenance queries.

The queries ran fastest for

transactional provenance for all

three queries,

Page 44: Provenance Management & Citations in  Curated  Databases

CONCLUSIONS

The experimental results affirm that provenance can be tracked and managed efficiently using our approach.

This is a promising first step towards providing powerful, general-purpose tools that will make life easier for scientific data curators and increase the reliability and transparency of the scientific record.

Page 45: Provenance Management & Citations in  Curated  Databases

CONTENTS

“Provenance Management In Curated Databases” Peter Buneman, Adriane P. Chapman, James Cheney

“How to cite curated databases and how to make them citable” Peter Buneman

1st paper 2nd paper

Page 46: Provenance Management & Citations in  Curated  Databases

WHAT IS THE PROBLEM BEING ADDRESSED IN THE PAPER?

Importance of citing databases. Citing something that has: Internal structure. Evolves over time.

Propose a stable citation system for IUPHAR. Describe:

How to publish the database in a form that can be cited.

How to ensure that the citations remain valid. How to generate and validate the citations

automatically.

Page 47: Provenance Management & Citations in  Curated  Databases

PRELIMINARIES (1/4)

Bioessays 17:999-1001

Bard JB and Davies JA. Development, Databases and the Internet.

Page 48: Provenance Management & Citations in  Curated  Databases

PRELIMINARIES (1/4)

Citations are used to identify the source material and provide some additional information.

Example:The citations.. Ann. Phys., Lpz 18 639-641 Nature, 171,737-738

while adequate for identification, hardly convey the importance of these publications.

Page 49: Provenance Management & Citations in  Curated  Databases

PRELIMINARIES (2/4)

A citation does not give us a specific mechanism for retrieving a document. It is useful to find what we are looking for. It is a structure that can be used by a variety of

mechanisms such as online indexes and search engines.

A citation consists of two kinds of information.

Bard JB and Davies JA. Development, Databases and the Internet. Bioessays. 1995 Nov;17(11):999-1001.

Page 50: Provenance Management & Citations in  Curated  Databases

PRELIMINARIES (2/4)

A citation does not give us a specific mechanism for retrieving a document. It is useful to find what we are looking for. It is a structure that can be used by a variety of

mechanisms such as online indexes and search engines.

A citation consists of two kinds of information.

Bard JB and Davies JA. Development, Databases

and the Internet. Bioessays. 1995 Nov;17(11):999-1001.

Location

Page 51: Provenance Management & Citations in  Curated  Databases

PRELIMINARIES (2/4)

A citation does not give us a specific mechanism for retrieving a document. It is useful to find what we are looking for. It is a structure that can be used by a variety of

mechanisms such as online indexes and search engines.

A citation consists of two kinds of information.

Bard JB and Davies JA. Development, Databases and the Internet. Bioessays. 1995 Nov;17(11):999-1001.

Descriptive Information

(authorship, title, date)

Page 52: Provenance Management & Citations in  Curated  Databases

PRELIMINARIES (3/4)

Requirements concerning citations: There is some “thing” that is being cited. The thing should be accessible. The thing should not change over time.

There are few accepted practices for supporting citation of data. Few standards. Little supporting technology.

Page 53: Provenance Management & Citations in  Curated  Databases

PRELIMINARIES (4/4)

D1 For any citation C, <C> should remain fixed Since database change, this simple requirement

is not always easy to maintain.

D2 Any citable thing T should contain a citation C such that <C> = T Anything we cite should provide us with at least

one way of citing it.This is not always done in journal publications. It is essential because:1.One wants confirmation that we have found the correct citation. Even if we have found T using some other citation C’ (<C> = <C’>), we want to be sure that they refer to the same thing2.If we found <C> by some other means (search engine) , we want to know how to cite it.

Page 54: Provenance Management & Citations in  Curated  Databases

CURRENT PRACTICE

On-line databases frequently give recommendations on how to cite them. They often omit version information. Fail to provide adequate location.

The Columbia Guide to Online Style although it discusses issues of permanence of links, does not mention D1 as one of its citation “principles”.

ISO690 standard deals with citations of parts of electronic documents.

Page 55: Provenance Management & Citations in  Curated  Databases

STRUCTURAL ISSUES (1/5)

Databases have explicit structure. This offers the possibility of a citation using this

structure to home in on the relevant data.

Example (IUPHAR database)

Figure 1: Rough structure of the IUPHAR web interface.

The structure of what the user sees is not the same as the underlying database.

Page 56: Provenance Management & Citations in  Curated  Databases

STRUCTURAL ISSUES (2/5)

Consider the following:1. The IUPHAR database (C1) contains no

information about Ginandtonicin. 2. The IUPHAR database (C2) lists five ligands for

Melatonin receptor MT1.3. The IUPHAR database (C3) asserts that

luzindole is an antagonist ligand for receptor MT1.

1. Making the context two narrow can be as counterproductive as making it too wide.

C1 should refer to the

whole database

<C2> should be the web

page for that receptor or maybe the receptor

family page.

Citing just that row or the table?Better, cite the receptor or its

family.

Page 57: Provenance Management & Citations in  Curated  Databases

STRUCTURAL ISSUES (3/5)

One citation is coarser than another if it refers to a higher structure (<C1> is coarser than <C2>).

D3 It should be possible to cite a database at varying degrees of coarseness.

In order to make further progress we have to look at the internal structure of citation.Life Sci., 53, 393-398journ

al

Volume

number

pages Our understanding is based on a common structure of all journals.

Page 58: Provenance Management & Citations in  Curated  Databases

STRUCTURAL ISSUES (4/5)

A “concrete syntax” for citations is a sequence {k1 = v1, k2 = v2, ..}, where k1, k2, ... are keywords and v1, v2, … are associated values.Example{Journal = “Life Sci.”, Number = 53, Pages =

3930398}

There is a natural “part of” relationship among citations.Example{Journal = “Life Sci.”} and{Journal = “Life Sci.”, Number = 53}

Page 59: Provenance Management & Citations in  Curated  Databases

STRUCTURAL ISSUES (5/5)

D4 If C and C’ are citations and <C’> is coarser than <C> then the location information in C’ should be a part of the location information in C.

C’: {DB=IUPHAR, IUPHAR-Receptor-family=Melatonin}

C: {IUPHAR-Receptor-family=Melatonin}

Page 60: Provenance Management & Citations in  Curated  Databases

TEMPORAL ISSUES (1/2)

The obvious way to deal with change in citation is to provide, in the citation, a version number. {DB=IUPHAR, Version=17, Family=Melatonin} Using time may be misleading.

D5 Versions should be recorded at the database level. The rate of publication of versions is much

much slower than the rate of updates. Having such a citation obliges someone to

keep past versions. It is possible to cite a range of versions

{..Version=2-8..}

To what does the version refer?

Page 61: Provenance Management & Citations in  Curated  Databases

TEMPORAL ISSUES (2/2)

Now, what is <{DB=IUPHAR, Family=Melatonin}>, a citation without a version number? The latest version of the database.

So, we need two words: One for a fixed citation, One for a “current link”, the place at which you

may find the latest information. A good job of distinguishing between “this”

version, the “latest” version and previous versions of documents was presented.

Page 62: Provenance Management & Citations in  Curated  Databases

PRESENTATION, CONTENT AND PRESERVATION

The structure of the cited “thing” is not necessarily the same as the structure of the underlying database.

The underlying database contains information – working notes etc – that is not intended as part of the published material. We should not be making direct citations to the

internal structure of the database.

The hierarchy that the user sees should be represented as an XML document.

Page 63: Provenance Management & Citations in  Curated  Databases

AUTOMATICALLY GENERATING CITATIONS (1/3)

Insert citation data manually is both time consuming and error prone.

Automatically generation of citations is a good check on the integrity of the document. Guarantees that the contents of the document

are consistent with the citation. Give guarantees on the descriptive information

(e.g. there is at most one Title)

Page 64: Provenance Management & Citations in  Curated  Databases

AUTOMATICALLY GENERATING CITATIONS (2/3)

A rule that generates location information:

{DB=IUPHAR, Version=$v, Family=$f} /Root[]/Version[Number=$’v]/Data[] /Family[FamilyName=$’f]

The pattern is expressed in the syntax of Xpath.

A concrete syntax of

citations with variables.

The database

or document

has a unique root.

Each Version must have a Number that uniquely

identifies the node and provides a value for $v.

Indicates that for each

Version, there is precisely one data

node.

Each family node has a FamilyName

which uniquely identifies the

family.

Page 65: Provenance Management & Citations in  Curated  Databases

AUTOMATICALLY GENERATING CITATIONS (3/3)

A rule that generates description information: {DB=IUPHAR, Version=$v, Family=$f,

Receptor=$r, Contributors=$a, Editor=$e, Date=$d, DOI=$i} /Root[]/Version[Number=$’v, Editor=$?e, DOI=#.i, Date=$.d]/Data[]/Family[FamilyName=$’f]/Contributor-list/Contributor=$+a]/Receptor[ReceptorName=$’r]

Generates: { DB=IUPHAR, Version=11, Family=Calcitonin,

Receptor=CALCR, Contributors={Debbie Hay, David R. Poyner},Editor=Tony Harmar, Date=Jan, 2006, DOI=10.1234}

Exactly one value.

At most one value.One or more

values expected.

Page 66: Provenance Management & Citations in  Curated  Databases

CONCLUSIONS

We have to do a modest amount of work in structuring the data appropriately in XML, after which citations can be specified and generated by some simple rules.