user requirements - xg provenance wiki.pdf

User RequirementsFrom XG Provenance Wiki

Contents1 Requirements for Provenance on the Web2 About this document3 Introduction4 Motivating Scenarios: The Need for Provenance

4.1 News Aggregator Scenario4.2 Disease Outbreak Scenario4.3 Business Contract Scenario

5 Requirements for Provenance on the Web5.1 Content5.2 Management5.3 Use

6 Concluding Remarks

Requirements for Provenance on the WebAbout this documentSource: W3C Provenance Incubator Group (http://www.w3.org/2005/Incubator/prov/wiki)

Authors: James Cheney, Yolanda Gil, Paul Groth (Editor), and Simon Miles

Acknowledgements: Luc Moreau, Yogesh Simmhan, Paolo Missier for providing detailed comments; Jun Zhao,Oleksiy Chayka, Jose Manuel Gomez Perez, Lalana Kagal, James McCusker, James Myers, Luc Moreau, SimonMiles, Paul Groth, Yolanda Gil, Christine Runnegar, Satya Sahoo for specifying requirements; Rocio Aldeco-Perez, Jitin Arora, Chris Bizer, Aleksey Chayka, James Cheney, Kai Eckert, Andre Freitas, Irini Fundulaki,Yolanda Gil, Paul Groth, Olaf Hartig, Laura Hollink, M. Scott Marshall, Simon Miles, Paolo Missier, LucMoreau, Jim Myers, Iman Naja, Joshua Phillips, Paulo Pinheiro da Silva, Ronald Reck, Marco Roos, SatyaSahoo, Amit Sheth, Marieke van Erp, Brent Weatherly, Jun Zhao for contributing use cases.

Release Date: April 9, 2010.

Description: This document provides an accessible discussion of the requirements for provenance on the Webdeveloped by the W3C Provenance Incubator Group (http://www.w3.org/2005/Incubator/prov/wiki). Thedocument focuses on user requirements: those requirements that users have on software systems with respectto provenance. These requirements are organized into a number of dimensions focusing broadly on three aspectsof provenance: the content that needs to be represented and maintained in provenance records, the managementof provenance records, and the uses of provenance information. To illustrate these requirements, we use threescenarios that were synthesized from a large set of use cases collected by the the group.

User Requirements - XG Provenance Wiki http://www.w3.org/2005/Incubator/prov/wiki/User_Requirements

1 of 11 5/3/15, 6:15 PM

IntroductionThe provenance of information is crucial to making determinations about whether information is trusted, how tointegrate diverse information sources, and how to give credit to originators when reusing information. Broadlyconstrued, provenance encompasses the initial sources of information used as well as any entity and processinvolved in producing a result. In an open and inclusive environment such as the Web, users find informationthat is often contradictory or questionable. People make trust judgments based on provenance that may or maynot be explicitly offered to them. Reasoners in the Semantic Web will need explicit representations ofprovenance information in order to make trust judgments about the information they use. With the arrival ofmassive amounts of Semantic Web data (eg, via the Linked Open Data community) information about theprovenance of that data becomes an important factor in developing new Semantic Web applications. Therefore,a crucial enabler of the Semantic Web deployment is the explicit representation of provenance that is accessibleto machines, not just to humans.

The W3C Provenance Group (http://www.w3.org/2005/Incubator/prov/wiki) was formed in September 2009 asa W3C Incubator Activity (http://www.w3.org/2005/Incubator/). Its charter (http://www.w3.org/2005/Incubator/prov/charter) was provide a state-of-the art understanding and develop a roadmap in the area of provenance forSemantic Web technologies, development, and possible standardization. At the time of writing this document,the group has been in existence for six months, which is a half of its expected activity. The group has produceda number of documents, including:

an overview of key dimensions for provenancemore than thirty use cases spanning many areas and contexts that illustrate these key dimensionsa broad set of user requirements and technical requirements derived from those use cases.

The present document is intended for a broad audience and summarizes the group's findings regardingrequirements for provenance motivated by use cases collected by the group. We refer the reader to theProvenance Group's web site (http://www.w3.org/2005/Incubator/prov/wiki) for details and documentation. Inparticular, there is a detailed elaboration of what use cases and requirements are represented in the scenarios inthe group's interim document on requirements.

We begin describing three broad scenarios that illustrate the need for provenance:

News Aggregator Scenario: a news aggregator site that assembles news items from a variety of sources(such as news sites, blogs, and tweets), where provenance records can help with verification, credit, andlicensing

1.

Disease Outbreak Scenario: a data integration and analysis activity for studying the spread of a disease,involving public policy and scientific research, where provenance records support combining data fromvery diverse sources, justification of claims and analytic results, and documentation of data analysisprocesses for validation and reuse

2.

Business Contract Scenario: an engineering contract scenario, where the compliance of a deliverablewith the original contract and the analysis of its design process can be done through provenance records

3.

These three scenarios were motivated by the original use cases collected by the Provenance Group. They weredesigned to cover different subsets of requirements and address different communities of interest.

After describing the three scenarios, we discuss the requirements for provenance using the scenarios to illustrateand clarify. We group the requirements for provenance into three major areas of concern: content, management,and use. Content refers to the type of information that provenance records need to contain. Management refersto the mechanisms that make provenance available and accessible in an open system like the Web. Uses of


2 of 11 5/3/15, 6:15 PM

provenance refer to the purposes and usage of provenance information.

Motivating Scenarios: The Need for ProvenanceNews Aggregator ScenarioMany web users would like to have mechanisms to automatically determine whether a web document orresource can be used, based on the original source of the content, the licensing information associated with theresource, and any usage restrictions on that content. Furthermore, in cases of mashed-up content it would beuseful to ascertain automatically whether or not to trust it by examining the processes that created, processed,and delivered it. To illustrate these issues, we present the following scenario of a fictitious website, BlogAgg,that aggregates news information and opinion from the Web.

BlogAgg aims to provide rich real time news to its readers automatically. It does this by aggregatinginformation from a wider variety of sources including microblogging websites, news websites, publiclyavailable blogs and other opinion. It is imperative for BlogAgg to present only credible and trusted content onits site to satisfy its customer, attract new business, and avoid legal issues. Importantly, it wants to ensure thatthe news that it aggregates are correctly attributed to the right person so that they may receive credit.Additionally, it wants to present the most attractive content it can find including images and video. However,unlike other aggregators it wants to track many different web sites to try and find the most up to the date newsand information.

Unfortunately for BlogAgg, the source of the information is not often apparent from the data that it aggregatesfrom the web. In particular, it must employ teams of people to check that selected content is both high-qualityand can be used legally. The site would like this quality control process to be handled automatically.

For example, one day BlogAgg discovers that #panda is a trendy topic on Twitter. It finds that a tweet "#pandabeing moved from Chicago Zoo to Florida! Stop it from sweating http://bit.ly/ssurl", is being retweeted acrossmany different microblogging sites. BlogAgg wants to find the correct originator of the microblog who first gotthe word out. It would like to check if it is a trustworthy source and verify the news story. It would also like tocredit the original source in its site, and in these credits it would like to include the email address of theoriginator if allowed by that person.

Following the tiny-url, BlogAgg discovers a site protesting the move of the panda. BlogAgg wants to determinewhat organization is responsible for the site so that its name can run next to the snippet of text that BlogAggruns. In determining the snippet of text to use, BlogAgg needs to determine whether or not the text originated atthe panda protest site or was a quoted from another site. Additionally, the site contains an image of a panda thatappears as if it is sweating. BlogAgg would like to automatically use a thumbnail version of this image in itssite, therefore, it needs to determine if the image license allows this or if that license has expired and is nolonger in force. Furthermore, BlogAgg would like to determine if the image was modified, by whom, and if theunderlying image can be reused (i.e. whether the license of the image is actually correct). Additionally, it wantsto find out whether any modifications were just touch-ups or were significant modifications. Using the variousinformation about the content it has retrieved, BlogAgg creates an aggregated post. For the post, it provides avisual seal showing how much the site trusts the information. By clicking on the seal, the user can inspect howthe post was constructed and from what sources and a description of how the trust rating was derived from thosesources.

Note, that BlogAgg would want to do this same sort of process for thousands to hundreds of thousands a sites aday. It would want to automate its aggregation process as much as possible. It is important for BlogAgg to be


3 of 11 5/3/15, 6:15 PM

able to detect when this aggregation process does not work in particular when it can not determine the origins ofthe content it uses. These need to be flagged and reported to BlogAgg's operators.

Disease Outbreak ScenarioMany uses of the web involve the combination of data from diverse sources. Data can only be meaningfullyreused if the collection processes are exposed to users. This enables the assessment of the context in which thedata was created, its quality and validity, and the appropriate conditions for use. Often data is reused acrossdomains of interest that end up mutually influencing each other. This scenario focuses on the reuse of dataacross disciplines in both anticipated and unanticipated ways.

Alice is an epidemiologist studying the spread of a new disease called owl flu (a fictitious disease made up forthis example), with support from a government grant. Many studies relevant to public policy are funded by thegovernment, with the expectation that the conclusions will provide guidance for public policy decision-makers.These decisions need to be justified by a cost-benefit analysis. In practice, this means that results of studies notonly need to be scientifically valid, but the source data, intermediate steps and conclusions all need to beavailable for other scientists or non-experts to evaluate. In the United Kingdom, for example, there arepublished guidelines that social scientists like Alice are required to follow in reporting their results, called TheGreen Book: Appraisal and Evaluation in Central Government (http://www.hm-treasury.gov.uk/d/green_book_complete.pdf).

Alices study involves collecting reports from hospitals and other health services as well as recruitingparticipants, distributing and collecting surveys, and performing telephone interviews to understand how thedisease spreads. The data collected through this process are initially recorded on paper and then transcribed intoan electronic form. The paper records are confidential but need to be retained for a set period of time.

Alice will also use data from public sources (Data.gov, Data.gov.uk), or repositories such as the UK DataArchive (data-archive.ac.uk). Currently, a number of e-Social Science archives (such as NeISS, e-Stat andNESSTAR) are being developed for keeping and adding value to such data sets, which Alice may also use.Alice may also seek out "natural experiment" data sources that were originally gathered for some other purpose(such as customer sales records, Tweets on a given day, or free geospatial data).

Once the data are collected and transcribed, Alice processes and interprets the results and writes a reportsummarizing the conclusions. Processing the data is labor-intensive, and may involve using a combination ofmany tools, ranging from spreadsheets, generic statistics packages (such as SPSS or Stata), or analysis packagesthat cater specifically to Alice's research area. Some of the data sources may also provide some querying oranalysis processing that Alice uses at the time of obtaining the data.

Alice may find many challenges in integrating the data in different sources, since units or semantics of fields arenot always documented. When different data sources use different representations, Alice may need to recode orintegrate this data by hand or by guessing an appropriate conversion function - subjective choices that may needto be revisited later. To prepare the final report and make it accessible to other scientists and non-experts, Alicemay also make use of advanced visualization tools, for example to plot results on maps.

The conclusions of the report may then be incorporated into policy briefing documents used by civil servants orexperts on behalf of the government to identify possible policy decisions that will be made by (non-expert)decision makers, to help avoid or respond to future outbreaks of owl flu. This process may involve consideringhypotheticals or reevaluating the primary data using different methodologies than those applied originally byAlice. The report and its linked supporting data may also be provided online for reuse by others or archivedpermanently in order to permit comparing the predictions made by the study with the actual effects of decisions.


4 of 11 5/3/15, 6:15 PM

Bob is a biologist working to develop diagnostic and chemotherapeutic targets in the human pathogenresponsible for owl flu. Bobs experimental data may be combined with data produced in Alicesepidemiological study, and Bobs level of trust in this data will be influenced by the detail and consistency ofthe provenance information accompanying Alices data. Bob generates data using different experimentprotocols such as expression profiling, proteome analysis, and creation of new strains of pathogens through geneknockout. These experiment datasets are also combined with data from external sources such as biologydatabases (NCBI Entrez Gene/GEO, TriTrypDB, EBI's ArrayExpress) and information from biomedicalliterature (PubMed) that have different curation methods and quality associated with them. Biologists need tojudge the quality, timeliness and relevance of these sources, particularly when data needs to be integrated orcombined.

These sources are used to perform "in silico" experiments via scientific workflows or other programmingtechniques. These experiments typically involve running several computational steps, for example runningmachine learning algorithms to cluster gene expression patterns or to classify patients based on genotype andphenotype information. The results need to meet a high standard of evidence so that the biologists can publishthem in peer-reviewed journals. Therefore, justification information documenting the process used to derive theresults is required. This information can be used to validate the results (by ensuring that certain common errorsdid not occur), to track down obvious errors or understand surprising correct results, to understand how differentresults arising from similar processes and data were obtained, and to infer heuristics for data quality.

As more data of owl flu outbreaks and treatments become available over time, Alice's epidemiological studiesand Bob's research on the behavior of the owl flu virus will need to be updated by repeating the same analyticalprocesses incorporating the new data.

Business Contract ScenarioIn scientific collaborations and in business, individual entities often enter into some form of contract to providespecific services and/or to follow certain procedures as part of the overall effort. Proof that work was performedin conformance with the expectations of the project leadership (as expressed in the contract) is often required inorder to receive payment and/or to settle disputes. Such proof must, for example, document work that wasperformed on specific items (samples, artifacts), provide a variety of evidence that would preclude various typesof fraud, allow a combination of evidence from multiple witnesses, and be robust to providing partialinformation to protect privacy or trade secrets. To illustrate such demands of proof, and the other requirementswhich stem from having such information, we consider the following use case.

Bob's Website Factory (BWF) is a fictitious company that creates websites which include secured functionality,such as for payments. Customers choose a template website structure, enter specifications according to aquestionnaire, upload company graphics, and BWF will create an attractive and distinct website. The final stageof purchasing a website is for the customer to agree to a contract setting out the responsibilities of the customerand BWF. The final contract document, including BWF's digital signature will be automatically sent by email toboth parties.

BWF has agreed a contract with a customer, Customers Inc., for the supply of a website. Customers Inc. are nothappy with the product they receive, and assert that contractual requirements on quality were not met.Specifically, BWF finds it must defend itself against claims that work to create a site to the specifications wasnot performed or was performed by someone without adequate training, that the security of payments to the siteis faulty due to improper quality control and testing procedures not being followed. Finally, Customers Inc.claim that records were tampered with to remove evidence of the latter improper practices.


5 of 11 5/3/15, 6:15 PM

BWF wish to defend themselves by providing proof that the contract was executed as agreed. However, theyhave concerns about what information to include in such a proof. Many websites are not designed from scratchbut are based on an existing design in response to the customer's requests or problems. Also, sometimes parts ofthe design of multiple different sites, designed for other customers, are combined to make a new website. Bothprotecting its own intellectual property and confidential information regarding other customers mean that BWFwishes to reveal only that information needed to defend against Customers Inc.'s claims.

There are many kinds of entities relevant to describing BWF's development processes, from source code text inprograms to images. To provide adequate proof, BWF may need to include documentation on all of thesedifferent objects, making it possible to follow the development of a final design through multiple stages. Thecontract number of a site is not enough to unambiguously identify that artifact, as designs move throughmultiple versions: the proof requires showing that a site was not developed from a version of a design which didnot meet requirements or use security software known to be faulty.

With regard to showing that there was adequate quality control and testing of the newly designed or redesignedsites, BWF needs to demonstrate that a design was approved in independent checks by at least two experts. Inparticular, the records should show that the experts were truly independent in their assessments, i.e. they did notboth base their judgement on one, possibly erroneous, summarized report of the design.

Finally, Customers Inc. claim that there is a discrepancy in the records, suggesting tampering by BWF to hidetheir incompetence, as the development division apparently claimed to have received instructions from theexperts checking the design that it was OK before the experts themselves claim to have supplied such a report.BWF suspects, and wishes to check, that this is due to a difference in semantics between the reported times, e.g.in one case it regards the receipt of the report by the developers, in the other it regards the receipt of theacknowledgement of the report from the developers by the experts. These reports should be shared in a formatthat both parties understand.

Requirements for Provenance on the WebWe organize the requirements for provenance into three related categories: content, management, and use. Wewill refer to the three scenarios described above: the News Aggregator Scenario, the Disease Outbreak Scenario,and the Business Contract Scenario.

ContentContent refers to what types of information would need to be represented in a provenance record. That is, whatstructures and attributes would need to be defined in order to contain the kinds of provenance information thatwe envision need to be captured.

We need to establish first what is the artifact that is the object of any statements about provenance, and be ableto refer to it. This object can be a variety of things. In the Web this will be a web resource, essentially anythingthat can be identified with a URI, such as web documents, datasets, assertions, or services. In the NewsAggregator Scenario the object of provenance was the final news item posted by the news aggregator. It can alsobe a set or collection of items, as we may wish to associate provenance information to a group of items, such asthe pages of the website sold in the Business Contract Scenario. Conversely, sometimes provenance informationneeds to refer to a particular portion or aspect of an artifact. A challenge to keep in mind is being able to keeptrack of provenance during an object's lifetime. For example, objects may be organized in collections, thensubgroups selected, then portions of some objects modified, etc.


6 of 11 5/3/15, 6:15 PM

One important aspect of content is Attribution. Attribution refers to the sources (i.e., typically any webresource that has an associated URI such as documents, web sites, or data) or entities (i.e., people,organizations, and other identifiable groups) that contributed to create the artifact in question. In the NewsAggregator Scenario, sources would be the URIs for a blog or for a news item at a website, while entities wouldinclude Twitter.com and the person who created the original microblog. Typically an artifact would beassociated with several sources and entities. One challenge in this respect is to represent which of them hasresponsibility for the artifact at hand, meaning which of the many entities that were associated with its creationactually endorses or stands by it. For example, in the News Aggregator Scenario the image of the panda mayhave been modified before it was posted on the protest site but some entity must have originally taken the actualpicture of the unhappy panda and endorsed it. Comparably, in the Business Contract Scenario, it is a critical partof the argument to distinguish the expert's checking a design's validity from the developer's creating a site fromthat design. In addition, the provenance representation of attribution should also enable us to see the true originof any statement of attribution to an entity. It is important to represent whether the statement was recorded bythat entity and can be verified (for example with a digital signature). Alternatively, we would want an indicationthat the attribution statement was stated by the original entity or was reconstructed and then asserted by a thirdparty.

Another important aspect of provenance content is Process. This refers to the activities (or steps) that werecarried out to generate the artifact at hand. These activities encompass the execution of a computer program thatwe can explicitly point to, a physical act that we can only refer to, and some action performed by a person thatcan only be partially represented. An example of such activity in the Disease Outbreak Scenario is the executionof a machine learning algorithm to create a classification for a clinical trial dataset. In the News AggregatorScenario, examples of activities are each of the retweets (representing the action that different people did oftaking an assertion and including it in a new tweet) and each of the edits to the panda image done by differentpeople. Activities are typically related to one another temporally or causally, and the provenance representationshould capture such relationships. Provenance information may need to refer to descriptions of the activities, sothat it becomes possible to support reasoning about processes involved in provenance and support descriptivequeries about them. For example, in the Disease Outbreak Scenario, the machine learning algorithm used maybe described as a Bayes algorithm or a decision tree algorithm, and a user may query for whether the classifier ishuman readable or not (which the latter is while the former is not). Processes can be represented at a veryabstract level, focusing only on important aspects, or at a very fine-grained level to include minute details. Forexample, in the Disease Outbreak Scenario, users would like to know how a query result was computed. Morespecifically, users may like to know the data elements and query language operators that were used to computethe result of a query.

With respect to the detail of provenance content, an important consideration is to represent enough processdetails to allow reproducibility, that is, the ability to re-create the artifact by re-enacting (re-executing) theprocess as described by its provenance record. For example, in the Disease Outbreak Scenario a scientist maywant to re-run the same machine learning algorithm to include more recent data and check whether they obtainthe same classification or a different one. For this they would need to have access to the code and perhaps theparameter settings, while it would not be important to know what particular machine was used or how muchmemory was originally allocated to run the algorithm. Process provenance information needs to be connectedwith attribution information, so that it is possible to represent how sources or entities were involved in whatactivities and what their role was. For example, in the Disease Outbreak Scenario the clinical dataset was asource that played the role of being input to the machine learning algorithm, while the scientist was associatedwith that same activity as the entity that set its parameters and submitted it for execution. An important aspectof representing process provenance are resource access aspects. This includes matters such as the access time,the server accessed, and the party responsible for the server. For example, in the Disease Outbreak Scenario thedataset of patients would be larger over time, and if so the results of the analysis process could changedrastically. In the News Aggregator Scenario, any of the tweets and posts may have been corrected after the


7 of 11 5/3/15, 6:15 PM

access occurred and if so it would be important to know at what time the content of any resource was accessed.

Evolution and versioning must be treated with special status in a provenance representation. As an artifactevolves over time, its provenance should be augmented in specific ways that reflect the changes made over priorversions and what entities and processes were associated with those changes. When one has full control over anartifact and its provenance records this may be a simple matter of housekeeping, but this is a challenge in opendistributed environments such as the web. Consider the representation of provenance when republishing, forexample by retweeting, reblogging, or repackaging a document. It should be possible to represent when a set ofchanges grant the denomination of a new version of the object in question. In some cases, it may be desirable tospecify the reasons for some changes, for example to summarize a document to facilitate readability or tocompress a dataset in order to reduce its size. In addition, it is an open question whether each new versionpublished should publish not only the artifact but also its full provenance as available in the original source. Inthe Business Contract Scenario, when the design of a website developed for one customer has aspects drawnfrom designs of other customers, publishing the full provenance may breach confidentiality: information mightbe revealed about past customers. Another open question is how to associate provenance information to each ofthe updates to an artifact. One must consider whether the entire provenance trail should be associated with eachupdate, or whether each change should only represent the delta difference and refer to prior versions of theartifact for more extensive provenance.

A particular kind of provenance information is justifications of decisions. The purpose of a justification is toallow those decisions to be discussed and understood. In the Business Contract Scenario, the criticalrequirement is for a company to justify the processes it undertook (or did not undertake), involvement ofparticular people, intermediate data (designs) used and so on, leading to a decision to provide some website to acustomer under a contractual agreement. This decision may then be discussed in a court. To allow justificationsto be made, the justifying party must be prepared, i.e. they must adequately store the provenance data which canbe used to make their case and set up procedures to gather that information, ideally automatically. Anotherexample of the importance of using provenance to justify decisions is in the Disease Outbreak Scenario where itis important to capture the arguments for and against the conclusions of the owl flu report as well as the abilityto capture the evidence behind particular hypotheses in the report.

Inference may be required to derive information from the original provenance records. Some provenanceinformation may be directly asserted by the relevant sources of some data or actors in a process, while otherinformation may be derived from that which was asserted. In general, one fact may entail another, and this isimportant in the case of provenance data which is inherently describing the past, for which the majority of factscannot now be known. To an example from the Business Contract Scenario, the complainants in the case derivefrom the (ambiguous) records held by the company that a check on the quality of security-related code wasperformed only after development based on that code had started. This derivation should not be taken as plainfact by the court, but understood as due to a particular derivation process, with a set of assumptions, and by aparticular (biased) source.

ManagementProvenance management refers to the mechanisms that make provenance available and accessible in a system.

An important issue is the publication of provenance. Provenance information must be made available on theweb. Related issues include how is provenance exposed, discovered, and distributed. A provenancerepresentation language must be chosen and made available so others can refer to it in interpreting theprovenance. The publisher of provenance information should be associated with provenance records.

Once provenance is available, it must be accessible. That is, it must be possible to find it by specifying the


8 of 11 5/3/15, 6:15 PM

artifact of interest. For example, in the News Aggregator Scenario the aggregator must be able to point to atweet and query about its provenance. In some cases, it must be possible to determine what the authoritativesource of provenance is for some class of entities. For example, in the Business Contract Scenario, the websitedevelopment company will argue that their own records are the authoritative sources regarding their owninternal processes. Query formulation and execution mechanisms must be defined for provenancerepresentation.

In realistic settings, provenance information will have to be subject to dissemination control. Provenanceinformation may be associated with access policies about what aspects of provenance can be made availablegiven a requestor's credentials. Provenance may have associated use policies about how an artifact can be usedgiven its origins. This may include licensing information stated by the artifact's creators regarding what rightsof use are possible for the artifact. For example, in the News Aggregator Scenario the aggregator needs to beable to determine what license the panda image actually has. Finally, provenance information may be withheldfrom access for privacy protection. In the Business Contract Scenario, the website developer has a need tofilter what provenance information is revealed about a design due to its entanglement with their own intellectualproperty and confidential information about other customers.

The scale of provenance information is a major concern, as the size of the provenance records may by farexceed the scale of the artifacts themselves. Despite the presence of large amounts of provenance, efficientaccess to provenance records must be possible. Tradeoffs must be made regarding the granularity of theprovenance records kept and the actual amount of detail needed by users of provenance. For instance, in theDisease Outbreak Scenario, the complete provenance about the result of an analysis may include a huge portionof the biomedical literature distributed across many published repositories.

UseWe need to take into account requirements for provenance based on the use of any provenance information thatwe have recorded. The same provenance records may need to accommodate a variety of uses as well as diverseusers/consumers.

An important consideration is how to make provenance information understandable to its users/consumers.Just because the information that they need is recorded in the provenance representation it does not mean thatthey would be able to use it for their purposes. An important challenge that we face is to allow for multiplelevels of abstraction in the provenance records of an artifact as well as multiple perspectives or views oversuch provenance. In the Disease Outbreak Scenario a scientist may want to start with a high-level descriptionthat includes only the machine learning algorithms used but not any details of the data format conversionoperations that were carried out, and then the latter may be shown upon request. In addition, appropriatepresentation and visualization of provenance information is an important consideration, as users will likelyprefer something other than a set of provenance statements. For example, in the Disease Outbreak Scenario ascientist may prefer to look at a causal diagram of the processes used to generate a result, while in the NewsAggregator Scenario a simple logo signifying approved provenance may be more appropriate to use instead.Transitions between different presentations of provenance need to be defined so that users understand how thedifferent views are related. To achieve understandability, it is important to be able to combine generalprovenance information with domain-specific information. This is the case in the Business Contract Scenariowhere general information about how one item derives from another (a website from a design) needs to becombined with information specific to the case in hand, the specifications unique to the customer. Differentpresentations may be needed for end users with different levels of expertise in the subject matter.

Because provenance information may be obtained from heterogeneous systems and different representations,interoperability is an important requirement. A query may be issued to retrieve information from provenance


9 of 11 5/3/15, 6:15 PM

records created by different systems that then need to be integrated. At a finer grain, the provenance of a givenartifact may be specified by multiple systems and need to be combined. It could be that each system contributeda type of provenance information, or that different systems contributed the provenance of sources used to createthe artifact. In the News Aggregator Scenario, each news site may use a different representation of provenance,and the aggregator would need to integrate them to provide a coherent provenance picture to the user. Usersmay want to also know what sources contributed what provenance information, so they can make decisions incase of conflicts. If the different systems use different vocabularies, shared standards or mappings would beneeded facilitate interoperability and usability.

Another important use of provenance is for comparison of artifacts based on their origins. Two artifacts mayseem very different while their provenance may indicate significant commonalities. Conversely, two artifactsmay seem alike, and their provenance may reveal important differences. For example in the Disease OutbreakScenario two scientists may want to compare results from their experiments by comparing the processprovenance information as well as the entities (e.g., reagents) that they each used.

Provenance data can be used for accountability, such as in the Business Contract Scenario, where the websiteengineering company uses provenance data to account for its actions. Specifically, accountability may meanallowing users to verify that work performed meets a contract decided upon earlier, determining the license thata composite object has due to the licenses of its components, or comparing that an account of the past suggestedby a provenance record is compliant with regulations about what should have occurred. Accountability requiresthat the users can rely on the provenance record and authenticate its sources. This could be achieved withsignatures, or possibly by combining multiple independent accounts to collectively provide enough weight tothe claim that the correct thing happened, as occurs with the two experts' accounts in the engineering scenario.

A very important use of provenance is trust. Provenance information is used to make trust judgments on agiven entity. For example, in the News Aggregator Scenario, the aggregator site makes decisions about whetherto include a news item based on its provenance. Users can make similar decisions based on provenance as well,for example by defining filters to eliminate news items based on specific properties of their provenanceinformation. Trust is often based on attribution information, by checking the reputation of the entities involvedin the provenance, perhaps based on past reliability ratings, known authorities, or third-party recommendations.Similarly, measures of the relative information quality can be used to choose among competing evidence fromdiverse sources. For example, in the Disease Outbreak Scenario, researchers will need to assess the quality ofthe data they are using, particularly if they are obtained from open public sources that do not follow formalcreation or curation processes. Finally, users should be able to access and understand how trust assessments arederived from provenance.

Using provenance information may imply handling imperfections. Provenance information may be incompletein that some information may be missing, or incorrect if there are errors. Provenance information may also beprovided with some uncertainty or be of a probabilistic nature. These imperfections may be caused because ofproblems with the recording of the provenance information. But they may also arise because the user does nothave access to the complete and accurate provenance records even if they exist. This is the case whenprovenance is summarized or compressed, in which case many details may be abstracted away or missing. Thisis also the case if the user does not have the right permissions to access some aspects of the provenance records.This would also be the case if some aspect of the provenance records needs to be withheld for privacy reasons.Finally, provenance may also be subject to deception and be fraudulent partially or in its entirety. Several of thelatter cases can be seen in the Business Contract Scenario: the engineering company hides some records due toconfidentiality, leading to gaps in the account, while the complainant claims that this and the apparent timediscrepancies suggest that the records have been fraudulently altered to support the engineering company's case.

Another use of provenance is debugging. Users may want to detect failure symptoms in the provenance records


10 of 11 5/3/15, 6:15 PM

and diagnose problems in the process that generated an artifact, whether conducted in a software system or bypeople. An example is shown in the Business Contract Scenario, where the company tries to determine whetherit has made an error due to two apparently independent evaluations being dependent on the same faulty sourceinformation. Without a record of the provenance of those evaluations, it would be impossible to determinewhether such a 'bug' in the process had indeed occurred.

Concluding RemarksThis document focuses on user requirements for provenance, motivated by an extensive set of use casescontributed by members of the Provenance Group and that cover a wide variety of perspectives on the topic ofprovenance.

This document will be followed by a state-of-the-art report on provenance, which will place existing work onprovenance in the context of the user requirements described here.

Retrieved from "http://www.w3.org/2005/Incubator/prov/wiki/index.php?title=User_Requirements&oldid=2242"

This page was last modified on 7 December 2010, at 23:52.This page has been accessed 86,328 times.


11 of 11 5/3/15, 6:15 PM

user requirements - xg provenance wiki.pdf

Documents