W3CW3C Incubator Report

Provenance XG Final Report

W3C Incubator Group Report 08 December 2010

This Version:
http://www.w3.org/2005/Incubator/prov/XGR-prov-20101214/
Latest Published Version:
http://www.w3.org/2005/Incubator/prov/XGR-prov/
Editors:
Yolanda Gil, University of Southern California Information Sciences Institute (USC / ISI)
James Cheney, University of Edinburgh (School of Informatics)
Paul Groth, VU University Amsterdam
Olaf Hartig, Humboldt-Universität zu Berlin
Simon Miles, King´s College London
Luc Moreau, University of Southampton
Paulo Pinheiro da Silva, Rensselaer Polytechnic Institute
Contributors:
Sam Coppens, IBB TELIS, Ghent University
Daniel Garijo, Universidad Politécnica de Madrid
Jose Manuel Gomez, Isoco
Paolo Missier, University of Manchester
Jim Myers, RPI
Satya Sahoo, Case Western Reserve University
Jun Zhao, University of Oxford

Abstract

Given the increased interest in provenance in the Semantic Web area and in the Web community at large, the W3C established the Provenance Incubator Group as part of the W3C Incubator Activity with a charter to provide a state-of-the art understanding and develop a roadmap in the area of provenance and possible recommendations for standardization efforts. This document summarizes the findings of the group.

Status of This Document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of Final Incubator Group Reports is available. See also the W3C technical reports index at http://www.w3.org/TR/.

This document was published by the W3C Provenance Incubator Group as an Incubator Group Report. If you wish to make comments regarding this document, please send them to public-xg-prov@w3.org. All feedback is welcome.

Publication of this document by W3C as part of the W3C Incubator Activity indicates no endorsement of its content by W3C, nor that W3C has, is, or will be allocating any resources to the issues addressed by it. Participation in Incubator Groups and publication of Incubator Group Reports at the W3C site are benefits of W3C Membership.

Incubator Groups have as a goal to produce work that can be implemented on a Royalty Free basis, as defined in the W3C Patent Policy. Participants in this Incubator Group have agreed to offer patent licenses according to the W3C Royalty-Free licensing requirements described in Section 5 of the W3C Patent Policy for any portions of the XG Reports produced by this XG that are subsequently incorporated into a W3C Recommendation produced by a Working Group which is chartered to take the XG Report as an input.

Table of Contents

1. Introduction

Provenance refers to the sources of information, such as entities and processes, involved in producing or delivering an artifact. The provenance of information is crucial in deciding whether information is to be trusted, how it should be integrated with other diverse information sources, and how to give credit to its originators when reusing it. In an open and inclusive environment such as the Web, users find information that is often contradictory or questionable. People make trust judgments based on provenance that may or may not be explicitly offered to them. Reasoners in the Semantic Web would benefit from explicit representations of provenance to make informed trust judgments about the information they use. With the arrival of massive amounts of Semantic Web data (eg, Linked Open Data) information about the origin of that data, i.e., provenance, becomes an important factor in developing new Semantic Web applications. Therefore, a crucial enabler of the Semantic Web deployment is the ability to the explicitly express provenance that is accessible and understandable to machines and humans.

Provenance is concerned with a very broad range of sources and uses. For example, businesses may exploit provenance in their quality assurance procedures for manufactured processes. The provenance of a document or an image in terms of its origins and prior ownerships is crucial to describe publication rights and therefore to determine their legal use. In a scientific context, data is integrated depending on the collection and pre-processing methods used. Further, decisions about the validity or reliability of an experimental result are based on how each analysis step was carried out. In this context, provenance can enable reproducibility. Despite this diversity, there are many common threads underpinning the representation, capture, and use of provenance that need to be better understood to enable a new generation of Web applications enabled by provenance.

There are many pockets of research and development that have studied relevant aspects of provenance. The Semantic Web and agents communities have developed algorithms for reasoning about unknown information sources in a distributed network. Logic reasoners can produce justifications for how an answer was derived, and explanations that can help find and fix errors in ontologies. The information retrieval and argumentation communities have investigated ways to amalgamate alternative views and sources of contradictory and complementary information, taking into account its origins. The database and distributed systems communities have looked into the issue of provenance in their respective areas. Provenance has also been studied for workflow systems in e-Science as a means to represent the processes that generate new scientific results. Licensing standards bodies take into account the attribution of information as it is reused in new contexts. However, by and large, these results tend to be shared within their respective communities, without broad dissemination and take up by the Web community.

Given the increased interest in provenance in the Semantic Web area and in the Web community at large, the W3C established the Provenance Incubator Group as part of the W3C Incubator Activity with a charter to provide a state-of-the art understanding and develop a roadmap in the area of provenance and possible recommendations for standardization efforts. This document summarizes the findings of the group.

2. What is provenance

Provenance is too broad a term for it to be possible to have one, universal definition - like other related terms such as "process", "accountability", "causality" or "identity", we can argue about their meanings forever (and philosophers have indeed debated concepts such as identity or causality for thousands of years without converging). Our goal was to develop a working definition reflecting how the W3C Provenance Incubbator Group views provenance in the context of the Web.

To develop this view, we first the activities reported in the rest of this document. That is, we did not start out trying to agree on a definition of provenance but rather the group came to a shared view once we had a common background and context, based on months of discussions.

2.1 A Working Definition of Provenance

Provenance is a very broad topic that has many meanings in different contexts. The W3C Provenance Incubator Group developed a working definition of provenance on the Web:

Provenance of a resource is a record that describes entities and processes involved in producing and delivering or otherwise influencing that resource. Provenance provides a critical foundation for assessing authenticity, enabling trust, and allowing reproducibility. Provenance assertions are a form of contextual metadata and can themselves become important records with their own provenance.

2.2 Provenance, Metadata, and Trust

Provenance is often conflated with metadata and trust. These terms are related, but they are not the same.

2.2.1 Provenance and Metadata

Metadata is used to represent properties of objects (e.g. an image). Many of those properties have to do with provenance, so the two are often equated. How does metadata relate to provenance?

Descriptive metadata of a resource only becomes part of its provenance when one also specifies its relationship to deriving the resource. For example, a file can have a metadata property that states its size. But, this, is not typically considered provenance information since it does not relate to how the file was created. The same file can have metadata regarding its creation date, which would be considered provenance-relevant metadata. So even though a lot of metadata potentially has to do with provenance, both terms are not equivalent. In summary, provenance is often represented as metadata, but not all metadata is necessarily provenance.

2.2.2 Provenance and Trust

Trust is a term with many definitions and uses, but in many cases establishing trust in an object or an entity involves analyzing its origins and authenticity. How does trust relate to provenance?

Trust is often equated with provenance, and it is indeed related but it is not the same. Trust is derived from provenance and from other data quality metrics, and typically is a subjective judgment that depends on context and use. With provenance, the focus is on how to represent, manage, and use information about resource origins, but not on detailed approaches as to how trust may be derived from it. In essence, provenance is a platform for trust algorithms and approaches on the Web.

Authentication is often conflated with provenance because it leads to establishing trust. However, current mechanisms available for authentication address the verification of an identity or the access to a resource, such as digital signatures and access control. Provenance information may be used for authentication purposes, for example the creator of a document may provide a signature that can be verified by a third party, but is only one component of authentication.

2.3 Alternative Views on Provenance

There are other definitions in the literature that emphasize different views of provenance. Other views include: 1) Provenance as Process, 2) Provenance as a Directed Acyclic Graph, 3) Why-Provenance, 4) Where-Provenance, 5) How-Provenance, 6) Provenance as Annotations, 7) Event oriented view. Chapter 3 of the survey Foundations of Provenance on the Web for a description of these different views of provenance.

The following literature surveys summarize provenance research over the last 10 years:

3. Importance of provenance

In order to provide a better understanding of the need for provenance and its importance, the group carried out the following activities:

3.1 Original Use Cases

The group compiled 33 use cases with the aim of developing an understanding of the need for provenance in a variety of situations and application areas. Each use case follows a template that includes some background and current practices, a description of what can be achieved without the use of provenance and then with the use of some provenance technology.

The use cases covered a broad range of motivations for provenance. The topics included eScience, eGovernment, business, manufacturing, cultural heritage, library sciences, engineering design, emergency response, public policy, privacy, linked data, attribution, licensing, trust, metadata management, and others.

The use cases were reviewed by a team of curators and revised by the authors to ensure sufficient details. Six use cases were merged as a result of their curation because they overlapped with other use cases.

3.2 Flagship Scenarios

After the analysis of the use cases and the identification of the dimensions of concern, the group decided that three broad motivating scenarios should be developed to concisely illustrate the need for provenance. These scenarios were designed to be representative of the larger set of use cases and cover a broad range of provenance issues:

  1. News Aggregator Scenario: a news aggregator site that assembles news items from a variety of sources (such as news sites, blogs, and tweets), where provenance records can help with verification, credit, and licensing
  2. Disease Outbreak Scenario: a data integration and analysis activity for studying the spread of a disease, involving public policy and scientific research, where provenance records support combining data from very diverse sources, justification of claims and analytic results, and documentation of data analysis processes for validation and reuse
  3. Business Contract Scenario: an engineering contract scenario, where the compliance of a deliverable with the original contract and the analysis of its design process can be done through provenance records

3.2.1 News Aggregator Scenario

Many web users would like to have mechanisms to automatically determine whether a web document or resource can be used, based on the original source of the content, the licensing information associated with the resource, and any usage restrictions on that content. Furthermore, in cases of mashed-up content it would be useful to be able to ascertain automatically whether or not to trust it by examining the processes that created, processed, and delivered it. To illustrate these issues, we present the following scenario of a fictitious website, BlogAgg, that aggregates news information and opinion from the Web.

BlogAgg aims to provide rich real time news to its readers automatically. It does this by aggregating information from a wider variety of sources including microblogging websites, news websites, publicly available blogs and other opinion. It is imperative for BlogAgg to present only credible and trusted content on its site to satisfy its customers, attract new business, and avoid legal issues. Importantly, it wants to ensure that the news that it aggregates are correctly attributed to the right person so that they may receive credit. Additionally, it wants to present the most attractive content it can find including images and video. However, unlike other aggregators it wants to track many different web sites to try and find the most up to the date news and information.

Unfortunately for BlogAgg, the source of the information is often not apparent from the data that it aggregates from the Web. In particular, it must employ teams of people to check that selected content is both high-quality and can be used legally. The site owner would like this quality control process to be handled automatically. It is also important to check that it can legally use any pictures or information, and credit appropriate sources.

For example, one day BlogAgg discovers that #panda is a trendy topic on Twitter. It finds that a tweet "#panda being moved from Chicago Zoo to Florida! Stop it from sweating http://bit.ly/ssurl", is being retweeted across many different microblogging sites. BlogAgg wants to find the correct originator of the microblog who first got the word out. It would like to check if it is a trustworthy source and verify the news story. It would also like to credit the original source in its site, and in these credits it would like to include the email address of the originator if allowed by that person.

Following the tiny-url, BlogAgg discovers a site protesting the move of the panda. BlogAgg wants to determine what organization is responsible for the site so that its name can run next to the snippet of text that BlogAgg runs. In determining the snippet of text to use, BlogAgg needs to determine whether or not the text originated at the panda protest site or was a quoted from another site. Additionally, the site contains an image of a panda that appears as if it is sweating. BlogAgg would like to automatically use a thumbnail version of this image in its site, therefore, it needs to determine if the image license allows this or if that license has expired and is no longer in force. Furthermore, BlogAgg would like to determine if the image was modified, by whom, and if the underlying image can be reused (i.e. whether the license of the image is actually correct). Additionally, it wants to find out whether any modifications were just touch-ups or were significant modifications. Using the various information about the content it has retrieved, BlogAgg creates an aggregated post. For the post, it provides a visual seal showing how much the site trusts the information. By clicking on the seal, the user can inspect how the post was constructed and from what sources and a description of how the trust rating was derived from those sources.

Note, that BlogAgg would want to do this same sort of process for thousands to hundreds of thousands a sites a day. It would want to automate its aggregation process as much as possible. It is important for BlogAgg to be able to detect when this aggregation process does not work in particular when it can not determine the origins of the content it uses. These need to be flagged and reported to BlogAgg´s operators.

The provenance issues highlighted in this scenario include:

3.2.2 Disease Outbreak Scenario

Many uses of the Web involve the combination of data from diverse sources. Data can only be meaningfully reused if the collection processes are exposed to users. This enables the assessment of the context in which the data was created, its quality and validity, and the appropriate conditions for use. Often data is reused across domains of interest that end up mutually influencing each other. This scenario focuses on the reuse of data across disciplines in both anticipated and unanticipated ways.

Alice is an epidemiologist studying the spread of a new disease called owl flu (a fictitious disease made up for this example), with support from a government grant. Many studies relevant to public policy are funded by the government, with the expectation that the conclusions will provide guidance for public policy decision-makers. These decisions need to be justified by a cost-benefit analysis. In practice, this means that results of studies not only need to be scientifically valid, but the source data, intermediate steps and conclusions all need to be available for other scientists or non-experts to evaluate and reproduce. In the United Kingdom, for example, there are published guidelines that social scientists like Alice are required to follow in reporting their results.

Alice´s study involves collecting reports from hospitals and other health services as well as recruiting participants, distributing and collecting surveys, and performing telephone interviews to understand how the disease spreads. The data collected through this process are initially recorded on paper and then transcribed into an electronic form. The paper records are confidential but need to be retained for a set period of time.

Alice will also use data from public sources (Data.gov, Data.gov.uk), or repositories such as the UK Data Archive (data-archive.ac.uk). Currently, a number of e-Social Science archives (such as NeISS, e-Stat and NESSTAR) are being developed for keeping and adding value to such data sets, which Alice may also use. Alice may also seek out "natural experiment" data sources that were originally gathered for some other purpose (such as customer sales records, Tweets on a given day, or free geospatial data).

Once the data are collected and transcribed, Alice processes and interprets the results and writes a report summarizing the conclusions. Processing the data is labor-intensive, and may involve using a combination of many tools, ranging from spreadsheets, generic statistics packages (such as SPSS or Stata), or analysis packages that cater specifically to Alice´s research area. Some of the data sources may also provide some querying or analysis processing that Alice uses at the time of obtaining the data.

Alice may find many challenges in integrating the data in different sources, since units or semantics of fields are not always documented. When different data sources use different representations, Alice may need to recode or integrate this data by hand or by guessing an appropriate conversion function - subjective choices that may need to be revisited later. To prepare the final report and make it accessible to other scientists and non-experts, Alice may also make use of advanced visualization tools, for example to plot results on maps.

The conclusions of the report may then be incorporated into policy briefing documents used by civil servants or experts on behalf of the government to identify possible policy decisions that will be made by (non-expert) decision makers, to help avoid or respond to future outbreaks of owl flu. This process may involve considering hypotheticals or reevaluating the primary data using different methodologies than those applied originally by Alice. The report and its linked supporting data may also be provided online for reuse by others or archived permanently in order to permit comparing the predictions made by the study with the actual effects of decisions.

Bob is a biologist working to develop diagnostic and chemotherapeutic targets in the human pathogen responsible for owl flu. Bob´s experimental data may be combined with data produced in Alice´s epidemiological study, and Bob´s level of trust in this data will be influenced by the detail and consistency of the provenance information accompanying Alice´s data. Bob generates data using different experiment protocols such as expression profiling, proteome analysis, and creation of new strains of pathogens through gene knockout. These experiment datasets are also combined with data from external sources such as biology databases (NCBI Entrez Gene/GEO, TriTrypDB, EBI´s ArrayExpress) and information from biomedical literature (PubMed) that have different curation methods and quality associated with them. Biologists need to judge the quality, timeliness and relevance of these sources, particularly when data needs to be integrated or combined.

These sources are used to perform "in silico" experiments via scientific workflows or other programming techniques. These experiments typically involve running several computational steps, for example running machine learning algorithms to cluster gene expression patterns or to classify patients based on genotype and phenotype information. The results need to meet a high standard of evidence so that the biologists can publish them in peer-reviewed journals. Therefore, justification information documenting the process used to derive the results is required. This information can be used to validate the results (by ensuring that certain common errors did not occur), to track down obvious errors or understand surprising results, to understand how different results arising from similar processes and data were obtained, and to reproduce results.

As more data of owl flu outbreaks and treatments become available over time, Alice´s epidemiological studies and Bob´s research on the behavior of the owl flu virus will need to be updated by repeating the same analytical processes incorporating the new data.

This scenario illustrates several distinctive aspects of provenance:

3.2.3 Business Contract Scenario

In scientific collaborations and in business, individual entities often enter into some form of contract to provide specific services and/or to follow certain procedures as part of the overall effort. Proof that work was performed in conformance with the contract is often required in order to receive payment and/or to settle disputes. Such proof must, for example, document work that was performed on specific items (samples, artifacts), provide a variety of evidence that would preclude various types of fraud, allow a combination of evidence from multiple witnesses, and be robust enough to allow provision of partial information to protect privacy or trade secrets. To illustrate such demands of proof, and the other requirements which stem from having such information, we consider the following use case.

Bob´s Website Factory (BWF) is a fictitious company that creates websites which include secured functionality, e.g. for payments. Customers choose a template website structure, enter specifications according to a questionnaire, upload company graphics, and BWF will create an attractive and distinct website. The final stage of purchasing a website involves the customer agreeing to a contract setting out the respective responsibilities of the customer and BWF. The final contract document, including BWF´s digital signature will be automatically sent by email to both parties.

BWF has agreed a contract with a customer, Customers Inc., for the supply of a website. Customers Inc. are not happy with the product they receive, and assert that contractual requirements on quality were not met. Specifically, BWF finds it must defend itself against claims that work to create a site to the specifications was not performed or was performed by someone without adequate training, that the security of payments to the site is faulty due to improper quality control and testing procedures not being followed. Finally, Customers Inc. claim that records were tampered with to remove evidence of the latter improper practices.

BWF wish to defend themselves by providing proof that the contract was executed as agreed. However, they have concerns about what information to include in such a proof. Many websites are not designed from scratch but are based on an existing design in response to the customer´s requests or problems. Also, sometimes parts of the design of multiple different sites, designed for other customers, are combined to make a new website. Both protecting its own intellectual property and confidential information regarding other customers mean that BWF wishes to reveal only that information needed to defend against Customers Inc.´s claims.

There are many kinds of objects relevant to describing BWF´s development processes, from source code text in programs to images. To provide adequate proof, BWF may need to include documentation on all of these different objects, making it possible to follow the development of a final design through its multiple stages. The contract number of a site is not enough to unambiguously identify that artifact, as designs move through multiple versions: the proof requires showing that a site was not developed from a version of a design which did not meet requirements or use security software known to be faulty.

With regard to showing that there was adequate quality control and testing of the newly designed or redesigned sites, BWF needs to demonstrate that a design was approved in independent checks by at least two experts. In particular, the records should show that the experts were truly independent in their assessments, i.e. they did not both base their judgement on one, possibly erroneous, summarized report of the design.

Finally, Customers Inc. claim that there is a discrepancy in the records, suggesting tampering by BWF to hide their incompetence, as the development division apparently claimed to have received instructions from the experts checking the design that it was OK before the experts themselves claim to have supplied such a report. BWF suspects, and wishes to check, that this is due to a difference in semantics between the reported times, e.g. in one case it regards the receipt of the report by the developers, in the other it regards the receipt of the acknowledgement of the report from the developers by the experts. These reports should be shared in a format that both parties understand.

The provenance issues highlighted in this scenario include:

4. Requirements for provenance

As seen in the previous section, provenance touches on many different domains and applications. Each of these has different requirements for provenance. Here, we present the requirements extracted by the group from the collected use cases. The group produced several detailed documents:

4.1 Provenance Dimensions

The group found useful to organize requirements and use cases in terms of key dimensions that concern provenance. An overview of these dimensions is shown in the following table:

Category Dimension Description
Content
Object The artifact that a provenance statement is about.
Attribution The sources or entities that contributed to create the artifact in question.
Process The activities (or steps) that were carried out to generate or access the artifact at hand.
Versioning Records of changes to an artifact over time and what entities and processes were associated with those changes.
Justification Documentation recording why and how a particular decision is made.
Entailment Explanations showing how facts were derived from other facts.
Management
Publication Making provenance available on the Web.
Access The ability to find the provenance for a particular artifact.
Dissemination Defining how provenance should be distributed and its access be controlled.
Scale Dealing with large amounts of provenance.
Use
Understanding How to enable the end user consumption of provenance.
Interoperability Combining provenance produced by multiple different systems.
Comparison Comparing artifacts through their provenance.
Accountability Using provenance to assign credit or blame.
Trust Using provenance to make trust judgments.
Imperfections Dealing with imperfections in provenance records.
Debugging Using provenance to detect bugs or failures of processes.


4.2 Content Requirements

Content refers to the types of information that would need to be represented in a provenance record. That is, what structures and attributes would need to be defined in order to contain the kinds of provenance information that we envision need to be captured.

We need to be able to establish the artifact or object that statements of provenance are about, and be able to refer to that object. This object can be a variety of things. On the Web, this will be a web resource, essentially anything that can be identified with a URI, such as web documents, datasets, assertions, or services. Provenance may refer to aspects or portions of an object. For example, objects may be organized in collections, then subgroups selected, then portions of some objects modified, etc.

Attribution is a critical component of provenance. It refers to the sources (i.e., typically any web resource that has an associated URI, such as documents, web sites, or data) or entities (i.e., people, organizations, and other identifiable groups) that contributed to the creation of the artifact in question. In addition, the provenance representation of attribution should also enable us to see the true origin of any statement of attribution to an entity. Technically, this may require that the statement be verified through the use of a an authentication system (perhaps with the use of digital signature). Additionally, one may want to know whether a statement was made by the original entity or was reconstructed and then asserted by a third party. Attribution may also require some form of anonymization, for privacy or identity protection reasons.

Process refers to the activities (or steps) that were carried out to generate the artifact. These activities encompass the execution of a computer program that we can explicitly point to, a physical act that we can only refer to, and some action performed by a person that can only be partially represented. Provenance representations should represent how activities are related to form a process. Provenance information may need to refer to descriptions of the activities, so that it becomes possible to support reasoning about the processes involved in provenance and support descriptive queries about them. Processes can be represented at a very abstract level, focusing only on important aspects, or at a very fine-grained level, including minute details sufficient to enable exact reproduction of the process.

Dealing with evolution and versioning is a critical requirement for a provenance representation. As an artifact evolves over time, its provenance should be augmented in specific ways that reflect the changes made over prior versions and what entities and processes were associated with those changes. When one has full control over an artifact and its provenance records this may be a simple matter of good-recording keeping, but this is a challenge in open distributed environments such as the Web. Consider the representation of provenance when republishing, for example by retweeting, reblogging, or repackaging a document. It should also be possible to represent when a set of changes grant the denomination of a new version of the object in question. An important aspect to consider is how resource access properties, such as the access time, server accessed, the party accessing the resoource and the party responsible for the server, impacts the version of an artifact.

A particular kind of provenance information is justifications of decisions. The purpose of a justification is to allow those decisions to be discussed and understood. Justifications should be backed up by supporting evidence, which needs to be collected according to a well-defined procedure, ideally in an automatic fashion. It is important to capture both the arguments for and against particular conclusions as well as the ability to capture the evidence behind particular hypotheses. Technically, this may require systems that are provably correct and cater to long-term preservation. Additionally, if justifications are based on external information, systems must have a mechanism to exchange provenance information.

Inference may be required to derive information from the original provenance records. Some provenance information may be directly asserted by the relevant sources of some data or actors in a process, while other information may be derived from that which was asserted. In general, one fact may entail another, and this is important in the case of provenance data which is inherently describing the past, for which the majority of facts may not now be known. It is also important to capture the assumptions that were used when performing inference. For example, this may require in a RDF triple store that supports inference of the capability to provide the provenance of result given for a SPARQL query.

4.3 Management Requirements

Provenance management refers to the mechanisms that make provenance available and accessible in a system.

An important issue is the publication of provenance. Provenance information must be made available on the Web. Related issues include how provenance is exposed, discovered, and distributed. A transparent provenance representation language must be chosen and made available so others can refer to it in interpreting the provenance. The publisher of provenance information should be associated with provenance records. Technically, this requires tools to enable such publication.

Once provenance is available, it must be accessible. That is, it must be possible to find it by specifying the artifact of interest. In some cases, it must be possible to determine what the authoritative source of provenance is for some class of entities. Query formulation and execution mechanisms must be defined for provenance representation.

In realistic settings, provenance information will have to be subject to dissemination control. Provenance information may be associated with access policies about what aspects of provenance can be made available given a requestor´s credentials. Provenance may have associated use policies about how an artifact can be used given its origins. This may include licensing information stated by the artifact´s creators regarding what rights of use are granted for the artifact. Finally, provenance information may be withheld from access for privacy protection or intellectual property protection. Dissemination control must also take into account how security can be integrated.

The scale of provenance information is a major concern, as the size of the provenance records may by far exceed the scale of the artifacts themselves. Despite the presence of large amounts of provenance, efficient access to provenance records must be possible. Tradeoffs must be made with respect to the granularity of the provenance records kept and the actual amount of detail needed by users of provenance.

4.4 Use Requirements

We need to take into account requirements for provenance based on the use of any provenance information that we have recorded. The same provenance records may need to accommodate a variety of uses as well as diverse users/consumers.

Important considerations are how to make provenance information understandable to its users/consumers and usable. Just because the information they need is recorded in the provenance representation does not mean that they would be able to use it for their purposes. An important challenge that we face is how to allow for multiple levels of abstraction in the provenance records of an artifact as well as multiple perspectives or views concerning such provenance. In addition, appropriate presentation and visualization of provenance information is an important consideration, as users will likely prefer something other than a set of provenance statements. To achieve understandability and usability, it is important to be able to combine general provenance information with domain-specific information.

Because provenance information may be obtained from heterogeneous systems and different representations and used across multiple applications, interoperability is an important requirement. A query may be issued to retrieve information from provenance records created by different systems that then need to be integrated. At a finer grain, the provenance of a given artifact may be specified by multiple systems and need to be combined. Users may want to also know what sources contributed specific provenance statements, so they can make decisions in case of conflicts.

Another important use of provenance is for comparison of artifacts based on their origins. Two artifacts may seem very different while their provenance may indicate significant commonalities. Conversely, two artifacts may seem alike, and their provenance may reveal important differences. Technically, this would require that the underlying provenance representation be amenable to comparison, for example, through graph comparison.

Provenance data can be used for accountability. Specifically, accountability may mean allowing users to verify that work performed meets a contract decided upon earlier, determining the license that a composite object has due to the licenses of its components, or comparing that an account of the past suggested by a provenance record is compliant with regulations. Accountability requires that the users can rely on the provenance record and authenticate its sources.

A very important use of provenance is trust. Provenance information is used to make trust judgments on a given entity. Trust is often based on attribution information, by checking the reputation of the entities involved in the provenance, perhaps based on past reliability ratings, known authorities, or third-party recommendations. Similarly, measures of the relative information quality can be used to choose among competing evidence from diverse sources based on provenance. Finally, users should be able to access and understand how trust assessments are derived from provenance.

Using provenance information may imply handling imperfections. Provenance information may be incomplete in that some information may be missing, or incorrect if there are errors. Provenance information may also be provided with some uncertainty or be of a probabilistic nature. These imperfections may be caused by problems with the recording of the provenance information but they may also arise because the user does not have access to the complete and accurate provenance records even if they exist. This would be the case when provenance is summarized or compressed, in which case many details may be abstracted away or missing, or when the user does not have the right permissions to access some aspects of the provenance records, potentially because of privacy concerns. Finally, provenance may also be subject to deception and be fraudulent, partially or in its entirety.

Another use of provenance is debugging. Users may want to detect failure symptoms in the provenance records and diagnose problems in the process that generated an artifact, whether conducted in a software system or by people. For example, detecting when a error was produced because of the use of a common source of information. Debugging may also involve comparison of records from multiple witnesses, e.g. a workflow engine and a computer operating system, to catch instances where fine-grained records are inconsistent with the coarser-grained view.

4.5 Requirements on RDF for supporting Provenance

The group found that the following requirements should be considered in the context of future extensions of RDF:

Based on these requirements, the group argued for additional desirable capabilities that the current standard RDF model does not offer, including proper identification of RDF statements and an annotation framework permitting a standard approach for linking meta-information like provenance with sets of RDF triples. The group also argued for the development of a common approach to exchange provenance information between systems and publish it on the Web. This requirement would begin to address the remaining three provenance requirements. Later in this report, we discuss the basis for the development and specification of such an approach.

4.6 Summary

The provenance dimensions described here summarize the requirements that users have with respect to provenance. Users need to be able to point to and ask for the provenance of an object. They want to know who is responsible for information and if there is adequate justification for that information. They may want to understand how information was processed to produce a particular query result or artifact and how information evolves over time. They want to use provenance information to ascertain trust, compare artifacts, debug programs and to hold other parties to account. They need easy access to large of amounts of provenance information across many different systems and those systems should take into account their privacy when making provenance information available.

5. State of the art and technology gaps

Prior and ongoing work on provenance is spread out in many areas. The published literature on provenance is vast and rapidly growing, with a reported half of the articles published in the last two years (Moreau 2010).

To obtain a comprehensive understanding of existing work on provenance the group carried out several activities, including:

These materials provide a checkpoint on the state of the art of provenance during the Incubator Group´s activities in 2009-2010, and provide a basis for the recommendations and future roadmap of this report.

This section summarizes the most salient results from these activities.

5.1 Provenance Bibliography

The group did not think necessary to do a comprehensive literature review of provenance, since there are several surveys published about provenance and trust. Instead, the group agreed to do a limited effort in creating an annotated bibliography collection that would give concrete entry points to the literature in this area.

Driven by the three flagship scenarios and their requirements, the group constructed a shared bibliography of relevant publications and technologies. We then used tagging to organize the entries according to the three scenarios as well as the provenance dimensions above.

The bibliography collection can be browsed here. The collection can be browsed and viewed by authors, keywords, type (journal versus conference), and year.

The entire set of references in bib format can be downloaded here.

5.2 Provenance Vocabularies

Among the relevant technologies, a few stood out as covering approximately similar applications and provenance needs using roughly comparable ontologies or data models. These Provenance Vocabularies include:

The group developed concrete mappings between the terms in these different provenance vocabularies. These mappings can help users better understand the similarities and differences between the provenance terminologies, facilitate the development of applications that can utilize the mappings for provenance interoperability, and enable the provenance research community to move towards the adoption of a common provenance terminology.

The mappings use the Open Provenance Model (OPM) as a reference vocabulary. The mappings between the provenance terms are formally encoded using the W3C recommended Simple Knowledge Organization System (SKOS) vocabulary. The rationale for the mappings was documented in detail.

5.3 Analysis of The State of the Art

We carried out detailed analyses of the requirements and provenance dimensions exhibited by the flagship scenarios, and the extent to which current practice and ongoing research addresses these needs. The State of the Art Report presents these detailed analyses along with references to the bibliography. Here, we summarize the discussion of how current technology is used to address these kinds of problems.

5.3.1 State of the Art Analysis of the News Aggregator Scenario

The News Aggregator Scenario envisions a system that can automatically tell where a piece of content (i.e. object) on the Web comes from and who is responsible for that content after it has been aggregated.

Aggregation Today

Content aggregation is widely used on the Web. Examples of content aggregation for news include sites such The Huffington Post, Digg, and Google News. Personal aggregation is facilitated by feed technologies (RSS, Atom) and their associated readers (e.g. Google Reader). Newer aggregators can provide a merged view of content, thus, hiding some of the provenance of the information to increase visual appeal. This is similar to what is envisioned in the News Aggregator scenario.

Tracking Content

A number of systems have looked at tracking content, in particular, quotes across the Web. Some sites provide ways for tracking distinctive phrases through the blogosphere. Other work has expanded on this to track how information is propagated through the network and thus which blogs and media have the greater influence. Similarly, there has also been work on influence in the microblogging Internet, Twitter, including the difference between the influence of content and the influence of users. Researchers have also studied how the social networks of both Twitter and Digg impact the propagation of information through these networks.

Most of these systems rely on crawls of the Web that are produced uniquely for each application. However, there are tools and services for producing these crawls uniquely, for example by periodically crawling and then analyzing blogs.

Need for Explicit Provenance

It is important to note that these systems deduce the provenance of an object (e.g. a piece of text) after the fact from crawled data. However, determining provenance after the fact is often difficult. For example, during the Iranian Green Revolution protests it was extremely difficult to determine the actual origins of tweets about what was happening on the ground. Later 2009, Twitter launched its own service to explicitly capture the notion of retweeting. There is documented evidence of the difficulty in determining provenance for accountability and trust from crawled documents. There are mechanisms for explicit tracking of the origin of blogs and microblogs. These are used to notify a blog when another blog has linked to it. We note that these systems are isolated to specific technology platforms and do not encompass the whole of the Web.

Licensing

A crucial reason for tracking provenance in the News Aggregator Scenario is the ability to determine if an image (or other content) can be reused. There are sites to track and maintain licensing rights to music or books. Fundamental technology related to this is the digital representation of licenses.

5.3.2 State of the Art Analysis of the Disease Outbreak Scenario

Data provenance

Provenance is central to the requirement that scientific research be reproducible. Paper laboratory notebooks have been in use for hundreds of years as a primary means of recording provenance. In the last few decades, automated mechanisms including laboratory information management systems (LIMS), databases, and electronic notebooks have seen growing use. However, such systems have had significant adoption barriers and continue to have significant limitations, particularly as community-scale infrastructure. Tracking provenance across laboratories, sharing data with provenance, and integrating data from multiple sources remains very labor intensive today. Provenance records for reference information and community data sets are either produced by hand (i.e. by scientists filling in data entry forms on submission to a data archive or repository), produced by ad hoc applications developed specifically for a given kind of data, or not produced at all.

Within curated biological databases, provenance is often recorded manually in the form of human-readable change logs. Some scientists are starting to use wikis to build community databases.

There has been a stream of research on provenance in databases but so far little of it has been transferred into practice, in part because most such proposals are difficult to implement, and often involve making changes to the core database system and query language behavior. Such techniques have not yet matured to the point where they are supported by existing database systems or vendors, which limits their accessibility to non-researchers.

Workflow provenance and e-Science

On the other hand, a number of workflow management systems and Semantic Web systems have been developed and are in active use by communities of scientists. Many of these systems implement some form of provenance tracking internally, and have begun to standardize on a few common representations for the provenance data to enable interchange and integration. Some systems also track the provenance of the workflow templates themselves. Moreover, research on storing and querying such data effectively can more easily be transferred to practice since it relies on an explicit model of data and control flow between computational steps.

Development of electronic notebooks, e-Science systems, and underlying content and provenance middleware have begun to address end-to-end provenance management of physical, computational, and coordination processes. Over the last two decades, numerous open source and commercial electronic notebook systems have been developed. Electronic notebook capabilities range from wiki-style free-text annotation to LIMS/database-style automated documentation of repetitive work using standardized experimental protocols. State-of-the-art systems provide integrated management of provenance across many types of activities. Some systems track provenance from experiment planning through laboratory work to eventual deposition in a reference data repository. Standardization would be a significant benefit to researchers trying to implement such end-to-end capabilities using tools from multiple commercial and open source providers.

Provenance as justification for public policy

Justification information, describing how the data has been processed from raw observations to processed results that support scientific conclusions, is legally required in some scientific settings. Regulatory agencies, e.g. in areas such as food safety and medical research, specifically require research and testing documentation and also define acceptable practice for managing such records electronically. Some researchers in social sciences are developing community-scale techniques directly addressing these problems through computer systems, but most such justification information is created and maintained by user effort.

5.3.3 State of the Art Analysis of the Business Contract Scenario

The Business Contract scenario envisages mechanisms to explain the details of procedures, in this case instances of design and production, such that it can be compared with the normative statements made in contracts, specifications or other legal documents. It further assumes the ability to filter this evidence on the basis of confidentiality. While we are not aware of any system providing exactly the functionality described (standards-based, machine interpretable records that respect privacy and other concerns), there are many ´business/contract fulfillment´ and related services which address particular aspects.

Tracking Design

It is critical to track which products a design decision or production action ultimately affect, so that it is possible to show that what was done in producing an individual product fulfilled obligations and to determine which set of products a decision affected. Product recalls occur that are due to particular manufacturing issues. Without knowing the connection between the manufacturing actions and products, many vehicles unaffected by the problem may be recalled, thereby costing a company a great deal more than necessary.

Computer-Aided Design and Requirements Management

Computer-aided design (CAD) systems can include features to capture what is occurring as a design is created. Aside from storing a history of changes to a design, requirements management systems allow the rationale behind design choices to be captured, which can be an essential part of the record in explaining how contractual obligations were aimed to be met (particularly where a contract states that some factor must be ´taken into account´) and in assessing whether additional design changes will impact a product´s ability to meet stated requirements.

The provenance of a design goes beyond just the changes made through a single CAD system, both because one design may be based on another previously developed design for another customer and because the design is just a part of a larger manufacturing and retail process. Ways in which the interconnection between parts of a process include the use of common formats for describing designs and standardised archiving mechanisms used by all stages of a process.

5.4 Gap Analysis

Given the state of the art in the area of provenance, the group wanted to identify major gaps in technology that stand in the way to making shared provenance on the Web a reality. Since the major issues are illustrated by the three flagship scenarios, this gap analysis was done with the scenarios as a point of reference.

5.4.1 Gap Analysis for the News Aggregator Scenario

Existing provenance solutions only address a small portion of the scenario and are not interlinked or accessible among one another. For each step within the News Aggregator scenario, there are existing technologies or relevant research that could solve that step. For example, one can properly insert licensing information into a photo using a creative commons license and the Extensible Metadata Platform. One can track the origin of tweets either through retweets or using some extraction technologies within twitter. However, the problem is that across multiple sites there is no common format and api to access and understand provenance information whether it is explicitly indicated or implicitly determined. To inquire about retweets or inquire about trackbacks one needs to use different apis and understand different formats. Furthermore, there is no (widely deployed) mechanism to point to provenance information on another site. For example, once a tweet is traced to the end of twitter there is no way to follow where that tweet came from.

System developers rarely include provenance management or publish provenance records. Systems largely do not document the software by which changes were made to data and what those pieces of software did to data. However, there are existing technologies that allow this to be done, for example to document the transformations of images. There are also general provenance models would allow this to be expressed, but they are not currently widely deployed. There are no widely accepted architectural solutions to managing the scale of the provenance records, as they may be significantly larger than the base information itself in addition to also evolving over time.

While many sites provide for identity and there are several widely deployed standards for identity, there are no existing mechanisms for tying identity to objects or provenance traces. This is a fundamental open problem in the web, and affects provenance solutions in that provenance records must be attached to the object(s) they describe.

Finally, although there have been proposals for how to use provenance to make trust judgments on open information sources, there are no broadly accepted methodologies to automatically derive trust from provenance records. Another issue that has been largely unaddressed is the incompleteness of provenance records and the potential for errors and inconsistencies in a widely distributed and open setting such as the web.

5.4.2 Gap Analysis for the Disease Outbreak Scenario

This scenario is data-centric, and there is currently a large gap between ideas that have been explored in research and techniques that have been adopted in practice, particularly for provenance in databases. This is an important problem that already imposes a high cost on curators and users, because provenance needs to be added by hand and then interpreted by human effort, rather than being created, maintained, and queried automatically. However, there are several major obstacles to automation, including the heterogeneity of systems that need to communicate with each other to maintain provenance, and the difficulty of implementing provenance-tracking efficiently within classical relational databases. Thus, further research is needed to validate practical techniques before this gap can be addressed.

In the workflow provenance and Semantic Web systems area, provenance techniques are closer to maturity, in part because the technical problems are less daunting because the information is coarser-grained, typically describing larger computation steps rather than individual data items, and focusing on computations from immutable raw data rather than versioning and data evolution. There is already some consensus on graph-based data models for exchanging provenance information, and this technology gap can probably be addressed by a focused standardization effort.

Guidance to users about how to publish provenance at different granularity is also very important, for example whether publishing the provenance of an individual object or a collection of objects. Users need to know how to use different existing provenance vocabularies to express such different types of provenance and what the consequence will be, for example, how people will use this information and what information is needed to make it useful.

5.4.3 Gap Analysis for the Business Contract Scenario

Overall, there is a gap in practice: provenance technologies are simply not being used for the described purpose. Even with encouragement, the existing state of the art does not make it a simple task to achieve the scenario, because of a lack of standards, guidelines and tools. Specifically, we can consider the gaps with regards to content, management and use, as modelled above.

There is no established way to express what needs to be expressed regarding the provenance of an engineered product, in such a way that this can be used by both the developer and the customer (and any independent third party). Each needs to first identify and retrieve the provenance of some product, then determine from the provenance where one item is a later version of another, where implementation activity was triggered by a successful quality check etc. Moreover, even if a commonly interpretable form for the latter provenance information was agreed, there is not an established way to then augment the data so that we can verify the quality of the process it describes, e.g. how to make it comparable against contractual obligations, or to ensure it cannot be tampered with prior to providing such proof. While provenance models, digital signatures, deontic formalisms, versioning schemes and so on provide an adequate basis for solutions, there is a substantial gap in establishing how these should be applied in practice. Without common expectations on how provenance is represented, a producer cannot realistically document its activities in a way which all of its customers can use.

Assuming that common representations are established, there is a gap in what the technology provides for storing and accessing that information. This goes beyond simply using existing databases due to the form of information conveyed in provenance. For example, if an independent expert documents, in some store local to themselves, that they checked a website design, this must be connected to rest of the website´s manufacturing process to provide a full account. While web technologies could be harnessed for the purpose, there is no established approach to interlinking across parts of provenance data. Moreover, the particular characteristics of provenance data may affect how other data storage requirements must be completed: how to scale up to the quantities of interlinked information we can expect with automatic provenance recording, how to limit access to only non-confidential information etc. Finally, the provenance, once recorded, has to be accessible and there are no existing standards for exposing provenance data on the web.

With regards to use of provenance, the scenario requires various sophisticated functions, and it is non-obvious how existing technologies would realise them. For example, the various parties need to be able to understand the engineering processes at different levels of granularity, to resolve apparent conflicts in what appears to be expressed in the provenance data, to acquire indications and evidence of which parts of the record can be relied on, determine whether the provenance shows that the engineering process conformed to a contract, check whether two supposedly independent assessments rely on the same, possibly faulty, source, and so on. To encourage mass adoption of provenance technologies, it must be apparent how to achieve these kinds of function, through guidelines and standardised approaches.

5.5 Summary of Major Technology Gaps in the State of the Art

There are many proposed approaches and technology solutions that are relevant to provenance. Despite this large body of growing work, there are several major technology gaps to realizing the requirements exemplified by the flagship scenarios. Organized by the three major provenance dimensions described above, key technology gaps are:

6. Provenance in Web Architecture

There are no standard mechanisms to publish and access provenance on the Web. It is not sufficient to have an agreement on a language to represent provenance, there must also be an agreement on how the provenance can be found for any given resource.

This section discusses possibilities to integrate provenance into the Web architecture. It focuses on how provenance information about Web resources could be exposed as part of the HTTP based message exchange with which these resources can be accessed. More detailed discussion can be found at the Provenance and Web Architecture report.

6.1 Overview

According to the Web architecture, an HTTP server serves representations of Web resources. These resources might be static documents (i.e. files on the server), but, they can also be dynamically generated documents. Each resources has a URL. When an HTTP client does an HTTP GET request on that URL, the server responds with a representation of the resource referred to by the URL. This representation can be understood as a specific (negotiable) serialization that represents a specific state of the resource. Negotiation can be done on three dimensions: media type, encoding (i.e. charset), and language.

We aim to extend this access pattern in a way that enables clients to access provenance information about the retrieved representation and the represented Web resource.

6.2 Subject of Provenance Information

The provenance information could be about the Web resource. This option might be problematic because Web resources may change over time. Nonetheless, some provenance statements always hold, irrespective of the state of a Web resource. Note, OPM in its current form cannot be used to represent this kind of provenance because OPM focuses on "immutable pieces of state."

The provenance information could be about the representation of the Web resource. Each representation served by an HTTP server may have a unique provenance; this holds in particular for representations that are created on the fly. However, representations of the same state of a Web resource have at least some provenance information in common.

The provenance information could be about the state of the Web resource. While this option ignores the creation of the representation that the HTTP server actually serves, it might be more feasible for passing provenance by reference (see below) because it may avoid establishing a provenance record for each representation served.

6.3 Provenance Passing

We distinguish three patterns to pass provenance via the HTTP based message exchange: by value, by reference, mixed.

6.3.1 Passing by Value

The idea of this provenance passing pattern is to add all (known) provenance information about the served representation directly to the HTTP response. This information could be embedded in the HTTP response header or in the representation itself as discussed later.

Pros:

Cons:

The integration of a mechanism for provenance negotiation (see below) may address the disadvantages of this pattern.

6.3.2 Passing by Reference

The idea of this provenance passing pattern is to understand the provenance record as another Web resource and to add the URI of the corresponding record to the HTTP response. This reference could also be embedded in the HTTP response header or in the representation itself.

Supporting this provenance passing pattern requires a server to mint a new URI for each provenance record. Furthermore, the look-up of these provenance records has to be enabled. In response to such a look-up the server could either reconstruct the provenance information on the fly or it could access a provenance store to which it added provenance records, generated at the time when the original response has been sent. Reconstructing the provenance records might be problematic because the reconstructed record may provide a different account than what actually happened, and may vary depending on the reconstruction methods used.

Pros:

Cons:

6.3.3 Passing Partially by Value and by Reference

The idea of this provenance passing pattern is to pass some provenance information by value while providing additional references. The references could be separate from the embedded provenance information and refer i) to the complete or ii) to a more detailed provenance record. Alternatively, it could be included in the embedded provenance information via URIs that identify common pieces of provenance (e.g. an agent, a common source artifact, etc), assuming that a look-up of these URIs yields additional information.

6.4 Embedding Provenance

As mentioned before, the provenance information or references could be embedded at the HTTP level or in the representation itself. Both of these provenance embedding patterns can be used for each of the provenance passing patterns discussed above.

6.4.1 Embedded at the HTTP Level

It might be possible to pass provenance (by value or by reference) in the header of an HTTP response. This would require the use of an appropriate header field. For large provenance records passed by value, this option might not be feasible due to the limit on header size. Furthermore, it is not clear how provenance, passed by value, can be provided via a header field.

Alternatively, provenance could be embedded at the HTTP level via a multipart MIME message.

6.4.2 Embedded in the Representations

Instead of embedding provenance at the HTTP level, it can be embedded (by value or by reference) in the representations itself. This provenance embedding pattern is only possible for representations serialized using a media type with metadata capabilities. The actual approach as to how provenance is embedded in the representation would depend on the media type.

For representations of RDF graphs, serialized in RDF/XML, Turtle, N3, etc., it is not clear yet how the embedded provenance description can be associated with the embedding representation (i.e. what should be the subject of provenance statements). Once this problem is solved, provenance passed by value can be represented by additional RDF triples using a suitable provenance vocabulary. To pass provenance by reference an appropriate RDF property has to be established (dct:provenance might be an option).

For representations of web pages, serialized in (X)HTML, provenance passed by reference could be embedded using the link element. This option requires the registration of a suitable link type (e.g. "provenance") that has to be used for the rel attribute. Passing provenance by value could be done using RDFa; however, as with the RDF graphs, it is also not clear yet how the embedded provenance description can be associated with the embedding representation (i.e. with the actual HTML serialization that embeds the provenance description).

6.5 Provenance Negotiation

Different clients / users may have different needs (e.g. provenance described at different levels of detail or no provenance at all). Hence, the aforementioned options do not have to be considered as exclusive, "one size has to fit all" approaches. Instead, it should be possible for clients to negotiate the kind of response they want to receive. This would require the introduction of another dimension of content negotiation. Such an extension of HTTP might be particulary relevant for provenance passed by value.

7. A Roadmap for Provenance on the Web

The group analyzed the major gaps in light of the immediate needs for provenance in the use cases that it considered. This section outlines a roadmap towards addressing those gaps, more details can be found in the report on broad recommendations and priorities.

7.1 Broad Recommendations

The group synthesized eight general recommendations:

  1. a handle (URI) to refer to an object (resource)
  2. a person/entity that the object is attributed to
  3. a processing step done by a person/entity to an object to create a new object

Recommendations #1, #2, and #3 are present in all three flagship scenarios. Recommendations #4, #5, and #6 are more central to the second scenario, while #7 and #8 are more central to the third scenario.

7.2 Short-Term and Long-Term Priorities

The group agreed that recommendations #1, #2, and #3 from the group´s effort were the highest priorities and that they should be addressed within a provenance standardization effort. While acknowledging that the priorities of the recommendations depend on the context, those three recommendations are considered to be the most common and to represent the core set of issues to be addressed. Moreover, there already exist a number of systems that provide some form of provenance capability in these areas demonstrating convergent practices. At the same time, there are a number of developments, such as the growth in popularity of open data initiatives and Linked Data, that create a pressing need for standard representation of provenance and interoperability across systems. Failure to address effective standardization in the recommended areas now could impede effective reuse of open data, and potentially dissipate the existing community momentum and enthusiasm.

Thus, there is both a clear need and a clear starting point for a standardization effort to address these recommendations within the next two years.

The remaining recommendations #4, #5, #6, #7, and #8 represent issues for which there is less understanding or community agreement of the best approaches. While a number of research projects have considered these problems, proposed mechanisms, and deployed solutions in specific projects, there is not yet consensus on how, for example, to represent fine-grained provenance for RDF, XML, or relational data, how to integrate provenance with versioning information, or even how to version and archive past versions of data on the web (an important problem in its own right). Some aspects of these problems involve issues related to description and categorization of processes and how to describe mutable resources that have been debated for thousands of years. However, it is likely that as basic technology and standards are developed, the need for practical techniques to address these (and other) provenance issues will lead to progress. However, there is a case to be made to provide an upgrade path to cater for these elements. Thus, standardization should consider these elements when catering for extensibility.

Therefore, while there may not yet be a clear case for standardization in some areas, it is important that: a core standard be developed with future extension in mind; these problems continue to receive attention in the research community and more broadly; and the question of standardization in these areas be revisited in 3-5 years once the problems and possible solutions are better understood.

7.3 Bootstrapping Provenance Standards

The first recommendation implies that the community would be able to reach agreement on a representation of basic provenance entities. This task would have required major effort, but the group found solid grounds for immediate progress in work that has already been done.

The group found that there is a core set of provenance terms that are common across the different provenance terminologies. Despite the diverse motivations and perspectives that led to these terminologies, the group was able to establish mappings among them and successfully demonstrate that there are many common concepts in provenance.

In addition, the group also identified that the core OPM terms along with a set of provenance terms from the nine other provenance terminology can be a bootstrapping basis for a future common provenance model.

7.4 Potential for Immediate Impact

If the consensus on provenance technologies and progress on technical challenges described above is fruitful, then we can anticipate the provenance of resources to be readily available to users in all domains involving production or exchange of data. This could in turn change common practice in many domains, as knowing the origins of data may allow users to more accurately interpret it and feel more able to rely on it.

The group found evidence of the widespread demand for knowing the provenance of data in a broad range of use cases. These domains spanned scientific research (in biology, physics, geology etc.), medicine, emergency services, museum collections, legal disputes, government policy, industrial engineering, environmental data collection, as well as in everyday usage of public web resources and personal data. Each domain will have specific issues regarding provenance, but we found questions and technical requirements to be fairly generalisable, e.g. who said this and when, why did I end up with results which look like this, can I rely on this data, etc.

If both common standards for provenance are adopted and data regarding provenance is routinely automatically recorded, then a side-effect of provenance could be to connect data across application domains. For example, consider the following scenario. Environmental sensor data is gathered and made public, a social science study is conducted investigating social effects of environmental changes, and conclusions from this finally make their way into an online newspaper article. While each process is conducted independently with domain-specific tools, due to the automatic recording of provenance data, a connection is apparent between the newspaper article and the environmental sensors on which its analysis ultimately depends.

8. Recommendations

The group articulated the requirements for provenance in a variety of contexts on the Web. It has identified technology gaps in terms of standard mechanisms to represent and access provenance to address those needs. The group formulated a roadmap for provenance on the Web that includes short term and long term priorities. The group also agreed to concrete starting points to ensure rapid progress towards a standardization effort, as the potential impact of provenance standards on the Web is very broad and immediate.

The group recommends the formation of a Working Group to address core concepts of provenance with the charter below.

While the proposed charter leaves out aspects of provenance where there is not yet a clear case for standardization, it is important that the core standard be developed with future extension in mind as those aspects continue to receive attention in the research community and can be revisited in 3-5 years when the issues and possible solutions are better understood.

8.1 Proposed Charter for a Provenance Interchange Working Group

Introduction/Mission

The mission of the proposed Provenance Interchange Working Group is to support the widespread publication and use of the provenance of Web documents, data, and resources. It will define a language for exchanging provenance, and publish concrete specifications of the language using existing W3C standards.

1. Background

The W3C Incubator Group on Provenance identified rapidly growing needs for provenance in social, scientific, industry, and government contexts, involving data and information integration across the Web. Provenance is unique in that it inherently draws on distributed information. Therefore, collecting provenance and making sense of it requires consulting different heterogeneous systems.

Over time, multiple techniques to capture and represent various forms provenance have been devised, and are sometimes known under the names of lineage, pedigree, proof, or traceability. As noted in the group´s state-of-the-art report, the lack of a standard model is a significant impediment to realizing such applications. This matters as provenance is key to establishing trust in documents, data, and resources. However, the group´s work also indicates that many provenance models exist with significantly different expressivity, fundamentally different assumptions about the system they are embedded in, and radically different performance impact. The idea that a single way of representing and collecting provenance could be adopted internally by all systems does not seem to be realistic today.

A pragmatic approach is to consider a core provenance language and extension mechanisms that allow any provenance model to be translated into such a lingua franca and exchanged between systems. Heterogeneous systems can then export their provenance into such a core language, and applications that need to make sense of provenance in heterogeneous systems can then import it and reason over it.


2. Scope

The Provenance Interchange Working Group has the following objectives:

3. Deliverables and Schedule


3.1 Deliverables

By "language", we refer to the provenance interchange language to be defined, and the inferences allowed over it. The Working Group will select an appropriate name for the language.



Comments about the deliverables:


3.2 Milestones

Reports will undergo the W3C development process: Working Draft (WD), Working Draft in Last Call (LC), Candidate Recommendation (CR), Proposed Recommendation (PR) and Recommendation (Rec).

Specification FPWD LC CR PR Rec
D1 T+6 T+9 T+12 T+15 T+18
D2 T+6 T+9 T+12 T+15 T+18
D3 (Optional) T+12 T+18 n/a n/a n/a
D4 T+9 T+15 n/a n/a n/a
D5 T+9 T+12 n/a n/a n/a
D6 T+15 T+18 n/a n/a n/a
D7 T+12 T+18 n/a n/a n/a

4. Provenance Concepts

The proposed Working Group will leverage the activities of the W3C Provenance Incubator Group, its understanding of the state-of-the-art, extensive requirements capture, use cases and flagship scenarios, and mapping of provenance vocabularies.

Drawing on existing vocabularies/ontologies (namely: Changeset Vocabulary, Dublin Core, Open Provenance Model (OPM), PREMIS, Proof Markup Language (PML), Provenance Vocabulary, Provenir ontology, SWAN Provenance Ontology, Semantic Web Publishing Vocabulary, WOT Schema), a set of concepts have been identified to constitute the core of a standard provenance interchange language. The number of concepts is intentionally limited, so as to ensure a cohesive and tractable core. Other concepts can be relevant to provenance, but it is anticipated that they would be defined by means of the envisaged extension mechanism of the provenance interchange language.

In the following list, the names appearing as titles are used for intuition. Concepts with similar intuition in existing vocabularies are provided. Examples from one of the three flagship scenarios from the W3C Provenance Incubator Group are also shown.

  1. Resource: Note that it includes static or dynamic (mutable or immutable), the proposed Working Group can decide whether to subclass this and make a distinction.
  2. Process execution: refers to execution of a computation, workflow, program, service, etc. Does not refer to a query.
  3. Recipe link: we will not define what the recipe is, what we mean here is just a standard way to refer to a recipe (a pointer). The development of standard ways to describe these recipes is out of scope.
  4. Agent: entity (human or otherwise) involved in the process execution. An agent can be the creator or contributor
  5. Role
  6. Location: a link to a description of location. Defining how the spatial information will be represented is out of scope, will point to an existing ontology.
  7. Derivation
  8. Generation
  9. Use
  10. Ordering of Processes
  11. Version
  12. Participation
  13. Control, is a subclass of participation. Related to this is a notion of "responsibility", i.e. an entity that stands behind the artifact that was produced (Alice controls the process but the organization that she worked for is responsible, so that even after she leaves the organization is still responsible). It may be a useful shortcut to add.
  14. Provenance Container
  15. Views or Accounts
  16. Time
  17. Collections: Should be a lightweight notion, mainly focused on "part of". Might be treated as a resource ultimately.
Acknowledgements:
Chris Bizer, James Cheney, Sam Coppens, Kai Eckert, Andre Freitas, Irini Fundulaki, Daniel Garijo, Yolanda Gil, Jose Manuel Gomez, Paul Groth, Olaf Hartig, Deborah McGuinness, Simon Miles, Paolo Missier, Luc Moreau, James Myers, Michael Panzer, Paulo Pinheiro da Silva, Christine Runnegar, Satya Sahoo, Yogesh Simmhan, Raphaël Troncy and Jun Zhao.
We thank all the outside participants who consulted with the group over its lifetime