Copyright © 2013 W3C® (MIT, ERCIM, Keio, Beihang), All Rights Reserved. W3C liability, trademark and document use rules apply.
RDF defines the concept of RDF datasets, a structure composed of a distinguished RDF graph and zero or more named graphs, being pairs comprising an IRI or blank node and an RDF graph. While RDF graphs have a formal model-theoretic semantics that determines what arrangements of the world make an RDF graph true, no agreed formal semantics exists for RDF datasets. This document presents some issues to be addressed when defining a formal semantics for datasets, as they have been discussed in the RDF 1.1 Working Group, and specify several semantics in terms of model theory, each corresponding to a certain design choice for RDF datasets.
This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.
This document was published by the RDF Working Group as a First Public Working Draft. The group does not expect this document to become W3C Recommendation. If you wish to make comments regarding this document, please send them to public-rdf-comments@w3.org (subscribe, archives). All comments are welcome.
Publication as a First Public Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.
This document was produced by a group operating under the 5 February 2004 W3C Patent Policy. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.
The Resource Description Framework (RDF) version 1.1 defines the concept of RDF datasets, a notion introduced first by the SPARQL specification [RDF-SPARQL-QUERY]. An RDF dataset is defined as a collection of RDF graphs where all but one are named graphs associated with an IRI or blank node (the graph name), and the unnamed default graph [RDF11-CONCEPTS]. Given that RDF is a data model equipped with a formal semantics [RDF11-MT], it is natural to try and define what the semantics of datasets should be.
The RDF Working Group was chartered to provide such semantics in its recommendation:
Required features
- Standardize a model and semantics for multiple graphs and graphs stores [...]
However, discussions within the Working Group revealed that very different assumptions currently exist among practitioners, who are using RDF datasets with their own intuition of the meaning of datasets. Defining the semantics of RDF datasets requires an understanding of the two following issues:
Possible choices for the denotation of graph names are:
ex:hasGraph
with the graph inside the pair;Even with an intuitive understanding of what the truth of an RDF dataset should be, the precise model-theoretic formalization can be subject to many variations.
Possible choices for the meaning of the triples in the named graphs include:
Depending on the assumptions taken with respect to these two issues, the formalization of the semantics of RDF datasets can vary very much.
In this Working Group Note, we examine the propositions that were given by Working Group members in the course of a one-year-and-a-half debate.
We first take a look at existing specifications that could shed a light on how the semantics of datasets should be defined. There are three important documents that closely relate to the issue:
As described in RDF 1.1 Semantics, a set of RDF graphs can be interpreted as either the union of the graphs or as their merge ([RDF11-MT], Technical note, Section 5.2).
So, a first intuition could be that an RDF dataset, being presented as a collection of graph, should mean exactly what the set of its named graphs and default graph means. However, this completely leaves out the potential meaning of graph names, which could be valuable indicators for the truth of a dataset.
Formally, the semantics of RDF defines a notion of interpretation for a set of triples (i.e., an RDF graph), which then can extend to a set of RDF graphs. A dataset is neither a set of triples nor a set of RDF graphs. It is a set of pairs (name,graph) together with a distinguished RDF graph and the RDF semantics does not itself specify a meaning for these pairs.
Conceptually, it is problematic since one of the reasons for separating triples into distinct (named) graphs is to avoid propagating the knowledge of one graph to the entire triple base. Sometimes, contradicting graphs need to coexist in a store. Sometimes named graphs are not endorsed by the system as a whole, they are merely quoted.
In Carroll et al. [CARROLL-05], a named graph is defined as a pair comprising an IRI and an RDF graph. The notion of RDF interpretation is extended to named graphs by saying that the graph IRI in the pair must denote the pair itself. This non-ambiguously answers the question of what the graph IRI denotes. This can then be used to define proper dataset semantics, as shown in Section 3.3. Note that it is deliberate that the graph IRI is forced to denote the pair rather than the RDF graph. This is done in order to differentiate two occurrences of the same RDF graph that could have been published at different times, or authored by different people. A simple reference to the RDF graph would simply identify a mathematical set, which is the same wherever it occurs.
RDF 1.1 borrows the notion of RDF dataset from the SPARQL specification [RDF-SPARQL-QUERY], with the notable different that RDF 1.1 allows graph names to be blank nodes. So, in order to understand the semantics of dataset, it is worthwhile looking at how SPARQL uses datasets. SPARQL defines what answers to queries posed against a dataset are, but it never defines the notions that are key to a model theoretic formal semantics: it neither presents interpretations nor entailment. Still, it is worth noticing that a ASK query that only contains a basic graph pattern without variables yields the same result as asking whether the RDF graph in the query is entailed by the default graph. Based on this observation, one may extrapolate that a ASK query containing no variables and only GRAPH
graph patterns would yield the same result as dataset entailment.
This can be used as a guide for formalizing the semantics of datasets, as can be seen in Section 3.7.
This section presents the different options proposed, together with their formal definitions. We include each time a discussion of the merits of the choice, and some properties.
Each subsection here describes the option informally, before presenting the formal definitions. As far as the formal part is concerned, one has to be familiar with the definitions given in RDF Semantics. We rely a lot on the notion of interpretation and entailment, which are key in model theory.
All proposed options share some commonalities:
The first item above reflects the indication given in [RDF11-MT] (Section "RDF Datasets") with respect to dataset semantics: a dataset SHOULD be understood to have at least the same content as its default graph
.
The dependency on RDF semantics is such that most of the dataset semantics below reuse RDF semantics as a black box. More precisely, it is not necessary to be specific about how truth of RDF graphs is defined as long as there is a notion of interpretation that determines the truth of a set of triples. In fact, RDF Semantics does not define a single formal semantics, but multiple ones, depending on what standard vocabularies are endorsed by an application (such as the RDF, RDFS, XSD vocabularies). Consequently, we parameterize most of the definitions below with an unspecified entailment regime E. RDF 1.1 defines the following entailment regimes: simple entailment, D-entailment, RDF-entailment, RDFS-entailment. Additionally, OWL defines two other entailment regimes, based on the OWL 2 direct semantics [OWL2-DIRECT-SEMANTICS] and the OWL 2 RDF-based semantics [OWL2-RDF-BASED-SEMANTICS].
For an entailment regime E, we will say E-interpretation, E-entailment, E-equivalence, E-consistency to describe the notions of interpretations, entailment, equivalence and consistency associated with the regime E. Similarly, we will use the terms dataset-interpretation, dataset-entailment, dataset-equivalence, dataset-consistency for the corresponding notions in dataset semantics.
The simplest semantics defines an interpretation of a dataset as an RDF interpretation of the default graph. The dataset is true, according to the interpretation, if and only if the default graph is true. In this case, any datasets that have equivalent default graphs are dataset-equivalent.
This means that the named graphs in a dataset are irrelevant to determining the truth of a dataset. Therefore, arbitrary modifications of the named graphs in a graph store always yield a logically equivalent dataset, according to this semantics.
Considering an entailment regime E, a dataset-interpretation with respect to E is an E-interpretation. Given an interpretation I and a dataset D having default graph G and named graphs NG, I(D) is true if and only if I(G) is true.
Consider the following dataset:
{ :s :p :o . } :g1 { :a :b :c }
does not dataset-entail:
{ :s :p :o . :a :b :c .}
but dataset-entails:
{} # empty default graph :g2 { :x :y :z }
Since graph names are not particularly constrained, one can use them in triples, for instance:
{ :g1 :author :Bob . :g1 :created "2013-09-17"^^xsd:date .} :g1 { :a :b :c }
but it would dataset-entail:
{ :g1 :author :Bob . :g1 :created "2013-09-17"^^xsd:date .} :g1 { :x :y :z }
Assuming this semantics is convenient since it merely ignores named graphs in a dataset for any reasoning task. As a result, datasets can be simply treated as regular RDF graphs by extracting the default graph. Named graphs can still be used to preserve useful information, but it bears no more meaning than a commentary in a program source code.
The obvious disadvantage is that, since named graphs are completely disregarded in terms of meaning, there is no guarantee that any information intended to be conveyed by the named graphs is preserved by inference.
It is sometimes assumed that named graphs are simply a convenient way of sorting the triples but all the triples participate in a united knowledge base that takes the place of the default graph. More precisely, a dataset is considered to be true if all the triples in all the graphs, named or default, are true together. This description allows two formalizations of dataset semantics, depending on how blank nodes spanning several named graphs are treated. Indeed, if one blank node appears in several named graphs, it may be intentional, to indicate the existence of only one thing across the graphs, in which case union is appropriate. If the sharing of blank nodes is incidental, merge is also an applicable solution.
We define a dataset-interpretation with respect to an entailment regime E as an E-interpretation. Given a dataset-interpretation I and a dataset D having default graph G and named graphs NG, I(D) is true if and only if I(G) is true and for all ng in NG, I(ng) is true.
This is equivalent to I(D) is true if I(H) is true where H is the merge of all the RDF graphs, named or default, appearing in D.
We define a dataset-interpretation with respect to an entailment regime E as an E-interpretation. Given a dataset-interpretation I and a dataset D having default graph G and named graphs NG, I(D) is true if and only if I(H) is true where H is the union of all the RDF graphs, named or default, appearing in D.
An alternative presentation of this variant is the following: define I+A to be an extended interpretation which is like I except that it uses A to give the interpretation of blank nodes; define blank(D) to be the set of blank nodes in D. Then I(D) is true if and only if [I+A](D) is true for some mapping A from blank(D) to the set of resources in I.
Consider the following dataset:
{ :s :p :o . } # default graph :g1 { :a :b :c }
dataset-entails:
{ :s :p :o . :a :b :c .}
If the entailment regime E is RDFS with the recognized datatype xsd:integer
, then the following RDF dataset is RDFS-dataset-inconsistent:
{ } # empty default graph :g1 { :age rdfs:range xsd:integer . } :g2 { :bob :age "twenty" .}
This semantics allows one to partition the triples of an RDF graph into multiple named graphs for easier data management, yet retaining the meaning of the overall RDF graph. Note that this choice of semantics does not impact the way graph names are interpreted: it is possible to further constrain the graph names to denote the RDF graph associated with it, or other possible constraints. The possible interpretations of graph names, and their consequences, are presented in the next sections.
This semantics is implicitly assumed by existing graph store implementations. The OWLIM RDF database management system implements reasoning techniques over RDF datasets that materialize inferred statements into the database [[citation needed]]. This is done by taking the union of the graphs in the named graphs, applying standard entailment regimes over this RDF graph and putting the inferred triples into the default graph.
This dataset semantics makes all triples in the named graphs contribute to a global knowledge, thus making the whole dataset inconsistent whenever two graphs are mutually contradictory. In situations where named graphs are used to store RDF graphs obtained from various sources on the open Web, inconsistencies or contradictions can easily occur. Notably, Web crawlers of search engines harvest all RDF documents, and it is known as a fact that the Web contains documents serializing inconsistent RDF graphs as well as documents that are mutually contradicting yet consistent on their own. In this case, this semantics can be seen as problematic.
It is common to use the graph name as a way to identify the RDF graph inside the named graphs, or rather, to identify a particular occurrence of the graph. This allows one to describe the graph or the graph source in triples. For instance, one may want to say who the creator of a particular occurrence of a graph is. Assuming this semantics for graph names amounts to say that each named graph pair is an assertion that sets the referent of the graph name to be the associated graph or named graph pair.
The following paragraph refers to speech and asserting, while dataset semantics never refers to such notions. This may be confusing.
Intuitively, this semantics can be seen as quoting the RDF graphs inside the named graphs. In this sense, :alice {:bob :is :smart}
has to be understood as Alice said: “Bob is smart”
which does not entail Alice said: “Bob is intelligent”
because Alice did not use the word “intelligent”, even though “smart” and “intelligent” can be understood as equivalent.
We reuse the notation presented in [RDF11-MT]:
Suppose I is an interpretation and A is a mapping from a set of blank nodes to the universe IR of I. Define the mapping [I+A] to be I on names, and A on blank nodes on the set: [I+A](x)=I(x) when x is a name and [I+A](x)=A(x) when x is a blank node; and extend this mapping to triples and RDF graphs using the rules given above for ground graphs.
A dataset-interpretation I with respect to an entailment regime E is an E-interpretation extended to named graphs and datasets as follows:
Consider the following dataset:
{ } # empty default graph :g1 { :a :b :c } :g2 { :x :y :z }
dataset-entails:
{ } _:b { :a :b :c } :g2 { :x :y :z }
but does not dataset-entail:
{ } :g1 { [] :b :c } :g2 { :x :y :z }
nor:
{ } :g1 { }
If the entailment regime E is RDFS with the recognized datatype xsd:integer
, then the following RDF dataset is RDFS-dataset-inconsistent:
{ :age rdfs:range xsd:integer . :me :age :g1 . } # default graph :g1 { :s :p :o }
The graph name can be used in triples to attached metadata (here :hasNextVersion
is a custom term that does not enforce a formal constraint, so it is up to the implementation to decide how to treat it):
{ :g1 :published "2013-08-26"^^xsd:date . :g1 :hasNextVersion :g2 .} :g1 { :s1 :p1 :o1 . :s2 :p2 :o2 } :g2 { :s1 :p1 :o1 }
There are important implications with this semantics. In this case, a named graph pair can only entail itself or a graph that is structurally equivalent if the graph name is a blank node. Graph names have to be handled almost like literals. Unlike other IRIs or blank nodes, their denotation is strictly fixed, like literals are. This means that graph IRIs may possibly clash with constraints on datatypes, as in the example above.
A variant of this dataset semantics imposes that the graph name denotes the RDF graph itself, rather than the pair. This means that two occurrences of the same graph in different named graph pairs actually identify the same thing. Thus, the graph names associated with the same RDF graphs are interchangeable in any triple in this case.
Named graphs in RDF datasets are sometimes used to delimit a context in which the triples of the named graphs are true. From the truth of these triples according to the graph semantics, follows the truth of the named graph pair. An example of such situation occurs when one wants to keep track of the evolution of facts with time. Another example is when one wants to allow different viewpoints to be expressed and reasoned with, without creating a conflict or inconsistency. By having inferences done at the named graph level, one can prevent for instance that triples coming from untrusted parties are influencing trusted knowledge. Yet it does not disallow reasoning with and drawing conclusions from untrusted information.
Intuitively, this semantics can be seen as interpreting the RDF graphs inside the named graphs. In this sense, :alice {:bob :is :smart}
has to be understood as Alice said that Bob is smart
which entails Alice said that Bob is intelligent
because it is what Bob means, whether he used the term “smart”, “intelligent”, or “bright”. Neither sentence implies that Alice used these actual words.
This does not take into account blank nodes as graph names.
There are several possible formalizations of this. One way is to interpret the graph name as denoting a graph, and a named graph pair is true if this graph entails the graph inside the pair. In this case, a dataset-interpretation with respect to an entailment regime E is an E-interpretation such that:
Consider the following dataset:
{ } # empty default graph :g1 { :YoutubeEmployee rdfs:subClassOf :GoogleEmployee . :steveChen rdf:type :YoutubeEmployee . } :g2 { :chadHurley rdf:type :YoutubeEmployee }
RDFS-dataset-entails:
{ } :g1 { :steveChen rdf:type :GoogleEmployee }
but does not RDFS-dataset-entail:
{ } :g2 { :chadHurley rdf:type :GoogleEmployee }
Graph names used in triples that express metadata do not necessarily generate inconsistency:
{ :g1 :validAfter "2006"^^xsd:gYear . :g1 :published "2013-08-26"^^xsd:date . :g2 :validAt "2005"^^:xsd:gYear .} :g1 { :YoutubeEmployee rdfs:subClassOf :GoogleEmployee . :steveChen rdf:type :YoutubeEmployee . } :g2 { :chadHurley rdf:type :YoutubeEmployee }
(here, :validAfter
and :validAt
are custom terms that do not enforce a formal constraint, but may be used internally for, e.g., checking the temporal validity of triples in the named graph).
This semantics assumes that the truth of named graphs is preserved when replacing the RDF graphs inside named graphs with equivalent graphs. This means in particular, that one can normalize literals and still preserve the truth of a named graph. This means too that standard RDF inferences that can be drawn from the RDF graphs inside named graphs can be added to the graph associated with the graph name without impacting the truth of the RDF dataset.
While this semantics does not guarantee that reasoning with RDF datasets will preserve the exact triples of an original dataset, it is semantically valid to store both the original and any entailed datasets.
An example implementation of such a context-based semantics is Sindice [DELBRU-ET-AL-2008].
There are several variants of this type of dataset-semantics
In accordance with linked data principles, IRIs may be assumed to reference the document that is obtained by dereferencing it. If the document contains an RDF graph it can be assumed that the graph in the named graph is in a special relationship (such as, equals, entails) with this RDF graph.
In such case, the truth of an RDF dataset is dependent on the state of the Web, and the same dataset may entail different statements at different times.
Let d be the function that maps an IRI to an RDF graph that can be obtained from dereferencing the IRI. For an IRI u, d(u) is empty when dereferencing returns an error or a document that does not encode an RDF graph.
A dataset-interpretation I with respect to an entailment regime E is an E-interpretation such that:
Entailments in this semantics depend not only on the content of a dataset but also on the content of the Web and the ability of a reasoner to accept this content. Moreover, the entailments vary whether the considered relation is “equals”, or “subgraph of”, or “entailed by”.
For instance, if the reasoner is offline, then the dereferencing function d in the previous definition always return an empty graph. In this case, if the relation is “equals” or “subgraph of”, only empty named graphs can be true; if the relation is “entails by”, then only named graphs containing axiomatic triples are true. In general, if the relationship is “equals”, named graph do not provide extra entailments.
The distinguishing characteristic of this dataset semantics is the fact that a single RDF dataset can lead to different entailments, depending on the state of the Web. This can be seen as a feature for systems that need to be in line with what is found online, but is a drawback for systems that must retain consistency even when they go offline.
This approach consists in considering named graph as sets of quadruples, having the subject, predicate and object of the triples as first three components, and the graph IRI as the fourth element. Each quadruple is interpreted similarly to a triple in RDF, except that the relation that the predicate denotes is not indicating a binary relation but a ternary relation.
This semantics is extending the semantics of RDF rather than simply reusing it.
A quad-interpretation is a tuple (IR,IP,IEXT,IS,IL,LV) where IR, IP, IS, IL and LV are defined as in RDF and IEXT is a mapping from IP into the powerset of IR × IR union IR × IR × IR.
Since this option modifies the notion of simple-interpretation, which is the basis for all E-interpretations in any entailment regime E, it is not clear how it can be extended to arbitrary entailment regimes. For instance, does the following quad set:
:a rdf:type :c :x . :c rdfs:subClassOf :d :x .
RDFS-dataset-entails:
:a rdf:type :d :x .
With this semantics, all inferences that are valid with normal RDF triples are preserved, but it is necessary to extend RDFS in order to accommodate for ternary relations. There are several existing proposal that extends this quad semantics by dealing with a specific “dimension”, such as time, uncertainty, provenance. For instance, temporal RDF [TEMPORAL-RDF] use the fourth element to denote a time frame, and reasoning can be performed per time frame. Special semantic rules allow one to combine triples in overlapping time frames. Fuzzy RDF [FUZZY-RDF] extends the semantics to deal with uncertainty. stRDF [ST-RDF] extends temporal RDF to deal with spatial information. Annotated RDF [ANNOTATED-RDF] generalizes the previous proposals.
Quoted graphs are a way to associate information to a specific RDF graph without constraining the relationship between a graph name and the graph associated with it in a dataset. An RDF graph is “quoted” by using a literal having a lexical form that is a syntactic expression of the graph. For instance:
{ :g :quotes ":a :b []"^^:turtle . } :g { :b rdf:type rdf:Property . :a :b _:x . }
This technique allows one to assume a dataset semantics of contexts (as in Section 3.4) and still preserve an initial version of a graph. However, quoting big graphs may be cumbersome and would require a custom datatype to be recognized.
There is a strong relationship between SPARQL ASK queries with an entailment regime [SPARQL11-ENTAILMENT] and inferences in the regime. If an ASK query does not contain variables and its WHERE clause only contains a basic graph pattern, then the query can be seen as an RDF graph. If such a graph query Q returns true
when issued against an RDF graph G with entailment regime E, then G E-entails Q. If it returns false
, then G does not E-entail Q.
A dataset semantics can also be compared to what ASK queries return when they do not contain variables but may contain basic graph patterns or graph graph patterns. For instance, consider the dataset:
{ } :g1 { :x rdf:type :c . :c rdfs:subClassOf :d . } :g2 { :y rdf:type :c . }
Then the query:
ASK WHERE { GRAPH :g1 { :x rdf:type :d } }
with RDFS entailment regime would answer true
, but the query:
ASK WHERE { GRAPH :g1 { :x rdf:type :d } GRAPH :g2 { :y rdf:type :d } }
would answer false
.
This can lead to a classification of dataset semantics in terms of whether they are compatible with SPARQL ASK queries or not. It can be noted that a semantics where each named graph defines its own context is “SPARQL-ASK-compatible”, while a semantics where the graph name denotes the graph or named graph is not compatible in this sense.
The RDF Working Group did not define a formal semantics for a multiple graph data model because none of the semantics presented before could obtained consensus. Choosing one or another of the propositions before would have gone against some deployed implementations. Therefore, the Working Group discussed the possibility to define several semantics, among which an implementation could choose, and provide the means to declare which semantics is adopted.
This was not retained eventually, because of the lack of experience, so there is no definite option for this. Nonetheless, for completeness, we describe here possible solutions.
A dataset can be described in RDF using vocabularies like voiD [VOID] and the SPARQL service description vocabulary [SPARQL11-SERVICE-DESCRIPTION]. VoiD is used to describe how a collection of RDF triples is organized in a web site or across web sites, giving information about the size of the datasets, the location of the dump files, the IRI of the query endpoints, and so on. The notion of dataset in voiD is used as a more informal and broader concept than RDF dataset. However, an RDF dataset and the graphs in it can be describe as voiD datasets and the information can be completed with SPARQL service description
@prefix er: <http://www.w3.org/ns/entailment> . @prefix sd: <http://www.w3.org/ns/sparql-service-description#> . [] a sd:Dataset; sd:defaultEntailmentRegime er:RDF; sd:namedGraph [ sd:name "http://example.com/ng1"; sd:entailmentRegime er:RDFS ] .
A vocabulary specifically tailored for describing the intended dataset semantics could be defined in a future specification.
Communication of the intended semantics could be performed in various ways, from having the author tell the consumers directly, to inventing a protocol for this. Use of the HTTP protocol and content negotiation could be a possible way too. Special syntactic markers in the concrete serialization of datasets could convey the intended meaning. All of those are solutions that do not follow current practices.
This section is non-normative.
This document is the result of extensive discussions that involved many members of the RDF 1.1 Working Group. The editor especially acknowledges valuable contributions from Richard Cyganiak, Sandro Hawks, Pat Hayes, Ivan Herman, Peter F. Patel-Schneider, Guus Schreiber, and David Wood.
This section is non-normative.