Kendall Clark & Evren Sirin, Clark & Parsia LLC
30 June 2013
We begin with a brief overview of the existing RDF validation landscape. Then we discuss Stardog's Integrity Constraint Validation in more detail. We conclude with some general considerations about what an RDF validation spec should and shouldn't do.
There are three primary systems to consider: SPIN, IBM's Resource Shapes, and Stardog ICV. Each of these systems can be used to validate RDF (i.e., Linked Data). SPIN works with TopBraid's toolchain; our ICV works with Stardog, our RDF database, and with any other system that can evaluate SPARQL queries; and IBM's Resource Shapes works (or will work) with parts of IBM's Rational suite of OSLC tools.
The differences between these three RDF validation tools are largely superficial, i.e., a matter of syntax. The most obvious difference, from a user's point of view, is surface syntax; that is, the syntax that is used to capture the constraints.
There are other, older systems that do RDF validation but they are (at this point) of primarily historical significance.
Stardog can provide ICV services even for other RDF databases that don't support ICV natively by converting user's constraints into SPARQL queries (normal, ordinary SPARQL queries) that other RDF databases may evaluate.
Using high-level languages to represent RDF validation constraints is largely about concision and abstraction. Constraint languages should be syntactically expressive and (to the greatest degree feasible, technically) independent of or abstracted from graph details. Consider some examples from the Stardog ICV documentation which use Manchester OWL syntax to great effect:
Class: Employee
SubClassOf: works_on some Project or
supervises some (Employee and works_on some Project) or
manages some Department
Translating this constraint into natural language:
Each employee either works on at least one project, supervises at least one employee that works
on at least one project, or manages at least one department.
Because of the complexity of the RDF triples-level representation, a low-level syntax is necessarily messy with respect to those triples, i.e., fails to shield users from that messiness. Consider,
Consider some further examples:
The manager of a department must work for that department.
In a high-level syntax like OWL's Manchester syntax, this becomes:
manages subPropertyOf worksFor
It's hard to imagine anything simpler. But in all fairness the equivalent SPARQL isn't difficult, particularly for people who already know SPARQL, which we anticipate to be the primary users of an RDF validation technology.
SELECT * WHERE {
?x manages ?y .
FILTER NOT EXISTS {
?x worksFor ?y.
}
}
The OWL version is obviously much shorter than the SPARQL version; we expect an equivalent SPARQL encoding to typically be more verbose, but is still easy to read and understand.
Let's look at complex example.
If a project is funded by only internal funding sources then it should be approved by
the internal budget office.
In OWL as interpreted by Stardog ICV that becomes
Project and fundedBy only InternalFundingSource subClassOf (approvedBy value InternalBudgetOffice)
And the same constraint in SPARQL:
SELECT * WHERE {
?x a :Project .
FILTER NOT EXISTS {
?x :fundedBy ?y .
FILTER NOT EXISTS {
?y a :InternalFundingSource .
}
}
FILTER NOT EXISTS {
?x0 :approvedBy :InternalBudgetOffice .
}
}
Now the SPARQL version is harder to understand as we have nested negations whereas the OWL version is very close to the natural language rendering. Admittedly some of the terms here are a bit artificial, but not in a way that makes much difference in the different encodings.
We think RDF validation conceptually makes the most sense as if it were being applied orthogonally to either an explicit RDF graph or to an RDF graph under the semantics of a SPARQL 1.1 entailment regime. That's how Stardog ICV works: an explicit triple or triples may violate (or satisfy) one or more constraints; likewise, an inferred (that is, implicit) triple or triples may violate (or satisfy) one or more constraints.
But, note, too that this issue is orthogonal to which syntax or syntaxes are used to represent the constraints themselves. Stardog ICV works with SPARQL 1.1 entailment regimes whether the constraints themselves are in OWL, SWRL, or SPARQL.
We think a polyglot approach makes the most sense because
Of course in a W3C spec one of the cost drivers would be multiple surface syntaxes. While we don't require multiple syntaxes, we do very much support the idea that multiple syntaxes be permitted (even if not specified) in the sense that the resulting SPARQL translation is the canonical representation of constraints from the point of view of execution and exchange.
This section is intended to establish the maturity of ICV as an approach to RDF validation. As such, you may skip it with no great loss.
A few words about Stardog ICV's history. We described the idea in a research proposal to NIST, which they funded, in early 2008. That was the culmination of about 18 months of behind-the-scenes conversations in the OWL research community about how to do RDF validation. At that early stage, we were already focused on how to re-use OWL syntax to provide a high-level constraint language. Which we eventually generalized to using SPARQL and SWRL syntaxes, too.
The earliest published (peer reviewed, no less) description of this work from us came at OWLED 2008: Opening, Closing Worlds: On Integrity Constraints.
We delivered the first prototype to NIST in early 2009; that prototype was based on the SPARQL query engine in Pellet. So the ICV work that's in Stardog now is based on work that was done before Stardog development even started. Sometimes research to market is a series of long lines between vague dots.
We released the first version of ICV integrated with Stardog in 2011 and have been working on extending it since then, including the ability to explain ICV results automatically. That explanation work is ongoing today as we're working on automated repair plans for ICV violations. That means RDF validation in Stardog ICV isn't merely a system that tells a user that data is wrong in some way, but tells users why it's wrong and what they can do to repair it.
Things we care about with respect to a future standard in this area:
From the user's point of view, you can
Some constraints are easier to write in one syntax than in the others. There isn't any particular reason to force users to use one and only one syntax for writing all constraints since the only reasonable basis of interoperability is SPARQL queries. The expressivity of RDF validation should be precisely the expressivity of SPARQL query evaluation against RDF data (including, optionally, as SPARQL 1.1 does, entailment regimes). No more, no less. By and large, it will be RDF databases that provide RDF validation services and the lingua franca of RDF databases is SPARQL: not nested for-loops in Jena, or Sesame SAILs, or OWL axioms, or SWRL rules, or RDF vocabularies. An RDF validation spec should use SPARQL (SPARQL 1.1) queries as the basis of interoperability and exchange and as many surface syntaxes as the market cares to support.
Depending on how the market turns and how the W3C takes up these matters in a future standardization effort, Stardog will add support for the standard. In fact Stardog is very likely to support any constraint syntax that can be efficiently translated into legal, valid SPARQL because life is too short to obsess about syntax.
An RDF validation spec should not focus on