Copyright
©
2015
2016
W3C
®
(
MIT
,
ERCIM
,
Keio
,
Beihang
).
W3C
liability
,
trademark
and
document
use
rules
apply.
This document provides a framework in which the quality of a dataset can be described, whether by the dataset publisher or by a broader community of users. It does not provide a formal, complete definition of quality, rather, it sets out a consistent means by which information can be provided such that a potential user of a dataset can make his/her own judgment about its fitness for purpose.
This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.
The model for the Data Quality Vocabulary is nearing maturity, but the Working Group is seeking feedback on a number of specific issues highlighted in the document below.
This document was published by the Data on the Web Best Practices Working Group as a Working Draft. If you wish to make comments regarding this document, please send them to public-dwbp-comments@w3.org ( subscribe , archives ). All comments are welcome.
Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.
This document was produced by a group operating under the 5 February 2004 W3C Patent Policy . The group does not expect this document to become a W3C Recommendation. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy .
This document is governed by the 1 September 2015 W3C Process Document .
The Data on the Web Best Practices Working Draft has pointed out the relevance of publishing information about the quality of data published on the Web . Accordingly, the Data on the Web Best Practices Working Group has been chartered to create a vocabulary for expressing data quality. The Data Quality Vocabulary (DQV) presented in this document is foreseen as an extension to DCAT [ vocab-dcat ] to cover the quality of the data, how frequently is it updated, whether it accepts user corrections, persistence commitments etc. When used by publishers, this vocabulary will foster trust in the data amongst developers.
This vocabulary does not seek to determine what "quality" means. We believe that quality lies in the eye of the beholder; that there is no objective, ideal definition of it. Some datasets will be judged as low-quality resources by some data consumers, while they will perfectly fit others' needs. In accordance, we attach a lot of importance to allowing many actors to assess the quality of datasets and publish their annotations, certificates, opinions about a dataset. A dataset's publisher should seek to publish metadata that helps data consumers determine whether they can use the dataset to their benefit. However, publishers should not be the only ones to have a say on the quality of data published in an open environment like the Web. Certification agencies, data aggregators, data consumers can make relevant quality assessments, too.
We
want
to
stimulate
this
by
making
it
easier
to
publish,
exchange
and
consume
quality
metadata,
for
every
step
of
a
dataset's
lifecycle.
This
is
why
next
to
rather
expected
constructs
like
quality
measures,
the
Data
Quality
Vocabulary
puts
a
lot
of
an
emphasis
on
feedback,
annotation,
agreements
and
agreements.
Note
that
DQV
elements
can
be
applied
not
only
to
express
metadata
on
the
provenance
quality
of
datasets;
they
can
also
be
used
to
express
statements
about
the
quality
of
that
metadata
itself.
This
is
especially
true
when
it
comes
to
representing
the
provenance
of
that
describes
them.
metadata
or
its
conformance
with
respect
to
established
metadata
standards.
The
namespace
for
DQV
is
provisionally
set
as
http://www.w3.org/ns/dqv#
.
DQV,
however,
seeks
to
re-use
elements
from
other
vocabularies,
notably
DCAT
,
following
the
best
practices
for
data
vocabularies
identified
by
the
Data
on
the
Web
Best
Practices
Working
Group.
.
The table below indicates the full list of namespaces and prefixes used in this document.
Prefix | Namespace |
---|---|
daq | http://purl.org/eis/vocab/daq# |
dcat | http://www.w3.org/ns/dcat# |
dcterms | http://purl.org/dc/terms/ |
dqv | http://www.w3.org/ns/dqv# |
duv | http://www.w3.org/ns/duv# |
oa | http://www.w3.org/ns/oa# |
prov | http://www.w3.org/ns/prov# |
sdmx-attribute | http://purl.org/linked-data/sdmx/2009/attribute# |
skos | http://www.w3.org/2004/02/skos/core# |
The following vocabulary is based on DCAT [ vocab-dcat ] that it extends with a number of additional properties and classes suitable for expressing the quality of a dataset.
The
quality
of
a
given
dataset
of
or
distribution
is
assessed
via
a
number
of
observed
properties.
For
instance,
one
may
consider
a
dataset
to
be
of
high
quality
because
it
complies
to
a
specific
standard
while
for
other
use-cases
the
quality
of
the
data
will
depend
on
its
level
of
interlinking
with
other
datasets.
To
express
these
properties
an
instance
of
a
dcat:Dataset
or
dcat:Distribution
can
be
related
to
four
five
different
types
of
quality
information
represented
by
the
following
classes:
DQV
defines
quality
measures
as
specific
instances
of
DQV
observations,
Quality
Measurements,
adapting
the
DAQ
daQ
quality
metrics
framework
[
DaQ
],
[
DaQ-RDFCUBE
]:
].
It
relies
on
quality
dimensions
and
quality
metrics.
Besides quality measurements, DQV considers certificates, standards, and quality policies, which can also be organized according to dimensions. Quality metadata containers ( dqv:QualityMetadata ) can group together different quality statements, so that their provenance can be tracked jointly.
N.B.: "containment" refers to the inclusion of quality statements into "containers", which may or may not be treated as (RDF) graphs (see later example and the usage note for the class dqv:QualityMetadata ).
Quality
information
can
be
derived
from
other
quality
information.
For
example,
a
dimension
could
quality
annotation
can
be
"multilinguality"
and
two
metrics
could
derived
from
a
standard
or
a
quality
measurement.
Quality
measurements
can
be
"ratio
of
literals
with
language
tags"
derived
from
other
measurements.
Metrics
can
be
derived
from
other
metrics.
A
standard
can
be
built
on
another
standard
or
a
(set
of)
metrics.
DQV
models
such
derivations
through
the
property
prov:wasDerivedFrom
as
illustrated
in
the
diagram
below.
This
section
is
work
in
progress.
We
will
include
later
more
tables
with
specification
of
different
language
tags".
individual
classes
and
properties.
The
following
properties
should
be
used
on
this
class:
dqv:hasMetric
dqv:isMeasurementOf
,
dqv:value
,
qb:dataSet
.
Should
(and
if
yes,
how)
DQV
represent
parameters
for
a
metric
applied
for
computing
a
specific
quality
measure
measurement
(e.g.,a
specific
setting
of
weights)?
(
Issue-223
)
RDF Class: |
|
---|---|
Definition: |
A
quality
|
Subclass of: | qb:Observation |
Equivalent class | daq:Observation |
Usage note: | The unit of measure in quality measurement should be specified through the property sdmx-attribute:unitMeasure as recommended by RDF Data Cube [ Vocab-Data-Cube ]. The Ontology of units of Measure (OM) [ RijgersbergEtAl ] provides a list of HTTP dereferenceable unit of measures which can be exploited as values for sdmx-attribute:unitMeasure . |
RDF Property: |
|
---|---|
Definition: | Indicates the metric being observed. |
Instance of: | qb:DimensionProperty |
Domain: | qb:Observation |
Range: | dqv:Metric |
Equivalent Property | daq:metric |
RDF Property: | qb:dataSet |
---|---|
Definition: |
Indicates
the
dataset
to
a
quality
|
Domain: | qb:Observation |
Range: | qb:DataSet |
RDF Property: | dqv:computedOn |
---|---|
Definition: | Refers to the resource (e.g., a dataset, a linkset, a graph, a set of triples) on which the quality measurement is performed. In the DQV context, this property is generally expected to be used in statements in which objects are instances of dcat:Dataset and dcat:Distribution . |
Instance of: | qb:DimensionProperty |
Domain: |
|
Equivalent property: | daq:computedOn |
Inverse property: |
|
RDF Property: | dqv:value |
---|---|
Definition: | Refers to values computed by metric. |
Instance of: | qb:MeasureProperty , owl:DatatypeProperty |
Domain: |
|
Equivalent property: | daq:value |
The
following
properties
should
be
used
on
this
class:
dqv:hasDimension
dqv:inDimension
.
RDF Class: | dqv:Metric |
---|---|
Definition: |
A
standard
to
measure
a
quality
dimension.
An
observation
(instance
of
|
Equivalent class | daq:Metric |
RDF Property: |
|
---|---|
Definition: |
Represents
the
|
Domain: | dqv:Metric |
Range: |
|
|
|
The
following
properties
should
be
used
on
this
class:
dqv:hasCategory
dqv:inCategory
.
RDF Class: | dqv:Dimension |
---|---|
Definition: |
Represents
criteria
relevant
for
assessing
quality.
Each
quality
dimension
must
have
one
or
more
metric
to
measure
it.
A
dimension
is
linked
with
a
category
using
the
|
Subclass of: | skos:Concept |
Equivalent class | daq:Dimension |
RDF Property: |
|
---|---|
Definition: | Represents the category a dimension is grouped in. |
Domain: | dqv:Dimension |
Range: | dqv:Category |
Inverse: | daq:hasDimension |
Usage note: |
Categories
are
meant
to
systematically
organize
dimensions.
The
Data
Quality
Vocabulary
defines
no
specific
cardinality
constraints
for
|
RDF Class: | dqv:Category |
---|---|
Definition: | Represents a group of quality dimensions in which a common type of information is used as quality indicator. |
Subclass of: | skos:Concept |
Equivalent class | daq:Category |
Dimension
and
category
are
abstract
entities.
We
represent
instances
dqv:Dimension
and
dqv:Category
as
instances
of
skos:Concept
,
which
we
think
enable
similar
features
as
these
for
dimensions
and
categories
in
daQ.
Our
representation
choice
differs
more
significantly
for
metrics,
however.
daQ
uses
RDFS/OWL
classes
and
subclasses
to
represent
constraints
on
measurements
(e.g.,
on
the
type
of
values).
RDFS/OWL
however
makes
an
'open
world'
assumption
that
does
not
allow
one
to
capture
entirely
all
constraints.
Additionally,
languages
are
currently
being
defined
to
represent
constraints
in
more
appropriate
ways
(SHACL).
We
think
it
is
therefore
not
appropriate
now
to
recommend
to
treat
specific
metrics
as
subclasses
of
dqv:Metric,
and
we
refer
implementers
to
future
progress
on
SHACL
and
related
technology.
RDF Class: |
|
---|---|
Definition: |
Represents
a
dataset
of
quality
|
Subclass of: | qb:DataSet |
Equivalent class | daq:QualityGraph |
RDF Class: | dqv:QualityPolicy |
---|---|
Definition: | Represents a policy or agreement that is chiefly governed by data quality concerns. |
RDF Class: | dqv:QualityAnnotation |
---|---|
Definition: | Represents quality annotations, including rating, quality certificate, feedback that can be associated to datasets or distributions. Quality annotations must have one oa:motivatedBy statement with an instance of oa:Motivation (and skos:Concept), which reflects a quality assessment purpose. We define this instance as dqv:qualityAssessment. |
Subclass of: | oa:Annotation |
Equivalent class | EquivalentClasses( dqv:QualityAnnotation ObjectHasValue( oa:motivatedBy dqv:qualityAssessment ) ) |
To make the document more self-contained we might consider to describe some properties of oa:Annotation, such as hasBody, hasTarget.
RDF Class: | dqv:QualityCertificate |
---|---|
Definition: | An annotation that associates a resource (especially, a dataset or a distribution) to another resource (for example, a document) that certifies the resource's quality according to a set of quality assessment rules. |
Subclass of: | dqv:QualityAnnotation |
RDF Class: | dqv:UserQualityFeedback |
---|---|
Definition: | Represents feedback users might want to associate to datasets or distributions. Besides dqv:qualityAssessment which is the motivation required by all quality annotations, one of the predefined instances of oa:Motivation should be indicated as motivation to distinguish among the different kinds of feedback, e.g, classifications, questions. |
Subclass of: | dqv:QualityAnnotation duv:UserFeedback |
RDF Class: |
|
---|---|
Definition: |
Represents
quality
metadata,
it
is
defined
to
|
Subclass of: |
rdfg:Graph
|
Usage note: |
QualityMetadata
containers
do
not
necessary
include
all
types
of
quality
statements
DQV
can
support.
Implementers
decide
the
|
RDF Property: | dqv:inDimension |
---|---|
Definition: |
Represents
the
|
Range: | dqv:Dimension |
Equivalent to: | SubObjectPropertyOf( ObjectInverseOf( daq:hasMetric ) dqv:inDimension ) |
Usage note: |
Dimensions
are
meant
to
systematically
organize
metrics,
quality
certificates
and
quality
annotations.
The
Data
Quality
Vocabulary
defines
no
specific
cardinality
constraints
for
dqv:inDimension,
since
distinct
quality
|
RDF Property: |
|
---|---|
Definition: | Refers to the performed quality measurements. Quality measurements can be performed to any kind of resource (e.g., a dataset, a linkset, a graph, a set of triples). However, in the DQV context, this property is generally expected to be used in statements in which subjects are instances of dcat:Dataset and dcat:Distribution . |
Range: |
|
Inverse property: | dqv:computedOn |
RDF Property: |
prov:wasDerivedFrom
|
---|---|
Definition: |
A
derivation
is
|
Domain: | prov:Entity |
Range: | prov:Entity |
Usage note: | prov:wasDerivedFrom expresses a quite abstract relation of derivation. More specialized relations of derivation can be defined as subproperties of prov:wasDerivedFrom, whenever this is required by applications. |
The
section
entitled
"Expressing
derivation
between
quality
metrics,
measurements
and
annotations"
shows
some
examples
to
illustrate
the
application
of
the
Dataset
uses
of
this
property.
RDF Instance: | dqv:qualityAssessment |
---|---|
Definition: | Motivation that must be specified for quality annotations. |
Instance of: | oa:Motivation |
This
section
is
still
work
in
progress.
Further
examples
will
be
provided
as
soon
as
some
of
Whenever
DQV
implementers
need
to
extend
the
pending
issues
are
resolved.
We
invite
motivations
for
quality
annotations,
they
should
follow
the
public
to
contact
instructions
provided
by
the
editors
Web
Annotation
Data
Model,
and
submit
relevant
examples
of
quality
data,
even
not
yet
represented
the
concepts
in
DQV.
We
welcome
your
input!
the
extension
should
be
defined
as
specializations
of
dqv:qualityAssessment
.
RDF Instance: | dqv:precision |
---|---|
Definition: | Precision is a quality dimension which refers to the recorded level of details. It represents the exactness of measurement or description. |
Instance of: | dqv:Dimension |
Equivalent to |
iso:precision |
NB:
in
the
remainder
of
this
section,
the
prefix
"
:
"
refers
to
http://example.org/
myDataset
,
,
and
its
distribution
myDatasetDistribution
,
:myDataset
a dcat:Dataset ;
dcterms:title "My dataset" ;
dcat:distribution :myDatasetDistribution
.
,
:myDataset a dcat:Dataset ; dcterms:title "My dataset" ; dcat:distribution :myDatasetDistribution . :myDatasetDistribution a dcat:Distribution ; dcat:downloadURL <http://www.example.org/files/mydataset.csv> ; dcterms:title "CSV distribution of dataset" ; dcat:mediaType "text/csv" ; dcat:byteSize "87120"^^xsd:decimal .:myDatasetDistribution a dcat:Distribution ; dcat:downloadURL <http://www.example.org/files/mydataset.csv> ; dcterms:title "CSV distribution of dataset" ; dcat:mediaType "text/csv" ; dcat:byteSize "87120"^^xsd:decimal .
An
automated
quality
checker
has
provided
a
quality
assessment
with
two
(CSV)
quality
measures
measurements
for
myDatasetDistribution
.
.
:myDatasetDistribution dqv:hasQualityMeasurement :measurement1, :measurement2 . :measurement1 a dqv:QualityMeasurement ; dqv:computedOn :myDatasetDistribution ; dqv:isMeasurementOf :downloadURLAvailabilityMetric ; dqv:value "true"^^xsd:boolean . :measurement2 a dqv:QualityMeasurement ; dqv:computedOn :myDatasetDistribution ; dqv:isMeasurementOf :csvCompletenessMetric ; dqv:value "0.5"^^xsd:double . #definition of dimensions and metrics :availability a dqv:Dimension ; skos:prefLabel "Availability"@en ; skos:definition "Availability of a dataset is the extent to which data (or some portion of it) is present, obtainable and ready for use."@en ; dqv:inCategory :accessibility . :completeness a dqv:Dimension ; skos:prefLabel "Completeness"@en ; skos:definition "Completeness refers to the degree to which all required information is present in a particular dataset."@en ; dqv:inCategory :intrinsicDimensions . :downloadURLAvailabilityMetric a dqv:Metric ; skos:definition "It checks if dcat:downloadURL is available and if its value is dereferenceable."@en ; dqv:expectedDataType xsd:boolean ; dqv:inDimension :availability . :csvCompletenessMetric a dqv:Metric ; skos:definition "Ratio between the number of objects represented in the csv and the number of objects expected to be represented according to the declared dataset scope."@en ; dqv:expectedDataType xsd:double ; dqv:inDimension :completeness:csvAvailabilityMetric a dqv:Metric ; dqv:hasDimension :availabity ..:csvConsistencyMetric a dqv:Metric ; dqv:hasDimension :consistency .
Categories and dimensions might be more extensively defined, see in the section 'Dimensions and metrics hints' for further examples. Any quality framework is free to define its own dimensions and categories.
The
results
of
metrics
obtained
in
the
previous
assessment
are
stored
in
the
myQualityMetadata
graph.
# :myQualityMatadata is a graph :myQualityMetadata { :myDatasetDistribution dqv:hasQualityMeasurement :measurement1, :measurement2 . # The graph contains the rest of the statements presented in the previous example. } # :myQualityMetadata has been created by :qualityChecker and it is the result of the # :qualityChecking activity :myQualityMetadata a dqv:QualityMetadata ; prov:wasAttributedTo :qualityChecker ; prov:generatedAtTime "2015-05-27T02:52:02Z"^^xsd:dateTime ; prov:wasGeneratedBy :qualityChecking . # :qualityChecker is a service computing some quality metrics :qualityChecker a prov:SoftwareAgent ; rdfs:label "a quality assessment service"^^xsd:string # Further details about quality service/software can be provided, for example, # deploying vocabularies such as Data Usage Vocabulary (DUV), Dublin Core or ADMS.SW . # the :qualityChecking is the activity that has generated :myQualityMetadata starting from # :myDatasetDistribution :qualityChecking a prov:Activity; rdfs:label "the checking of myDatasetDistribution's quality"^^xsd:string; prov:wasAssociatedWith :qualityChecker; prov:used :myDatasetDistribution; prov:generated :myQualityMetadata; prov:endedAtTime "2015-05-27T02:52:02Z"^^xsd:dateTime; prov:startedAtTime "2015-05-27T00:52:02Z"^^xsd:dateTime# myQualityMetadata has been created by: qualityChecker and it is the result of the :qualityChecking activity :myQualityMetadata a dqv:QualityMetadata ; prov:wasAttributedTo :qualityChecker ; prov:generatedAtTime "2015-05-27T02:52:02Z"^^xsd:dateTime ; prov:wasGeneratedBy :qualityChecking ..# qualityChecker is a service computing some quality metrics :qualityChecker a prov:SoftwareAgent ; rdfs:label "a quality assessment service"^^xsd:string .# the qualityChecking is the activity that has generated myQualityMetadata starting from MyDatasetDistribution :qualityChecking a prov:Activity; rdfs:label "the checking of myDatasetDistribution's quality"^^xsd:string; prov:wasAssociatedWith :qualityChecker; prov:used :myDatasetDistribution; prov:generated :myQualityMetadata; prov:endedAtTime "2015-05-27T02:52:02Z"^^xsd:dateTime; prov:startedAtTime "2015-05-27T00:52:02Z"^^xsd:dateTime .
The
group
has
discussed
provenance
at
different
level
of
granularity
(dqv:QualityMeasure
(dqv:QualityMeasurement
and
dqv:QualityMetadata),
so
dqv:QualityMetadata).
In
the
previous
example
we
might
consider
have
shown
how
to
add
track
provenance
at
level
of
quality
metadata,
in
the
following,
we
provide
an
example
of
provenance
for
dqv:QualityMeasure.
the
quality
measurement
:measurement
.
:myDatasetDistribution dqv:hasQualityMeasurement :measurement . # :measurement has been created by :qualityChecker and it is the result of the # :qualityChecking activity :measurement a dqv:QualityMeasurement ; dqv:computedOn :myDatasetDistribution ; dqv:isMeasurementOf :downloadURLAvailabilityMetric ; dqv:value "true"^^xsd:boolean ; prov:wasAttributedTo :qualityChecker ; prov:generatedAtTime "2015-05-27T02:52:02Z"^^xsd:dateTime ; prov:wasGeneratedBy :qualityChecking . :downloadURLAvailabilityMetric a dqv:Metric ; skos:definition "It checks if dcat:downloadURL is available and if its value is dereferenceable."@en ; dqv:expectedDataType xsd:boolean ; dqv:inDimension :availability . # :qualityChecker is a services computing some quality metrics :qualityChecker a prov:SoftwareAgent ; rdfs:label "a quality assessment service"^^xsd:string # Further details about quality service/software can be provided, for example, # deploying vocabularies such as Data Usage Vocabulary (DUV), Dublin Core or ADMS.SW . # the :qualityChecking is the activity that has generated :measurement starting from # :myDatasetDistribution :qualityChecking a prov:Activity; rdfs:label "the checking of myDatasetDistribution's quality"^^xsd:string; prov:wasAssociatedWith :qualityChecker; prov:used :myDatasetDistribution; prov:generated :measurement; prov:endedAtTime "2015-05-27T02:52:02Z"^^xsd:dateTime; prov:startedAtTime "2015-05-27T00:52:02Z"^^xsd:dateTime .
Statements
similar
to
the
ones
applied
to
the
resource
myQualityMetadata
above
can
be
applied
to
the
resource
myDataset
to
indicate
the
provenance
of
the
dataset.
I.e.,
a
dataset
can
be
generated
by
a
specific
software
agent,
be
generated
at
a
certain
time,
etc.
The
HCLS
Community
Profile
for
describing
datasets
provides
further
examples.
Let us express that an ODI certificate for the "City of Raleigh Open Government Data" dataset is available at the URL <https://certificates.theodi.org/en/datasets/393/certificate>.
<https://certificates.theodi.org/en/datasets/393> a dcat:Dataset ; dqv:hasQualityAnnotation :myDatasetQA . :myDatasetQA a dqv:QualityCertificate ; oa:hasTarget <https://certificates.theodi.org/en/datasets/393> ; oa:hasBody <https://certificates.theodi.org/en/datasets/393/certificate> ; oa:motivatedBy dqv:qualityAssessment .
Let us ask a question about the completeness of the "City of Raleigh Open Government Data" dataset.
<https://certificates.theodi.org/en/datasets/393> a dcat:Dataset ; dqv:hasQualityAnnotation :questionQA . :questionQA a dqv:UserQualityFeedback ; oa:hasTarget <https://certificates.theodi.org/en/datasets/393> ; oa:hasBody :textBody ; oa:motivatedBy dqv:qualityAssessment, oa:questioning ; dqv:inDimension :completeness . :textBody a cnt:ContentAsText, dctypes:Text ; cnt:chars "Could you please provide information about the completeness of your dataset?" ; dc:language "en" ; dc:format "text/plain" .
Let us express that the "City of Raleigh Open Government Data" dataset is classified as a four stars dataset against the 5 Stars linked open data rating system.
<https://certificates.theodi.org/en/datasets/393> a dcat:Dataset ; dqv:hasQualityAnnotation :classificationQA . :classificationQA a dqv:UserQualityFeedback ; oa:hasTarget <https://certificates.theodi.org/en/datasets/393> ; oa:hasBody :four_stars ; oa:motivatedBy dqv:qualityAssessment, oa:classifying ; dqv:inDimension :availability . :four_stars a skos:Concept; skos:inScheme :OpenData5Star ; skos:prefLabel "Four stars"@en ; skos:definition "Dataset available on the web with structured machine-readable non proprietary format. It uses URIs to denote things."@en .
DQV
models
derivation
with
the
property
prov:wasDerivedFrom
.
For
example,
the
accessability
of
the
dataset
:myDataset
can
be
derived
from
the
accessability
of
its
distributions
:myCSVDatasetDistribution
and
:mySPARQLDatasetDistribution
.
:myDataset a dcat:Dataset ; dcterms:title "My dataset" ; dcat:distribution :myDatasetDistribution . :myCSVDatasetDistribution a dcat:Distribution ; dcat:downloadURL <http://www.example.org/files/mydataset.csv> ; dcterms:title "CSV distribution of dataset" ; dcat:mediaType "text/csv" ; dcat:byteSize "87120"^^xsd:decimal . :mySPARQLDatasetDistribution a dcat:Distribution ; dcat:accessURL <http://www.example.org/sparql> dcterms:title "SPARQL access to the dataset" ; dcat:mediaType "sparql-results+json" . #definition of dimensions and metrics :availability a dqv:Dimension ; skos:prefLabel "Availability"@en ; skos:definition "Availability of a dataset is the extent to which data (or some portion of it) is present, obtainable and ready for use."@en ; dqv:inCategory :accessibility . :downloadURLAvailabilityMetric a dqv:Metric ; skos:definition "Checks if dcat:downloadURL is available and if its value is dereferenceable."@en ; dqv:expectedDataType xsd:boolean ; dqv:inDimension :availability . :SPARQLAvailabilityMetric a dqv:Metric ; skos:definition "Checks if an URL specified in dcat:accessURL is available and if at that URL a SPARQL endpoint is active."@en ; dqv:expectedDataType xsd:boolean ; dqv:inDimension :availability . :datasetAvailabilityMetric a dqv:Metric ; prov:wasDerivedFrom :downloadURLAvailabilityMetric, :SPARQLAvailabilityMetric; skos:definition "Checks the availabitity of the specified distributions."@en ; dqv:expectedDataType xsd:boolean ; dqv:inDimension :availability .
Depending
on
the
specific
application
context,
the
expression
of
this
derivation
can
be
kept
at
level
of
the
quality
measurements.
In
the
following
the
measurement
:measurement3
of
:myDataset
's
availability
is
derived
from
:measurement1
and
:measurement2
.
:myCSVDatasetDistribution dqv:hasQualityMeasurement :measurement1 . :mySPARQLDatasetDistribution dqv:hasQualityMeasurement :measurement2 . :myDataset dqv:hasQualityMeasurement :measurement3 . :measurement1 a dqv:QualityMeasurement ; dqv:computedOn :myCSVDatasetDistribution ; dqv:isMeasurementOf :downloadURLAvailabilityMetric ; dqv:value "true"^^xsd:boolean :measurement2 a dqv:QualityMeasurement ; dqv:computedOn :mySPARQLDatasetDistribution ; dqv:isMeasurementOf :SPARQLAvailabilityMetric ; dqv:value "false"^^xsd:boolean . :measurement3 a dqv:QualityMeasurement ; dqv:computedOn :myDataset ; dqv:isMeasurementOf :datasetAvailabilityMetric ; prov:wasDerivedFrom measurement2, measurement3 ; dqv:value "false"^^xsd:boolean .
The
classification
of
mydataset
as
:three_star
can
be
derived
from
the
result
of
a
quality
measurement
:measurement2
:myDataset dqv:hasQualityAnnotation :myDatasetClassification . :myDatasetClassification a dqv:UserQualityFeedback ; prov:wasDerivedFrom :measurement2 ; oa:hasTarget :myDataset ; oa:hasBody :three_stars ; oa:motivatedBy dqv:qualityAssessment, oa:classifying ; dqv:inDimension :availability . :three_stars a skos:Concept; skos:inScheme :OpenData5Star ; skos:prefLabel "three stars"@en ; skos:definition "Dataset available on the web with structured machine-readable non proprietary format."@en .:myDatasetQA a dqv:QualityCertificate ; oa:hasTarget <https://certificates.theodi.org/en/datasets/393> ; oa:hasBody <https://certificates.theodi.org/en/datasets/393/certificate> ; oa:motivatedBy dqv:qualityAssessment .
Let’s
consider
myControlledVocabulary
,
a
controlled
vocabulary
made
available
on
the
Web
using
the
SKOS
[
SKOS-reference
]
and
DCAT
[
vocab-dcat
].
,
:myControlledVocabulary a dcat:Dataset ; dcterms:title "My controlled vocabulary" . :myControlledVocabularyDistribution a dcat:Distribution ; dcat:downloadURL <http://www.example.org/files/myControlledVocabulary.csv> ; dcterms:title "SKOS/RDF distribution of my controlled vocabulary" ; dcat:mediaType "text/turtle" ; dcat:byteSize "190120"^^xsd:decimal .
qSKOS is an open source tool, which detects quality issues affecting SKOS vocabularies [ qSKOS ]. It considers 26 quality issues including, for example, “Incomplete Language Coverage” and “Label Conflicts” which are grouped in the category “Labeling and Documentation issues”. Quality issues addressed by qSKOS can be considered as DQV quality dimensions, whilst the number of concepts in which a quality issue occurs can be the metric deployed for each quality dimension.
# definition of instances for some of the metrics, dimensions and categories deployed # in qSKOS. :numOfConceptsWithLabelConflicts a dqv:Metric; skos:prefLabel "Conflicting concepts"@en ; skos:definition "Number of concepts having conflicting labels"@en ; dqv:expectedDataType xsd:interger ; dqv:inDimension :LabelConflicts . :numOfConceptsWithIncompleteLanguageCoverage a dqv:Metric; skos:prefLabel "Language incomplete concepts"@en ; skos:definition "Number of concepts having an incomplete language coverage"@en ; dqv:expectedDataType xsd:interger ; dqv:inDimension :incompleteLanguageCoverage . :LabelConflicts a dqv:Dimension; skos:prefLabel "Label Conflicts"@en ; skos:definition "Dimension corresponding to the label conflicts quality issue"@en ; dqv:inCategory :labelingDocumentationIssues . :incompleteLanguageCoverage a dqv:Dimension; skos:prefLabel "Incomplete Language Coverage"@en ; skos:definition "Dimension corresponding to the incomplete language coverage issue"@en ; dqv:inCategory :labelingDocumentationIssues . :labelingDocumentationIssues a dqv:Category ; skos:prefLabel "Labeling and Documentation Issues"@en ; skos:definition "Category grouping labeling and documentation issues"@en .:labelingDocumentationIssues a dqv:Category ; rdfs:label "Labeling and Documentation Issues"@en ; rdfs:comment "Category grouping labeling and documentation issues"@en ; .
DQV
represents
the
qSKOS
quality
assessment
on
myControlledVocabulary
for
the
dimensions
“Incomplete
Language
Coverage”
and
“Label
Conflicts”.
:myDatasetDistribution dqv:hasQualityMeasurement :measurement1, :measurement2 . :measurement1 a dqv:QualityMeasurement ; dqv:computedOn :myControlledVocabulary ; dqv:isMeasurementOf :numOfConceptsWithMissingValues ; dqv:value "1500"^^xsd:integer . :measurement2 a dqv:QualityMeasurement ; dqv:computedOn :myControlledVocabulary ; dqv:isMeasurementOf :numOfConceptsWithIncompleteLanguageCoverage ; dqv:value "450"^^xsd:integer .:measure1 a dqv:QualityMeasure ; dqv:computedOn :myControlledVocabulary ; dqv:hasMetric :numOfConceptsWithMissingValues ; dqv:value "1500"^^xsd:integer . :measure2 a dqv:QualityMeasure ; dqv:computedOn :numOfConceptsWithIncompleteLanguageCoverage ; dqv:hasMetric :csvConsistencyMetric ; dqv:value "450"^^xsd:integer .
(VoID) linksets are collections of (RDF) links between two datasets. Linksets are as important as datasets when it comes to the joint exploitation of independently served datasets in linked data. The representation of quality for a linkset offers a further example of how DQV can be exploited.
Let’s define three DCAT datasets, including one VoID linkset, which connects the two others:
:myDataset1 a dcat:Dataset ; dcterms:title "My dataset 1" . :myDataset2 a dcat:Dataset ; dcterms:title "My dataset 2" . :myLinkset a dcat:Dataset, void:Linkset ; dcterms:title "A Linkset between My dataset 1 and My dataset 2"; void:linkPredicate skos:exactMatch ; void:target :myDataset1 ; void:target :myDataset2 .
We can represent information about the quality of :myLinkset using the “Multilingual importing” [ MultilingualImporting ] linkset quality metric. This metrics works on linksets between datasets that include SKOS concepts [ SKOS-reference ]. It quantifies the information gain when adding the preferred labels or the alternative labels of the concepts from a linked dataset to the descriptions of the concepts from the other dataset, which these concepts have been matched with a skos:exactMatch statement from the linkset. We must first define the proper metric, dimension and category.
# Definition of instances for Metric, Dimension and Category. :importingForPropertyPercentage a dqv:Metric ; skos:definition "Ratio between novel preferred or alternative labels gained via skos:exactMatch links and preferred or alternative labels already in the dataset."@en dqv:expectedDataType xsd:double ; dqv:inDimension :completeness . :completenessGain a dqv:Dimension ; skos:prefLabel "Completeness Gain"@en ; skos:definition "Degree to which a linkset contributes to obtaining all required information in a particular dataset."@en ; dqv:inCategory :complementationGain . :complementationGain a dqv:Category ; skos:definition "Category that groups dimensions measuring the data quality gain obtained by exploiting linksets."@en:complementationGain a dqv:Category ..
The
quality
assessment
of
the
"label
importing"
can
be
made
dependent
depend
on
two
extra
parameters:
property
onProperty
and
language,
onLanguage
,
respectively
the
SKOS
property
and
the
language
tag.
tag
considered
for
measuring
the
completeness
gains.
We
extend
DQV
to
represent
these
parameters.
:onLanguage a qb:DimensionProperty, owl:DataProperty ; rdfs:comment "language on which label importing is assessed."@en ; rdfs:domain dqv:QualityMeasurement; rdfs:label "label import assessment language"@en . :onProperty a qb:DimensionProperty, rdf:Property ; rdfs:comment "property on which label importing is assessed."@en ; rdfs:domain dqv:QualityMeasurement ; rdfs:label "label import assessment property"@en ; rdfs:range rdf:Property .We need to further evaluate the way we add extra parameters for the metric and extend the DAQ RDF-CUBE data structure (postponed issue) :language a qb:DimensionProperty, owl:DataProperty ; rdfs:comment "language on which label importing is assessed."@en ; rdfs:domain dqv:QualityMeasure; rdfs:label "label import assessment language"@en . :property a qb:DimensionProperty, rdf:Property ; rdfs:comment "property which label importing is assessed."@en ; rdfs:domain dqv:QualityMeasure ; rdfs:label "label import assessment property"@en ; rdfs:range rdf:Property .
Let us add actual quality assessments:
:qualityMeasurementDataset a dqv:QualityMeasurementDataset ; qb:structure :dsd . :importingForPropertyPercentage dqv:hasObservation :measurement_exactMatchAltLabelItDataset1, :measurement_exactMatchAltLabelItDataset2, :measurement_exactMatchAltLabelEnDataset1, :measurement_exactMatchAltLabelEnDataset2, :measurement_exactMatchPrefLabelItDataset1, :measurement_exactMatchprefLabelItDataset2 . #Adding quality observations ## for Italian alternative labels :measurement_exactMatchAltLabelItDataset1 a dqv:QualityMeasurement; dqv:computedOn :myLinkset ; dqv:value "1.0"^^xsd:double ; dqv:isMeasurementOf :importingForPropertyPercentage ; qb:dataSet :qualityMeasurementDataset; :onLanguage "it" ; :onProperty skos:altLabel . :measurement_exactMatchAltLabelItDataset2 a dqv:QualityMeasurement; dqv:computedOn :myLinkset ; dqv:value "1.0"^^xsd:double ; dqv:isMeasurementOf :importingForPropertyPercentage ; qb:dataSet :qualityMeasurementDataset; :onLanguage "it" ; :onProperty skos:altLabel . ## for English alternative labels :measurement_exactMatchAltLabelEnDataset1 a dqv:QualityMeasurement; dqv:computedOn :myLinkset ; dqv:value "0.1"^^xsd:double ; dqv:isMeasurementOf :importingForPropertyPercentage ; qb:dataSet :qualityMeasurementDataset; :onLanguage "en" ; :onProperty skos:altLabel . :measurement_exactMatchAltLabelEnDataset2 a dqv:QualityMeasurement; dqv:computedOn :myLinkset ; dqv:value "1.0"^^xsd:double ; dqv:isMeasurementOf :importingForPropertyPercentage ; qb:dataSet :qualityMeasurementDataset; :onLanguage "en" ; :onProperty skos:altLabel . ## for Italian preferred labels :measurement_exactMatchPrefLabelItDataset1 a dqv:QualityMeasurement; dqv:computedOn :myLinkset ; dqv:value "0.5"^^xsd:double ; dqv:isMeasurementOf :importingForPropertyPercentage ; qb:dataSet :qualityMeasurementDataset; :onLanguage "it" ; :onProperty skos:prefLabel . :measurement_exactMatchprefLabelItDataset2 a dqv:QualityMeasurement; dqv:computedOn :myLinkset ; dqv:value "0.5"^^xsd:double ; dqv:isMeasurementOf :importingForPropertyPercentage ; qb:dataSet :qualityMeasurementDataset; :onLanguage "it" ; :onProperty skos:prefLabel .Let us specify the RDF Data Cube data
:dsd a qb:DataStructureDefinition ; ##Copying the structure of daq:dsq qb:component [ qb:dimension dqv:computedOn ; qb:order 2 ] ; qb:component [ qb:measure dqv:value] ; qb:component [ qb:dimension <http://purl.org/linked-data/sdmx/2009/dimension#timePeriod> ; qb:order 3 ] ; qb:component [ qb:dimension dqv:isMeasurementOf ; qb:order 1 ] ; qb:component [ qb:measure dqv:value;]; # Attribute (here: unit of measurement) qb:component [ qb:attribute sdmx-attribute:unitMeasure ; qb:componentRequired false ; qb:componentAttachment qb:DataSet ; ] ; ##Extending the structure of lds:dsq with two new dimensions qb:component [ qb:dimension :onProperty ; qb:order 4 ] ; qb:component [ qb:dimension :onLanguage ; qb:order 5 ] .
It is often desirable to indicate that metadata about datasets in a catalogue are compliant with a metadata standard, or an application profile of an existing metadata standard. A typical example is the GeoDCAT Application Profile [ GeoDCAT-AP ], an extension of the DCAT vocabulary [ vocab-dcat ] to represent metadata for geospatial data portals. GeoDCAT-AP enables to express that a dataset's metadata conforms to an existing standard, following the recommendations of ISO 19115, ISO 19157 and the EU INSPIRE directive. DCAT partly supports the expression of such metadata conformance statements. The following example illustrates how a (DCAT) catalog record can be said to be conformant with the GeoDCAT-AP standard itself.
:myDataset a dcat:Dataset . :myDatasetRecord a dcat:CatalogRecord ; foaf:primaryTopic :myDataset ; dcterms:conformsTo :geoDCAT-AP . :geoDCAT-AP a dcterms:Standard; dcterms:title "GeoDCAT Application Profile. Version 1.0" ; dcterms:comment "GeoDCAT-AP is developed in the context of the Interoperability Solutions for European Public Administrations (ISA) Programme"@en; dcterms:issued "2015-12-23"^^xsd:date ; foaf:page <https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/geodcat-ap-v10> .ex:geoDCAT-AP a dcterms:Standard; dcterms:title "GeoDCAT Application Profile" ; dcterms:comment "GeoDCAT-AP is developed in the context of the Interoperability Solutions for European Public Administrations (ISA) Programme"@en; dcterms:issued "201X-XX-XX"^^xsd:date .
Note
that
this
example
does
not
include
the
metadata
about
the
dataset
ex:myDataset
itself.
We
assume
this
is
present
in
an
RDF
data
source
accessible
via
the
URI
ex:myDatasetRecord
.
We
also
assume
that
.
ex:geoDCAT-AP
is
a
reference
URI
that
denotes
the
GeoDCAT-AP
standard,
which
can
be
re-used
across
many
catalog
record
descriptions,
not
just
a
locally
introduced
URI.
Relation
between
DQV,
ISO
19115/19157
Finer-grained
representation
of
conformance
statements
can
be
found
in
the
literature,
and
GeoDCAT-AP:
DQV
is
already
able
to
express
applications
with
more
complex
requirements
may
implement
them,
including
for
example
the
notion
requirement
of
"conformance"
to
representing
'non-conformance'
tested
by
specific
procedures.
The
GeoDCAT
Application
Profile,
for
example,
suggests
a
standard
using
"provisional
mapping"
for
extended
profiles,
which
re-uses
the
property
dcterms:conformsTo.
However,
there
were
suggestion
PROV
data
model
for
provenance
(see
Annex
II.14
at
[
GeoDCAT-AP
]).
Such
patterns
come
however
at
the
cost
of
having
to
publish
and
exchange
representations
that
are
much
more
elaborate.
They
will
also
have
to
be
further
compatible
aligned
with
ISO
19157:2013
and
INSPIRE
by
adding
respectively
"Not
conformant"
the
result
of
another
ongoing
efforts
on
data
validation
and
"Not
evaluated"
the
reporting
thereof,
as
possible
properties
or
values.
Should
DQV
be
this
expressive?
(
Issue-202
)
currently
discussed
around
SHACL,
for
example.
We
have
thus
decided
to
postpone
addressing
these
requirements
for
now.
DQV
introduces
the
class
dqv:QualityPolicy
to
express
that
a
Dataset
or
Distribution
follows
a
policy
or
agreement
that
is
chiefly
defined
by
data
quality
concerns.
DQV
does
not
provide
a
complete
framework
for
expressing
policies.
The
class
dqv:QualityPolicy
is
rather
meant
as
an
anchor
point,
through
which
DQV
implementers
can
relate
properties
and
classes
of
policy-dedicated
vocabularies,
such
as
ODRL
[
ODRL
],
to
the
core
elements
that
define
quality
of
datasets
and
distributions.
The
example
below
specifies
that
a
data
provider
grants
the
permission
to
access
a
dataset
and
commits
to
serve
the
data
with
a
certain
quality,
more
concretely,
99%
availability
of
a
SPARQL
endpoint
(distribution)
associated
with
the
dataset.
This
section
is
non-normative.
expressed
in
ODRL
as
an
offer
with
a
duty
on
the
service
provider
that
states
a
constraint
defined
using
a
DQV
metric
(
sparqlEndpointUptime
),
for
which
measurements
have
to
be
greater
than
a
certain
percentage
(99).
The
odrl:assigner
is
the
issuer
of
the
policy
statement;
it
is
also
the
assignee
of
the
duty
to
deliver
the
distribution
as
the
policy
requires
it.
There
is
no
explicitly
mentioned
recipient
for
the
policy
itself,
since
this
examples
is
about
a
generic
data
access
scenario.
Note
that
instances
of
dqv:QualityPolicy
could
be
instances
of
the
class
odrl:Agreement
,
in
which
case
an
odrl:assignee
is
likely
to
appear
for
the
policy.
:serviceProvider a odrl:Party . :myDataset a dcat:Dataset ; dcat:distribution :myDatasetSparqlDistribution ; :myDatasetSparqlDistribution a dcat:Distribution . :policy1 a odrl:Offer, dqv:QualityPolicy ; odrl:permission [ a odrl:Permission ; odrl:target :myDataset ; odrl:action odrl:read ; odrl:assigner :serviceProvider; odrl:duty [ a odrl:Duty; odrl:assignee :serviceProvider; odrl:target :myDatasetSparqlDistribution ; odrl:constraint [ a odrl:Constraint ; prov:wasDerivedFrom :sparqlEndpointUptime; odrl:percentage "99"^^xsd:double ; odrl:operator odrl:gteq ] ] ] .
This
section
will
be
refined
as
soon
as
Issue-204
and
Issue-205
are
solved.
In
particular,
following
the
discussion
The
expression
of
constraints
in
ODRL
seems
quite
unfit
with
expressing
general
constraints
on
Issue-200
,
values
in
RDF
graphs,
as
we
plan
would
require
here.
However,
ODRL
can
be
easily
extended,
and
is
schedule
to
align
undergo
refinement
in
the
DQV
dimension
classification
with
context
of
the
ISO
25012
W3C
Permissions
&
Obligations
Expression
Working
Group
.
In
the
future
implementers
should
investigate
whether
a
general
constraint
expression
language
like
the
coming
SHACL
[
ISOIEC25012
SHACL
]
provides
a
more
appropriate
mechanism
to
be
used
on
top
of
ODRL
permissions
and
duties.
The
need
for
documenting
data
precision
(also
sometimes
refered
to
provide
the
classification
proposed
as
"resolution")
is
a
common
requirement,
in
Zaveri
Et
Al.
[
particular,
when
dealing
with
spatial
data.
The
following
example
shows
how
DQV
can
meet
this
requirement.
:myDataset a dcat:Dataset ; dqv:hasQualityMeasurement :myDatasetPrecision, :myDatasetAccuracy . :myDatasetPrecision a dqv:QualityMeasurement ; dqv:isMeasurementOf :spatialResolutionAsDistance ; dqv:value "1000"^^xsd:decimal ; sdmx-attribute:unitMeasure <http://www.wurvoc.org/vocabularies/om-1.8/metre> . :spatialResolutionAsDistance a dqv:Metric; skos:definition "Spatial resolution of a dataset expressed as distance"@en ; dqv:expectedDataType xsd:decimal ; dqv:inDimension dqv:precisionZaveriEtAl.] as
Precision
can
be
alternatively
expressed
without
unit
of
measure
specifying
spatial
resolution
by
means
of
an
"equivalent
scale"
with
a
further
example.
Suggestions
on
possible
mappings
between
ISO
25012
and
Zaveri
et
al.'s
dimensions
fraction
(e.g.,
1:1,000,
1:1,000,000)
:myDataset a dcat:Dataset; dqv:hasQualityMeasurement :myDatasetPrecisionES . :spatialResolutionAsEquivalentScale a dqv:Metric; skos:definition "Spatial resolution of a dataset expressed as equivalent scale, by using a representative fraction (e.g., 1:1,000, 1:1,000,000)."@en ; dqv:expectedDataType xsd:decimal ; dqv:inDimension dqv:precision . :myDatasetPrecisionES a dqv:QualityMeasurement ; dqv:isMeasurementOf :spatialResolutionAsEquivalentScale ; dqv:value "0.000001"^^xsd:decimal .
or specifying the angular distance.
:myDataset a dcat:Dataset; dqv:hasQualityMeasurement :myDatasetPrecisionAS . :spatialResolutionAsAngularDistance a dqv:Metric; skos:definition "Spatial resolution of a dataset expressed as angular distance"@en ; dqv:expectedDataType xsd:decimal ; dqv:inDimension dqv:precision . :myDatasetPrecisionAS a dqv:QualityMeasurement ; dqv:isMeasurementOf :spatialResolutionAsAngularDistance ; dqv:value "3.5"^^xsd:decimal ; sdmx - attribute : unitMeasure < http : //www.wurvoc.org/vocabularies/om-1.8/degree> .
Note
that
the
precision
(or
resolution)
of
a
dataset
is
not
equivalent
to
its
accuracy.
High
precision
values
are
not
necessarily
accurate.
High
precision
values
can
even
be
pointless,
as
well
when
one
asserts
that
Magna
Carta
was
signed
at
1215-06-15T00:00:00
.
Accuracy
is
nonetheless
an
important
dimension
of
data
quality.
Data
accuracy
metrics
and
measurements
can
be
represented
with
DQV,
as
any
other
well-known
classification
are
welcome.
in
the
following
example:
:myDatasetAccuracy a dqv:QualityMeasurement ; dqv:isMeasurementOf :spatialAccuracy ; dqv:value "98.2"^^xsd:decimal sdmx-attribute:unitMeasure <http://www.wurvoc.org/vocabularies/om-1.8/Percentage> . :spatialAccuracy a dqv:Metric; skos:definition "Percentage of spatial elements that are found accurate according to methodology XYZ"@en ; dqv:expectedDataType xsd:decimal ; dqv:inDimension ldqd:semanticAccuracy .
This
section
gathers
relevant
quality
dimensions
and
ideas
for
corresponding
metrics,
which
might
be
eventually
represented
as
instances
of
daq:Dimension
dqv:Category
,
dqv:Dimension
and
daq:Metric
dqv:Metric
.
The
goal
of
the
Data
Quality
Vocabulary
is
not
to
define
a
normative
list
of
dimensions
and
metrics,
rather,
metrics.
There
are
already
several
reference
classifications
available,
which
are
the
result
of
a
lot
of
community
work.
Unifying
them
here
seems
both
hard
and
not
desirable,
as
fundamental
approaches
to
quality
vary
between
domains
or
even
applications.
This
section
provides
instead
a
set
of
examples
examples,
starting
from
use
cases
included
in
the
Use
Cases
&
Requirements
document
and
from
.
In
particular,
we
offer
the
following
sources:
http://lists.w3.org/Archives/Public/public-dwbp-wg/2015Apr/0023.html
quality
dimension
proposed
in
ISO
25012
[
http://www.slideshare.net/OpenDataSupport/open-data-quality-29248578
ISOIEC25012
]
and
Zaveri
et
al.
[
https://www.w3.org/2013/dwbp/wiki/Quality_and_Granularity_Description_Vocabulary
ZaveriEtAl
Issue
12
Are
]
as
two
starting
points.
Ultimately,
implementers
will
need
to
choose
themselves
the
levels
of
granularity
approach
that
fits
best
their
needs.
They
can
extend
on
these
starting
points,
creating
their
own
refinements
of
dqv:Dimension
categories
and
dqv:Category
well-defined
enough
dimensions,
and
of
course
their
own
metrics.
They
can
mix
existing
approaches
—
we
show
that
the
proposals
from
ISO
and
Zaveri
et
al.
are
not
completely
disjoint.
Implementers
can
also
adopt
completely
different
classifications,
if
existing
ones
do
not
fit
for
purpose?
(
Issue-225
)
their
specific
application
scenarios.
They
should
however
be
aware
that
relying
on
existing
classifications
and
metrics
increases
interoperability,
i.e.,
the
chance
that
human
and
machine
agents
can
properly
understand
and
exploit
their
quality
assessments.
The following table gives example on statistics that can be computed on a dataset and interpreted as quality indicators by the data consumer. Some of them can be relevant for the dimensions listed in the rest of this section. The properties come from the VoID extension created for the Aether tool .
Observation | Suggested term |
---|---|
Number of distinct external resources linked to | http://ldf.fi/void-ext#distinctIRIReferenceObjects |
Number of distinct external resources used (including schema terms) | http://ldf.fi/void-ext#distinctIRIReferences |
Number of distinct literals | http://ldf.fi/void-ext#distinctLiterals |
Number of languages used | http://ldf.fi/void-ext#languages |
The
Aether
VoID
extension
represents
statistics
as
direct
statements
that
have
a
dataset
as
subject
and
an
integer
as
object.
This
pattern,
which
can
be
expected
to
be
rather
common,
is
different
from
the
pattern
that
DQV
inherits
from
DAQ.
daQ.
Guidance
on
how
DQV/daQ
can
work
with
other
quality
statistics
vocabulary
will
be
provided.
Can
the
data
ISO/IEC
25012
provides
an
example
of
quality
dimensions
grouped
in
three
categories
that
can
be
accessed
now
adopted
to
document
the
quality
of
datasets.
These
quality
dimensions
and
over
time?
categories
are
listed
in
the
table
below.
Category | Dimension | Definition |
---|---|---|
Inherent Data Quality | Accuracy | The degree to which data has attributes that correctly represent the true value of the intended attribute of a concept or event in a specific context of use. |
Completeness |
The
degree
to
which
subject
data
| |
Consistency |
The
degree
to
which
data
has
attributes
that
are
free
from
contradiction
and
| |
Credibility |
The
degree
to
| |
Currentness |
The
degree
to
which
data
has
attributes
that
are
of
the
| |
Inherent and System-Dependent Data Quality | Accessibility |
The
degree
to
which
data
can
be
accessed
in
a
|
Compliance | The degree to which data has attributes that adhere to standards, conventions or regulations in force and similar rules relating to data quality in a specific context of use. | |
Confidentiality |
The
degree
to
which
data
| |
Efficiency |
The
degree
to
which
data
has
attributes
that
can
be
processed
and
provide
the
| |
Precision | The degree to which data has attributes that are exact or that provide discrimination in a specific context of use. | |
Traceability | The degree to which data has attributes that provide an audit trail of access to the data and of any changes made to the data in a specific context of use. | |
Understandability |
The
degree
to
which
data
has
attributes
that
enable
it
to
be
| |
System-Dependent Data Quality | Availability |
The
degree
to
which
data
has
attributes
that
|
Portability |
The
degree
to
which
data
has
attributes
that
enable
it
to
be
installed,
replaced
or
moved
from
one
system
to
another
preserving
the
| |
Recoverability | The degree to which data has attributes that enable it to maintain and preserve a specified level of operations and quality, even in the event of failure, in a specific context of use. |
DQV
can
express
the
dimensions
and
categories
listed
in
the
table
above.
The
following
example
includes
only
an
exemplification
of
the
ISO
dimensions
and
categories
which
should
be
available
'for
authoritatively
provided
by
ISO.
Semantic
relation
defined
in
SKOS
can
be
exploited
to
related
categories
and
dimensions,
for
example,
in
the
foreseeable
future?'
following,
skos:broader
has
been
exploited
to
define
iso:inherentSystemDependentDataQuality
as
a
specialization
of
iso:inherentDataQuality
and
iso:systemDependentDataQuality
.
# definition of ISO categories iso:inherentDataQuality a dqv:Category ; skos:prefLabel "Inherent Data Quality"@en. iso:systemDependentDataQuality a dqv:Category ; skos:prefLabel "System-Dependent Data Quality"@en. iso:inherentSystemDependentDataQuality a dqv:Category ; skos:prefLabel "Inherent and System-Dependent Data Quality"@en. skos:broader iso:inherentDataQuality, iso:systemDependentDataQuality . # definition of ISO dimensions iso:accuracy a dqv:Dimension ; dqv:inCategory iso:inherentDataQuality ; skos:prefLabel "Accuracy"@en; skos:definition "The degree to which data has attributes that correctly represent the true value of the intended attribute of a concept or event in a specific context of use."@en . iso:completeness a dqv:Dimension ; dqv:inCategory iso:inherentDataQuality ; skos:prefLabel "Completeness"@en; skos:definition "The degree to which subject data associated with an entity has values for all expected attributes and related entity instances in a specific context of use."@en . iso:consistency a dqv:Dimension ; dqv:inCategory iso:inherentDataQuality ; skos:prefLabel "Consistency"@en; skos:definition "The degree to which data has attributes that are free from contradiction and are coherent with other data in a specific context of use. It can be either or both among data regarding one entity and across similar data for comparable entities."@en . # ... ... iso:accessibility a dqv:Dimension ; dqv:inCategory iso:inherentSystemDependentDataQuality ; skos:prefLabel "Accessibility"@en; skos:definition "The degree to which data can be accessed in a specific context of use, particularly by people who need supporting technology or special configuration because of some disability."@en . # ... etc ...
Is
the
Zaveri
et
al.
provides
a
review
of
quality
dimensions,
which
is
specifically
suited
for
linked
open
data
machine
readable
?
[
ZaveriEtAl
].
Category | Dimension | Definition |
---|---|---|
Accessibility dimensions | Availability |
Availability
of
a
dataset
is
the
|
Licensing |
Licensing
is
defined
as
the
granting
of
permission
for
a
consumer
to
| |
Interlinking |
Interlinking
refers
to
the
| |
Security |
Security
is
the
extent
to
| |
Performance |
Performance
refers
to
the
efficiency
of
a
system
that
binds
to
a
large
dataset,
that
is,
the
more
performant
a
data
| |
Intrinsic dimensions | Syntactic validity |
Syntactic
validity
is
defined
as
the
degree
to
|
Semantic accuracy |
Semantic
accuracy
is
defined
as
the
degree
to
which
data
values
correctly
| |
Consistency
|
Consistency means that a knowledge base is free of (logical/formal) contradictions with respect to particular knowledge representation and inference mechanisms. | |
Conciseness |
Conciseness
refers
to
the
minimization
of
redundancy
of
entities
at
the
schema
and
the
data
| |
Completeness |
Completeness
refers
to
the
degree
to
which
all
required
information
is
present
in
| |
Contextual dimensions | Relevancy |
Relevancy
refers
to
the
|
Trustworthiness |
Trustworthiness
is
defined
as
the
degree
to
which
the
information
is
accepted
to
be
correct,
true,
real
and
| |
Understandability |
Understandability
refers
to
the
ease
with
which
data
can
be
comprehended
without
| |
Timeliness | Timeliness measures how up-to-date data is relative to a specific task. | |
Representational dimensions | Representational-conciseness |
Representational-conciseness
refers
to
the
representation
of
|
Interoperability | Interoperability is the degree to which the format and structure of the information conforms to previously returned information as well as data from other sources. | |
Interpretability | Interpretability refers to technical aspects of the data, that is, whether information is represented using an appropriate notation and whether the machine is able to process the data. | |
Versatility | Versatility refers to the availability of the data in different representations and in an internationalized way. |
There
could
DQV
can
express
these
dimensions
and
categories
as
shown
in
the
following
example.
The
encoding
of
all
the
dimensions
and
categories
mentioned
above
can
be
some
overlap
with
accuracy.
found
at
http://www.w3.org/2016/05/ldqd
.
# definition of categories from Zaveri et al ldqd:accessibilityDimensions a dqv:Category ; skos:prefLabel "Accessibility"@en. ldqd:intrinsicDimensions a dqv:Category ; skos:prefLabel "Intrinsic dimensions"@en. ldqd:contextualDimensions a dqv:Category ; skos:prefLabel "Contextual dimensions"@en. ldqd:representationalDimensions a dqv:Category ; skos:prefLabel "Representational Dimensions"@en. #definition of dimensions from Zaveri et al ldqd:availability a dqv:Dimension ; dqv:inCategory ldqd:accessibilityDimensions ; skos:prefLabel "Availability"@en; skos:definition "Availability of a dataset is the extent to which data (or some portion of it) is present, obtainable and ready for use."@en . ldqd:licensing a dqv:Dimension ; dqv:inCategory ldqd:accessibilityDimensions ; skos:prefLabel "Licensing"@en; skos:definition "Licensing is defined as the granting of permission for a consumer to re-use a dataset under defined conditions."@en . ldqd:interlinking a dqv:Dimension ; dqv:inCategory ldqd:accessibilityDimensions ; skos:prefLabel "Consistency"@en; skos:definition "Interlinking refers to the degree to which entities that represent the same concept are linked to each other, be it within or between two or more data sources."@en . # ... etc ...
In
Zaveri
Et
Al.
[
Does
the
dataset
include
an
appropriate
amount
of
data?
ZaveriEtAl
It
might
be
useful
to
include
]
some
information
about
dimensions
are
not
completely
independent
and
may
be
related.
These
relationships
can
be
represented
in
DQV
by
using
the
context
(e.g.,
why
was
appropriate
SKOS
properties
or
by
specilizing
the
data
created
SKOS
properties
if
more
specific
semantics
must
be
expressed.
For
example,
availability
is
related
to
performance
and
what
purpose
semantic
accuracy
,
whilst
semanticAccuracy
is
it
supposed
related
to
serve).
timeliness
,
trustworthiness
,
consistency
,
syntaticValidity
and
completeness
.
ldqd:availability skos:related ldqd:performance , ldqd:interlinking . ldqd:semanticAccuracy skos:related ldqd:timeliness , ldqd:trustworthiness , ldqd:consistency , ldqd:syntaticValidity , ldqd:completeness , ldqd:interlinking . ldqd:consistency skos:related ldqd:conciseness , ldqd:syntaticValidity , ldqd:interoperability . ldqd:interoperability skos:related ldqd:conciseness , ldqd:syntaticValidity . ldqd:conciseness skos:related ldqd:completeness , ldqd:representationalConciseness . ldqd:interpretability skos:related ldqd:versatility . # Note: skos:related is a symmetric property, hence from every statement # ex:subject skos:related ex:object in this example, one can infer that # the statement ex:object skos:related ex:subject is true.
Does
the
data
include
all
data
items
representing
Dimensions
can
also
be
related
across
different
categorizations.
For
example,
in
the
entity
or
event
?
following,
we
present
two
possible
links
between
dimensions
from
ISO/IEC
25012
[
6.8
Conformance
ISOIEC25012
]
and
Zaveri
et
al.
Here
we
assume
that
completeness
is
equivalent
across
both
classifications
and
that
ISO's
credibility
is
one
specific
facet
of
trustworthiness
in
Zaveri
et
al.
(see
Definition
12
in
[
Is
the
data
following
accepted
standards
?
ZaveriEtAl
]).
We
pencil
more
such
possible
relationships
in
Annex
C.
ldqd:completeness skos:exactMatch iso:completeness . ldqd:trustworthiness skos:narrowMatch iso:credibility .
This
section
presents
examples
of
metrics
inspired
by
those
reviewed
in
Zaveri
et
al.
[
Is
the
data
based
on
trustworthy
sources
?
ZaveriEtAl
This
is
described
using
],
in
order
to
further
illustrate
how
dqv:Metric
can
be
instantiated.
Note
that
they
are
not
all
specific
to
linked
data
quality,
as
some
dimensions
in
Zaveri
et
al.
matches
the
provenance
vocabulary
PROV-O
dimensions
of
ISO/IEC
25012
(see
previous
sub-section
and
Annex).
These
examples
are
just
some
of
the
actual
situation
possible
ones.
They
show
metrics
for
different
dimensions
and
it
is
published
soon
enough
?
kinds
of
dataset
distributions.
We
might
consider
reorganizing
examples
around
specific
criteria
(e.g.,
include
at
least
a
metric
for
each
dimension,
or
focus
on
metrics
for
a
specific
kind
of
distribition,
e.g.,
RDF,
JSON,
CSV).
We
might
also
consider
to
add
further
examples
about
derived
metrics,
multivalued
metrics
and
extra
parameters,
once
we
have
solved
the
remaining
issues.
:downloadURLAvailabilityMetric a dqv:Metric ; skos:definition "It checks if dcat:downloadURL is available and if its value is dereferenceable."@en ; dqv:inDimension ldqd:availability ; dqv:expectedDataType xsd:boolean . :sparqlAvailabilityMetric a dqv:Metric ; skos:definition "It checks if a void:sparqlEndpoint is specified for a dataset and if the server responds to a SPARQL query."@en ; dqv:inDimension ldqd:availability ; dqv:expectedDataType xsd:boolean . :misreportedContentTypeMetric a dqv:Metric ; skos:definition "It detects whether the HTTP response contains the header field stating the appropriate content type of the returned file, e.g. application/rdf+xml"@en ; dqv:inDimension ldqd:availability ; dqv:expectedDataType xsd:boolean . :licensingMetric a dqv:Metric ; skos:definition "It detects the indication of a license in a the DCAT/VoID description or in the dataset of a license itself."@en ; dqv:inDimension ldqd:licensing ; dqv:expectedDataType xsd:boolean . :highThroughput a dqv:Metric ; skos:definition "It represents the maximum number of answered HTTP-requests per second."@en ; dqv:inDimension ldqd:performance ; dqv:expectedDataType xsd:integer . :sparqlScalability a dqv:Metric ; skos:definition "It detects whether the time to answer an amount of ten requests divided by ten is not longer than the time it takes to answer one request."@en ; dqv:inDimension ldqd:performance ; dqv:expectedDataType xsd:boolean . :noRDFSyntaxError a dqv:Metric ; skos:definition "It returns the number of syntax errors detected by an RDF validator."@en ; dqv:inDimension ldqd:syntacticValidity; dqv:expectedDataType xsd:integer . :noJSONSyntaxError a dqv:Metric ; skos:definition "It returns the number of syntax errors detected by an JSON validator."@en ; dqv:inDimension ldqd:syntacticValidity; dqv:expectedDataType xsd:integer . :populationCompletenessMetric a dqv:Metric ; skos:definition "Ratio between the number of objects represented in the dataset and the number of objects expected to be represented according to the declared dataset scope."@en ; dqv:inDimension ldqd:completeness ; dqv:expectedDataType xsd:double .
The UCR document lists relevant requirement for data quality and granularity :
The
aforementioned
requirements
are
going
to
be
have
been
further
elaborated
considering
on-going
discussions
and
materials
from
these
two
extended
by
new
use
cases
and
examples,
following
discussions
on
the
DWBP
WG's
mailing
list,
wiki
pages:
pages
(see
Requirements
from
FPWD_BP
here
and
Quality
Requirements
From
UCR
.
Issue
13
We
have
to
confirm
whether
the
scope
of
DQV
work
is
indeed
these
"official"
DQV
reqs
or
if
we
should
go
beyond,
e.g.,
reflecting
the
quality
of
here
),
as
well
as
external
contributions
during
the
vocabulary
(re-)used,
access
to
datasets,
metadata
and
more
generally
review
process
(see
the
implementation
general
list
of
our
best
practices
(cf.
the
"5
stars"
thread
).
The
distinction
between
Intrinsic
and
extrinsic
metadata
may
help
making
choices
here.
For
example,
DQV
could
be
defined
wrt.
intrinsic
properties
of
the
datasets,
not
extrinsic
properties
(let
alone
properties
of
the
metadata
for
a
dataset!)
(
Issue-190
issues
)
Issue
14
Backward
compatibility
with
DAQ
and
RDF
Data
Cube:
DAQ
exploits
Data
Cube
to
make
metric
results
consumable
by
visualisers
that
includes
such
as
CubeViz
(see
Jeremy's
paper
).
This
may
be
useful
to
preserve
in
DQV.
(
Issue-191
)
external
feedback).
The W3C Human Care and Life Science Community Group has created a DCAT profile for describing datasets . This is work is well visible and used in the HCLS community. DQV should be aligned with this profile if there are overlapping areas. Are there such areas? ( Issue-221 )
The editors acknowledge the chairs of this Working Group: Hadley Beeman, Yaso Córdova, Deirdre Lee and the staff contact Phil Archer.
The editors also gratefully acknowledge the contributions made to this document by all members of the working group, specially the contributions received from Ghislain Auguste Atemezing, Carlos Laufer, Annette Greiner, Michel Dumontier, Eric Stephan.
The editors would also like to thank comments received from non-members of this working group, such as Andrea Perego, Rachel E. Heaven, Linda van den Brink, Werner Bailer, Jon Blower, Guillaume Duffes, Davide Ceolin, Anisa Rula.
Changes since the previous version include:
The
dimensions
listed
in
ISO/IEC
25012
[
ISOIEC25012
]
and
Zaveri
et
al.
[
ZaveriEtAl
]
are
not
disjoint.
Assuming
that
dimensions
are
expressed
as
instances
of
skos:Concept
,
the
following
table
includes
some
of
the
correspondences
that
can
be
considered
between
these
two
classifications.
Dimension from Zaveri et al. | Dimension from ISO/IEC 25012 | Suggested mapping relation |
---|---|---|
Availability | Availability | skos:exactMatch |
Completeness | Completeness | skos:exactMatch |
Consistency | Consistency | skos:exactMatch |
Timeliness | Currentness | skos:exactMatch |
Interoperability | Portability | skos:relatedMatch |
Interoperability | Compliance | skos:relatedMatch |
Semantic Accuracy | Accuracy | skos:broadMatch |
Trustworthiness | Credibility | skos:narrowMatch |
Trustworthiness | Traceability | skos:relatedMatch |
Understandability | Understandability | skos:exactMatch |
Interpretability | Understandability | skos:relatedMatch |
Versatility | Understandability | skos:broadMatch |
Syntactic Validity | Accuracy | skos:broadMatch |
Syntactic Validity | Compliance | skos:broadMatch |
Licensing | Accessibility | skos:relatedMatch |
Security | Traceability | skos:relatedMatch |
Security | Confidentiality | skos:relatedMatch |
Performance | Efficiency | skos:exactMatch |
Interlinking | Availability | skos:broadMatch |
Representation-conciseness | Compliance | skos:broadMatch |