Share-PSI Best Practice: Publish Statistical Data In Linked Data Format

Outline

Publishing statistical data as Linked Data on the basis of W3C’s RDF Data Cube vocabulary which specifies an approach for the expression of the data in a standardised machine-readable way as well as identifying a recommended set of metadata terms to describe the datasets.

Links to the Revised PSI Directive

Techniques

Challenge

Statistical data is currently published in a range of formats and standards that do not allow linking across datasets. It is used as the foundations for policy prediction, planning and adjustments, and therefore has a significant impact on the society (from citizens to businesses to governments). The process of collecting and monitoring socio-economic indicators can be considerably improved if the data produced by government organizations such as Statistical Offices, National Banks, Employment services, etc. are published in Linked Data Format.

Solution

Linked Data paradigm has opened new possibilities and perspectives for government organisations to open data and interchange information. Data is open if it is technically open (available in a machine-readable standard format, which means it can be retrieved and meaningfully processed by a computer application) and legally open (explicitly licensed in a way that permits commercial and non-commercial use and re-use without restrictions), see the World Bank Open Data Essentials.

The Linked Data approach enables datasets to be linked together through references to common concepts. A dataset is represented in the form of a graph, using the Resource Description Framework (RDF) as a general-purpose language. Linked Data publication process refers to a set of activities related to extraction, transformation, validation, exploration and publication of RDF datasets originating from different sources (e.g., databases) on the Web. The ready for use RDF datasets can be either stored locally or registered at a metadata catalog e.g. build with CKAN open-source tool.

In 2014, The RDF Data Cube Vocabulary was published by the W3C Government Linked Data Working Group as a Recommendation for publishing multi-dimensional data on the Web.

Why is this a Best Practice?

The approach contributes to the standardisation of the process of publishing and re-use of multi-dimensional data on the Web. The approach is based on RDF Data Cube vocabulary that is mature enough to be used for publishing statistical data as it improves interoperability and allows comparison of data from different statistical sources. The vocabulary underlies SDMX (Statistical Data and Metadata eXchange), an ISO standard for exchanging and sharing statistical data and metadata among organizations and provides a layer on top of data to describe domain semantics, dataset's metadata, and other crucial information needed in the process of statistical data exchange.

Cost implication: Costs of publication should be minimised unless there are clear benefits. Public sector body should analyse the current status of data availability, the demand for data and thus avoid unnecessary costs of transformation of data in Linked Data format. Public sector bodies publishing information SHOULD either:

Publish it in the manner that involves lowest cost, consistent with making it available effectively and openly, or
Carry out cost-benefit analyses of the possible measures to assess potential use and stimulate take-up, methods of publication, and formats for publication, and select measures, methods and formats in the light of those analyses.

The risk of deciding what publication form will best deliver value (commercial or other value of public information), and the work of converting it to that form, could be left to commercial product and service providers, and other consumers. If due to cost implications it is not possible to publish statistical data in that format, it is important to ensure possible transformations by third parties from the provided format to the RDF Data Cube Vocabulary. The multidimensional data model (with n-dimensional data cubes as datasets with observations, dimensions, measures) used by the RDF Data Cube Vocabulary is sufficiently generic to not restrict publishers.

A possible transformation has been shown for other common data formats for statistical data such as SDMX, XBRL, and the Dataset Publishing Language. If sufficient metadata is provided, transformation scripts are also possible from CSV and spreadsheet (e.g., Microsoft Excel) data.

How do I implement this Best Practice?

This best practice is based on a set of tools for automating the data extraction and publication process. However the EU research community delivered many open-source tools for publishing the statistical data in Linked Data format, see e.g. the LOD2 Statistical Workbench, the OpenCube toolkit.

Where has this best practice been implemented?

Country	Implementation	Contact Point
Italy	LinkedStat	SpazioDati and Istat
UK	Scottish Government Statistics	Scottish Government

References

Samos Workshop presentation: A Methodology for Publishing Linked Open Statistical Data (PDF), George Papastefanatos IMIS / RC Athena, Greece
Samos Workshop presentation: Publishing and Consuming Linked Open Data with the LOD Statistical Workbench, Valentina Janev, Institut Mihajlo Pupin

Contact Info

Original Author & editor: Valentina Janev, Institute Mihajlo Pupin; contributor: Benedikt Kämpgen, FZI Research Center for Information Technology