W3C

The Web Is More Than A USB Stick

Proposal for an Impact Contribution at the European Data Forum 2015
General Topics: Optimised architectures, Semantic Technologies & the Web of Data, Language Resources, Geospatial Techniques
Phil Archer, Data Activity Lead, World Wide Web Consortium. phila@w3.org
Relevant projects: Share-PSI Thematic Network, SmartOpenData, Big Data Europe

This paper is also the basis for a talk at Conferência Web.br 2015 Vamos "re-descentralizar a Web."


A USB stick on a traditional record card with some basic metadata
How the Web is being used to share data. Photo credit: Rosie Sutton

The Digital Economy as we know it became possible with development of the Internet, beginning in the 1960s, and the invention of the Web in March 1989. The original proposal for the Web describes how people, ideas, experiments, projects and documents can all be linked using an infrastructure that requires minimal agreement.

With some notable exceptions, this is not how it's used.

In reality, the Internet is used as a glorified USB stick, allowing access to discrete datasets, or packages of datasets, that have no connection to the outside world or each other. Metadata is posted on the Web, but the data itself generally isn't, so that speed and convenience is the only difference between making data available online and responding to a query by putting a USB stick in the post.

In many cases, this is perfectly fine. A lot of eye catching and innovative applications are created that essentially visualise and interpret a single dataset or provide a user interface to a single dataset's API. These can provide real insights into a particular topic, such as daily energy consumption by government departments in Scotland; useful, actionable information, like current roadblocks in Ljubljana or current average speeds on Dutch roads; or transparency information like Monithon that follows EU money in Italy.

It's when you combine multiple datasets that connections may be revealed that were previously hidden. For example, the Slovenian Supervizor service connects public expenditure, contract data and the company register. This quickly allowed humans to spot that a school was giving an unusually high number of contracts to a local company, which might be evidence that the company offered excellent services or, as it turned out, that it was run by the head teacher's wife. In the Czech Republic, Léková encyklopedie and Justinian combine multiple datasets within their domains to offer services that save doctors and lawyers respectively a lot of time cross referencing information to find important relationships. It's notable that high quality services like these, the Portuguese Local Transparency Portal, or BMW's use of open data are developed by highly skilled engineers. They're expensive. They're not going to come out of a hackathon.

This is, at least in part, because it is hard to mix different datasets. It requires the investment of companies like Shoothill, or infomediaries like Euroalert to understand that two different lines of data are talking about the same thing or the same place.

If only there were a means to link all this data together in a way that developers can easily get their hands on and use in their applications without having to work as hard as the folks at The Locator; if it were as easy to visualise multiple Slovenian Environmental Indicators as it is to create a map showing rates of violence against women in a region of Brazil.

That's what the Web is for. That's what is possible if data owners use the Web as a platform for the data itself, and not just a means to advertise the availability of zipped up data dumps.

Linked Data is part of the answer, but it's not the whole answer. It's not the whole answer because for many developers, it's no less opaque than anything else and really, all they need is a Restful API that returns JSON. And it's not the whole answer because office software (be it Microsoft or Open), doesn't generate or understand it either. The data analysis tool of choice for a great many people is Excel.

How do non-specialists put data on the Web? What's the minimum that the public sector must do to enable the Web of data? How can we expose multiple datasets in a way that makes it easy for regular Web developers to build applications that discover cross-domain relationships?

Since early 2013, the Share-PSI thematic network and the W3C Data on the Web Best Practices working group have been addressing this, developing and collating best practices that facilitate the broader data ecosystem. For Share-PSI, this means providing a set of best practices that can be referenced as Member States implement the revised PSI Directive.

More recently, initiated by the SmartOpenData project, the Open Geospatial Consortium and W3C have been collaborating to develop best practices for publishing spatial data on the Web. This is an essential component of the data infrastructure since the point of commonality between disparate datasets is often the locations to which they refer. The aim is to make better connections between the rich data available in geospatial infrastructures and the mass of information available on the Web. As an example of this, SmartOpenData applied Linked Data principles to data concerning natural habitats, protected sites and rural communities to conduct a series of pilots across Europe. As part of this work, a set of RDF schemas were developed that provide a practical bridge between INSPIRE and Linked Data.

W3C is proud to be part of the Big Data Europe project that is using semantics within its suite of tools designed to give non-specialists access to the potential of big data.

Other relevant work at W3C includes the standardisation of metadata for CSV and of algorithms for generating JSON or RDF from that. Future directions are likely to include standardised methods for simple, Web developer-friendly access to arbitrarily complex data structures, providing an interface between communities as much as between technical formats, perhaps based around data owners (or infomediaries) providing sets of pre-cooked queries.

Today, the only viable platform for Europe's multilingual, cross border, cross domain data infrastructure, is the Web. And that means more than just posting a few bits of metadata to describe data you could get on a USB stick.

Acknowledgements

This abstract refers to many examples of open data usage contributed by members of the Share-PSI Thematic Network.

Phil Archer
W3C Data Activity Lead
May 2015