Copyright © 2015 W3C® (MIT, ERCIM, Keio, Beihang). W3C liability, trademark and document use rules apply.
This document lists use cases, compiled by the Data on the Web Best Practices Working Group, that represent scenarios of how data is commonly published on the Web and how it is used. This document also provides a set of requirements derived from these use cases that will be used to guide the development of the set of Data on the Web Best Practices and the development of two new vocabularies: Quality and Granularity Description Vocabulary and Data Usage Description Vocabulary.
This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.
This document is considered to be approaching its final version. Only one or, at most, two further iterations are expected during the life time of the working group.
This document was published by the Data on the Web Best Practices Working Group as a Working Group Note. If you wish to make comments regarding this document, please send them to public-dwbp-comments@w3.org (subscribe, archives). All comments are welcome.
Publication as a Working Group Note does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.
This document was produced by a group operating under the 5 February 2004 W3C Patent Policy. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.
This document is governed by the 1 August 2014 W3C Process Document.
This section is non-normative.
There is a growing interest in publishing and consuming data on the Web. Both government and non-government organizations already make a variety of data available on the Web, some openly, some with access restrictions, covering many domains like education, the economy, security, cultural heritage, eCommerce and scientific data. Developers, journalists and others manipulate this data to create visualizations and to perform data analysis. Experience in this field shows that several important issues need to be addressed in order to meet the requirements of both data publishers and data consumers.
To address these issues, the Data on the Web Best Practices Working Group seeks to provide guidance to all stakeholders that will improve consistency in the way data is published, managed, referenced and used on the Web. The guidance will take two forms: a set of best practices that apply to multiple technologies, and vocabularies that are currently missing but that are needed to support the data ecosystem on the Web.
In order to determine the scope of the best practices and the requirements for the new vocabularies, a set of use cases has been compiled. Each use case provides a narrative describing an experience of publishing and using Data on the Web. The use cases cover different domains and illustrate some of the main challenges faced by data publishers and data consumers. A set of requirements, used to guide the development of the set of best practices as well as the development of the vocabularies, have been derived from the compiled use cases.
Interpretations of each use case could lead to an unmanageably large number of requirements and so before including them, each potential requirement has been assessed against three specific criteria:
Only requirements meeting those three criteria have been included.
A use case illustrates an experience of publishing and using Data on the Web. The information gathered from the use cases should be helpful for the identification of the best practices that will guide the publishing and usage of Data on the Web. In general, a use case will be described at least by a statement and a discussion of how the use case is currently implemented. Use case descriptions demonstrate some of the main challenges faced by publishers or developers. Information about challenges will be helpful to identify areas where Best Practices are necessary. According to the challenges, a set of requirements are abstracted in such a way that a requirement motivates the creation of one or more best practices.
(Contributed by Lewis John McGibbney, NASA Jet
Propulsion Laboratory/California Institute of Technology)
URL: http://aso.jpl.nasa.gov/
The two most critical properties for understanding snowmelt runoff and timing are the spatial and temporal distributions of snow water equivalent (SWE) and snow albedo. Despite their importance in controlling volume and timing of runoff, snowpack albedo and SWE are still largely unquantified in the US and not at all in most of the globe, leaving runoff models poorly constrained. NASA/JPL, in partnership with the California Department of Water Resources, has developed the Airborne Snow Observatory (ASO), an imaging spectrometer and scanning Lidar system, to quantify SWE and snow albedo, generate unprecedented knowledge of snow properties for cutting edge cryospheric science, and provide complete, robust inputs to water management models and systems of the future.
Elements:
Positive aspects:
This use case provides insight into what a NASA funded demonstration mission looks like (from a data provenance, archival point of view).
It is an excellent opportunity to delve into an earth science mission which is actively addressing the global problem of water resource management. Recently senior officials have declared a statewide (CA) drought emergency and are asking all Californians to reduce their water use by 20 percent. California, and other U.S. states are experiencing a serious drought and the state will be challenged to meet its water needs in the upcoming year. Calendar year 2013 was the driest year in recorded history for many areas of California, and current conditions suggest no change is in sight for 2014. ASO is at the front line of cutting edge scientific research meaning that the data that backs the mission, as well as the practices adopted within the project execution, are extremely important to addressing this issue.
Project collaborators and stakeholders are sent data and information when it is produced and curated. For some stakeholders, the data (in an operational sense) they require is very small in size and in such cases ASO emphasizes speed. It's more like a sharing of information than delivering a product for the short-term turnaround of information.
Negative aspects:
Demonstration missions of this caliber also have downsides. With regards to data best practices, more work is required in the following areas:
Challenges:
Requires: R-AccessUpToDate, R-Citable, R-DataIrreproducibility, R-DataMissingIncomplete, R-FormatMachineRead, R-GeographicalContext, R-GranularityLevels, R-LicenseLiability, R-MetadataAvailable, R-ProvAvailable, R-QualityCompleteness, R-QualityMetrics, R-TrackDataUsage, R-UsageFeedback and R-VocabDocum.
Contributors: Ghislain Atemezing (EURECOM)
URL: http://www.bbc.co.uk/ontologies
Overview: the BBC provides a list of the ontologies they implement and use for their Linked Data platform. The site provides access to the ontologies the BBC is using to support its audience using their applications, such as BBC Sport or BBC Education. Each ontology has a short description with metadata information, an introduction, sample data, an ontology diagram and the terms used in the ontology. The metadata includes 6 fields that are generally filled: mailto authors, created data, version (current version number), prior version (decimal), license (a link to the license) and a link for downloading the RDF version. For example, see the description of the “Core concepts ontology.” However, this metadata that is available in the HTML page is NOT present in a machine-readable format, i.e. in the ontology itself.
Versioning: each ontology uses a decimal notation for the version and the URL for accessing each version file of the ontology is constructed as {BASE-URI}/{ONTO-PREFIX}/{VERSION}.ttl; where {BASE-URI} is http://www.bbc.co.uk/ontologies/. For example: the file of version 1.9 of the “core concepts” ontology is located at http://www.bbc.co.uk/ontologies/coreconcepts/1.9.ttl. However, between different versions, the URI of the ontology used is the same and is of the form : {BASE-URI}/{ONTO-PREFIX}/.
Elements:
Challenges
Requires R-MetadataDocum, R-MetadataMachineRead, R-FormatMultiple, R-MetadataStandardized and R-VocabVersion.
(Contributed by Carlos Laufer)
URL: http://bio2rdf.org/
Bio2RDF1 is an open source project that uses Semantic Web technologies to make possible the distributed querying of integrated life sciences data. Since its inception2, Bio2RDF has made use of the Resource Description Framework (RDF) and the RDF Schema (RDFS) to unify the representation of data obtained from diverse fields (molecules, enzymes, pathways, diseases, etc.) and heterogeneously formatted biological data (e.g. flat-files, tab-delimited files, SQL, dataset specific formats, XML etc.). Once converted to RDF, this biological data can be queried using the SPARQL Protocol and RDF Query Language (SPARQL), which can be used to federate queries across multiple SPARQL endpoints.
Elements:
wasDerivedFrom
such that one can query the dataset SPARQL endpoint to retrieve all
provenance records for datasets created on different dates. Each
resource in the dataset is linked the date-unique dataset IRI that
is part of the provenance record using the VoID inDataset
predicate. Other important features of the provenance record include
the use of the Dublin Core creator
term to link a
dataset to the script on Github that was used to generate it, the
VoID predicate sparqlEndpoint
to point to the dataset
SPARQL endpoint, and VoID predicate dataDump
to point
to the data download URL.
Dataset metrics
References:
Challenges:
Potential Requirements:
Requires: R-AccessLevel, R-AccessUpToDate, R-DataLifecyclePrivacy, R-FormatMultiple, R-FormatStandardized , R-PersistentIdentification and R-VocabReference.
(Contributed by Deirdre Lee)
URL: http://mypp.ie/
Buildingeye.com makes building and planning information easier to find and understand by mapping what's happening in your city. In Ireland local authorities handle planning applications and usually provide some customized views of the data (PDFs, maps, etc.) on their own Web site. However there isn't an easy way to get a nationwide view of the data. BuildingEye, an independent SME, built http://mypp.ie/ to achieve this. However as each local authority didn't have an Open Data portal, BuildingEye had to directly ask each local authority for its data. It was granted access to some authorities, but not all. The data it did receive was in different formats and of varying quality/detail. BuildingEye harmonized this data for its own system. However, if another SME wanted to use this data, they would have to go through the same process and again go to each local authority asking for the data.
Elements:
Challenges:
Potential Requirements:
Requires: R-AccessBulk, R-AccessRealTime, R-DataLifecyclePrivacy, R-DataMissingIncomplete, R-DataProductionContext, R-AccessLevel, R-FormatMachineRead, R-FormatOpen, R-FormatStandardized, R-GeographicalContext, R-LicenseAvailable, R-MetadataAvailable, R-MetadataDocum, R-QualityCompleteness, R-QualityComparable, R-SensitivePrivacy, R-SensitiveSecurity and R-VocabDocum.
(Contributed by Yasodara)
URL: http://dados.gov.br/
Dados.gov.br is the open data portal of Brazil's Federal Government. The site was built by a community network pulled together by three technicians from the Ministry of Planning. They managed the group from INDA or "National Infrastructure for Open Data." CKAN was chosen because it is free software and presents independent solutions for the placement of a data catalog of the Federal Government provided on the internet.
Elements:
Challenges:
Requires: R-AccessLevel, R-DataLifecyclePrivacy, R-DataLifecycleStage, R-DataMissingIncomplete, R-FormatStandardized, R-LicenseAvailable, R-MetadataAvailable, R-GeographicalContext, R-MetadataDocum, R-ProvAvailable, R-QualityOpinions, R-UsageFeedback, R-VocabReference and R-VocabVersion.
(Contributed by Christophe Guéret)
URL: http://dans.knaw.nl/
Digital archives, such as DANS in the Netherlands, have so far been concerned with the preservation of what could be defined as "frozen" datasets. A frozen dataset is a finished, self-contained set of data that does not evolve after it has been constituted. The goal of the preserving institution is to ensure this dataset remains available and readable for as many years as possible. This can for example concern an audio recording, a digitized image, e-books or database dumps. Consumers of the data are expected to look for specific content based on its associated identifier, download it from the archive and use it. Now comes the question of the preservation of Linked Open Data. In opposition to "frozen" data sets, linked data can be qualified as "live" data. The resources it contains are part of a larger entity to which third parties contribute, one of the design principles indicate that other data producers and consumers should be able to point to data. As LD publishers stop offering their data (e.g. at the end of a project), taking the LD off-line as a dump and putting it in an archive effectively turns it into a frozen dataset, likewise SQL dumps and other kind of databases. The question then is to what extent this is an issue.
Challenges: The archive has to think about whether dereferencing for resources found in preserved datasets is required or not, also to think about providing a SPARQL endpoint or not. If data consumers and publishers are fine with having RDF data dumps to be downloaded from the archive prior to its usage - just like any other digital item so far - the technical challenges could be limited to handling the size of the dumps and taking care of serialization evolution over time (e.g. from N-Triples to TriG, or from RDF/XML to HDT) as the preference for these formats evolves. Turning a live dataset into a frozen dump also raises the question of the scope. Considering that LD items are only part of a much larger graph that gives them meaning through context the only valid dump would be a complete snapshot of the entire connected component of the Web of Data graph the target dataset is part of.
Potential Requirements: Decide on the importance of the de-referencability of resources and the potential implications for domain names and naming of resources. Decide on the scope of the step that will turn a connected sub-graph into an isolated data dump.
Requires: R-AccessLevel, R-PersistentIdentification, R-UniqueIdentifier and R-VocabReference.
(Contributed by Christophe Guéret)
URL: http://www.e-overheid.nl/onderwerpen/stelselinformatiepunt/stelsel-van-basisregistraties
The Netherlands has a set of registers that are under consideration for exposure as Linked (Open) Data in the context of the "PiLOD" project. The registers contain information about buildings, people, businesses that other individual public bodies may want to refer to for they daily activities. One of them is, for instance, the service of public taxes ("BelastingDienst") which regularly pulls out data from several registers, stores this data in a big Oracle instance and curates it. This costly and time consuming process could be optimized by providing on-demand access to up-to-date descriptions provided by the register owners.
Challenges:
In terms of challenges, linking is for once not much of an issue as registers already cross-reference unique identifiers (see also http://www.wikixl.nl/wiki/gemma/index.php/Ontsluiting_basisgegevens). A URI scheme with predicable and persistent URIs is being considered for implementation. Actual challenges include:Requires: R-AccessLevel, R-FormatMultiple, R-PersistentIdentification, R-SensitivePrivacy, R-UniqueIdentifier and R-VocabReference.
(Contributed by Mark Harrison, University of Cambridge & Eric Kauz, GS1).
Retailers and Manufacturers / Brand Owners are beginning to understand that there can be benefits to openly publishing structured data about products and product offerings on the Web as Linked Open Data. Some of the initial benefits may be enhanced search listing results (e.g. Google Rich Snippets) that improve the likelihood of consumers choosing such a product or product offer over an alternative product that lacks the enhanced search results. However, the longer term vision is that an ecosystem of new product-related services can be enabled if such data is available. Many of these will be consumer-facing and might be accessed via smartphones and other mobile devices, to help consumers to find the products and product offers that best match their search criteria and personal preferences or needs — and to alert them if a particular product is incompatible with their dietary preferences or other criteria such as ethical / environmental impact considerations — and to suggest an alternative product that may be a more suitable match. A more complete description of this use case is available.
Elements:
Challenges:
Potential Requirements:
Requires: R-AccessUpToDate, R-Citable, R-FormatMultiple, R-FormatStandardized, R-LicenseLiability, R-PersistentIdentification and R-ProvAvailable.
(Contributed by Ghislain Atemezing)
ISO GEO manages catalog records of geographic information in XML that conform to ISO-19139, a French adaptation of ISO-19115 (data sample). They export thousands of records like that today but they need to manage them better. In their platform, they store the information in a more conventional manner and use this standard for export datasets compliant to the INSPIRE standards or via the OGC's CSW protocol. Sometimes, they have to enrich their metadata using tools like GeoSource and accessed through an SDI with their own metadata records. ISO GEO wants to be able to integrate all the different implementations of ISO-19139 in different tools in a single framework to better understand the thousands of metadata records they use in their day-to-day business. Types of information recorded in each file include: contact info (metadata) [data issued], spatial representation, reference system info [code space], spatial resolution, geographic extension of the data, file distribution, data quality and process step (example).
Challenges:
Requires: R-AccessUpToDate, R-DataEnrichment, R-FormatLocalize, R-FormatMachineRead, R-GranularityLevels, R-LicenseAvailable, R-MetadataMachineRead, R-MetadataStandardized, R-PersistentIdentification, R-ProvAvailable and R-VocabReference.
(Contributed by Carlos Iglesias)
URL: http://landportal.info/
The IFAD Land Portal platform has been completely rebuilt as an Open Data collaborative platform for the Land Governance community. Among the new features the Land Portal provides access to more than 100 indicators from more than 25 different sources on land governance issues for more than 200 countries over the world, as well as a repository of land related-content and documentation. Thanks to the new platform people could
Elements:
Challenges:
Potential Requirements:
Requires: R-AccessBulk, R-AccessRealTime, R-DataEnrichment, R-DataVersion, R-FormatLocalize, R-FormatMachineRead, R-FormatMultiple, R-FormatStandardized, R-GeographicalContext, R-GranularityLevels, R-MetadataAvailable, R-MetadataMachineRead, R-MetadataStandardized, R-ProvAvailable, R-PersistentIdentification, R-QualityCompleteness R-QualityMetrics, R-TrackDataUsage, R-UniqueIdentifier, R-VocabDocum, R-VocabOpen, R-VocabReference and R-VocabVersion.
(Contributed by Phil Archer )
URL: http://articles.latimes.com/2014/mar/27/local/la-me-ln-gender-wage-gap-city-government-20140327
On 27 March 2014, the LA Times published a story Women earn 83 cents for every $1 men earn in L.A. city government. It was based on an Infographic released by LA's City Controller, Ron Galperin. The Infographic was based on a dataset published on LA's open data portal, Control Panel LA . That portal uses the Socrata platform which offers a number of spreadhseet-like tools for examining the data, the ability to download it as CSV, embed it in a Web page and see its metadata.
Positive aspects:
Negative aspects:
Challenges:
Other Data Journalism blogs:
Requires: R-Citable, R-DataMissingIncomplete, R-DataProductionContext, R-FormatMultiple, R-FormatOpen, R-GeographicalContext, R-LicenseAvailable, R-MetadataAvailable, R-MetadataStandardized, R-QualityMetrics, R-UniqueIdentifier and R-TrackDataUsage.
(Contributed by Riccardo Albertoni, CNR-IMATI,
Genoa, Italy)
URL: http://linkeddata.ge.imati.cnr.it/
LusTRE is a framework that combines existing thesauri to support the management of environmental resources. It considers the heterogeneity in scope and levels of abstraction of existing environmental thesauri as an asset when managing environmental data, thus it exploits Linked Data (SKOS, RDF etc.) in order to provide a multi-thesauri solution for INSPIRE data themes related to nature conservation.
LusTRE is intended to support metadata compilation and data/service discovery according to ISO 19115/19119. The development of LusTRE includes:
Quality of thesauri and linksets is an issue that is not necessarily limited to the initial review of thesauri, it should be monitored and promptly documented.
In this respect, a standardized vocabulary for expressing dataset and linkset quality would be needed to make accessible the quality assessment of thesauri included in LusTRE. Considering the importance of linkset quality in the achievement of an effective cross-walking among thesauri, further services for assessing the quality of linksets are going to be investigated. Such services might be developed extending the measure proposed in Albertoni et al, 2013 (PDF), so that, linksets among thesauri can be assessed considering their potential when exploiting interlinks for thesaurus complementarities.
LusTRE is currently under development within the EU project eENVplus (CIP-ICT-PSP grant No. 325232), it extends the common thesaurus framework De Martino et al. 2011 previously resulting from the EU project NatureSDIplus (ECP-2007-GEO-317007).
Elements:
Positive aspects: The use case includes publication as well as consumptions of data.
Challenges:
Requires: R-AccessBulk, R-Citable, R-DataEnrichment, R-DataVersion, R-FormatMachineRead, R-FormatMultiple, R-FormatOpen, R-FormatStandardized, R-LicenseAvailable, R-MetadataAvailable, R-MetadataDocum, R-MetadataMachineRead, R-MetadataStandardized, R-PersistentIdentification, R-ProvAvailable, R-QualityComparable, R-QualityCompleteness, R-QualityMetrics, R-QualityOpinions, R-TrackDataUsage, R-UniqueIdentifier, R-UsageFeedback, R-VocabDocum, R-VocabOpen, R-VocabReference and R-VocabVersion.
(Contributed by Deirdre Lee, based on post by Leigh Dodds)
There are many different licenses available under which data can be published on the Web, e.g. Creative Commons, Open Data Commons, national licenses, etc. It is important that the license is available in a machine-readable format. Leigh Dodds has done some work towards this with the Open Data Rights Statement Vocabulary including guides for publishers and reusers. Another issue is when data under different licenses are combined, the license terms under which the data is available also have to be merged. This interoperability of licenses is a challenge.
Challenges:
Requires: R-LicenseAvailable
NB there is also a requirement for licenses to be interoperable but this is out of scope as defined by the Working Group's charter.
(Contributed by Annette Greiner, Lawrence
Berkeley National Laboratory, California)
URL: https://openmsi.nersc.gov/
Mass spectrometry imaging (MSI) is widely applied to image complex samples for applications spanning health, microbial ecology, and high throughput screening of high-density arrays. MSI has emerged as a technique suited to resolving metabolism within complex cellular systems; where understanding the spatial variation of metabolism is vital for making a transformative impact on science. Unfortunately, the scale of MSI data and complexity of analysis presents an insurmountable barrier to scientists where a single 2D-image may be many gigabytes and comparison of multiple images is beyond the capabilities available to most scientists. The OpenMSI project will overcome these challenges, allowing broad use of MSI to researchers by providing a Web-based gateway for management and storage of MSI data, the visualization of the hyper-dimensional contents of the data, and the statistical analysis.
Elements:
Positive aspects: huge improvement in ease of analysis over traditional methods, ability to readily share results with other researchers, ability to download relevant subsets of data, provides metadata for each sample, self-describing data format, fast and flexible Web API, interactive Web-based exploration that enables user to view data that cannot be opened using standard MSI tools.
Negative aspects: submission of metadata should be easier and automated. As it scales, we'll need to facilitate discovery of datasets of interest via search.
Challenges: Project is largely unfunded and resources are vitally needed for project to succeed.
Requires: R-AccessRealTime, R-APIDocumented, R-DataEnrichment, R-FormatMachineRead, R-FormatOpen, R-FormatStandardized, R-MetadataAvailable, R-MetadataDocum, R-MetadataMachineRead and R-SensitiveSecurity.
(Contributed by Deirdre Lee based on the 2012 ePSI Open Transport Data Manifesto)
The Context: Transportation is an important contemporary issue that has a direct impact on economic strength, environmental sustainability and social equity. Accordingly, transport data — largely produced or gathered by public sector organisations or semi-private entities, quite often locally — represents one of the most valuable sources of public sector information (PSI, also called ‘open data’), a key policy area for many, including the European Commission.
The Challenge: Combined with the advancement of Web technologies and the increasing use of smart phones, the demand for high quality machine-readable and openly licensed transport data, allowing for reuse in commercial and non-commercial products and services, is rising rapidly. Unfortunately this demand is not met by current supply: many transport data producers and holders (from the public and private sectors) have not managed to respond adequately to these new challenges set by society and technology.
So what do we need?
Why is this not happening?
Requires: R-AccessBulk, R-AccessLevel, R-AccessRealTime, R-AccessUpToDate, R-APIDocumented, R-DataMissingIncomplete, R-DataProductionContext, R-FormatLocalize, R-FormatMachineRead, R-FormatOpen, R-GeographicalContext, R-LicenseAvailable, R-LicenseLiability, R-MetadataAvailable, R-QualityComparable, R-QualityCompleteness, R-QualityMetrics, R-SLAAvailable, R-UsageFeedback and R-VocabOpen.
(Contributed by Deirdre Lee, based on a presentation by Axel Polleres at EDF 2014).
The Open City Data Pipeline aims to provide an extensible platform to support citizens and city administrators by providing city key performance indicators (KPIs), leveraging open data sources. An assumption of open data is that “Added value comes from comparable open datasets being combined.” Open data needs stronger standards to be useful, in particular for industrial uptake. Industrial usage has different requirements than that of an application-building hobbyist or civil society so it's important to think how open data can be used by industry at the time of publication. The Open City Data Pipeline project has developed a data pipeline to:
Current Data Summary
Base assumption (for our use case): Added value comes from comparable open datasets being combined
Challenges & Lessons Learnt:
:populationDensity =
:population/:area
;dbpedia:populationTotal
dbpedia:populationCensus
Challenges:
Requires: R-DataMissingIncomplete, R-DataProductionContext, R-FormatMachineRead, R-FormatOpen, R-FormatStandardized, R-FormatLocalize, R-GeographicalContext, R-LicenseAvailable, R-MetadataAvailable, R-MetadataStandardized, R-QualityComparable, R-QualityCompleteness, R-VocabDocum, R-VocabOpen and R-VocabReference.
(Contributed by Eric Stephan)
In 2013 the United States Whitehouse published an executive order on Open Data to help make publically available data: understandable, accessible, and searchable. A number of historical and on-going atmospheric studies fall into this category but are not currently open. This use case describes characteristics of laboratory experiments and field studies that could be published as open data.
For measurements to be considered useful and comparable to other findings scientists need to track every aspect of their laboratory and field experiments. This can include: background describing the purpose of the experiment, field site selected, instrumentation deployed, configuration settings, house keeping data, types of measurements that need to be taken, work performed on field visits, processing the raw measurements, intermediate processing data, value added data products, quality assurance, problem reporting, and standards relied upon for disseminating the study results including selected data formats, quality control codes selected, engineering units selected, and metadata vocabularies relied upon for describing the measurements.
Traditionally knowledge and data about the studies have either been kept in separate local databases, file systems and spreadsheets, or in non-record keeping systems. If kept electronically the experiment in its entirety may be kept in bulk by way of archive files (tar, zip etc). Measurements from the study may be shared along with background information in the form of a summarized report or publication, content management system or wiki site and the bulk of knowledge is largely retained internally by data providers.
Elements:
Positive aspects: The Web of Things (instruments), Linked Services (processing software), and Linked Data communities offer an opportunity to field or laboratory experiments by coupling all the elements of the experiment into one composite product. Leveraging these technologies it is possible to construct a catalog that acts as a concierge to any collaborator giving them perspectives on things, services, and data.
Negative aspects:
Challenges:
Requires: R-AccessRealTime, R-DataIrreproducibility, R-DataLifecycleStage, R-DataProductionContext, R-FormatMachineRead, R-FormatMultiple, R-FormatStandardized, R-TrackDataUsage, R-UsageFeedback, R-VocabOpen, R-VocabReference and R-UsageFeedback.
(Contributed by Sumit Purohit)
URL: http://rdesc.org/
RDESC's objective is to develop a capability for describing, linking, searching and discovering scientific resources used in collaborative science. For the purpose of capturing semantic context, RDESC adopts sets of existing ontologies where possible such as FOAF, BIBO and schema.org. RDESC also introduced new concepts in order to provide a semantically integrated view of the data. Such concepts have two distinct functions. The first is to preserve the semantics of the source that are more specific than what already existed in the ontology. The second is to provide broad categorization of existing concepts as it becomes clear that concepts are forming general groups. These generalizations enable users to work with concepts they understand, rather than needing to understand the semantics of many different systems. It strives to provide a lightweight enough framework to be used as a component in any software system such as desktop user environments or dashboards but also be scalable to millions of resources.
Elements
Positive aspects:
Negative aspects:
Challenges:
Potential Requirements:
Requires: R-AccessLevel, R-AccessRealTime, R-Citable, R-DataLifecyclePrivacy, R-DataMissingIncomplete, R-FormatStandardized, R-PersistentIdentification, R-ProvAvailable, R-SensitiveSecurity, R-SLAAvailable, R-TrackDataUsage, R-UniqueIdentifier, R-VocabOpen R-VocabReference and R-UsageFeedback.
(Contributed by Bernadette Lóscio )
URL: http://dados.recife.pe.gov.br/
Recife is a city situated in the Northeast of Brazil and it is famous for being one of the Brazil’s biggest tech hubs. Recife is also one of the first Brazilian cities to release data generated by public sector organizations for public use as open data. Then Open Data Portal Recife was created to offer access to a repository of governmental machine-readable data about several domains, including: finances, health, education and tourism. Data is available in CSV and GeoJSON formats and every dataset has metadata that helps in the understanding and usage of the data. However, the metadata is not provided using standard vocabularies or taxonomies. In general, data is created in a static way, where data from relational databases are exported in a CSV format and then published in the data catalog. Currently, work is under way to dynamically generate data from relational databases so that data will be available as soon as it is created. The main phases of the development of this initiative were: to educate people with appropriate knowledge concerning open data, relevant data identification in order to identify the sources of data that their potential consumers could find useful, data extraction and transformation from the original data sources to open formats, configuration and installation of the open data catalog tool, data publication and portal release.
Elements:
Challenges:
Requires: R-MetadataMachineRead, R-MetadataDocum, R-MetadataStandardized, R-QualityComparable, R-QualityCompleteness, R-VocabDocum, R-VocabOpen and R-VocabReference.
(Contributed by Yasodara )
URL: https://github.com/dataviz/retrato-da-violencia.org
This is a data visualization made in 2012 by Vitor Batista , Léo tartari and Thiago Bueno for a W3C Brazil Office challenge about data from Rio Grande do Sul (a brazilian region). The data was released in a .zip package, the original format was .csv. The code and the documentation of the project are in it's GitHub repository
Elements:
Positive Aspects: the decision to transform the CSV in to JSON was based on the necessity to have hierarchical data. The ability to map the CSV structure to XML or JSON was considered as a positive since JSON can cover more complex structures.
Negative Aspects: the data is already outdated (in 2014), there is no provision for new releases, and there's no associated metadata.
Requires: R-AccessUpToDate, R-MetadataAvailable, R-MetadataStandardized, R-PersistentIdentification, R-QualityCompleteness and R-SensitiveSecurity.
(Contributed by Luis Polo )
URL: http://www.tabulaeapp.com/
Tabul.ae is a framework to publish and visually explore data that can used to deploy powerful and easy-to-exploit open data platforms, so allowing organizations to unleash the potential of their data. The aim is to enable data owners (public organizations) and consumers (citizens and business reusers) to transform the information they manage into added-value knowledge, empowering them to easily create data-centric Web applications. These applications are built upon interactive and powerful graphs, and take the shape of interactive charts, dashboards, infographics and reports. Tabulae provides a high degree of assistance to create these apps and also automate several data visualization tasks (e.g. recognition of geographical entities to automatically generate a map). In addition, the charts and maps are portable outside the platform and can be smartly integrated with any Web content, enhancing the reusability of the information.
Elements:
Challenges:
Potential Requirements:
Requires: R-AccessUpToDate, R-FormatLocalize, R-FormatMachineRead, R-FormatMultiple, R-FormatStandardized, R-ProvAvailable, R-QualityComparable, R-QualityCompleteness, R-SensitiveSecurity, R-VocabReference and R-VocabVersion.
(Contributed by Phil Archer)
URL: http://www.researchinfonet.org/wp-content/uploads/2014/07/Joint-statement-of-principles-June-2014.pdf
(PDF)
In 2013, the Royal Society lead the formation of the UK Open Research Data Forum. This effort is a national reflection of a global trend towards the open publication of research data; see, for instance, the work of the Research Data Alliance, DataCite and the US National Institutes of Health as described in a talk by its Associate Director for Data Science, Philip Bourne. Following a workshop in April 2014, the UK Open Research Data Forum and US Committee on Coherence at Scale issued a joint statement (PDF) of the principles of open research data.
At the time of writing, these are undergoing review and refinement but the aims are clear. In the context of the Data on the Web Best Practices Working Group, many requirements stem from this list.
Challenges
Each principle listed here represents one or more challenges, with points 1, 3 and 5, being particularly relevant to Data on the Web Best Practices. Matters of policy and culture within any domain, whilst certainly challenging, are out of scope for the current work.
Elements:
Requires R-AccessBulk, R-Citable, R-FormatMachineRead, R-FormatOpen, R-FormatStandardized, R-LicenseAvailable, R-MetadataAvailable, R-MetadataDocum, R-MetadataMachineRead, R-MetadataStandardized, R-PersistentIdentification, R-ProvAvailable, R-SensitivePrivacy, R-SensitiveSecurity, R-TrackDataUsage and R-UniqueIdentifier.
(Contributed by AGESIC )
URL: http://datauy.org/
Uruguay's open data portal was launched in December 2012 and at the time of writing holds 85 datasets containing 114 resources. The open data initiative prioritizes the “use of data” rather than “quantity of data”, that’s why the catalog also promotes a number of applications using data resources in some way (in common with many other data portals). It’s important for the project to keep the ratio 1:3 between applications and datasets. Most of the resources are CSV and ESRI Shapefiles making this a catalog of 2 and 3 star resources according to the 5 Stars of Linked Open Data scheme. AGESIC does not have sufficient resources at government agencies to implement an open data liberation strategy and go to the next level. So when we are asked about opening data, keep it simple is the answer, and CSV is by far the easiest and smart way to start. Uruguay has an access to public information law but doesn't have legislation about open data. The open data initiative is lead by AGESIC with the support of an open data working group drawn from multiple government agencies.
Elements:
Challenges: Consolidation of tools to manage datasets, improve visualizations and transform resources to higher level (4 – 5 stars). Automated publication process using harvesting or similar tools. Alerts or control panels to keep data updated.
Requires: R-DataMissingIncomplete, R-AccessLevel R-VocabReference and R-TrackDataUsage.
Contributors: Adriano C. Machado Pereira, Adriano Veloso, Gisele Pappa, Wagner Meira Jr.
City/country: Belo Horizonte, Brazil
URL: http://observatorio.inweb.org.br/english.html
Overview: There are almost 65 million Brazilians connected to the Internet - 3 6% of the Brazilian Population, according to Comitê Gestor da Internet no Brasil. As a consequence, events such as the Brazilian Election Running have become popular topics in the Web, mainly in Online Social Networks. Our goal is to understand this new reality and present new ways to watch facts, events and entities on the fly using the Web and user-generated content available in Online Social Networks and Blogs. The Web Observatory is a research project part of the Instituto Nacional de Ciência e Tecnologia para a Web (INWEB), sponsored by CNPq and Fapemig. There are over 30 experts involved in the project, from four differente Federal Universities: Universidade Federal de Minas Gerais (UFMG), Centro Federal de Educação Tecnológica de Minas Gerais (CEFET-MG), Universidade Federal do Amazonas (UFAM) e Universidade Federal do Rio Grande do Sul (UFRGS). The INWEB researchers use a set of new techniques related to information recovery, data mining and data visualization to understand and summarize what the media and users are talking about on the Web. This provides the fundamental basis for an evaluation of the impact of the Olympic Campaigns and how users react to news and discussions. One new feature in this project is the possibility to see the propagation of the Tweets.
Elements:
Challenges
Requires R-DataEnrichment, R-GranularityLevels, R-MetadataDocum, R-MetadataMachineRead, R-MetadataStandardized, R-ProvAvailable, R-VocabDocum, R-VocabOpen and R-VocabReference.
(Contributed by Eric Stephan)
This use case describes a data management facility being constructed to support scientific offshore wind energy research for the U.S. Department of Energy’s Office of Energy Efficiency and Renewable Energy (EERE) Wind and Water Power Program. The Reference Facility for Renewable Energy (RFORE) project is responsible for collecting wind characterization data from remote sensing and in-situ instruments located on an offshore platform. This raw data is collected by the Data Management Facility and processed into a standardized NetCDF format. Both the raw measurements and processed data are archived in the PNNL Institutional Computing (PIC) petascale computing facility. The DMF will record all processing history, quality assurance work, problem reporting, and maintenance activities for both instrumentation and data. All datasets, instrumentation, and activities are cataloged providing a seamless knowledge representation of the scientific study. The DMF catalog relies on linked open vocabularies and domain vocabularies to make the study data searchable. Scientists will be able to use the catalog for faceted browsing, ad-hoc searches, query by example. For accessing individual datasets a REST GET interface to the archive will be provided.
Challenges:
For accessing numerous datasets scientists will be accessing the
archive directly using other protocols such as sftp, rsync, scp, and
access techniques such as HPN-SSH.
Requires: R-AccessRealTime R-FormatStandardized, R-VocabOpen and R-VocabReference.
The use cases presented in the previous section illustrate a number of challenges faced by data publishers and data consumers. These challenges show that some guidance is required on specific areas and therefore best practices should be provided. According to the challenges, a set of requirements were defined in such a way that a requirement motivates the creation of one or more best practices. Challenges related to Data Quality and Data Usage motivated the definition of specific requirements for the Quality and Granularity Description Vocabulary and the Data Usage Vocabulary.
The Open Knowledge Foundation defines open data most succinctly as data that can be freely used, modified, and shared by anyone for any purpose. Data on the Web may be open but Web technologies are equally applicable to data that is not open, or to scenarios where open and closed data are combined. There are a number of areas where data may be on the Web but not open.
Closed data may be generated in an organization that then blocks general access using a firewall or other access control system. Generated data may have links to other "open" data hosted elsewhere and it may be represented using open Web standards but this cannot be considered "open data."
Data can be closed through the policies of the data publisher and data provider. Business-sensitive data that is not made accessible to rest of the world is an example of closed data. Data controlled by law or government policies are further examples of closed data e.g. national security data, law enforcement, health care etc.
There is often a period between the generation of data and its publication as open data and data in this state should be considered as "closed." The data may remain in a closed state for an indefinite period of time while it is validated and analyzed, and insights and discoveries are published. It may also remain closed because the data publisher prefers to maximize their advantage gained by availability of data before they publish it openly. This is current common practice in scientific research.
Historically data has been exposed using various non-HTTP IETF protocol based end points including, but not limited to FTP, SFTP, SCP, Rsync. While these protocols are considered "open," their inter-operability with HTTP based Web protocol is currently a limiting factor. From an open data perspective, data only available using these these non-HTTP protocols should be considered as closed data and, by definition, is not on the Web. It follows that data accessible by private or application-specific proprietary access protocol end points are also deemed as both closed data and out of scope for data on the Web.
In the following section we summarize the requirements derived from all the use cases, grouped according to theme. Closed data cuts across those themes (it's all data on the Web) but it's worth highlighting R-AccessLevel, R-DataMissingIncomplete, R-DataLifecyclePrivacy and R-SensitiveSecurity as being of particular relevance to closed data.
The table below groups the requirements derived from the use cases according to the challenges faced by producers and users of data on the Web.
Data should be available for bulk download
Motivation: BuildingEye, LandPortal, LusTRE, OKFNTransport and UKOpenResearchForum.
The access level of the data should be provided along with conditions of access, for example, Open, restricted or closed.
Motivation:Bio2RDF, DadosGovBr, DigitalArchiving, DutchBaseReg, OKFNTransport, RDESC and UruguayOpenData.
Where data is produced in real-time, it should be available on the Web in real-time
Motivation: BuildingEye, LandPortal, MSI, OKFNTransport, OEFS, RDESC, SharePSI (Share-PSI Gijon) and WindCharacterization
Data should be available in an up-to-date manner and the update cycle made explicit
Motivation: ASO, Bio2RDF, GS1Digital, ISO GEO Story, OKFNTransport, RetratoDaViolencia, SharePSI (Share-PSI Emergency Response) and Tabulae
If the data is available via an API, the API should be documented.
Motivation:MSI and OKFNTransport.
It should be possible to perform some data enrichment tasks in order to aggregate value to data, therefore providing more value for user applications and services.
Motivation:ISO GEo Story, LandPortal, LusTRE, MSI and WebObservatory.
Information about locale parameters (date and number formats, language) should be made available
Motivation: ISO GEO Story, LandPortal, OKFNTransport and Tabulae.
Data should be available in a machine-readable format that is adequate for its intended or potential use
Motivation: ASO, BuildingEye, ISO GEO Story, LandPortal, LusTRE, MSI, OKFNTransport, OpenCityDataPipeline, OEFS, Tabulae and UKOpenResearchForum.
Data should be available in multiple formats
Motivation:BBC, Bio2RDF, DutchBaseReg, GS1Digital, LandPortal, LATimes, LusTRE, OEFS and Tabulae.
Data should be available in an open format
Motivation: BuildingEye, LATimes, LusTRE, MSI, OKFNTransport, OpenCityDataPipeline and UKOpenResearchForum.
Data should be available in a standardized format. Through standardization, interoperability is also expected.
Motivation:Bio2RDF, BuildingEye, DadosGovBr, GS1Digital, LandPortal, LusTRE, MSI, OpenCityDataPipeline, OEFS, Tabulae, UKOpenResearchForum and WindCharacterization.
Each data resource should be associated with a unique identifier
Motivation: DigitalArchiving, DutchBaseReg, LandPortal, LATimes, LusTRE, RDESC and UKOpenResearchForum.
Data should be designated if it is irreproducible.
Preliminary steps in the data lifecycle should not infringe upon individual’s intellectual property rights.
Motivation: Bio2RDF, BuildingEye, DadosGovBr and RDESC.
Data should be identified by a designated lifecycle stage
Motivation:DadosGovBr and OEFS
Vocabularies should be clearly documented
Motivation:ASO, BuildingEye, LandPortal, LusTRE, OpenCityDataPipeline, RecifeOpenData and WebObservatory.
Vocabularies should be shared in an open way
Motivation:LandPortal, LusTRE, OKFNTransport, OpenCityDataPipeline, OpenExperimenatlFieldStudies, RDESC, RecifeOpenData, WebObservatory and WindCharacterization.
Existing reference vocabularies should be reused where possible
Motivation: Bio2RDF, BuildingEye, DadosGovBr, DigitalArchiving, DutchBaseReg, ISO GEO Story, LandPortal, LusTRE, OpenCityDataPipeline, OpenExperimenatlFieldStudies, RDESC, RecifeOpenData, SharePSI, Tabulae, UruguayOpenData, WebObservatory and WindCharacterization.
Vocabularies should include versioning information
Motivation: BBC, DadosGovBr, LandPortal, LusTRE and Tabulae.
Note: SLAs are a form of metadata and so inherit metadata requirements
Service Level Agreements (SLAs) for industry reuse of the data should be available if requested (via a defined contact point). An SLA is a type of metadata, so all metadata requirements also apply here.
Motivation: OKFNTransport and RDESC.
Note: Licenses are a form of metadata and so inherit metadata requirements.
Data should be associated with a license.
Motivation: BuildingEye, DadosGovBr, ISO GEO Story, LATimes, LusTRE, OKFNTransport, OpenCityDataPipeline and UKOpenResearchForum.
Liability terms associated with usage of Data on the Web should be clearly outlined
Motivation: ASO, GS1Digital and OKFNTransport.
Production context information should be associated with data if relevant, e.g. service/process descriptions. DataProductijonContext is a type of metadata, so all metadata requirements also apply here.
Motivation: BuildingEye,LATimes, OKFNTransport, OpenCityDataPipeline, OpenExperimenatlFieldStudies
GeographicalContext (countries, regions, cities etc.) must be referred to consistently. GeographicalContext is a type of metadata, so all metadata requirements also apply here.
Motivation: ASO, BuildingEye, LandPortal, LATimes, OKFNTransport, OpenCityDataPipeline, SharePSI ( Share-PSI Location, Share-PSI Emergency Response).
Metadata should be available
Motivation: ASO, BuildingEye, DadosGovBr, LandPortal, LATimes, LusTRE, MSI, OKFNTransport, OpenCityDataPipeline, RetratoDaViolencia, UKOpenResearchForum.
Metadata vocabulary, or values if vocabulary is not standardized, should be well-documented
Motivation: BBC, BuildingEye, DadosGovBr, LusTRE, MSI, RecifeOpenData, SharePSI (Share-PSI Austria), UKOpenResearchForum and WebObservatory.
Metadata should be machine-readable
Motivation: BBC, ISO GEO Story, LandPortal, LATimes, LusTRE, MSI, RecifeOpenData, UKOpenResearchForum and WebObservatory.
Metadata should be standardized. Through standardization, interoperability is also expected.
Motivation: BBC, ISO GEO Story, LandPortal, LATimes, LusTRE, OpenCityDataPipeline, RecifeOpenData, RetratoDaViolencia, SharePSI (Share-PSI Federation), UKOpenResearchForum and WebObservatory.
An identifier for a particular resource should be resolvable on the Web and associated for the foreseeable future with a single resource or with information about why the resource is no longer available.
Motivation: Bio2RDF, DigitalArchiving, DutchBaseReg, GS1Digital, ISO GEO Story, LandPortal, LusTRE, RDESC, RetratoDaViolencia and UKOpenResearchForum.
Note: Provenance data is a form of metadata and so inherits metadata requirements.
If different versions of data exist, data versioning should be provided.
Motivation: BBC, LandPortal, LusTRE
Data provenance information should be available. Provenance data is a type of metadata, so all metadata requirements also apply here.
Motivation: ASO, DadosGovBr, GS1Digital, ISO GEO Story, LandPortal, LusTRE, RDESC, SharePSI (Share-PSI 270a), Tabulae, UKOpenResearchForum and WebObservatory.
Data should not infringe a person's right to privacy
Motivation: BuildingEye, DutchBaseReg, SharePSI (Share-PSI T Lights and Share-PSI Snap), and UKOpenResearchForum.
Data should not infringe an organization's security (local government, national government, business)
Motivation: BuildingEye, MSI, RDESC, RetratoDaViolencia, SharePSI (Share-PSI Emergency Response), Tabulae and UKOpenResearchForum.
Publishers should indicate if data is partially missing or if the dataset is incomplete
Motivation: ASO, BuildingEye, DadosGovBr, LATimes, OKFNTransport, OpenCityDataPipeline, RDESC and UruguayOpenData.
Data should be comparable with other datasets
Motivation: BuildingEye, LusTRE, OKFNTransport, OpenCityDataPipeline, RecifeOpenData, SharePSI (Share-PSI Emergency Response), and Tabulae.
Data should be complete
Motivation: ASO, BuildingEye, LandPortal, LusTRE, OKFNTransport, OpenCityDataPipeline, RecifeOpenData, RetratoDaViolencia and Tabulae.
Data should be associated with a set of documented, objective and, if available, standardized quality metrics. This set of quality metrics may include user-defined or domain-specific metrics.
Motivation: ASO, LandPortal, LATimes, LusTRE and OKFNTransport.
Subjective quality opinions on the data should be supported
Motivation: DadosGovBr, LusTRE, SharePSI (Share-PSI Feedback, Share-PSI Feedback 2, Share-PSI France).
Data available at different levels of granularity should be accessible and modelled in a common way
Motivation: ASO, ISO GEO Story and LandPortal.
It should be possible to cite data on the Web
Motivation: ASO, GS1Digital, LATimes, LusTRE, RDESC, SharePSI (Share-PSI Emergency Albania), and UKOpenResearchForum.
It should be possible to track the usage of data
Motivation: ASO, LandPortal, LusTRE, OEFS, RDESC and UKOpenResearchForum.
Data consumers should have a way of sharing feedback and rating data.
Motivation: ASO, DadosGovBr, LusTRE, OKFNTransport, OEFS, SharePSI (Share-PSI Feedback, Share-PSI Feedback 2, Share-PSI France).
The editors wish to thank all those who have contributed use cases or commented on those provided by others.
Changes since the previous version include a re-ordering of the use cases as well as the following: