Data on the Web Best Practices Use Cases & Requirements

Abstract

This document lists use cases, compiled by the Data on the Web Best Practices Working Group, that represent scenarios of how data is commonly published on the Web and how it is used. This document also provides a set of requirements derived from these use cases that will be used to guide the development of the set of Data on the Web Best Practices and the development of two new vocabularies: Quality and Granularity Description Vocabulary and Data Usage Description Vocabulary.

2. Use Cases

A use case illustrates an experience of publishing and using Data on the Web. The information gathered from the use cases should be helpful for the identification of the best practices that will guide the publishing and usage of Data on the Web. In general, a use case will be described at least by a statement and a discussion of how the use case is currently implemented. Use case descriptions demonstrate some of the main challenges faced by publishers or developers. Information about challenges will be helpful to identify areas where Best Practices are necessary. According to the challenges, a set of requirements are abstracted in such a way that a requirement motivates the creation of one or more best practices.

2.1 ASO: Airborne Snow Observatory

(Contributed by Lewis John McGibbney, NASA Jet Propulsion Laboratory/California Institute of Technology)
URL: http://aso.jpl.nasa.gov/

The two most critical properties for understanding snowmelt runoff and timing are the spatial and temporal distributions of snow water equivalent (SWE) and snow albedo. Despite their importance in controlling volume and timing of runoff, snowpack albedo and SWE are still largely unquantified in the US and not at all in most of the globe, leaving runoff models poorly constrained. NASA/JPL, in partnership with the California Department of Water Resources, has developed the Airborne Snow Observatory (ASO), an imaging spectrometer and scanning Lidar system, to quantify SWE and snow albedo, generate unprecedented knowledge of snow properties for cutting edge cryospheric science, and provide complete, robust inputs to water management models and systems of the future.

Elements:

Domains: Digital Earth Modeling, Digital Surface Modeling, Spatial Distribution Measurement, Snow Depth, Snow Water Equivalent, Snow Albedo.
Obligation/motivation: Funding provided by NASA Terrestrial Hydrology, NASA Applied Sciences, and California Department of Water Resources.
Usage: Example data usage include < 24hrs turnaround of flight data which is passed on to numerous Water Resource Managers aiding in water conservation usage, policy and decision making processes. Accurate and weekly spatially distributed SWE has never been produced before, and is highly informative to reservoir managers who must make tradeoffs between storing water for summer water supply versus using water before snowmelt recedes for generation of clean hydropower. Accurate SWE information, when coupled with runoff forecasting models, can also have ecological benefits through avoidance of late-spring high flows released from reservoirs that are not part of the natural seasonal variability.
Quality: Available in a number of scientific formats to customers and stakeholders based on customer requirements.
Lineage: All ASO data stems directly from on-board imaging spectrometer and scanning Lidar system instruments.
Size: Many many TB in size. Raw data acquisition is dependent on the basin/survey size. Recent individual flights generate in the order of ~500GB which include imaging spectrometer and Lidar data. This does however shrink considerably if we just consider the data that we would distribute.
Type/format: Digital Elevation Model / binary image (not public atm), Lidar (Raw Point Clouds)/ las (not public atm), Raster Zonal Stats / text (not public atm), Snow Water Equivalent / tiff, Snow Albedo / tiff
Rate of change: Recent weekly flights have provided information on a scale and timing that has never occurred before. Distributed SWE increases after storms, and decreases during melt events in patterns that have never before been measured and will be studied by snow hydrologists for years to come. Once data is captured it is not updated, however subsequent data is generated from the original data within processing pipelines which as screening for data quality control and assurance.
Data lifespan: For immediate operational purposes, the last flight's data become obsolete when a new flight is made. However, the annual sequence of data sets will be leveraged by snow hydrologists and runoff forecasters during the next decade as they are used to improve models and understanding of the spatial nature of the mountain snowpack.
Potential audience: (snow) hydrologists, hydrologic modelers, runoff forecasters, and reservoir operators and reservoir managers.

Positive aspects:

This use case provides insight into what a NASA funded demonstration mission looks like (from a data provenance, archival point of view).

It is an excellent opportunity to delve into an earth science mission which is actively addressing the global problem of water resource management. Recently senior officials have declared a statewide (CA) drought emergency and are asking all Californians to reduce their water use by 20 percent. California, and other U.S. states are experiencing a serious drought and the state will be challenged to meet its water needs in the upcoming year. Calendar year 2013 was the driest year in recorded history for many areas of California, and current conditions suggest no change is in sight for 2014. ASO is at the front line of cutting edge scientific research meaning that the data that backs the mission, as well as the practices adopted within the project execution, are extremely important to addressing this issue.

Project collaborators and stakeholders are sent data and information when it is produced and curated. For some stakeholders, the data (in an operational sense) they require is very small in size and in such cases ASO emphasizes speed. It's more like a sharing of information than delivering a product for the short-term turnaround of information.

Negative aspects:

Demonstration missions of this caliber also have downsides. With regards to data best practices, more work is required in the following areas:

Documentation of processes including data acquisition, provenance tracking, curation of data products such as bare earth digital earth models (DEM), full surface digital surface models (DSM), snow products, snow water equivalents (SWE), etc.
Currently data is not searchable, this makes retrieval of specific data difficult when data volumes grow to this size and nature
There is no publicly available guidance regarding suggested tools which can be used to interact with the data sources.
Quick turnarounds of operational data may be compromised when ASO moves beyond a demonstration mission and picks up new customers etc. This will most likely be attributed to the time associations for the generation and distribution of science grade products.

Challenges:

Data volumes are large, and will grow by year on year. The volume of generated data grew by 50% between 2013 and 2014.
On many occasions we require a very quick turn around on inferences which can be made from the data. This sometimes (but not always) comes at the cost of reducing the emphasis of best practices for the generation, storage and archival of projects data.
The data takes the form of science oriented representational formats. Such formats are non-typical of the typical data many people publish on the Web. A lot of thought needs to be put in to how this data can be better accessed.

Requires: R-AccessUpToDate, R-Citable, R-DataIrreproducibility, R-DataMissingIncomplete, R-FormatMachineRead, R-GeographicalContext, R-GranularityLevels, R-LicenseLiability, R-MetadataAvailable, R-ProvAvailable, R-QualityCompleteness, R-QualityMetrics, R-TrackDataUsage, R-UsageFeedback and R-VocabDocum.

2.2 BBC

Contributors: Ghislain Atemezing (EURECOM)

URL: http://www.bbc.co.uk/ontologies

Overview: the BBC provides a list of the ontologies they implement and use for their Linked Data platform. The site provides access to the ontologies the BBC is using to support its audience using their applications, such as BBC Sport or BBC Education. Each ontology has a short description with metadata information, an introduction, sample data, an ontology diagram and the terms used in the ontology. The metadata includes 6 fields that are generally filled: mailto authors, created data, version (current version number), prior version (decimal), license (a link to the license) and a link for downloading the RDF version. For example, see the description of the “Core concepts ontology.” However, this metadata that is available in the HTML page is NOT present in a machine-readable format, i.e. in the ontology itself.

Versioning: each ontology uses a decimal notation for the version and the URL for accessing each version file of the ontology is constructed as {BASE-URI}/{ONTO-PREFIX}/{VERSION}.ttl; where {BASE-URI} is http://www.bbc.co.uk/ontologies/. For example: the file of version 1.9 of the “core concepts” ontology is located at http://www.bbc.co.uk/ontologies/coreconcepts/1.9.ttl. However, between different versions, the URI of the ontology used is the same and is of the form : {BASE-URI}/{ONTO-PREFIX}/.

Elements:

Domains: vocabulary catalog, versioning, metadata
Obligation/motivation: Provide a unique point of vocabularies built within BBC
Usage: The site provides access to the ontologies the BBC is using to support its audience using their applications,
Quality: High level and domain vocabularies adapted to BBC applications.
Size: currently, there are 12 ontologies of different sizes, from 40 triples to 750 triples.
Type/format: RDF/TURTLE, and html pages describing each ontology
Rate of change: Depends on the vocabulary, may depends on the different versions; although there is not such metadata information
Data lifespan: n/a
Potential audience: BBC applications and any user interested in the domains of the vocabularies (publishers, researchers or developers)

Challenges

It could be nice and consistent to add systematically the metadata provided in the html pages describing each BBC ontology in the RDF vocabulary.
How to dereference from a unique URI, different versions of the ontology in different flavor of RDF (XML, TURTLE, etc.)
Need to add the modified date along with the version of each ontology.

Requires R-MetadataDocum, R-MetadataMachineRead, R-FormatMultiple, R-MetadataStandardized and R-VocabVersion.

2.3 Bio2RDF

(Contributed by Carlos Laufer)
URL: http://bio2rdf.org/

Bio2RDF¹ is an open source project that uses Semantic Web technologies to make possible the distributed querying of integrated life sciences data. Since its inception², Bio2RDF has made use of the Resource Description Framework (RDF) and the RDF Schema (RDFS) to unify the representation of data obtained from diverse fields (molecules, enzymes, pathways, diseases, etc.) and heterogeneously formatted biological data (e.g. flat-files, tab-delimited files, SQL, dataset specific formats, XML etc.). Once converted to RDF, this biological data can be queried using the SPARQL Protocol and RDF Query Language (SPARQL), which can be used to federate queries across multiple SPARQL endpoints.

Elements:

Domains: Biological data
Obligation/motivation: Biological researchers are often confronted with the inevitable and unenviable task of having to integrate their experimental results with those of others. This task usually involves a tedious manual search and assimilation of often isolated and diverse collections of life sciences data hosted by multiple independent providers including organizations such as the National Center for Bio-technology Information (NCBI) and the European Bioinformatics Institute (EBI) that provide dozens of user-submitted and curated datasets, as well as smaller institutions such as the Donaldson group that publishes iRefIndex³, a database of molecular interactions aggregated from 13 data sources. While these mostly isolated silos of biological information occasionally provide links between their records (e.g. UniProt links its entries to hundreds of other datasets), they are typically serialized in either HTML elements or in flat file data dumps that lack the semantic richness required to serialize the intent of the linkage between data records. With thousands of biological databases and hundreds of thousands of datasets, the ability to find relevant data is hampered by non-standard database interfaces and an enormous number of haphazard data formats⁴. Moreover, metadata about these biological data providers (dataset source data information, dataset versioning, licensing information, date of creation, etc.) is often difficult to obtain. Taken together, the inability to easily navigate through available data presents an overwhelming barrier to their reuse.
Usage: Biological research
Quality: Bio2RDF scripts generate provenance records using the W3C Vocabulary of Interlinked Datasets (VoID), the Provenance vocabulary (PROV) and Dublin Core vocabulary. Each data item is linked to a provenance object that indicates the source of the data, the time at which the RDF was generated, licensing (if available from the data source provider), the SPARQL endpoint in which the resource can be found, and the downloadable RDF file where the data item is located. Each dataset provenance object has a unique IRI and label based on the dataset name and creation date. The date-specific dataset IRI is linked to a unique dataset IRI using the PROV predicate wasDerivedFrom such that one can query the dataset SPARQL endpoint to retrieve all provenance records for datasets created on different dates. Each resource in the dataset is linked the date-unique dataset IRI that is part of the provenance record using the VoID inDataset predicate. Other important features of the provenance record include the use of the Dublin Core creator term to link a dataset to the script on Github that was used to generate it, the VoID predicate sparqlEndpoint to point to the dataset SPARQL endpoint, and VoID predicate dataDump to point to the data download URL.
Dataset metrics
1. total number of triples
2. number of unique subjects
3. number of unique predicates
4. number of unique objects
5. number of unique types
6. unique predicate-object links and their frequencies
7. unique predicate-literal links and their frequencies
8. unique subject type-predicate-object type links and their frequencies
9. unique subject type-predicate-literal links and their frequencies
10. total number of references to a namespace
11. total number of inter-namespace references
12. total number of inter-namespace-predicate references
Size: At the time of writing, thirty five datasets have been generated as part of the Bio2RDF 3 release. Several of the datasets are themselves collections of datasets that are now available as one resource. Each dataset has been loaded into a dataset-specific SPARQL endpoint using Openlink Virtuoso. All updated Bio2RDF linked data and their corresponding Virtuoso DB files are available for download.

Type/format: RDF
Rate of change: depends on data source
Data lifespan: depends on data source
Potential audience: Biological researchers

References:

Callahan A, Cruz-Toledo J, Ansell P, Klassen D, Tumarello G, Dumontier M: Improved dataset coverage and interoperability with Bio2RDF Release 2 (PDF). SWAT4LS 2012, Proceedings of the 5th International Workshop on Semantic Web Applications and Tools for Life Sciences, Paris, France, November 28-30, 2012.
Belleau F, Nolin MA, Tourigny N, Rigault P, Morissette J: Bio2RDF: towards a mashup to build bioinformatics knowledge systems. J Biomed Inform 2008, 41(5):706-716.
Razick S, Magklaras G, Donaldson IM: iRefIndex: a consolidated protein interaction database with provenance. BMC Bioinformatics 2008, 9:405.
Goble C, Stevens R: State of the nation in data integration for bioinformatics. J Biomed Inform 2008, 41(5):687-693.

Challenges:

Lack of human-readable metadata.
Data variability (models, sources, etc.).
RDFizations of Datasets.
Wide variety of formats and technologies.

Potential Requirements:

Dataset versioning and updating mechanisms
Standardization of schemas
Integration with other platforms/services
Data Persistence

Requires: R-AccessLevel, R-AccessUpToDate, R-DataLifecyclePrivacy, R-FormatMultiple, R-FormatStandardized , R-PersistentIdentification and R-VocabReference.

2.4 BuildingEye: SME use of public data

(Contributed by Deirdre Lee)
URL: http://mypp.ie/

Buildingeye.com makes building and planning information easier to find and understand by mapping what's happening in your city. In Ireland local authorities handle planning applications and usually provide some customized views of the data (PDFs, maps, etc.) on their own Web site. However there isn't an easy way to get a nationwide view of the data. BuildingEye, an independent SME, built http://mypp.ie/ to achieve this. However as each local authority didn't have an Open Data portal, BuildingEye had to directly ask each local authority for its data. It was granted access to some authorities, but not all. The data it did receive was in different formats and of varying quality/detail. BuildingEye harmonized this data for its own system. However, if another SME wanted to use this data, they would have to go through the same process and again go to each local authority asking for the data.

Elements:

Domains: Planning data
Obligation/motivation: demand from SME
Usage: Commercial usage
Quality: standardized, interoperable across local authorities
Size: medium
Type/format: structured according to legacy system schema
Rate of change: daily
Potential audience: Business, citizens
Governance: local authorities

Challenges:

Access to data is currently a manual process, on a case by case basis
Data is provided in different formats, e.g. database dumps, spreadsheets
Data is structured differently, depending on the legacy system schema, concepts and terms not interoperable
No official Open license associated with the data
Data is not available for further reuse by other parties

Potential Requirements:

Creation of top-down policy on open data to ensure common understanding and approach
Top-down guidance on recommended Open license usage
Standardized, non-proprietary formats
Availability of recommended domain-specific vocabularies.

Requires: R-AccessBulk, R-AccessRealTime, R-DataLifecyclePrivacy, R-DataMissingIncomplete, R-DataProductionContext, R-AccessLevel, R-FormatMachineRead, R-FormatOpen, R-FormatStandardized, R-GeographicalContext, R-LicenseAvailable, R-MetadataAvailable, R-MetadataDocum, R-QualityCompleteness, R-QualityComparable, R-SensitivePrivacy, R-SensitiveSecurity and R-VocabDocum.

2.5 Dados.gov.br

(Contributed by Yasodara)
URL: http://dados.gov.br/

Dados.gov.br is the open data portal of Brazil's Federal Government. The site was built by a community network pulled together by three technicians from the Ministry of Planning. They managed the group from INDA or "National Infrastructure for Open Data." CKAN was chosen because it is free software and presents independent solutions for the placement of a data catalog of the Federal Government provided on the internet.

Elements:

Domains: federal budget, addresses, Infrastructure information, e-gov tools usage, social data, geographic information, political information, Transport information.
Obligation/motivation: Data that must be provided to the public under a legal obligation, the called LAI or Brazilian Information Access Act, edited in 2012.
Usage: Data that is the basis for services to the public; Data that has commercial reuse potential.
Quality: Authoritative, clean data, vetted and guaranteed.
Lineage/Derivation: Data came from various publishers. As a catalog, the site has faced several challenges, one of them was to integrate the various technologies and formulas used by publishers to provide datasets in the portal.
Type/format: Tabular data, text data.
Rate of change: There is fixed data and data with high rate of change.

Challenges:

Data integration (lack of vocabularies).
Collaborative construction of the portal: managing online sprints and balancing public expectatives.
Licensing the data of the portal. Most of data that is in the portal does not have a special licence so there are types of license applied to different datasets.

Requires: R-AccessLevel, R-DataLifecyclePrivacy, R-DataLifecycleStage, R-DataMissingIncomplete, R-FormatStandardized, R-LicenseAvailable, R-MetadataAvailable, R-GeographicalContext, R-MetadataDocum, R-ProvAvailable, R-QualityOpinions, R-UsageFeedback, R-VocabReference and R-VocabVersion.

2.6 Digital archiving of Linked Data

(Contributed by Christophe Guéret)
URL: http://dans.knaw.nl/

Digital archives, such as DANS in the Netherlands, have so far been concerned with the preservation of what could be defined as "frozen" datasets. A frozen dataset is a finished, self-contained set of data that does not evolve after it has been constituted. The goal of the preserving institution is to ensure this dataset remains available and readable for as many years as possible. This can for example concern an audio recording, a digitized image, e-books or database dumps. Consumers of the data are expected to look for specific content based on its associated identifier, download it from the archive and use it. Now comes the question of the preservation of Linked Open Data. In opposition to "frozen" data sets, linked data can be qualified as "live" data. The resources it contains are part of a larger entity to which third parties contribute, one of the design principles indicate that other data producers and consumers should be able to point to data. As LD publishers stop offering their data (e.g. at the end of a project), taking the LD off-line as a dump and putting it in an archive effectively turns it into a frozen dataset, likewise SQL dumps and other kind of databases. The question then is to what extent this is an issue.

Challenges: The archive has to think about whether dereferencing for resources found in preserved datasets is required or not, also to think about providing a SPARQL endpoint or not. If data consumers and publishers are fine with having RDF data dumps to be downloaded from the archive prior to its usage - just like any other digital item so far - the technical challenges could be limited to handling the size of the dumps and taking care of serialization evolution over time (e.g. from N-Triples to TriG, or from RDF/XML to HDT) as the preference for these formats evolves. Turning a live dataset into a frozen dump also raises the question of the scope. Considering that LD items are only part of a much larger graph that gives them meaning through context the only valid dump would be a complete snapshot of the entire connected component of the Web of Data graph the target dataset is part of.

Potential Requirements: Decide on the importance of the de-referencability of resources and the potential implications for domain names and naming of resources. Decide on the scope of the step that will turn a connected sub-graph into an isolated data dump.

Requires: R-AccessLevel, R-PersistentIdentification, R-UniqueIdentifier and R-VocabReference.

2.7 Dutch Base Registers

(Contributed by Christophe Guéret)
URL: http://www.e-overheid.nl/onderwerpen/stelselinformatiepunt/stelsel-van-basisregistraties

The Netherlands has a set of registers that are under consideration for exposure as Linked (Open) Data in the context of the "PiLOD" project. The registers contain information about buildings, people, businesses that other individual public bodies may want to refer to for they daily activities. One of them is, for instance, the service of public taxes ("BelastingDienst") which regularly pulls out data from several registers, stores this data in a big Oracle instance and curates it. This costly and time consuming process could be optimized by providing on-demand access to up-to-date descriptions provided by the register owners.

Challenges:

In terms of challenges, linking is for once not much of an issue as registers already cross-reference unique identifiers (see also http://www.wikixl.nl/wiki/gemma/index.php/Ontsluiting_basisgegevens). A URI scheme with predicable and persistent URIs is being considered for implementation. Actual challenges include:

Capacity: at this point, it is considered unreasonable to ask every register to publish its own data. Some of them export what they have on the national open data portal. This data has been used to do some testing with third-party publications from PiLOD project members but this is rather sensitive as a long term strategy (governmental data has to be traceable/trustable as such). The middle ground solution currently deployed is the PiLOD platform, a (semi)-official platform for publishing register data.
Privacy: some of the register data is personal or may become so when linked to others (e.g. when addresses are used to disambiguate personal data). Some registers will require secure access to some of their data to some people only (an example of non-open Linked Data). Some others can go along with open data as long as they get a precise log of who is using what.
Revenue: institutions working under mixed government/non-government funding generate part of their revenue by selling some of the data they curate. Switching to an open data model will cause a direct loss in revenue that has to be compensated for by other means. This does not have to mean closing the data, e.g. a model of open dereferencing plus paid dumps can be considered, as well as other indirect revenue streams.

Requires: R-AccessLevel, R-FormatMultiple, R-PersistentIdentification, R-SensitivePrivacy, R-UniqueIdentifier and R-VocabReference.

2.8 GS1 Digital

(Contributed by Mark Harrison, University of Cambridge & Eric Kauz, GS1).

Retailers and Manufacturers / Brand Owners are beginning to understand that there can be benefits to openly publishing structured data about products and product offerings on the Web as Linked Open Data. Some of the initial benefits may be enhanced search listing results (e.g. Google Rich Snippets) that improve the likelihood of consumers choosing such a product or product offer over an alternative product that lacks the enhanced search results. However, the longer term vision is that an ecosystem of new product-related services can be enabled if such data is available. Many of these will be consumer-facing and might be accessed via smartphones and other mobile devices, to help consumers to find the products and product offers that best match their search criteria and personal preferences or needs — and to alert them if a particular product is incompatible with their dietary preferences or other criteria such as ethical / environmental impact considerations — and to suggest an alternative product that may be a more suitable match. A more complete description of this use case is available.

Elements:

Domains:
- Product master data (e.g. technical specifications, ingredients, nutritional information, dimensions, weight, packaging).
- Product offerings (e.g. sales price, availability (online, locally), payment options, delivery/collection options.
- Ethical / environmental claims about a product and its production process.
Obligation/motivation:
- initially, enhanced search result listings (e.g. Google Rich Snippets);
- vision is to enable an ecosystem of new digital apps around product data;
- the food sector in the EU is already obliged under new food labelling legislation (EU 1169 / 2011, Article 14) to provide the same amount of information about a food product that is sold online to consumers as the information that would be available to them from the product packaging if they picked up the product in-store. Although the legislation does not suggest that Linked Open Data technology should be used to make the same information available in a machine-readable format, there is currently significant investment and effort to upgrade Web sites to provide accurate and detailed information about food products; the GS1 Digital team consider that for a relatively small amount of effort, these companies could gain some tangible benefits (e.g. enhanced search results) from such compliance efforts by using Linked Open Data technology within their Web pages.
Usage:
- data providing transparency about product characteristics
- data used to help consumers make informed choices about which products to buy/consume
Quality: Very important to have trustworthy authoritative data from respective organizations.
Size: Typically 20+ factual claims per product - probably 40+ RDF triples.
Type/format: HTML + RDFa / JSON-LD / Microdata.
Rate of change: mostly static data initially — but subject to some variation over time
Data lifespan: data should remain accessible until products are no longer considered to be in circulation; this represents a challenge for deprecated product lines data that is stated authoritatively by one organization might be embedded / referenced in the data asserted by another organization; this raises concerns about whether embedded data becomes stale if it is inadequately synchronized, that referenced data is not dereferenced (and therefore not discovered / gathered) by consumers or the data. From a liability perspective, there also needs to be clarity about which organization asserted which factual information — and also information about which organization has the authority to assert specific factual claims.
Potential audience: machine-readable (search engines, data aggregators, mobile apps etc.)

Challenges:

Linked Open Data about products is likely to be highly distributed in nature and various parties have authority over specific claims.
Accreditation agencies have authority over ethical/environmental claims.
Brand owners / manufacturers have authority over product master data.
Retailers have authority over facts related to product offerings (price, availability etc.).
An organization (e.g. retailer) might embed authoritative data asserted by another organization (e.g. brand owner) and there is the risk that such embedded information becomes stale if it is not continuously synchronized.
An organization (e.g. retailer) might reference a graph of authoritative data that can be retrieved via an HTTP request to a remote HTTP URI. There is a risk that software or search engines consuming Linked Open Data containing such references may fail to dereference such HTTP URIs and in doing so may fail to gather all of the relevant data.
Organizations are currently faced with a choice of whether to embed machine-readable structured data in their Web pages using a block approach (e.g. using JSON-LD) or using an inline approach (e.g. using RDFa, RDFa Lite or Microdata). A block approach (JSON-LD) may be simpler and less brittle than inline annotation, especially as it can be easily decoupled from structural changes to the body of the Web page that may happen over time in the redesign of a Web site. At present, tool support for the 3 major markup approaches for embedded Linked Open Data (RDFa, JSON-LD, Microdata) is unequal across the three formats and some tools may not export or import / ingest all 3 formats - some tools even fail to extract data from JSON-LD markup created by their corresponding export tool. There are some significant challenges to ensure that the structured data embedded within a Web page is correctly linked to form coherent RDF triples, without any dangling nodes that should be connected to the Subject or other nodes.
Only through the provision of best-in-class tool support that recognize all three major formats on a completely equal footing can organizations have any confidence that they can use any of the 3 major markup formats and the ability to verify / validate that their own markup does result in the correct RDF triples.

Potential Requirements:

The ability to determine who asserted various facts — and whether they are the organization that can assert those facts authoritatively.
Where data from other sources is embedded, there is a risk that the embedded data might be stale. It is therefore helpful to indicate which graph of triples is a snapshot in time from data from another source - and to provide a link to the original source, so that the consumer of the data has the opportunity to obtain a fresh version of the live data rather than relying on a potentially stale snapshot graph of data. DWBP could provide guidance about how to indicate which graph of data is a snapshot and where it came from.
Consumers of Linked Open Data about products might rely on it for making decisions — not only about purchase but even consumption. If the data about a product is inaccurate or out-of-date, we might need to provide some guidance about how liability terms and disclaimers can be expressed in Linked Open Data. We’re not suggesting that we define such terms from a legal perspective, but perhaps there is an existing framework in a similar way that there is an existing framework for expressing various licences of the data? If not, perhaps such a framework needs to be developed - but outside of the DWBP group? Licensing generally says what you’re allowed to do with the data - but I don’t think it says anything about liability for using the data or making decisions based on that data. This area probably needs some clarification, particularly if there is a risk of injury or death (due to inaccurate information about allergens in a food product).

Requires: R-AccessUpToDate, R-Citable, R-FormatMultiple, R-FormatStandardized, R-LicenseLiability, R-PersistentIdentification and R-ProvAvailable.

2.9 ISO GEO Story

(Contributed by Ghislain Atemezing)

ISO GEO manages catalog records of geographic information in XML that conform to ISO-19139, a French adaptation of ISO-19115 (data sample). They export thousands of records like that today but they need to manage them better. In their platform, they store the information in a more conventional manner and use this standard for export datasets compliant to the INSPIRE standards or via the OGC's CSW protocol. Sometimes, they have to enrich their metadata using tools like GeoSource and accessed through an SDI with their own metadata records. ISO GEO wants to be able to integrate all the different implementations of ISO-19139 in different tools in a single framework to better understand the thousands of metadata records they use in their day-to-day business. Types of information recorded in each file include: contact info (metadata) [data issued], spatial representation, reference system info [code space], spatial resolution, geographic extension of the data, file distribution, data quality and process step (example).

Challenges:

Achieve interoperability between supporting applications, e.g. validation and discovery services built over a metadata repository.
Capture the semantics of the current metadata records with respect to ISO-19139.
A unified way to have access to each record within the catalog at different levels: local, regional, national or EU level.

Requires: R-AccessUpToDate, R-DataEnrichment, R-FormatLocalize, R-FormatMachineRead, R-GranularityLevels, R-LicenseAvailable, R-MetadataMachineRead, R-MetadataStandardized, R-PersistentIdentification, R-ProvAvailable and R-VocabReference.

2.10 The Land Portal

(Contributed by Carlos Iglesias)
URL: http://landportal.info/

The IFAD Land Portal platform has been completely rebuilt as an Open Data collaborative platform for the Land Governance community. Among the new features the Land Portal provides access to more than 100 indicators from more than 25 different sources on land governance issues for more than 200 countries over the world, as well as a repository of land related-content and documentation. Thanks to the new platform people could

curate and incorporate new data and metadata by means of different data importers and making use of the underlying common data model;
search, explore and compare the data through countries and indicators; and
consume and reuse the data by different means (i.e. raw data download at the data catalog; linked data and SPARQL endpoint at RDF triplestore; RESTful API; and built-in graphic visualization framework).

Elements:

Domains: Land Governance; Development
Obligation/motivation: To find reliable data driven indicators on land governance and put all them together to facilitate access, study, analysis, comparison and data gaps detection.
Usage: Research; Policy Making, Journalism; Development; Investments; Governance; Food security; Poverty; Gender issues.
Quality: Every sort of data, from high quality to unverified.
Size: Varies, but low-medium in general.
Type/format: Varies: APIs; JSON; spreadsheets; CSV; HTML; XML; PDF...
Rate of change: Usually yearly, but also higher rates (monthly, quarterly...).
Data lifespan: Unlimited.
Potential audience: Practitioners; Policy makers; Activists; Researchers; Journalists.

Challenges:

Data coverage.
Quality of data and metadata.
Lack of machine-readable metadata.
Inconsistency between different data sources.
Wide variety of formats and technologies.
Some non machine-readable formats.
Data variability (models, sources, etc.).
Data provenance.
Diversity and (sometimes) complexity of Licenses.
Internationalization issues (e.g. different formats for numbers, dates, etc.) and multilingualism.

Potential Requirements:

Availability of general use taxonomies (countries, topics, etc.).
Data interoperability i.e. domain-specific vocabularies for a common data model with reference formats and protocols.
Data persistence.
Versioning mechanisms.

Requires: R-AccessBulk, R-AccessRealTime, R-DataEnrichment, R-DataVersion, R-FormatLocalize, R-FormatMachineRead, R-FormatMultiple, R-FormatStandardized, R-GeographicalContext, R-GranularityLevels, R-MetadataAvailable, R-MetadataMachineRead, R-MetadataStandardized, R-ProvAvailable, R-PersistentIdentification, R-QualityCompleteness R-QualityMetrics, R-TrackDataUsage, R-UniqueIdentifier, R-VocabDocum, R-VocabOpen, R-VocabReference and R-VocabVersion.

2.11 LA Times' Reporting of Ron Galperin's Infographic

(Contributed by Phil Archer )
URL: http://articles.latimes.com/2014/mar/27/local/la-me-ln-gender-wage-gap-city-government-20140327

On 27 March 2014, the LA Times published a story Women earn 83 cents for every $1 men earn in L.A. city government. It was based on an Infographic released by LA's City Controller, Ron Galperin. The Infographic was based on a dataset published on LA's open data portal, Control Panel LA . That portal uses the Socrata platform which offers a number of spreadhseet-like tools for examining the data, the ability to download it as CSV, embed it in a Web page and see its metadata.

Positive aspects:

The LA Times story makes its sources clear (it also links to a related Pew Research Center article ).
It offers readers a commentary on the particular issue raised and is easy for anyone to digest.
Data sources are cited directly and can be followed up on by (human) readers.

Negative aspects:

The Infographic itself only cites the data portal, not the specific dataset.
The metadata provided on the data portal is very sparse with many fields left empty.
The dataset is itself the result of an analysis (there are only 8 lines in the table), the raw data on which it is based is not cited, let alone made available, and the methods used are not described.

Challenges:

Data Citation - how could Ron Galperin have referred to the source data in the Infographic? (the URI is way too long). QR code? Short PURL?
How could the publisher of the data link to the Infographic as a visualization of it?
In this case, the creator of the underlying data is the same as the creator of the Infographic, but if they were different, how could the data creator discover the Infographic, still less the media report about it?
The methodology used is not explained - making it hard to assess trustworthiness. How can provenance be described?
The metadata is incomplete and does not used a recognized standard vocabulary making automated discovery and use by anyone other than the data creator difficult.

Other Data Journalism blogs:

Requires: R-Citable, R-DataMissingIncomplete, R-DataProductionContext, R-FormatMultiple, R-FormatOpen, R-GeographicalContext, R-LicenseAvailable, R-MetadataAvailable, R-MetadataStandardized, R-QualityMetrics, R-UniqueIdentifier and R-TrackDataUsage.

2.12 LusTRE: Linked Thesaurus fRamework for Environment

(Contributed by Riccardo Albertoni, CNR-IMATI, Genoa, Italy)
URL: http://linkeddata.ge.imati.cnr.it/

LusTRE is a framework that combines existing thesauri to support the management of environmental resources. It considers the heterogeneity in scope and levels of abstraction of existing environmental thesauri as an asset when managing environmental data, thus it exploits Linked Data (SKOS, RDF etc.) in order to provide a multi-thesauri solution for INSPIRE data themes related to nature conservation.

LusTRE is intended to support metadata compilation and data/service discovery according to ISO 19115/19119. The development of LusTRE includes:

a review of existing environmental thesauri and their characteristics in term of multilingualism, openness and quality;
the publication of environmental thesauri as Linked Data;
the creation of linksets among published thesauri as well as well-known thesauri exposed as Linked Data by third-parties;
the exploitation of aforementioned linksets to take advantage of thesaurus complementarities in terms of domain specificity and multilingualism.

Quality of thesauri and linksets is an issue that is not necessarily limited to the initial review of thesauri, it should be monitored and promptly documented.

In this respect, a standardized vocabulary for expressing dataset and linkset quality would be needed to make accessible the quality assessment of thesauri included in LusTRE. Considering the importance of linkset quality in the achievement of an effective cross-walking among thesauri, further services for assessing the quality of linksets are going to be investigated. Such services might be developed extending the measure proposed in Albertoni et al, 2013 (PDF), so that, linksets among thesauri can be assessed considering their potential when exploiting interlinks for thesaurus complementarities.

LusTRE is currently under development within the EU project eENVplus (CIP-ICT-PSP grant No. 325232), it extends the common thesaurus framework De Martino et al. 2011 previously resulting from the EU project NatureSDIplus (ECP-2007-GEO-317007).

Elements:

Domains: Geographic information. Thesauri and Controlled vocabularies provided within LusTRE's are meant to ease the management of Geographical Data and Services.
Obligation/motivation: Activity foreseen in EU project which encourages the adoption of INSPIRE metadata implementation rules.
Usage: Data that is the basis for services to the public.
Quality: Largely variable.
Lineage: Thesauri and controlled vocabulary provided come from third parties.
Size: Small, most of the thesauri size is less than 100MB.
Type/format: LusTRE publishes SKOS/RDF, but the thesauri considered for inclusion in LusTRE are not necessarily in that format.
Rate of change: Depends on the thesaurus, in average it is a low rate of Change.
Data lifespan: Beyond the lifespan of eENVPlus project (2013 – 2015).
Potential audience: Public administrations involved in the cataloguing of geographical information and Spatial Data Infrastructure. Decision makers searching in Spatial Data Infrastructure.

Positive aspects: The use case includes publication as well as consumptions of data.

Challenges:

Diversity and (sometimes) complexity of Licenses.
Issues pertaining to multilingualism.
Assessment and documentation of dataset and linkset quality with domain-dependent quality metrics.

Requires: R-AccessBulk, R-Citable, R-DataEnrichment, R-DataVersion, R-FormatMachineRead, R-FormatMultiple, R-FormatOpen, R-FormatStandardized, R-LicenseAvailable, R-MetadataAvailable, R-MetadataDocum, R-MetadataMachineRead, R-MetadataStandardized, R-PersistentIdentification, R-ProvAvailable, R-QualityComparable, R-QualityCompleteness, R-QualityMetrics, R-QualityOpinions, R-TrackDataUsage, R-UniqueIdentifier, R-UsageFeedback, R-VocabDocum, R-VocabOpen, R-VocabReference and R-VocabVersion.

2.13 Machine-readability and Interoperability of Licenses

(Contributed by Deirdre Lee, based on post by Leigh Dodds)

There are many different licenses available under which data can be published on the Web, e.g. Creative Commons, Open Data Commons, national licenses, etc. It is important that the license is available in a machine-readable format. Leigh Dodds has done some work towards this with the Open Data Rights Statement Vocabulary including guides for publishers and reusers. Another issue is when data under different licenses are combined, the license terms under which the data is available also have to be merged. This interoperability of licenses is a challenge.

Challenges:

Standard vocabulary for data licenses.
Machine-readability of data licenses.
Interoperability of data licenses.

Requires: R-LicenseAvailable

NB there is also a requirement for licenses to be interoperable but this is out of scope as defined by the Working Group's charter.

2.14 Mass Spectrometry Imaging (MSI)

(Contributed by Annette Greiner, Lawrence Berkeley National Laboratory, California)
URL: https://openmsi.nersc.gov/

Mass spectrometry imaging (MSI) is widely applied to image complex samples for applications spanning health, microbial ecology, and high throughput screening of high-density arrays. MSI has emerged as a technique suited to resolving metabolism within complex cellular systems; where understanding the spatial variation of metabolism is vital for making a transformative impact on science. Unfortunately, the scale of MSI data and complexity of analysis presents an insurmountable barrier to scientists where a single 2D-image may be many gigabytes and comparison of multiple images is beyond the capabilities available to most scientists. The OpenMSI project will overcome these challenges, allowing broad use of MSI to researchers by providing a Web-based gateway for management and storage of MSI data, the visualization of the hyper-dimensional contents of the data, and the statistical analysis.

Elements:

Domains: imaging mass spectrometry, life sciences, microscopy, analytical chemistry.
Obligation/motivation: scientific analysis, reporting results, collaboration
Usage: Data sets can be contributed by researchers anywhere in the world and perused/analyzed by anyone. Users can share their data with individuals and the public using a familiar group and users view/edit/own permission scheme. Once their dataset is in the system, a researcher can select subsets of the data for viewing as an image or spectrum. Researchers can perform statistical analysis of their data, e.g, via non-negative matrix factorization, while the API and online viewers enable users to interact with derived analytics in the same way as with raw data. Users can also download individual images. A REST API provides programmatic access to enable custom remote data analytics and retrieval of data subsets.
Quality: varies with mass spectrometry instrument used, preparation of sample.
Size: Average sizes typically range from 10-50 GB per sample (before compression). Larger images of 50 - 500GB can already be generated today. Each lab with an OpenMSI account generates typically 2-5 samples per week.
Type/format: Multiscale, multimodal, and multidimensional data stored using the OpenMSI file format based on HDF5.
Rate of change: Underlying data for an experiment does not generally change, though new analyses and metadata will be added over time.
Data lifespan: years to decades.
Potential audience: working scientists interested in obtaining spatially resolved chemical information about samples including scientists researching cancer, agriculture, and synthetic biology.

Positive aspects: huge improvement in ease of analysis over traditional methods, ability to readily share results with other researchers, ability to download relevant subsets of data, provides metadata for each sample, self-describing data format, fast and flexible Web API, interactive Web-based exploration that enables user to view data that cannot be opened using standard MSI tools.

Negative aspects: submission of metadata should be easier and automated. As it scales, we'll need to facilitate discovery of datasets of interest via search.

Challenges: Project is largely unfunded and resources are vitally needed for project to succeed.

Requires: R-AccessRealTime, R-APIDocumented, R-DataEnrichment, R-FormatMachineRead, R-FormatOpen, R-FormatStandardized, R-MetadataAvailable, R-MetadataDocum, R-MetadataMachineRead and R-SensitiveSecurity.

2.15 OKFN Transport WG

(Contributed by Deirdre Lee based on the 2012 ePSI Open Transport Data Manifesto)

The Context: Transportation is an important contemporary issue that has a direct impact on economic strength, environmental sustainability and social equity. Accordingly, transport data — largely produced or gathered by public sector organisations or semi-private entities, quite often locally — represents one of the most valuable sources of public sector information (PSI, also called ‘open data’), a key policy area for many, including the European Commission.

The Challenge: Combined with the advancement of Web technologies and the increasing use of smart phones, the demand for high quality machine-readable and openly licensed transport data, allowing for reuse in commercial and non-commercial products and services, is rising rapidly. Unfortunately this demand is not met by current supply: many transport data producers and holders (from the public and private sectors) have not managed to respond adequately to these new challenges set by society and technology.

So what do we need?

Access to any transport data of any operator, of high quality, in real time, against free or at least fair standard conditions.
An inclusive infrastructure, based on common open, non-discriminatory and interoperable standards and APIs, to which operators, service providers, developers and users can connect.
An ecosystem wherein universal access and re-usability of transport data is the rule, not the exception.

Why is this not happening?

Data that is necessary for integrated personal transportation solutions is rich and encompasses several domains (geospatial data, environmental data, private service provider data), involving a wide array of data holders from the public and private sectors. Because of its very nature, transport data is often held locally.
Legacies create lock-ins that prevent adoption of open standards and hamper interoperability.
Many operators and incumbent service providers, in particular those relying on income from sales of data, still regard selective and exclusive access to transport data as a competitive advantage, restricting access and reuse through the exercise of intellectual property rights.
Perceived liability risks, often associated with data quality issues, prevent operators from opening up their data.
Significant differences between countries, regions and transport modalities in terms of level of development, market maturity and associated business models prevent a ‘one size fits all’ solution.
A lack of leadership in the value chain, either by the industry or from the authorities (whatever the level), limits governance capabilities as to establishment of access, accessibility and other framework conditions, creating a need for a subtle mix of mostly bottom-up instruments and a dash of top-down measures.
Existing market players with associated interests turn governmental actions into a delicate matter, in particular as to the question of where the role of the government should start and end within the value chain and where the market parties should take over and become the driving factor.
Where market parties need to step in, the lack of a clear and predictable environment prevents businesses from establishing a long-term perspective, whereby fair competition needs to be safeguarded.

Requires: R-AccessBulk, R-AccessLevel, R-AccessRealTime, R-AccessUpToDate, R-APIDocumented, R-DataMissingIncomplete, R-DataProductionContext, R-FormatLocalize, R-FormatMachineRead, R-FormatOpen, R-GeographicalContext, R-LicenseAvailable, R-LicenseLiability, R-MetadataAvailable, R-QualityComparable, R-QualityCompleteness, R-QualityMetrics, R-SLAAvailable, R-UsageFeedback and R-VocabOpen.

2.16 Open City Data Pipeline

(Contributed by Deirdre Lee, based on a presentation by Axel Polleres at EDF 2014).

The Open City Data Pipeline aims to provide an extensible platform to support citizens and city administrators by providing city key performance indicators (KPIs), leveraging open data sources. An assumption of open data is that “Added value comes from comparable open datasets being combined.” Open data needs stronger standards to be useful, in particular for industrial uptake. Industrial usage has different requirements than that of an application-building hobbyist or civil society so it's important to think how open data can be used by industry at the time of publication. The Open City Data Pipeline project has developed a data pipeline to:

(semi-)automatically collect and integrate various open data sources in different formats;
compose and calculate complex city KPIs from the collected data.

Current Data Summary

Ca. 475 different indicators
Categories: Demography, Geography, Social Aspects, Economy, Environment, etc.
from 32 sources (html, CSV, RDF …)
Wikipedia, urbanaudit.org, Statistics from City homepages, country Statistics, iea.org.
Covering 350+cities in 28 European countries.
District data for selected cities (Vienna, Berlin).
Mostly snapshots, partially covering timelines.
On average ca. 285 facts per city.

Base assumption (for our use case): Added value comes from comparable open datasets being combined

Challenges & Lessons Learnt:

Incomplete Data: can be partially overcome by:
- ontological reasoning (RDF & OWL), by aggregation, or by rules & equations, e.g. :populationDensity = :population/:area;
- By statistical methods or Multi-dimensional Matrix Decomposition (unfortunately only partially successful, because these algorithms assume normally-distributed data).
Incomparable data:
- dbpedia:populationTotal
- dbpedia:populationCensus
Heterogeneity across open government data efforts:
- different indicators, different temporal and spatial granularity;
- different licenses of open data: e.g. CC-BY, country specific licences, etc.
- Heterogeneous formats and heterogeneity within formats, especially CSV.

Challenges:

Incomplete data (can be overcome using semantic technologies and/or statistical methods).
Heterogeneity (indicators, licenses, formats).
Open data needs stronger standards to be useful (in particular for industrial uptake), at a metadata level, and dataset level.
Metadata is not always uniform, not only titles of columns, but standardization about units, etc.

Requires: R-DataMissingIncomplete, R-DataProductionContext, R-FormatMachineRead, R-FormatOpen, R-FormatStandardized, R-FormatLocalize, R-GeographicalContext, R-LicenseAvailable, R-MetadataAvailable, R-MetadataStandardized, R-QualityComparable, R-QualityCompleteness, R-VocabDocum, R-VocabOpen and R-VocabReference.

2.17 Open Experimental Field Studies

(Contributed by Eric Stephan)

In 2013 the United States Whitehouse published an executive order on Open Data to help make publically available data: understandable, accessible, and searchable. A number of historical and on-going atmospheric studies fall into this category but are not currently open. This use case describes characteristics of laboratory experiments and field studies that could be published as open data.

For measurements to be considered useful and comparable to other findings scientists need to track every aspect of their laboratory and field experiments. This can include: background describing the purpose of the experiment, field site selected, instrumentation deployed, configuration settings, house keeping data, types of measurements that need to be taken, work performed on field visits, processing the raw measurements, intermediate processing data, value added data products, quality assurance, problem reporting, and standards relied upon for disseminating the study results including selected data formats, quality control codes selected, engineering units selected, and metadata vocabularies relied upon for describing the measurements.

Traditionally knowledge and data about the studies have either been kept in separate local databases, file systems and spreadsheets, or in non-record keeping systems. If kept electronically the experiment in its entirety may be kept in bulk by way of archive files (tar, zip etc). Measurements from the study may be shared along with background information in the form of a summarized report or publication, content management system or wiki site and the bulk of knowledge is largely retained internally by data providers.

Elements:

Domains: Open scientific experimental research relying upon in situ and remote sensing instruments. E.g. wind studies that may use anemometers and Lidar to study wind measurements.
Obligation/motivation: Answer scientific questions about the characteristics and behavior of the physical system being studied.
Usage: Data may analyzed and visualized by applications, used in computational models or combined in larger data sets for larger studies.
Quality: House keeping data, problem reporting, maintenance history, calibration history.
Size: Dependent on the length of the study, measurement rate, and the size of each sample. Size can vary from kilobytes to tens of gigabytes daily for a single instrument.
Type/format: raw data is dictated by the instrument producing the measurements. Intermediate results and value added products can be in binary, delimited text file, NetCDF, or stored in other formats.
Rate of change: depends on the measurement rate.
Data lifespan: This may vary between scientific communities. For atmosphere field studies data cannot be reproduced and may be retained forever. If a laboratory experiment can be repeated, it may have a limited lifespan. In cases where data is cited even repeatable experiments will be available to back up the published research findings.
Potential audience: domain experts and scientific peers, science teachers and students. Other domains will use these results.

Positive aspects: The Web of Things (instruments), Linked Services (processing software), and Linked Data communities offer an opportunity to field or laboratory experiments by coupling all the elements of the experiment into one composite product. Leveraging these technologies it is possible to construct a catalog that acts as a concierge to any collaborator giving them perspectives on things, services, and data.

Negative aspects:

When data is published on the Web there is no mechanism for users to rate and review data.
Data providers usually are unaware of new user communities using measurements.

Challenges:

Publishing experiments to publically accessible Web-based archives.
Advertising experiments in catalogs that includes comprehensive information about the things and services used in the experiment.
Providing composite experiment in such a way that it is useful to users that are not fellow collaborators.
Identifying new emerging target user communities
Without specific best practices guidance data may not be published and irreproducible data risks being lost.
Policies need to be provided when in the experimental design when it is acceptable to publish data and when to keep it initially private.

Requires: R-AccessRealTime, R-DataIrreproducibility, R-DataLifecycleStage, R-DataProductionContext, R-FormatMachineRead, R-FormatMultiple, R-FormatStandardized, R-TrackDataUsage, R-UsageFeedback, R-VocabOpen, R-VocabReference and R-UsageFeedback.

2.18 Resource Discovery for Extreme Scale Collaboration (RDESC)

(Contributed by Sumit Purohit)
URL: http://rdesc.org/

RDESC's objective is to develop a capability for describing, linking, searching and discovering scientific resources used in collaborative science. For the purpose of capturing semantic context, RDESC adopts sets of existing ontologies where possible such as FOAF, BIBO and schema.org. RDESC also introduced new concepts in order to provide a semantically integrated view of the data. Such concepts have two distinct functions. The first is to preserve the semantics of the source that are more specific than what already existed in the ontology. The second is to provide broad categorization of existing concepts as it becomes clear that concepts are forming general groups. These generalizations enable users to work with concepts they understand, rather than needing to understand the semantics of many different systems. It strives to provide a lightweight enough framework to be used as a component in any software system such as desktop user environments or dashboards but also be scalable to millions of resources.

Elements

Domains:
- Scientific Resources: Instruments, Organizations, People
- Bibliographic Resources : Publications, Citations
- Physical Properties : soil moisture
- Digital Data Curation.
Obligation/motivation:
- Show value to data publishers in publishing High Quality Linked Data based resources.
- Search/Browse/Discover semantically tagged data.
- Recommend "Similar" data.
- Use of RDFa, Schema.org in HTML to let standard Web search engine index published pages.
Usage:
- User providing more expressive queries to search data.
- User able to reach to as close as possible to the source of data.
- User able to find "Similar" data.
Quality: is important to maintain correctness and quality of search result.
Size: Order of 1-2B triples as of 19 September 2014
Type/format: RDF LinkedData
Rate of change: No Formal Update Cycle as of now but data has been updated Quarterly
Potential audience: Scientific Community, Decision Makers

Positive aspects:

Persistent URI with content negotiation. RDESC uses persistent URI to describe all the entities in the system.
Use of existing ontologies such as foaf, bibo, schema.org
Published specialized RDESC ontology for scientific resources : RDESC Ontology (Turtle)
Enable application developers to use any kind of user interface suitable for their user needs.
The provision of examples.

Negative aspects:

Difficulties in Data Curation

Challenges:

Scalability of such systems.
Automated data curation pipelines.
Metadata about Quality of Published Data.
Frequency of Data Update..
User Feedback for data correction/annotation

Potential Requirements:

Use of Persistent URIs.
Recommending abstract and domain specific ontologies/vocabularies.
Requirements to publish quality of published data.

Requires: R-AccessLevel, R-AccessRealTime, R-Citable, R-DataLifecyclePrivacy, R-DataMissingIncomplete, R-FormatStandardized, R-PersistentIdentification, R-ProvAvailable, R-SensitiveSecurity, R-SLAAvailable, R-TrackDataUsage, R-UniqueIdentifier, R-VocabOpen R-VocabReference and R-UsageFeedback.

2.19 Recife Open Data Portal

(Contributed by Bernadette Lóscio )
URL: http://dados.recife.pe.gov.br/

Recife is a city situated in the Northeast of Brazil and it is famous for being one of the Brazil’s biggest tech hubs. Recife is also one of the first Brazilian cities to release data generated by public sector organizations for public use as open data. Then Open Data Portal Recife was created to offer access to a repository of governmental machine-readable data about several domains, including: finances, health, education and tourism. Data is available in CSV and GeoJSON formats and every dataset has metadata that helps in the understanding and usage of the data. However, the metadata is not provided using standard vocabularies or taxonomies. In general, data is created in a static way, where data from relational databases are exported in a CSV format and then published in the data catalog. Currently, work is under way to dynamically generate data from relational databases so that data will be available as soon as it is created. The main phases of the development of this initiative were: to educate people with appropriate knowledge concerning open data, relevant data identification in order to identify the sources of data that their potential consumers could find useful, data extraction and transformation from the original data sources to open formats, configuration and installation of the open data catalog tool, data publication and portal release.

Elements:

Domains: Base registers, cultural heritage information, geographic information, infrastructure information, social data and tourism information
Obligation/motivation: Data that must be provided to the public under a legal obligation (Brazilian Information Access Act, edited in 2012); Provide public data to citizens.
Usage: Data that supports democracy and transparency; data used by application developers.
Quality: Verified and clean data.
Size: in general small to medium CSV files.
Type/format: CSV, GeoJson
Rate of change: different rates of change depending on the data source.
Potential audience: application developers, startups, government organizations.

Challenges:

Use of common vocabularies to facilitate data integration.
Provide structural metadata to help understanding and usage.
Automate the data publishing process to keep data up to date and accurate.

Requires: R-MetadataMachineRead, R-MetadataDocum, R-MetadataStandardized, R-QualityComparable, R-QualityCompleteness, R-VocabDocum, R-VocabOpen and R-VocabReference.

2.20 Retrato da Violência (Violence Map)

(Contributed by Yasodara )
URL: https://github.com/dataviz/retrato-da-violencia.org

This is a data visualization made in 2012 by Vitor Batista , Léo tartari and Thiago Bueno for a W3C Brazil Office challenge about data from Rio Grande do Sul (a brazilian region). The data was released in a .zip package, the original format was .csv. The code and the documentation of the project are in it's GitHub repository

Elements:

Domains: political information, regional security information.
Obligation/motivation: Data that must be provided to the public under a legal obligation, the LAI or Brazilian Information Access Act, edited in 2012.
Quality: not guaranteed.
Type/format: Tabular data.
Rate of change: There is no new releases of the data, this was a one-off

Positive Aspects: the decision to transform the CSV in to JSON was based on the necessity to have hierarchical data. The ability to map the CSV structure to XML or JSON was considered as a positive since JSON can cover more complex structures.

Negative Aspects: the data is already outdated (in 2014), there is no provision for new releases, and there's no associated metadata.

Requires: R-AccessUpToDate, R-MetadataAvailable, R-MetadataStandardized, R-PersistentIdentification, R-QualityCompleteness and R-SensitiveSecurity.

2.21 Share-PSI 2.0: Uses of Open Data Within Government for Innovation and Efficiency

(Contributed by Phil Archer on behalf of Share-PSI 2.0)
URL: http://www.w3.org/2013/share-psi/workshop/samos/report

The Share-PSI 2.0 Thematic Network, co funded by the European Commission, is running a series of workshops throughout 2014 and 2015 examining different aspects of how to share Public Sector Information (PSI). This is in the context of the revised European Directive on the Public Sector Information. The network's focus is therefore slightly different than Data on the Web Best Practices as it covers a number of policy issues that are out of scope for W3C, and only covers public sector information. However, the overlap is substantial. There are more than 40 partners in the Share-PSI 2.0 network from 25 countries including many government departments as well as academics, consultants, citizen's organizations and standards bodies involved directly with PSI provision.

The report from the first Share-PSI 2.0 workshop, held as part of the Samos Summit 30 June - 1 July 2014, summarizes the many papers and discussions held at that event. From it, we can derive a long list of requirements.

Elements and challenges not included here as the report summarizes many use cases.

Requires: R-AccessRealTime, R-AccessUpToDate, R-Citable, R-GeographicalContext, R-MetadataDocum, R-MetadataStandardized, R-ProvAvailable, R-QualityComparable, R-QualityOpinions, R-SensitivePrivacy, R-SensitiveSecurity, R-UsageFeedback and R-VocabReference.

2.22 Tabulae - how to get value out of data

(Contributed by Luis Polo )
URL: http://www.tabulaeapp.com/

Tabul.ae is a framework to publish and visually explore data that can used to deploy powerful and easy-to-exploit open data platforms, so allowing organizations to unleash the potential of their data. The aim is to enable data owners (public organizations) and consumers (citizens and business reusers) to transform the information they manage into added-value knowledge, empowering them to easily create data-centric Web applications. These applications are built upon interactive and powerful graphs, and take the shape of interactive charts, dashboards, infographics and reports. Tabulae provides a high degree of assistance to create these apps and also automate several data visualization tasks (e.g. recognition of geographical entities to automatically generate a map). In addition, the charts and maps are portable outside the platform and can be smartly integrated with any Web content, enhancing the reusability of the information.

Elements:

Domains: Quantitative and geographical information: stats, biodiversity, socio-economic indicators, environment, security, etc.
Obligation/motivation: to help citizens and companies (especially, consultancy firms) to understand and create value from open data by means of reusable, user-made visualizations.
Usage: Data used by citizens, public employees and companies.
Quality: The information must be at least semi-structured (for instance, an spreadsheet).
Size: Medium and large datasets (hundreds of thousands and millions rows).
Type/format: Tabulae can manage relational databases, GeoJSON, CSV files and spreadsheets, and provides an API for programmatic access.
Rate of change: depending on the original datasets. The platform enables automatic update from original sources.
Data lifespan: depending on the original datasets.
Potential audience: Organizations that want to publish their catalogue of datasets and aim to maximize their impact and consumption.

Challenges:

Quality of data and metadata.
Inconsistency between different data sources.
Wide variety of formats and technologies.
Different data schemas that complicates the integration of data sources.
Diversity and (sometimes) complexity of Licenses.
Data persistence.
Internationalization and format issues (e.g., languages, numbers, dates, etc.)

Potential Requirements:

Dataset versioning and updating mechanisms.
Standardization of schemas.
Integration with other platforms/services.

Requires: R-AccessUpToDate, R-FormatLocalize, R-FormatMachineRead, R-FormatMultiple, R-FormatStandardized, R-ProvAvailable, R-QualityComparable, R-QualityCompleteness, R-SensitiveSecurity, R-VocabReference and R-VocabVersion.

2.23 UK Open Research Data Forum

(Contributed by Phil Archer)
URL: http://www.researchinfonet.org/wp-content/uploads/2014/07/Joint-statement-of-principles-June-2014.pdf (PDF)

In 2013, the Royal Society lead the formation of the UK Open Research Data Forum. This effort is a national reflection of a global trend towards the open publication of research data; see, for instance, the work of the Research Data Alliance, DataCite and the US National Institutes of Health as described in a talk by its Associate Director for Data Science, Philip Bourne. Following a workshop in April 2014, the UK Open Research Data Forum and US Committee on Coherence at Scale issued a joint statement (PDF) of the principles of open research data.

The data that provide the evidence for the concepts in a published paper or its equivalent, together with the relevant metadata and computer code must be concurrently available for scrutiny and consistent with the criteria of “intelligent openness”. The data must be:
- discoverable – readily found to exist by online search;
- accessible – when discovered they can be interrogated;
- intelligible – they can be understood;
- assessable – e.g. the provenance and reliability of data;
- reuseable – they can be reused and re-combined with other data.
The data generated by publicly - or charitably - funded research that is not used as evidence for a published scientific concept should also be made intelligently open after a pre-specified period in which originators have exclusive access.
Those who reuse data but were not their orginators must formally acknowledge their originators.
The cost of creating intelligently open data from a research project is an intrinsic part of the cost of research, and should not be considered as an optional extra.
Although the default position for data generated by publicly - or charitably -- funded research should be one of “intelligent openness”, there are justifiable limits to openness. These are where commercial exploitation is in the public interest and the sectoral business model requires limitations on openness; in preserving the privacy of individuals whose personal information is contained in databases; where data release would endanger safety (unintended accidents) or security (deliberate attack). However, these instances do not provide justification for blanket exceptions to the default position for those researchers or research institutions whose role is to disseminate openly their finding, and should be argued on a case- by case basis.
Existing processes, reward structures and norms of behavior that inhibit or prevent data sharing or new forms of open collaboration should, wherever possible, be reformed so that data sharing and collaboration are encouraged, facilitated and rewarded.

At the time of writing, these are undergoing review and refinement but the aims are clear. In the context of the Data on the Web Best Practices Working Group, many requirements stem from this list.

Challenges

Each principle listed here represents one or more challenges, with points 1, 3 and 5, being particularly relevant to Data on the Web Best Practices. Matters of policy and culture within any domain, whilst certainly challenging, are out of scope for the current work.

Elements:

Domains: Research data.
Obligation/motivation: Cultural/professional obligation.
Usage: Data that supports the scientific method.
Quality: Variable - often empirical, often messy. Some of the data may not be repeatable.
Size: Highly variable but it's noteworthy that research data can be very large (e.g. genomics).
Type/format: variable including some specialist formats, XML dialects etc. but often CSV.
Rate of change: Usually the data is static.
Data lifespan: Publication often associated with a journal publication that marks the end of the cycle.
Potential audience: Research peers.

Requires R-AccessBulk, R-Citable, R-FormatMachineRead, R-FormatOpen, R-FormatStandardized, R-LicenseAvailable, R-MetadataAvailable, R-MetadataDocum, R-MetadataMachineRead, R-MetadataStandardized, R-PersistentIdentification, R-ProvAvailable, R-SensitivePrivacy, R-SensitiveSecurity, R-TrackDataUsage and R-UniqueIdentifier.

2.24 Uruguay Open Data Catalog

(Contributed by AGESIC )
URL: http://datauy.org/

Uruguay's open data portal was launched in December 2012 and at the time of writing holds 85 datasets containing 114 resources. The open data initiative prioritizes the “use of data” rather than “quantity of data”, that’s why the catalog also promotes a number of applications using data resources in some way (in common with many other data portals). It’s important for the project to keep the ratio 1:3 between applications and datasets. Most of the resources are CSV and ESRI Shapefiles making this a catalog of 2 and 3 star resources according to the 5 Stars of Linked Open Data scheme. AGESIC does not have sufficient resources at government agencies to implement an open data liberation strategy and go to the next level. So when we are asked about opening data, keep it simple is the answer, and CSV is by far the easiest and smart way to start. Uruguay has an access to public information law but doesn't have legislation about open data. The open data initiative is lead by AGESIC with the support of an open data working group drawn from multiple government agencies.

Elements:

Domains:
- Infrastructure: Most of the datasets are shapefiles
- Transportation: Shapefiles and CSV, containing information about public transportation (stops and frequency), roads, accidents, etc.
- Tourism: data about regional events, cultural agenda, hotels, camp sites, statistics
- Economics: budget, consumer price declarations, etc.
- Social development
- Environment
- Health
- Education
- Culture
Obligation/motivation: There is no obligation for the government agencies to publish open data. All initiatives were carried on by agencies that want to support the initiative.
Usage: Develop applications and new services for citizens, agencies interoperability (exchange of information in open data formats), transparency.
Quality:Most of the data is realized properly, with complete or near complete metadata.
Size: Small; most of the datasets are less than 1Gb.
Type/format: At the time of writing: ESRI Shapefile (35), CSV (26), TXT (19), ZIP (12), HTML (7), XLS (6),PDF (4), XML (3), RAR (2)
Rate of change: Depends on the dataset.
Data lifespan: Depends on the dataset, some change in real time, other monthly, every 6 months, annual or static.
Potential audience: Developers, journalists, civil society, entrepreneurs.

Challenges: Consolidation of tools to manage datasets, improve visualizations and transform resources to higher level (4 – 5 stars). Automated publication process using harvesting or similar tools. Alerts or control panels to keep data updated.

Requires: R-DataMissingIncomplete, R-AccessLevel R-VocabReference and R-TrackDataUsage.

2.25 Web Observatory

Contributors: Adriano C. Machado Pereira, Adriano Veloso, Gisele Pappa, Wagner Meira Jr.

City/country: Belo Horizonte, Brazil

URL: http://observatorio.inweb.org.br/english.html

Overview: There are almost 65 million Brazilians connected to the Internet - 3 6% of the Brazilian Population, according to Comitê Gestor da Internet no Brasil. As a consequence, events such as the Brazilian Election Running have become popular topics in the Web, mainly in Online Social Networks. Our goal is to understand this new reality and present new ways to watch facts, events and entities on the fly using the Web and user-generated content available in Online Social Networks and Blogs. The Web Observatory is a research project part of the Instituto Nacional de Ciência e Tecnologia para a Web (INWEB), sponsored by CNPq and Fapemig. There are over 30 experts involved in the project, from four differente Federal Universities: Universidade Federal de Minas Gerais (UFMG), Centro Federal de Educação Tecnológica de Minas Gerais (CEFET-MG), Universidade Federal do Amazonas (UFAM) e Universidade Federal do Rio Grande do Sul (UFRGS). The INWEB researchers use a set of new techniques related to information recovery, data mining and data visualization to understand and summarize what the media and users are talking about on the Web. This provides the fundamental basis for an evaluation of the impact of the Olympic Campaigns and how users react to news and discussions. One new feature in this project is the possibility to see the propagation of the Tweets.

Elements:

Domains: Different contexts or domains, related to data from the Web. For example: Health (for example, diseases); Tourism; Sports (for example, soccer championship and Olympic games); Politics; Finance; Etc.
Obligation/motivation: Data must be obtained from different public data sources from the Web.
Usage: DProvide different data analysis, indicators or visualizations to allow a better understand of a context.
Quality: Variable, depend on the data source, can be structured or not.
Size: Variable, can be small data instances to a huge amount of data, depending on the context under investigation. In general, there are a huge amount of data.
Type/format: Diverse, like CSV, HTML, JSON, XML, etc.
Rate of change: Different rates of change, usually very dynamic.
Data lifespan: n/a
Potential audience: Diverse, different Web users.

Challenges

Data volume;
Data velocity;
Data variety;
Data value;
Complexity

Requires R-DataEnrichment, R-GranularityLevels, R-MetadataDocum, R-MetadataMachineRead, R-MetadataStandardized, R-ProvAvailable, R-VocabDocum, R-VocabOpen and R-VocabReference.

2.26 Wind Characterization Scientific Study

(Contributed by Eric Stephan)

This use case describes a data management facility being constructed to support scientific offshore wind energy research for the U.S. Department of Energy’s Office of Energy Efficiency and Renewable Energy (EERE) Wind and Water Power Program. The Reference Facility for Renewable Energy (RFORE) project is responsible for collecting wind characterization data from remote sensing and in-situ instruments located on an offshore platform. This raw data is collected by the Data Management Facility and processed into a standardized NetCDF format. Both the raw measurements and processed data are archived in the PNNL Institutional Computing (PIC) petascale computing facility. The DMF will record all processing history, quality assurance work, problem reporting, and maintenance activities for both instrumentation and data. All datasets, instrumentation, and activities are cataloged providing a seamless knowledge representation of the scientific study. The DMF catalog relies on linked open vocabularies and domain vocabularies to make the study data searchable. Scientists will be able to use the catalog for faceted browsing, ad-hoc searches, query by example. For accessing individual datasets a REST GET interface to the archive will be provided.

Challenges:
For accessing numerous datasets scientists will be accessing the archive directly using other protocols such as sftp, rsync, scp, and access techniques such as HPN-SSH.

Requires: R-AccessRealTime R-FormatStandardized, R-VocabOpen and R-VocabReference.

Challenge	Requirements
Data Access	Requirements for Data Access	R-AccessBulk, R-AccessLevel,R-AccessRealTime, R-AccessUpToDate, R-APIDocumented
Data Enrichment	Requirements for Data Enrichment	R-DataEnrichment
Data Formats	Requirements for Data Formats	R-FormatLocalize, R-FormatMachineRead, R-FormatMultiple, R-FormatStandardized, R-FormatOpen
Data Granularity	Requirements for Data Granularity	R-GranularityLevels
Data Identification	Requirements for Data Identification	R-UniqueIdentifier
Data Quality	Requirements for Data Quality	R-DataMissingIncomplete, R-QualityComparable R-QualityCompleteness, R-QualityMetrics, R-QualityOpinions
Data Selection	Requirements for Data Selection	R-DataIrreproducibility, R-DataLifecyclePrivacy, R-DataLifecycleStage
Data Usage	Requirements for Data Usage	R-TrackDataUsage, R-UsageFeedback, R-Citable
Data Vocabularies	Requirements for Data Vocabularies	R-VocabDocum, R-VocabOpen, R-VocabReference , R-VocabVersion
Licenses	Requirements for Licenses	R-LicenseAvailable, R-LicenseLiability
Metadata	Requirements for Metadata	R-DataProductionContext, R-GeographicalContext, R-MetadataAvailable, R-MetadataDocum R-MetadataMachineRead, R-MetadataStandardized, R-SLAAvailable
Preservation	Requirements for Preservation	R-PersistentIdentification
Provenance	Requirements for Provenance	R-ProvAvailable, R-DataVersion
Sensitive Data	Requirements for Sensitive Data	R-SensitivePrivacy, R-SensitiveSecurity

Abstract

Status of This Document

Table of Contents

1. Introduction

2. Use Cases

2.1 ASO: Airborne Snow Observatory

2.2 BBC

2.3 Bio2RDF

2.4 BuildingEye: SME use of public data

2.5 Dados.gov.br

2.6 Digital archiving of Linked Data

2.7 Dutch Base Registers

2.8 GS1 Digital

2.9 ISO GEO Story

2.10 The Land Portal

2.11 LA Times' Reporting of Ron Galperin's Infographic

2.12 LusTRE: Linked Thesaurus fRamework for Environment

2.13 Machine-readability and Interoperability of Licenses

2.14 Mass Spectrometry Imaging (MSI)

2.15 OKFN Transport WG

2.16 Open City Data Pipeline

2.17 Open Experimental Field Studies

2.18 Resource Discovery for Extreme Scale Collaboration (RDESC)

2.19 Recife Open Data Portal

2.20 Retrato da Violência (Violence Map)

2.21 Share-PSI 2.0: Uses of Open Data Within Government for Innovation and Efficiency

2.22 Tabulae - how to get value out of data

2.23 UK Open Research Data Forum

2.24 Uruguay Open Data Catalog

2.25 Web Observatory

2.26 Wind Characterization Scientific Study

3. General Challenges

3.1 A Word on Open and Closed Data

3.2 Requirements by Challenge

4. Requirements

4.1 Requirements for Data on the Web Best Practices

4.1.1 Requirements for Data Access

4.1.2 Requirements for Data Enrichment

4.1.3 Requirements for Data Formats

4.1.4 Requirements for Data Identification

4.1.5 Requirements for Data Selection

4.1.6 Requirements for Data Vocabularies

4.1.7 Requirements for Industry Reuse

4.1.8 Requirements for Licenses

4.1.9 Requirements for Metadata

4.1.10 Requirements for Preservation

4.1.11 Requirements for Provenance

4.1.12 Requirements for Sensitive Data

4.2 Requirements for Quality and Granularity Description Vocabulary

4.2.1 4.2.1 Requirements for Data Quality

4.2.2 Requirements for Data Granularity

4.3 Requirements for Data Usage Description Vocabulary

4.3.1 Requirements for Data Usage

5. Reading Material

5.1 General Resources

5.2 Relevant Vocabularies

5.3 Communities of Interest

A. Acknowledgements

B. Change history