Experiences with the conversion of SenseLab databases to RDF/OWL

W3C Working Draft Interest Group Note 4 April June 2008

This version:: ~~http://www.w3.org/TR/2008/WD-hcls-senselab-20080404/~~ http://www.w3.org/TR/2008/NOTE-hcls-senselab-20080604/
Latest version:: http://www.w3.org/TR/hcls-senselab/
Previous version:: http://www.w3.org/TR/2008/WD-hcls-senselab-20080404/
Editors:: Matthias Samwald (Yale Center for Medical Informatics; Semantic Web Company); Kei-Hoi Cheung (Yale Center for Medical Informatics)

Abstract

One of the challenges facing Semantic Web for Health Care and Life Sciences is that of converting relational databases into Semantic Web format. The issues and the steps involved in such a conversion have not been well documented. To this end, we have created this document to describe the process of converting SenseLab databases into OWL. SenseLab is a collection of relational (Oracle) databases for neuroscientific research. The conversion of these databases into RDF/OWL format is an important step towards realizing the benefits of Semantic Web in integrative neuroscience research. This document describes how we represented some of the SenseLab databases in Resource Description Framework (RDF) and Web Ontology Language (OWL), and discusses the advantages and disadvantages of these representations. Our OWL representation is based on the reuse of existing standard OWL ontologies developed in the biomedical ontology communities. The purpose of this document is to share our implementation experience with the community.

Status of This Document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.

This is ~~a First Public Working Draft~~ an Interest Group Note of the Semantic Web in Health Care and Life Sciences Interest Group (HCLS) , part of the W3C Semantic Web Activity . It is considered stable and expected to be published as an Interest Group Note in May 2008. This document serves as a companion to ~~the HCLSIG Knowldege~~ A Prototype Knowledge Base for the Life Sciences and describes the process for integrating new data into an existing biological database. We hope other groups who plan to convert their databases into RDF/OWL format will benefit from this document.

~~Please send all comments on either of these documents~~ The document was produced by ~~21 April,~~ the Semantic Web in Health Care and Life Sciences Interest Group (HCLS) ,part of the W3C Semantic Web Activity ( see charter ). Comments may be sent to the publicly archived public-semweb-lifesci@w3.org ~~, a~~ mailing list. Feedback is encouraged, as is participation in the recently re-charted HCLSIG. A list ~~with a public archive , though~~ of changes since the ~~IG does not promise explicit responses to each comment.~~ last publication is available.

Publication ~~of this document~~ as an Interest Group Note ~~is planned for early April 2008; timely comments are appreciated. Publication as a Working Draft~~ does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.

This document was produced by a group operating under the disclosure obligations of the 5 February 2004 W3C Patent Policy . The group does not expect this document to become a W3C Recommendation. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information to public-semweb-lifesci@w3.org [ public archive ] in accordance with in accordance with section 6 of the W3C Patent Policy .

Appendices

Conversion process

Original data sources

The SenseLab databases can be accessed through a web interface at the SenseLab website [ SENSELAB-WEB ]. SenseLab is divided into a number of specialised databases, of which we have converted three to Semantic Web formats. These databases are NeuronDB, BrainPharm and ModelDB. All databases are based on compartmental models of neurons. NeuronDB contains descriptions of anatomic locations, cell architecture and physiologic parameters of neuronal cells. The pilot BrainPharm database is intended to support research on drugs for the treatment of neurological disorders. It enhances the descriptions in a portion of NeuronDB with descriptions of the actions of pathological and pharmacological agents. ModelDB is a large repository of computational neuroscience models and simulations. The mathematical models in ModelDB are annotated with references to NeuronDB. Taken together, these databases allow the researcher to query information and to run simulations pertaining to the function of neurons in healthy and disease states. All databases contain extensive literature references and excerpts from texts that have been used to curate the database entries.

The databases are based on the "entity-attribute-value with classes and relationships" (EAV/CR) schema [ EAV-CR ]. The data can also be downloaded from the SenseLab Semantic Web developments portal [ SENSELAB-SW ] as a database dump in Microsoft Acces MDB format and as text.

Initial RDF and OWL conversions

Motivation

Our motivation was to make the SenseLab databases available in RDF [ RDF ] (without OWL) and in OWL DL [ OWL Overview ]. The two versions were developed in parallel in order to compare the difference between the conversion processes and the outcomes. We wanted to explore the issues in mapping relational databases to RDF/OWL structure. In addition, we wanted to explore the possibility of automatic translation from EAV/CR to RDF.

Process

We developed a converter application in Java that queried the SenseLab database and wrote RDF/XML files. The conversion was fully automatic for the RDF version, but required some manual editing for the OWL version.

Outcome

These conversions were too tied to the original database structure, which resulted in inconsistent OWL ontologies. Some shortcomings of the first conversion to OWL were:

¹ Disjoint classes are used in OWL to assert that they have no members in common. Inferences from this can be used to flag an inconsistent models.

Revised OWL conversions

The revised OWL conversion was based on the first OWL conversions. The design of the SenseLab ontologies follows the "ontological realism" approach [ SMITH-2004 ]. This means that the ontologies are focused on direct representations of physical objects and processes (e.g., neuronal cells, ionic currents), and not on their abstractions (e.g., concepts or database entries).

Motivation

Manually correcting the logical inconsistencies in the first version of the OWL ontology; making use of foundational ontologies (BFO, Relation Ontology) where possible; mapping the ontology to other neuroscience ontologies.

Process

An ontology containing basic class hierarchies and relations was manually created, based on the structure of existing SenseLab databases. This basic ontology could not be created from the database structure in an automated process because this would not have resulted in a logically consistent ontology. This ontology was edited by a domain expert, based on inspection and manual editing with Protege 3.2 [ PROTEGE ] and Topbraid Composer [ TOPBRAID ]. The ontologies were built upon established foundational ontologies in order to maximize the interoperability with other existing and forthcoming biomedical Semantic Web resources. These ontologies were:

Based on this manually created basic ontology, the data from the SenseLab databases were then automatically converted to OWL using programs written in Java and Python. The automated export scripts extended the manually created basic ontology through the creation of subclasses, OWL property restrictions and individuals. The resulting ontologies show no clearly distinguishable divide between a 'schema' and 'data'.

The OWL export of NeuronDB was based on a transformation from the EAV/CR model of the SenseLab database to RDF/XML by a Java program. The export from ModelDB and BrainPharm was based on a simple flat text file export of the databases. The text file exports were converted to RDF/XML files with a Python script.

For mappings to external bioinformatics databases that do not yet offer stable URIs for reference on the Semantic Web, we used the URI scheme for database record identifiers established by Science Commons [ SC-URI ]. URIs for database records could simply be generated by concatenating the record identifier to a predefined namespace. For example, the Entrez Gene record with ID '3579' was identified by the URI

http://purl.org/commons/record/ncbi_gene/
3579

, the Uniprot record 'P46663' was identified by

http://purl.org/commons/record/uniprotkb/
P46663

and the Pubmed record with ID '11160518' was identified by

http://purl.org/commons/record/pmid/
11160518

. The database entries were connected to the ontological representations of real-word entities through relations such as has_nucleotide_sequence_described_by . For example, the gene of the Dopamine Receptor D1 (DRD1) is defined through a reference to NCBI record 1812, which contains a description of the sequence of this specific gene:

<http://purl.org/ycmi/senselab/neuron_ontology.owl#DRD1_Gene> owl:equivalentClass _:property_restriction1 .
_:property_restriction1 owl:onProperty senselab:has_nucleotide_sequence_described_by .
_:property_restriction1 owl:hasValue <http://purl.org/commons/record/ncbi_gene/1812> .

Figure 1: Import hierarchy of OWL ontologies. Ontologies printed in bold have been created by the SenseLab team, other ontologies have been created by other groups. The arrows point from the imported ontology to the importing ontology, e.g., the NeuronDB Ontology imports the Relation Ontology. Import statements are transitive, e.g., the ModelDB Ontology imports both the NeuronDB ontology and the Relation ontology.

Figure 2: Examples of relations ('mappings') spanning between classes from the NeuronDB ontology (in the middle) and classes from external ontologies.

Terse rdfs:labels were replaced by more descriptive ones that could be better understood without knowledge about context. For example, the rdfs:label "Ded" was changed to "Distal part of equivalent dendrite (Ded)". Note that, in this case, the original label was also preserved (in brackets), because it might still be useful for people that do know about the context.

The ontology development was moved to a Subversion (SVN) system on a central webserver. During most of the development, the ontologies were simply developed on the client side and were periodically uploaded via FTP. Of course this led to problems when more than one person was working on the ontologies at a time, and it was also impossible for users of the ontology to access previous versions of the ontology, since only the most recent version was available on the website.

The namespaces / ontology locations were changed to PURL-based URIs. For example, the URI

http://
neuroweb.med.yale.edu
/senselab/neuron_ontology.owl#Dopamine

was changed to

http://
purl.org/ycmi
/senselab/neuron_ontology.owl#Dopamine

. PURL-based URIs are easier to maintain when server configurations change or (in the worst case) the original server is unavailable and the ontologies need to be served from a different location. The increased stability of PURLs encourages the re-use of entities in ontologies developed by other groups -- which is a key factor in the creation of a coherent Semantic Web.

A SPARQL endpoint for the SenseLab ontologies was set up using the open source version of the Openlink Virtuoso server [ VIRTUOSO ]. A SPARQL endpoint is a service that allows clients to query a RDF store with the SPARQL query language through simple HTTP GET requests. The ontologies were loaded into the triplestore of the server to make them accessible to SPARQL queries. Each ontology file was put into a separate labeled graph, the label of each graph was identical to the URL of the ontology file. For example, the ontology located at http://purl.org/ycmi/senselab/neuron_ontology.owl was loaded into a graph labeled http://purl.org/ycmi/senselab/neuron_ontology.owl . Loading each ontology into a separate graph makes it possible to restrict SPARQL queries to certain graphs and hence, certain ontologies. This has the advantage that queries can be more selective and can be executed with better performance.

Outcome

Advantages

The use of OWL significantly eased the integration of SenseLab data with ontologies developed by other projects. OWL-based data integration does not require the development and maintenance of central mediators, reducing development and maintenance costs. The ontology integration can be accomplished by creating meaningful relations between entities in distributed ontologies.

Ontologies can be modularized; dependencies between ontologies can be made explicit through 'owl:imports' statements. This makes distributed development of ontology modules feasible and encourages the re-use of selected ontology modules by other groups.

Good OWL ontologies are self-descriptive because every entity can be annotated with text.

Reasoners can be used to identify errors and real (i.e., conscious) contradictions in submitted data sets. You might find more errors and contradictions than you expected.

Ontologies can be used to directly represent biological reality without introducing unnecessary abstractions such as database tables, data dictionaries, and documents.

Disadvantages

The open-source ontology editors used for this project were relatively unreliable. A lot of time was spent with steering around software bugs. Future versions of freely available editors (e.g. Protege 4) or currently available commercial ontology editors (e.g. Topbraid Composer) might be preferable.

Descriptions of OWL classes and their relations (i.e., OWL property restrictions) result in very complex and unintuitive RDF graphs. This makes it hard to generate them automatically, or to write SPARQL queries that query such ontologies.

Current reasoners can still have performance problems when checking / classifying complex OWL ontologies.

The RDF/XML serialisation of RDF is not very easy to work with. It is often a source of errors.

Future directions and plans

Future OWL conversions are planned to be based on the intermediate, syntactic RDF conversion. The SenseLab ontologies will be further integrated with other neuroscientific and biomedical ontologies. User friendly applications will be developed to query a multitude of interrelated ontologies in a scientifically meaningful way.

Suggestions based on our experiences

Try to create consistent OWL DL ontologies. Pure RDF(S) without OWL constructs is not much simpler than OWL DL and often leads to the creation of too many properties because pure RDF(S) does not support property restrictions.

If you do not want to import another ontology in its entirety (e.g. because it would be too large, too buggy or would introduce unnecessary constructs), you can still 'copy & paste' portions of the ontology into your own.

Try to base your ontology on a foundational ontology like BFO, OBO Relation Ontology or DOLCE [ DOL ].

Where possible use the rdfs:label property to give clear, understandable to each entity and property in the ontology. Try to formulate labels in a way that makes them understandable without too much additional context (e.g. a certain user interface).

Make a habit out of running your ontology through the RDF validator [ RDF-VALID ] periodically, especially when you create RDF/XML with scripts that you wrote yourself. Keep in mind that the RDF validator does not throw an error message when URIs contain blanks. Blanks in URIs are problematic for many Semantic Web applications, so try to make sure that your URIs do not contain blanks.

Check the consistency of your OWL ontology periodically. We used the Pellet reasoner [ PELLET ], which seems to be the best choice at the moment.

Use purl.org URIs for your ontologies. You can easily register a sub-domain at purl.org free of charge.

If you write a program that generates RDF/OWL, do NOT try to write RDF/XML code directly. RDF/XML is relatively complicated and messy, and it is very easy to produce syntactic or even semantic errors because of that. So if you write a program that generates RDF, use a RDF or OWL API for writing triples. If that is not possible, generate your RDF in the much simpler TURTLE syntax instead of RDF/XML. The TURTLE syntax is a subset of the N3 syntax [ N3 ]. You can save the resulting RDF in TURTLE format to a text file. If you need RDF/XML for another application, you can convert the TURTLE to RDF/XML in a second step.

Experiences with the conversion of SenseLab databases to RDF/OWL

W3C Working Draft Interest Group Note 4 April June 2008

Abstract

Status of This Document

Table of Contents

Appendices

Conversion process

Original data sources

Initial RDF and OWL conversions

Motivation

Process

Outcome

Revised OWL conversions

Motivation

Process

Outcome

Advantages

Disadvantages

Future directions and plans

Suggestions based on our experiences

References