Enabling Semantic Access to
(Position Paper for W3C workshop on RDF Access to
Relational Databases)
{jun.yuan, david.h.jones}@boeing.com
Mathematics & Computing Technology
Boeing Phantom Works
Motivations & Benefits
Relational databases have been an important part of
the enterprise IT infrastructure, mainly because of their proven efficiency in
dealing with huge volumes of data. However, the advent of the Internet has
brought many end users into the information cyberspace. They usually have
little, if any, training in either query languages or database systems, and
thus have difficulties using database languages, e.g. SQL. A new semantic query
language, such as SPARQL, is much preferred, and will enable users to retrieve
information solely by the semantic understanding of the applicable domain.
Another drawback of relational systems is that
database schemas change over the time, even though the semantics of the data
have not changed. Non-semantic changes may be caused by many different things,
including schema normalization or de-normalization, migrating from one DBMS
product to another, changing of data type for a particular data field, or by using
different techniques, for example using a stored procedure instead of a
traditional view.� We have also observed
that the addition of a new set of data or archiving existing data may also
result in schema changes.
For any software-based information consumer, all these
non-semantic changes imply one thing: the existing pre-defined or pre-cached
query statements have to be modified accordingly. Otherwise, an exception may
be returned instead of the query answers. Given the fact that a database
usually has many software applications on top of it, it is really challenging
and expensive to modify all the applications appropriately and in a timely
fashion. The larger question then, is: Is there an approach that is able to
hide these non-semantic changes from software applications so that it is not
necessary to modify an application as long as there isn�t any change in
semantics?
With the ever-growing information sharing requirement
in almost every enterprise, retrieved information (query answers) needs to be shared
among many information consumers, not only human but also computers and their
software components. This implies that the semantics of query answers must be
both human and machine understandable. We know that query answers from
databases are usually represented by a flat table, with multiple rows, and each
row may have a number of fields. Suppose that you are receiving a table with
two columns, one being aircraft tail number and the other being a part number,
would you be able to understand what the data really means without asking any
questions to another person?�
Semantic access to existing RDB data holds the promise
of bringing explicit semantics at several different levels and for different
parties: The semantics of the domain, the semantics of data content, the
semantics of a user-defined query itself, and the semantics of returned query
answers. It not only provides information consumers a more convenient and
user-friendlier interface to retrieve information, but it also offers a
foundation for better system maintainability, better semantic interoperability
among multiple data systems, and, hence, better data leverage .
Some Challenges
1.
Ontology
It is obvious to this audience that an
ontology plays a very important role in regards to the semantic access
to RDB data. Where and how shall we obtain this ontology? Is it difficult (or
not) to derive this ontology? As a matter of fact, a semantic model is commonly
used in database design. People are familiar with three-level database design,
starting from the conceptual design, then the logical design, and finally the
physical design. Each design phase has its own data schema: Conceptual,
logical, and physical. A conceptual schema, usually an Entity-Relationship
diagram, is actually a kind of semantic model. Based on the above it may appear
that every data system should have a conceptual schema, which would be able as
the basis of the ontology. Unfortunately, in practice there are two major
reasons why this is not the case.
First, people usually start
the database design with a concept model, but seem to totally forget about it
as the implementation goes on. In fact, it is common, even though incorrect,
that when there is a requirement to change semantics in a later phase of
database design, database developers usually update the physical model
directly, without referencing back to the concept model. This makes the
original conceptual model quickly out of date.
Secondly, the semantic model disappears or gets
embedded as a normal part of the implementation process, due to such factors as
schema normalization. Schema normalization is a very common practice in
database design, and is mostly driven by functional dependency. Without getting
into details of the schema normalization, the main thing to point out here is:
schema normalization is introduced mainly for the purpose of guaranteeing the
integrity of data in the database. While there is no doubt that data integrity
is one of the most important aspects in databases, the end result is usually
that the implementation/maintenance process make the conceptual schema
obsolete, even though for a select-only operation data integrity is not an
issue.
2. Mappings
Because of the proven efficiency of a RDB query
engine, it is both desirable and reasonable to push the semantic query
evaluation down into the RDB query engine as much as possible. This often
requires advanced query rewriting techniques to generate either high-level SQL
query statements or low-level relational algebra expressions, which are
semantically equivalent (if possible) to the original semantic query. We argue
that mappings are the key enablers to make such rewriting successfully.
Mappings fall into two categories: one maps the
semantic model (ontology) to the underlying data model; and the other maps the
semantic query primitives to the relational query primitives. For the second
category, instead of reinventing the wheel, we can use a lot existing research
results from previous deductive database research work. For example, there has
been extensive research on mapping first order logic or description logic to
the relational calculus. In order to successfully push down the semantic query
evaluation as much as possible into a RDB query engine, a generic mapping
structure needs to be developed between ontologies
and relational data models and theoretical foundation needs to be built between
the semantic query formalisms and the relational calculus/algebra.���
3. Result transformation
Query answer transformation, i.e. converting query
answers out of the RDB engine into an instantiation of ontology, is another
challenge. How to formulate the URI for each instance is one of the most
important things. Each database table may have its primary key defined, which,
in many cases, is a good resource for generating URIs.
However, this is not always enough.� A
more challenging situation is to access multiple databases where the different
databases don�t share the same primary key for the same real world entity.
Another big challenge in result transformation is how
to preserve the intermediate query answers efficiently. For traditional RDB
engines, the intermediate data is usually abandoned during the query evaluation
process. Whereas we know that to recover the semantics of query answers, not
only the query answers need to be returned, but also a lot of intermediate
data. These intermediate data tell exactly how data elements in final query
answers are semantically related, and need to be preserved all the way through
the query execution.
4. Performance
Performance is key! The
challenge here is how to keep semantic query performance at comparable levels with
that of SQL. The challenge of the performance has two aspects; one is how to
push down query execution as much as possible, and the other is a how to
efficiently transform query answers back into an ontological format.