This article compares the SPARQL and SQL query languages, which are designed to query respectively RDF and relational data. You may be reading this article because you know one and want to learn the other, or because you need to make some decisions about which to use for some purpose. First, let's look at their data models, that is, the way we consider their data to be structured.
Many people ask what can be done with SPARQL that can't be done with SQL, when in fact they care about what can be done in RDF that can't be done with relational databases. Both of these languages give the user access to create combine and consume structured data; SQL accessing tables in relational databases and SPARQL accessing a web of Linked Data. Of course, SPARQL can be used to access relational data as well, but it was designed to merge disparate sources of data.
Relational data is made up of rows of data collected into tables (also called a "relations" in formal relational literature). The rows in a table conform to a set data types and constraints called a schema. The subset of SQL called DDL (data definition language) asserts that schema:
CREATE TABLE Person ( ID INT, fname CHAR(10), addr INT, FOREIGN KEY(addr) REFERENCES Address(ID) );
CREATE TABLE Address ( ID INT, city CHAR(10), state CHAR(2) )
This constrains the rows in the Person
table in some database to have three columns, "ID", "fname" and "addr", which are respectively an integer, 10 characters and another integer. It similarly defines an Address
table and requires that the non-NULL values in the "addr" column in Person
correspond to the value of the "ID" column in Address
. This allows the database to capture relationships between the real world entities (things) being represented in the database. The values below, for instance, state that Bob lives in Cambridge MA and we don't know where Sue lives:
ID | fname | addr |
---|---|---|
7 | Bob | 18 |
8 | Sue | NULL |
ID | city | state |
---|---|---|
18 | Cambridge | MA |
The numbers 7, 8 and 18 were invented to identify the rows and capture linkages between them. SQL queries about these relations will generally not mention the specific numbers but will recapitulate the constraints of the relationship, i.e. that Person.addr=Address.ID. This same sort of expression can capture relationships which aren't intrinsic relationships, e.g. that a Person's addr number be the same as their shoe size.
RDF captures both entity attributes and relationships between entities as statements of the form entity1
has propertyA
relationship to entity2
.
Using a language called Turtle, we can say that there is a person named "Bob" with an address in Cambridge MA:
<PersonA> a <Person> . <AddressB> a <Address> . <PersonA> <Person#fname> "Bob" . <AddressB> <Address#city> "Cambridge" . <PersonA> <Person#addr> <AddressB> . <AddressB> <Address#state> "MA" .
and that there's another person, "Sue", but we won't say anything about her address because we don't know it:
<PersonF> a <Person> . <PersonF> <Person#fname> "Sue" .
We call the three terms in every RDF statement the subject, predicate and object. The terms used in the statements above are relative URLs in<>s, literals in ""s and the keyword "a" which is just a shortcut for the URL used to identify a "has type" relationship. There is no concept in RDF corresponding to SQL's NULL as there is no RDF requirement corresponding to SQL's structural constraint that every row in a relational database must conform to the same schema. The object of one assertion, e.g.<AddressB> above, may be the subject or object of other assertions. In this way, a set of RDF statements create a "graph"<http://en.wikipedia.org/wiki/Graph_%28mathematics%29>. You will frequently hear the term "RDF graph". These graphs may be cyclic, for example stating that Bob lives someplace where he is also the owner:
<PersonC> <Person#homeAddress> <AddressK> . <PerconC> <Person#fname> "Bob" . <AddressK> <Address#owner> <PersonC> .
The examples above illustrate some of the structural similarities and differences between RDF and relational data. A core philosophical difference is that RDF is a post-Web language, that is, it allows one to use web identifiers for the entities we want to describe, and for the attributes and relationships we use to describe them. If I trust the publishers not to lie to me, I can merge information from different parties. An example of this trust can be voiced as "Do I trust that whenever these parties assert that someone has a<http://xmlns.com/foaf/0.1/givenName>, that value will indeed be that person's given name". This means that RDF data is mergable in a way which for relational data would require an intermediate process of mapping the terms between databases and assuring that no tables used the same column name to mean different things.
Experts that we are on RDF and relational data, we can now examine their query languages. A SQL query to get the addresses for each person living in MA could look like:
SELECT Person.fname, Address.city FROM Person INNER JOIN Address WHERE Person.addr=Address.ID AND Address.state="MA"
Conceptually, we are SELECTing a list of attributes FROM a set of tables WHERE certain constraints are met. These constraints capture the relationships implicit in the scheme, Person.addr=Addresses.ID, and the selection criteria, e.g. Address.state="MA".
A SPARQL query of the same data could look like
SELECT ?name ?city WHERE { ?who < Person#fname> ?name ; < Person#addr> ?adr . ?adr < Address#city> ?city ; < Address#state> "MA" }
For better or worse, SPARQL reuses some key words familiar to SQL users: SELECT, FROM, WHERE, UNION, GROUP BY and most aggregate function names.
Looking at the examples above, we see this general form for SQL queries:
SELECT <attribute list> FROM <table list> WHERE <test expression>
The test expression captures both the rows relevent to a particular query (the rows for people who live in the state "MA") and the structure of the database (Person.addr references Address.ID). Documentation and literature tends to avoid the ambiguous word "selection", instead using restriction
for selecting rows and projection
for selecting specific columns from those rows. The result of the query is a list of rows, each with the SELECTed list of attributes. Executing the example query over the sample database at the top of the article yields one solution, corresponding to the one person living in "MA":
fname | city |
---|---|
Bob | Cambridge |
The SPARQL query above has a similar structure:
SELECT <variable list> WHERE {<graph pattern> }
The variables in the variable list are bound by the graph pattern. (Remember, "graph" just means that thee is a set of potentially interconnected statements.) The graph pattern looks like the data statements, but the subject, predicate or object may be a variable (terms beginning with a "?").The pattern above finds all values of ?name, ?city, ?who and ?addr which match the data, projecting out only ?name and ?city:
?fname | ?city |
---|---|
"Bob" | "Cambridge" |
The query had one solution with the variables fname
and city
bound to the literals "Bob" and "Cambridge".
The terms in SPARQL solution sets are represented the same way as in SPARQL queries or Turtle statements.
Note that the column headings in the SPARQL solution set are variables which appeared in the WHERE { <graph pattern> }
while the column headings in the SQL results are the names of attributes in the SQL schema.
@@ UNION, OPTIONAL, MINUS, etc. here or use this as intro for another article?
SQL uses the token NULL
to indicate that data is not available or not applicable.
SELECTs match table rows even if the selected attributes are NULL.
JOINs, however, the join constraints will typically eliminate rows if there are no rows with corresponding values.
The LEFT OUTER JOIN operator performs a regular ("inner") join but does not eliminate solutions if the join constraints are not met.
The following query would select each person's fname and, if available, their city:
SELECT Person.fname, Address.city FROM Person LEFT OUTER JOIN Address ON Person.addr=Address.ID WHERE Address.state="MA"
fname | city |
---|---|
Bob | Cambridge |
Sue | NULL |
SPARQL uses the operator OPTIONAL
instead of LEFT OUTER JOIN
, but the effect is similar:
SELECT ?name ?city WHERE { ?who < Person#fname> ?name . OPTIONAL { ?who < Person#addr> ?adr . ?adr < Address#city> ?city ; < Address#state> "MA" } }
Though the join semantics are analogous between the two languages, there's a noticeable difference in the treatment of missing data. Missing data is simply not expressed in RDF. Also (and consequentially), SPARQL graph patterns will not bind if there are missing attributes, e.g. Sue's addr above. A SPARQL query selecting each person's name and the identifier for their address record must make the addr attribute OPTIONAL in order to match Sue's record:
SELECT ?name ?city WHERE { ?who < Person#fname> ?name . OPTIONAL { ?who < Person#addr> ?adr } }
A truly analogous SQL query must prevent bindings to NULL, i.e.:
SELECT Person.fname, Person.addr
FROM Person
LEFT OUTER JOIN Address ON Person.addr=Address.ID
WHERE Address.state="MA" AND Person.addr IS NOT NULL
SPARQL and SQL have very similar UNION and MINUS operators, which respectively add and remove solutions from a solution set.
Because the datatypes of an SQL table are assumed to be uniform across all rows, care must be taken to align the datatypes of the SELECT. Some SQL databases enforce this rule, leaving the user with some helpful error messages to find the misalignment. Others amiably return heterogeneous columns, with e.g. the latter rows with values like 2012-05-28 in a column of floats.
SQL databases are respositories of data, with a set of tables populated by rows of data. SQL queries operate over a given database. SPARQL services vary in whether or not they have a pre-determined RDF database. The example query above presumes that a set of triples is available for querying, which one would expect from service which queries an already populated database. If the RDF database were by default empty, or didn't contain the data needed for the query, we would need to specify where to load that data. SPARQL re-uses the SQL keyword "FROM" to identify web resources which need to be loaded in order to complete a query:
FROM<http://example.org/AddressBook> SELECT ?name ?city WHERE { ... }
Combining navigation with exploration is very easy in SPARQL. If I want to what my organization knows about reactions involving illudium phosdex, I can write a query which finds those reactions and explores their attributes:
SELECT ?reaction ?p ?o WHERE { ?compound ex:name "illudium phosdex"; ?reaction ex:involves ?compound ; ?reaction ?p ?o }
In SQL, this would be like:
SELECT reactions.* FROM reactions, compounds WHERE reactions.compoundID=compounds.ID AND compounds.name="illudium phosdex"
Generic Linked Data browsers currently leverage SPARQL to explore data, and will likely lead to a generation of purpose-built interfaces which enable knowledge users to understand and capitolize on information assets.
Probably the first feature of SPARQL which will impress SQL users is the ability to federate queries across different resources. RDF theory provides a foundation to integrate graphs of data, and RDF tools put that power in the hands of users by allowing them to trivially retrieve multiple documents of data. Integrating large databases is also trivial, but instead of retrieving the data and merging it locally, one writes SPARQL queries which delegate portions of the query to remote query services, e.g. this week's unshipped orders:
SELECT ?order ?handler WHERE { SERVICE<sales> { ?order ex:soldBy ?handler ; ex:dueDate ?due FILTER (?due> "2012-02-22"&& ?due< "2012-02-29" } MINUS { SERVICE<fulfillment> { ?order ex:shipped ?shipped } } }
SQL has no standard system for query federation. Various products offer tools which more or less leverage the SQL syntax to manage access to a pre-assigned set of databases. MySQL FROM directives can join data from different virtual databases running in the same MySQL server, Oracle Database Streams and SQL Server Integration Services use that syntax to connect to databases which have been mapped to local schema names by some manual configuration.
The bold assertion that SPARQL can be used to trivially integrate data requires that the data of course be in intersecting domains. To make the integration truly trivial, it helps that the data be expressed in a similar fashion. If we want to connect one database's information about the physiological impact of certain chemicals to another database's ingredients of medications, it is certainly easier if the chemicals are represented the same way in both databases. This may seem like a tall order, but SPARQL and RDF's foundation in the web make it easy to explore, embrace and extend existing schemas. Where the collaborative spirit may be insufficient to inspire folks to use the same schema, we still have laziness to help us fight entropy.
This article introduced the relational and RDF data models and highlighted the structural differences between them. These differences become apparent in the capabilities of their query languages. We discussed the culture of re-use and integration that pervades the Semantic Web and mentioned that SPARQL can be used to access relational data as well as RDF. Two recent specifications, A Direct Mapping of Relational Data to RDF and R2RML: RDB to RDF Mapping Language, define this process. The next article contrasts SPARQL and SQL queries over two databases containing exactly the same information.