Comparing SPARQL and SQL by Example

In an earlier article, I introduced the RDF and SQL datamodels. That article explored a simple addressbook to describe at a high-level the structure of SPARQL and SQL queries. This article compares more of the language features of SPARQL and SQL, comparing two representations of a database of space craft. The first is the Talis Space Database and the second is an SQL database derived from it. This article will cover:

The relational database used in the examples is described by this UML diagram:

MissionRole +mission +actor +role +n Mission +id +title Role +id +label Crew +id +name SpacecraftDiscipline +craft +discipline Discipline +id +label LaunchSite +id +city Launch +id +date +site +vehicle Spacecraft +id +launch +mission +name +altname +designator +agency +mass PlaceName +id +label CountryName +id +label AltCraftName +craft +name

Property Selection

Property selection is similar in SPARQL and SQL; both use the keyword SELECT to introduce a list if "columns" to return to the querier. I'm using "columns" to correspond to the information slots in SQL and SPARQL. In SQL, these are named attributes in the tables listed in the later FROM clause. In SPARQL, these are variables bound in the later WHERE clause. For both SPARQL and SQL, the symbol * calls for the return of all columns.

All Properties of a Launch

The trivial SQL way to investigate the properties of a record is to use SELECT * FROM someTable WHERE someAttribute=someValue. Let's use Apollo 7's launch for example:

SELECT *
  FROM Launch
 WHERE launchword="1968-089";
-- ask for all attributes
-- from the records in the Launch table
-- which have a particular launchword.

id date launchword site vehicDesc
764 1968-10-11 1968-089 94 Saturn 1B

The analog in SPARQL is similar, though the launchword above is embedded in a URL used for the subject of some triples:

PREFIX craft: <http://nasa.dataincubator.org/launch/>
SELECT * {
  launch:1968-089 ?p ?o
}
# Prefixes shorten the body of the query.
# Return tbe predicate ?p and object ?o
#   for all triples with the given subject.

?p ?o
<http://purl.org/net/schemas/space/launched> "1968-10-11"^^<http://www.w3.org/2001/XMLSchema#date>
<http://purl.org/net/schemas/space/launchsite> <http://nasa.dataincubator.org/launchsite/capecanaveral>
<http://purl.org/net/schemas/space/launchvehicle> "Saturn 1B"
<http://purl.org/net/schemas/space/spacecraft> <http://nasa.dataincubator.org/spacecraft/1968-089A>
<http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://purl.org/net/schemas/space/Launch>

The first thing we notice is that the axes are different; factoids come in triples in RDF and in n-ary units in relational databases. We see also that SPARQL gave us URLs where SQL gave us the integers used in foreign keys. These integers are more compact, and, if you know the schema and have access to this particular database, can be used to join against other tables to get the attributes of the refrenced items as well. By contrast, the Semantic Web identifiers reported by SPARQL are in a global information space. Authors and custodians of other data are invited to re-use the same identifiers, promoting connectivity between databases and greatly enhancing data integration.

SQL Joins

The queries in the last example returned the attributes of a launch where, if we wanted to see some attributes of a launch, we'd have to look in the LaunchSite table, which in turn references the PlaceName and CountryName tables.

Attributes of Spacecraft Launched from Spain

SELECT Spacecraft.*
  FROM Spacecraft
  INNER JOIN Launch
     ON Launch.id=Spacecraft.launch
  INNER JOIN CountryName
     ON CountryName.site=Launch.site
  WHERE CountryName.label REGEXP "Spain";
-- All attributes of Spacecraft.
-- Step from Spacecraft
--  through Spacecraft.launch
--  to Launch.id .
-- Launch's foreign key to LaunchSite
--  = CountryName's foreign key to LaunchSite .
-- Launched by Spain.

id launchmissiondesignatornameagencymasshomepagecraftDesc
4500 3923 NULL1997-018AMinisat 01Spain 200<htt…8A>Minis…es.

As it turns out, the agency that oversaw the launch happened to be "Spain" as well, but that's certainly not a rule. Also, because this database only has "mission" data for Apolly missions, the mission attribute is NULL.

For comparison, here's that same query in SPARQL:

PREFIX space: <http://purl.org/net/schemas/space/>
SELECT ?id ?p ?o {
  ?site space:country ?country
  FILTER REGEX (?country, "Spain")
  ?launch space:launchsite ?site .
  ?id space:launch ?launch .
  ?id ?p ?o .
}

	  

(This is this same as

PREFIX space: <http://purl.org/net/schemas/space/>
SELECT ?id ?p ?o {
  ?id ?p ?o .
  ?id space:launch [ space:launchsite [ space:country ?country ] ]
  FILTER REGEX (?country, "Spain")
}
# Similar to previous SPARQL query
# but looking for a spacecraft.

, rewritten for speed.)

?p ?o
<http://purl.org/dc/elements/1.1/description> "Minis…es. "
<http://purl.org/net/schemas/space/agency> "Spain"
<http://purl.org/net/schemas/space/alternateName> "24779"
<http://purl.org/net/schemas/space/discipline> <http://nasa.dataincubator.org/discipline/engineering>
<http://purl.org/net/schemas/space/discipline> <http://nasa.dataincubator.org/discipline/astronomy>
<http://purl.org/net/schemas/space/internationalDesignator> "1997-018A"
<http://purl.org/net/schemas/space/launch> <http://nasa.dataincubator.org/launch/1997-018>
<http://purl.org/net/schemas/space/mass> "200.0"
<http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://purl.org/net/schemas/space/Spacecraft>
<http://xmlns.com/foaf/0.1/homepage> <http://nssdc.gsfc.nasa.gov/database/MasterCatalog?sc=1997-018A>
<http://xmlns.com/foaf/0.1/name> "Minisat 01"

There are lot of attibutes to sort through. To make it easy, here they are lined up:

row identifier discipline disciplinedesignatorlaunch mission designator name agency masshomepage craftDesc
4500 3923 NULL 1997-018A Minisat 01 Spain 200<htt…8A>Minis…es.
<htt…1997-018A>engineeringastronomy "24779" <htt…1997-018> "1997-018A""Minisat 01""Spain" 200<htt…8A>Minis…es.

We can see that where SQL returned a NULL mission, SPARQL didn't return any triple at all. We also see that SPARQL gave us many more properties than did the SQL query. That's because all of the many-to-many properties in SQL are normalized out into their own tables. A more appropriate analog for the above SPARQL query is:

SELECT Spacecraft.*, disc.labelword
  FROM Spacecraft
  INNER JOIN Launch
     ON Launch.id=Spacecraft.launch
  INNER JOIN CountryName
     ON CountryName.site=Launch.site
   LEFT OUTER JOIN AltCraftName
     ON AltCraftName.spacecraft=Spacecraft.id
   LEFT OUTER JOIN
     (
      SELECT SpacecraftDiscipline.spacecraft,
             Discipline.labelword
        FROM SpacecraftDiscipline
       INNER JOIN Discipline
          ON Discipline.id=SpacecraftDiscipline.discipline
     ) AS disc ON disc.spacecraft=Spacecraft.id
  WHERE CountryName.label REGEXP "Spain";
-- 
-- 
-- 
-- 
-- 
-- 
-- LEFT OUTER JOIN (described below) so the query
-- succeeds even without corresponding AltCraftNames.
-- 
-- The ()s demark a subquery described below.
-- Match against 0 or more disciplines
--  by joining through SpacecraftDiscipline
--  to Discipline.
-- 
-- 
-- 
-- 

id craftwordlaunchmissiondesignatorname agencymasshomepage craftDesclabelword
45001997-018A 3923 NULL1997-018A Minisat 01Spain 200<ht…18A>Min…es. astronomy
45001997-018A 3923 NULL1997-018A Minisat 01Spain 200<ht…18A>Min…es. engineering

Optionals

The previous SQL query .

@@


	  

	  


	  

	  

Constraints

Constraints in SPARQL and SQL are somewhat similar. In SPARQL, they expressed in a FILTER clause. We've seen an example of this in the previous section. In SQL, constraints in the WHERE and ON clauses additionally capture the foreign keys connecting the tables. Examining all of the constraints in the previous SQL query:

…    ON Launch.id=Spacecraft.launch
…    ON CountryName.site=Launch.site
…         ON Discipline.id=SpacecraftDiscipline.discipline
…              ON disc.spacecraft=Spacecraft.id
…    ON AltCraftName.spacecraft=Spacecraft.id
… WHERE CountryName.label REGEXP "Spain"
-- foreign key
-- foreign key
-- foreign key
-- foreign key
-- solution restriction criteria

A couple more minor differences between SQL and SPARQL constraints

are illustrated in this query:

Largest US and USSR Spacecraft

SELECT agency, name, mass
  FROM Spacecraft
 WHERE
    (agency="United States" AND mass>100000)
 OR (agency="U.S.S.R"       AND mass>70000);

	  

PREFIX space: <http://purl.org/net/schemas/space/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
SELECT ?agency ?name ?mass {
  ?craft space:agency ?agency ; space:mass ?mass ; foaf:name ?name
  FILTER (
    (?agency="United States" && xsd:float(?mass)>100000)
  ||(?agency="U.S.S.R"       && xsd:float(?mass)>70000))
}

	  

?agency ?name ?mass
"U.S.S.R" "Buran" 73000.0
"United States" "STS 35/Astro 1" 102420.0
"United States" "STS 95" 103069.0

Functions (and Fix-ups)

The last SPARQL query invoked a function, xsd:float(?mstr), in the SELECT clause and the FILTER clause. This was required because the database has strings where it should have floats. Most of the SPARQL 1.1 functions and operators are inspired by SQL. SPARQL's functions are extensible directives, like xsd:float(X), and the operators built in features of the language, e.g. X || Y or ?mass > 70000.

Aggregates

While we're on the subject of functions, and hurling heavy things into space, let's compare some minor stats for the US and USSR. Above, we stumbled across the largest vehicles the two superpowers launched. SPARQL and SQL have functions like min, max, avg, sum and count for this.

Max Vehicle Weight by Agency

What are the heaviest payloads that each agency has launched?

SELECT agency, MAX(mass) AS m
 FROM Spacecraft
GROUP BY agency
ORDER BY m

	  

PREFIX space: <http://purl.org/net/schemas/space/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
SELECT ?agency (MAX(xsd:float(?mstr)) AS ?m) {
  ?craft space:agency ?agency ; space:mass ?mstr
} GROUP BY ?agency ORDER BY ?m

	  

?agency ?m
"Denmark" 3.0
"The Netherlands" 6.5
…31 rows elided…
"Thailand" 6505.0
"Peoples Republic of China" 7600.0
"International" 11000.0
"Russia" 19640.0
"Luxembourg" 23000.0
"U.S.S.R" 73000.0
"United States" 103069.0

Apparently Luxembourg launched Luxembourg.

Unions

In both SPARQL and SQL, a UNION is equivalent to performing the queries on either side of the UNION and concatenating the results. The Largest US and USSR Spacecraft query we performed earlier can be re-expressed as a UNION to illustrate this point:

UNION of US and USSR Largest Spacecraft

 SELECT agency, name, mass
   FROM Spacecraft
  WHERE agency="United States" AND mass>100000
UNION
 SELECT agency, name, mass
   FROM Spacecraft
  WHERE agency="U.S.S.R"       AND mass>70000;

	  

Note the requirements for type consistency in UNIONs described previous article.

SPARQL results carry bindings of variables, insulating the user from errors in ordering SELECTed variables. They also carry the type in every returned term, so even if a variable is bound to terms of different type in different rows, the results include those types.

Below is a SPARQL equivalent to the SQL UNION:

PREFIX space: <http://purl.org/net/schemas/space/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
SELECT ?agency ?name ?mass {
  {
  ?craft space:agency ?agency ; space:mass ?mass ; foaf:name ?name
  FILTER ((?agency="United States" && xsd:float(?mass)>100000))
  }
UNION
  {
  ?craft space:agency ?agency ; space:mass ?mass ; foaf:name ?name
  FILTER ((?agency="U.S.S.R"       && xsd:float(?mass)>70000))
  }
}

	  

A common use of UNION in SPARQL is to find all of the arcs connecting to an RDF node.

Arcs In and Out

Find all of the arcs with <http://nasa.dataincubator.org/launch/1997-018> as the subject or object:

SELECT * {
  { ?sIn ?pIn <http://nasa.dataincubator.org/launch/1997-018> }
UNION
  { <http://nasa.dataincubator.org/launch/1997-018> ?pOut ?oOut }
}

	  

?sIn ?pIn ?pOut ?oOut
<http://nasa.dataincubator.org/spacecraft/1997-018A> <http://purl.org/net/schemas/space/launch> NULL NULL
<http://nasa.dataincubator.org/spacecraft/1997-018B> <http://purl.org/net/schemas/space/launch> NULL NULL
NULL NULL <http://purl.org/net/schemas/space/launched> "1997-04-21"^^<http://www.w3.org/2001/XMLSchema#date>
NULL NULL <http://purl.org/net/schemas/space/launchsite> <http://nasa.dataincubator.org/launchsite/canaryislands>
NULL NULL <http://purl.org/net/schemas/space/launchvehicle> "Pegasus XL"
NULL NULL <http://purl.org/net/schemas/space/spacecraft> <http://nasa.dataincubator.org/spacecraft/1997-018B>
NULL NULL <http://purl.org/net/schemas/space/spacecraft> <http://nasa.dataincubator.org/spacecraft/1997-018A>
NULL NULL <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://purl.org/net/schemas/space/Launch>

These sorts of queries are very common in getting to know a dataset. In RDF, the graph of data is explicit while in SQL the connections within the data are implicitly defined via the foreign keys. While in principle, SQL's INFORMATION_SCHEMA provides users with a way to query for foreign keys, the implementation varies wildly across databases. Even if it were consistent, it is syntactically impossible in SQL to derive an attribute name and use it in a selection or constraint. For example, in order to get data analogous to the results of the SPARQL query, we'd need query the INFORMATION_SCHEMA to compose the joins against referenced tables.

SELECT Spacecraft.designator AS spacecraft, Launch.launchword,
       Launch.date, LaunchSite.labelword AS launchsite, Launch.vehicDesc
  FROM Launch
 INNER JOIN Spacecraft
    ON Spacecraft.launch=Launch.id
 INNER JOIN LaunchSite
    ON LaunchSite.id=Launch.site
 WHERE Launch.launchword="1997-018";

	  

Subqueries

The second SQL query for attributes of spacecraft launched from Spain, we saw an example of a subquery. In SPARQL, subqueries are nested inside a pair of { }s, i.e. { SELECT … { … } }. Subqueries are less common in SPARQL than in SQL as idioms

Validating Range of Launch Date


	  

	  

id
764
@@@
@@@

?p
<http://purl.org/dc/elements/1.1/description>

Irreconcilable Differences

Another difference is that in SQL, those table headings you see are like smoke; you cannot calculate them.

XX


	  

	  

id
764
@@@
@@@

?p
<http://purl.org/dc/elements/1.1/description>

$Revision: 1.5 $ of $Date: 2012/06/03 21:54:14 $ by $Author: eric $