Copyright ©2001 W3C® (MIT, INRIA, Keio), All Rights Reserved. W3C liability, trademark, document use and software licensing rules apply.
This document introduces the XML Query Algebra (``the Algebra'') as a formal basis for an XML query language.
This section describes the status of this document at the time of its publication. Other documents may supersede this document. The latest status of this document series is maintained at the W3C.
This is a Public Working Draft for review by W3C Members and other interested parties. It is a draft document and may be updated, replaced or made obsolete by other documents at any time. It is inappropriate to use W3C Working Drafts as reference material or to cite them as other than "work in progress". This is work in progress and does not imply endorsement by the W3C membership.
This document has been produced as part of the W3C XML Activity, following the procedures set out for the W3C Process. The document has been written by the XML Query Working Group.
The purpose of this document is to present the current state of the XML Query Algebra and to elicit feedback on its current state. The XML Query Working Group feels that it has made good progress on this document but that it is subject to change in future versions. Comments on this document should be sent to the W3C mailing list www-xml-query-comments@w3.org (archived at http://lists.w3.org/Archives/Public/www-xml-query-comments/). Important issues remain open - see [B.3.1 Open Issues]. In particular, the reader should note the following issues related to compatibility of the Algebra with related XML activities.
[Issue-0018: Align algebra types with schema]: The Algebra's type system is not yet aligned with XML Schema.
[Issue-0056: Operators on Simple Types]: A joint XSLT/Schema/Query task force is chartered to define the operators on Schema simple types. The Algebra will adopt the operators defined by that group.
[Issue-0081: Lexical representation of Schema simple types]: The lexical representation of Schema simple types must be defined for the Algebra and XQuery.
A list of current W3C Recommendations and other technical documents can be found at http://www.w3.org/TR/.
This document proposes an algebra for XML Query (``the Algebra''). In this document, ``query-analysis time'' refers to when an Algebra expression is parsed and type checked, that is before the value of the expression is computed, and ``query-evaluation time'' refers to when an Algebra expression is evaluated, that is, when it is reduced to a value. We sometimes use the phrases query-analysis time and ``compile time'' interchangeably as well as the phrases query-evaluation time and ``run time''.
This work builds on long-standing traditions in the database community. In particular, we have been inspired by systems such as SQL, OQL, and nested relational algebra (NRA). We have also been inspired by systems such as Quilt, UnQL, XDuce, XML-QL, XPath, XQL, XSLT, and YaTL. We give citations for all these systems below.
In the database world, it is common to translate a query language into an algebra; this happens in SQL, OQL, and NRA, among others. The purpose of the algebra is twofold. First, an algebra is used to give a semantics for the query language, so the operations of an algebra should be well-defined. Second, an algebra is used to support query optimization, so it should possess a rich set of laws. The Algebra is powerful enough to capture the semantics of several XML query languages, such as XML-QL, XQL, and XQuery. The laws we give include analogues of most of the laws of relational algebra.
It is also common for a query language to exploit schemas or types; this happens in SQL, OQL, and NRA, among others. The purpose of types is twofold. Types can be used to detect certain kinds of errors at query-analysis time and to support query optimization. Given the type of input values, a query applied to those values, and the expected type of the query's output value, the Algebra's type system can detect at query-analysis time if the query's output value has the expected output type.
DTDs and XML Schema can be thought of as providing something like types for XML. The XML Query algebra uses a simple type system that captures the essence of [XML Schema Part 1]. The type system is close to that used in XDuce [HP2000]. On this basis, the XML Query algebra is statically typed. This allows an implementation of the Algebra to determine and check at query-analysis time the output type of a query on documents conforming to an input type. Compare this to an untyped or dynamically typed query language, where each individual output has to be validated against a schema at query-evaluation time, and there is no guarantee that this check will always succeed.
To define the Algebra completely, we present a static semantics and a dynamic semantics. The static semantics is presented as type inference rules, which relate Algebra expressions to types and and specify under what conditions an expression is well typed. The semantics is static, because ill-typed expressions are identified at query-analysis time, i.e., before the query is evaluated. The dynamic, or operational, semantics is presented as value inference rules, which relate Algebra expressions to values. The Algebra's values are defined in the [XML Query Data Model]; for example, they include XML atomic values, attributes, and elements, as well as other values. A dynamic semantics guarantees that every expression can be reduced to a value and may serve as the basis for a query interpreter or compiler.
The document is organized as follows. A tutorial introduction is presented in [2 The Algebra by Example]. After reading the tutorial, the reader will have seen the entire algebra and should be able to write example queries. The grammar for the Algebra's types and expressions are given in [3 Algebra Components]. We give the static typing rules for the Algebra in [4 Static Semantics : Type-Inference Rules] and then the dynamic semantics for the Algebra in [5 Dynamic Semantics : Value-Inference Rules]. These sections formalize the information presented informally in [2 The Algebra by Example]. Although these two sections contain the most challenging material, we have tried to make the content as accessible as possible. Readers only interested in learning about the Algebra need not read these sections, however, we expect that implementors of the Algebra will read them.
In [B.2 Issues list], we discuss open issues and problems. We present some equivalence and optimization laws of the Algebra in [A Equivalences].
Cited literature includes: monads [Mog89], [Mog91], [Wad92], [Wad93], [Wad95], NRA [BNTW95], [Col90], [LW97], [LMW96], OQL [BK93], [BKD90], [CM93], Quilt [Quilt], SQL [Date97], UnQL [BFS00], XDuce [HP2000], XMSL-QL [XMLQL99], XPath [XPath], XQL [XQL99], XSLT [XSLT 99], and YaTL [YAT99].
This section introduces the main features of the Algebra, using examples based on accessing a database of books.
Consider the following sample data:
<bib> <book year="1999" isbn="1-55860-622-X"> <title>Data on the Web</title> <author>Abiteboul</author> <author>Buneman</author> <author>Suciu</author> </book> <book year="2001" isbn="1-XXXXX-YYY-Z"> <title>XML Query</title> <author>Fernandez</author> <author>Suciu</author> </book> </bib>
Here is a fragment of an XML Schema for such data:
<xsd:group name="Bib"> <xsd:element name="bib"> <xsd:complexType> <xsd:group ref="Book" minOccurs="0" maxOccurs="unbounded"/> </xsd:complexType> </xsd:element> </xsd:group> <xsd:group name="Book"> <xsd:element name="book"> <xsd:complexType> <xsd:attribute name="year" type="xsd:integer"/> <xsd:attribute name="isbn" type="xsd:string"/> <xsd:element name="title" type="xsd:string"/> <xsd:element name="author" type="xsd:string" maxOccurs="unbounded"/> </xsd:complexType> </xsd:element> </xsd:group>
This data and schema is represented in the Algebra as follows:
type Bib = bib [ Book{0, *} ] type Book = book [ @year [ Integer ] & @isbn [ String ], title [ String ], author [ String ]{1, *} ] let bib0 : Bib = bib [ book [ @year [ 1999 ], @isbn [ "1-55860-622-X" ], title [ "Data on the Web" ], author [ "Abiteboul" ], author [ "Buneman" ], author [ "Suciu" ] ], book [ @year [ 2001 ], @isbn [ "1-XXXXX-YYY-Z" ], title [ "XML Query" ], author [ "Fernandez" ], author [ "Suciu" ] ] ]
The expression above defines two types, Bib
and Book
, and defines one global variable,
bib0
.
The Bib
type corresponds to a single
bib
element, which contains a
forest of zero or more Book
elements. We use the term forest to refer to a sequence of (zero or more) attributes and elements.
Every attribute or element can be viewed as a forest of length one.
The Book
type corresponds to a single
book
element, which
which also contains a forest of zero or more attributes and elements.
It contains one
year
attribute
and one isbn
attribute, followed by one
title
element,
followed by one or more
author
elements.
The &
operator, called the interleave operator,
indicates that the year
and
isbn
attributes may occur in any order.
A isbn
attribute and
a title
or
author
element contains a string value, and a
year
attribute contains an integer.
The let expression above binds the
variable bib0
to a literal XML
value, which is the data model representation of the earlier
XML document.
The variable bib0
is globally available, i.e.,
it may appear in any other expression in the query.
A global variable is defined once. It is an error to redefine a
global variable.
The Algebra also provides a let expression for binding a local
variable.
It is introduced in [2.6 Quantification].
The value of a global or local variable is immutable, that is, once
a variable is defined, its value does not change.
The value of bib0
is a
bib
element that contains two book
elements.
The Algebra is a strongly typed language, therefore the value of
bib0
must be an instance of its declared type, or the
expression is ill-typed. Here the value of bib0
is an
instance of the Bib
type, because it contains one
bib
element, which contains two
book
elements, each of which contain
an integer-valued year
attribute, a string-valued
isbn
attribute, a string-valued
title
element,
and one or more string-valued author
elements.
For convenience, we define a second global variable
book0
also bound to a literal value, which is
equal to the first book in bib0
.
let book0 : Book = book [ @year [ 1999 ], @isbn [ "1-55860-622-X" ], title [ "Data on the Web" ], author [ "Abiteboul" ], author [ "Buneman" ], author [ "Suciu" ] ]
One of the Algebra's most basic operations is projection. The Algebra uses a notation similar in
syntax and meaning to path navigation in XPath. The following expression
returns all author
elements contained in
book
elements contained in
bib0
:
bib0/book/author ==> author [ "Abiteboul" ], author [ "Buneman" ], author [ "Suciu" ], author [ "Fernandez" ], author [ "Suciu" ] : author [ String ] {0, *}
Note that in the result, the document order of
author
elements is preserved.
The above example and the ones that follow have three parts. First is an
expression in the Algebra. Second, following the
==>
is the value of this expression. Third,
following the :
is the type of the expression, which
is (of course) also a legal type for the value.
It may be unclear why the type of bib0/book/author
contains zero or more authors, even though the type of
a book
element contains one
or more authors. Let's look at the derivation of the result type by looking at
the type of each sub-expression:
bib0 : Bib bib0/book : Book{0, *} bib0/book/author : author [ String ]{0, *}
Recall that Bib
, the type of bib0
, may
contain zero or more Book
elements, therefore the expression bib0/book
might
contain zero book
elements, in which case,
bib0/book/author
would contain no authors.
This illustrates an important feature of the type system: the type of an
expression depends only on the type of its sub-expressions. It also illustrates
the difference between an expression's value at query-evaluation time and its
type at query-analysis time.
Since the type of bib0
is
Bib
, the best type for
bib0/book/author
is one listing zero or more authors,
even though for the given value of bib0
, the expression will
always contain exactly five authors.
Its also possible to project on attributes. This expression
produces the year
attribute of book0
whose
type is @year [ String ]
.
book0/@year ==> @year [ 1999 ] : @year [ String ]
One may access atomic data (strings, integers, or booleans) using
the keyword data()
. For instance, if we wish to select all
author names in a book, rather than all author elements, we could
write the following.
book0/author/data() ==> "Abiteboul", "Buneman", "Suciu" : String {1, *}
Similarly, it is possible to project the atomic values of attributes. The following returns the year the book was published.
book0/@year/data() ==> 1999 : Integer
This notation is similar to the use of text()
in XPath.
We chose the keyword data()
because, as the second example
shows, not all data items are strings.
Another common operation is to iterate over elements in a document so that their content can be transformed into new content. Here is an example of how to process each book to list the author before the title, and remove the year and isbn.
for b in bib0/book do book [ b/author, b/title ] ==> book [ author [ "Abiteboul" ], author [ "Buneman" ], author [ "Suciu" ], title [ "Data on the Web" ] ], book [ author [ "Fernandez" ], author [ "Suciu" ], title [ "XML Query" ] ] : book [ author[ String ]{1, *}, title[ String ] ]{0, *}
The for
expression iterates over all
book
elements in bib0
, and binds the
variable b
to each such element. For each element
bound to b
, the inner expression constructs a
new book
element containing the book's authors followed
by its title. The transformed elements appear in the same order as they occur
in bib0
.
In the result type, a book
element is guaranteed
to contain one or more authors followed by one title. Let's look at the
derivation of the result type to see why:
bib0/book : Book{0, *} b : Book b/author : author [ String ]{1, *} b/title : title [ String ]
The type system can determine that b
is
always Book
, therefore the type
of b/author
is author[ String ]{1, *}
,
and the type of b/title
is title[ String
]
.
In general, the value of a for
expression is a forest,
i.e., zero or more data-model values as defined in [XML Query Data Model].
If the body of the for
expression itself yields a forest, then all of
the forests are concatenated together. For instance, the
expression:
for b in bib0/book do b/author
is exactly equivalent to the expression bib0/book/author
.
Here we have explained the typing of for
expressions by example.
In fact, the typing rules are rather subtle
and will be explained in detail in [4 Static Semantics : Type-Inference Rules].
To select values that satisfy some predicate, we use
the where
expression. For example, the following
expression selects all book
elements in
bib0
that were published before 2000.
for b in bib0/book do where b/@year/data() <= 2000 do b ==> book [ @year [ 1999 ], @isbn [ "1-55860-622-X" ], title [ "Data on the Web" ], author [ "Abiteboul" ], author [ "Buneman" ], author [ "Suciu" ] ] : Book{0, *}
In general, an expression of the form:
where
e1 do e2
is converted to the form
if
e1
then
e2 else
()
where e1 and e2 are
expressions. Here ()
is an expression that stands for
the empty sequence, a forest that contains no attributes or elements. We also write
()
for the type of the empty sequence.
According to this rule, the expression above translates to
for b in bib0/book do if b/@year/data() <= 2000 then b else ()
and this has the same value and the same type as the preceding expression.
The following expression selects all book
elements
in bib0
that have some author
named "Buneman".
for b in bib0/book do for a in b/author/data() do where a = "Buneman" do b ==> book [ @year [ 1999 ], @isbn [ "1-55860-622-X" ], title [ "Data on the Web" ], author [ "Abiteboul" ], author [ "Buneman" ], author [ "Suciu" ] ] : Book{0, *}
In contrast, we can use the empty
operator to find
all books that have no author whose name is
Buneman:
for b in bib0/book do where empty(for a in b/author do where a/data() = "Buneman" do a) do b ==> book [ @year [ 2001 ], @isbn [ "1-XXXXX-YYY-Z" ], title [ "XML Query" ], author [ "Fernandez" ], author [ "Suciu" ] ] : Book{0, *}
The empty
expression checks that its argument is
the empty sequence ()
.
We can also use the empty
operator to find all
books where all the authors are Buneman, by checking that there are no authors
that are not Buneman:
for b in bib0/book do where empty(for a in b/author do where a/data() != "Buneman" do a) do b ==> () : Book{0, *}
There are no such books, so the result is the empty sequence. Appropriate
use of empty
(possibly combined with
not
) can express universally or existentially quantified
expressions.
Here is a good place to introduce the let
expression, which binds a local variable to a value. Introducing local
variables may improve readability. For example, the following expression is
exactly equivalent to the previous one.
for b in bib0/book do let nonbunemans = (for a in b/author do where a/data() != "Buneman" do a) do where empty(nonbunemans) do b
Local variables can also be used to avoid repetition when the same subexpression appears more than once in a query.
The scope of a local variable v is the body of the let
expression that follows the keyword do
up to any nested
let
expression that redefines v.
Another common operation is to join values from one or more documents. To illustrate joins, we give a second data source that defines book reviews:
type Reviews = reviews [ book [ title [ String ], review [ String ] ]{0, *} ] let review0 : Reviews = reviews [ book [ title [ "XML Query" ], review [ "A darn fine book." ] ], book [ title [ "Data on the Web" ], review [ "This is great!" ] ] ]
The Reviews
type contains one
reviews
element, which contains zero or more
book
elements;
each book
contains a title and a review.
We can use nested for
expressions to join the two
sources review0
and bib0
on
title values. The result combines the title, authors, and reviews for each
book.
for b in bib0/book do for r in review0/book do where b/title/data() = r/title/data() do book [ b/title, b/author, r/review ] ==> book [ title [ "Data on the Web" ], author [ "Abiteboul" ], author [ "Buneman" ], author [ "Suciu" ] review [ "A darn fine book." ] ], book [ title [ "XML Query" ], author [ "Fernandez" ], author [ "Suciu" ] review [ "This is great!" ] ] : book [ title [ String ], author [ String ] {1, *}, review [ String ] ] {0, *}
Note that the outer-most for
expression determines
the order of the result. Readers familiar with optimization of relational join
queries know that relational joins commute, i.e., they can be evaluated in any
order. This is not true for the Algebra: changing the order of the first
two for
expressions would produce different output.
In [2.8 Unordered forests], we introduce support for
unordered forests, which permits commutable joins.
It is beyond the scope of this document to describe algorithms for evaluating nested loop joins. See [Graefe93] for a survey.
As discussed in [2.7 Join] joins do not
commute on ordered forests. In databases, ordering often does not matter. To
permit commutable joins, and to allow for other query
optimization techniques, the Algebra also supports
unordered forests. At type level, unordered forests are
indicated by {Type}
. This type specifies that the
order of
book
elements in the reviews
element is insignificant:
type Reviews1 = reviews [{ book [ title [String], review [String] ]{0,*} }]
To transform an ordered forest to an unordered forest, we
use the listtobag
operator:
let review1 : Reviews1 = reviews [listtobag(review0/*)] ==> reviews [{ book [ title [ "XML Query" ], review [ "A darn fine book." ] ], book [ title [ "Data on the Web" ], review [ "This is great!" ] ] }]: Reviews1
This query transforms the ordered forest returned by
review0/*
into an unordered forest. The order
of nodes in the forest is now insignificant, thus review1
could also be bound to:
let review1 : Reviews1 = reviews [listtobag(review0/*)] ==> reviews [{ book [ title [ "Data on the Web" ], review [ "This is great!" ] ], book [ title [ "XML Query" ], review [ "A darn fine book." ] ] }]: Reviews1
Likewise, we can transform the ordered forest of books in
bib0
to an unordered forest. Now we can perform a commutable
join:
for b in listtobag(bib0/book) do for r in review1/book do where b/title/data() = r/title/data() do book [b/title, b/author, r/review] ==> {book [ title [ "Data on the Web" ], author [ "Abiteboul" ], author [ "Buneman" ], author [ "Suciu" ] review [ "A darn fine book." ] ], book [ title [ "XML Query" ], author [ "Fernandez" ], author [ "Suciu" ] review [ "This is great!" ] ]} : {book [ title [ String ], author [ String ] {1, *}, review [ String ] ] {0, *}}
Because both the inputs and the results are unordered forests, the outer for expression and the inner for expression can be exchanged without changing the semantics:
for r in review1/book do for b in listtobag(bib0/book) do where b/title/data() = r/title/data() do book [b/title, b/author, r/review]
Note that it is not necessary to explicitly transform bib0/book into an unordered forest, whenever one forest of a nested iteration is unordered, all other forests are cast to an unordered forest as well.
Sometimes it is convenient to transform an unordered forest to an
ordered forest. The operator bagtolist
sorts an unordered
forest by document order.
Many of the previous queries select values and return them or use them in the construction of new values. Once a value is selected, however, the previous queries do not access the original source of the selected value, i.e., the document or hierarchy of elements in which the selected value is contained. It is sometimes useful, however, to access or preserve the original context of selected nodes.
Consider the following example, which contains a new bibliography of
articles in bib1
:
type Bib1 = bib [ Article{0, *} ] type Article = article [ @year [ Integer ], title [ String ], journal [ String ], author [ String ]{1, *} ] let bib1 : Bib1 = bib [ article [ @year [ 2000 ], title [ "Queries and computation on the web" ], journal [ "Theoretical Computer Science" ], author [ "Abiteboul" ], author [ "Vianu" ] ] ]
Assume there exists a full-text search function,
contains
,
which given a set of documents,
selects elements that contain a particular keyword.
(This function is not defined in the Algebra, but is used
here to illustrate a point.)
The details of function application and declaration are given in
[2.18 Functions].
The following expression returns those elements in bib0
and bib1
that contain the keyword "Abiteboul".
Note that the result type of the expression is
AnyTree{0,*}
. This is because the contains
function cannot know apriori which elements, if any, contain a given keyword.
for a in contains(bib0,bib1; "Abiteboul") do a ==> author [ "Abiteboul" ], author [ "Abiteboul" ] : AnyTree{0,*}
The result above does not provide the context in which
the two author
elements occur. Even if the
contains
function did return more context,
it might be useful to browse the context in
which they occurred, for example, by accessing their parent and/or
sibling elements.
The built-in function parent
accesses the parent of an
attribute or element. For example, this expression returns more
useful information than the previous one:
for a in contains(bib0,bib1; "Abiteboul") do parent(a) ==> book [ @year [ 1999 ], @isbn [ "1-55860-622-X" ], title [ "Data on the Web" ], author [ "Abiteboul" ], author [ "Buneman" ], author [ "Suciu" ] ], article [ @year [ 2000 ], title [ "Queries and computation on the web" ], journal [ "Theoretical Computer Science" ], author [ "Abiteboul" ], author [ "Vianu" ] ] : AnyElement{0,*}
Note that the result type of the expression is
AnyElement{0,*}
. When applied to
an attribute or element value, the parent
function
always has return type AnyElement{0,1}
.
This is because the Algebra's type system only
preserves type information about an attribute or element's content,
not about its containing parent.
It is possible to recover more precise type information
with the dynamic or run-time cast
operator, which attempts to
cast at run time an expression to a given type.
If the expression does not have the given type,
a run-time error is raised.
Dynamic casts are necessary when it not possible to determine at
query-analysis time the most precise type of a value; they are sometimes
called ``down casts''.
For example, the use of parent
in the following expression
loses some useful type information, that is that
p
is a Book
. We can recover more
precise information by casting p
to the Book
type:
for p in parent(book0/title) do cast p : Book ==> book [ @year [ 1999 ], @isbn [ "1-55860-622-X" ], title [ "Data on the Web" ], author [ "Abiteboul" ], author [ "Buneman" ], author [ "Suciu" ] ] : Book{0,1}
The result type is Book{0,1}
because the
parent
function has type AnyElement{0,1}
.
If we try erroneously to cast p
to an
Article
, the error value is returned.
Its type is Ø, the empty choice:
for p in parent(book0/title) do cast p : Article ==> error : 0
We have already seen many examples of static or
compile-time casting. A static cast permits
the type of an expression to be changed and checked at query-analysis time;
they are sometimes called ``up casts''.
For example, consider the type Book0
, which permits a
book to have zero or more authors.
type Book0 = book [ @year [ Integer ] & @isbn [ String ], title [ String ], author [ String ]{0, *} ]
The explicit-type expression e : t
statically casts a value to the given type.
For example, the expression below statically casts book0
to
Book0
; this is permissible
because the type of book0
at query-analysis time is a sub-type of
Book0
.
book0 : Book0 ==> book [ @year [ 1999 ], @isbn [ "1-55860-622-X" ], title [ "Data on the Web" ], author [ "Abiteboul" ], author [ "Buneman" ], author [ "Suciu" ] ] : Book0
If we try erroneously to cast book0
to a more precise type (e.g., a book with 4 or more authors),
a type error will occur at query-analysis time.
The uses of the parent
operator
in [2.9 Parent and cast operators] show
that it is possible to access the original context of document nodes. This is
possible because the XML Query Data Model supports node
identity, that is, every instance of a node (e.g., element,
attribute, processing instruction, and comment) in the data model has
a unique identity. We can compare the identity
of two nodes for equality using the ==
operator.
For example, in the following expression, two distinct element nodes
are created and bound
to variables a1
and a2
.
Although the two nodes are structurally equal, their identities
are not equal:
let a1 = author [ "Suciu" ] do let a2 = author [ "Suciu" ] do a1 == a2 ==> false : Boolean
In general, all the Algebra's operators preserve node identity.
There is one exception: the element constructor, which given a tag name
and a forest of children nodes, constructs a new element.
A new element's content does not refer directly to
the given children nodes, but to copies of these nodes.
For example, the following expression constructs
an element with name newbook
and content
book0/author
, book0/title
:
let book1 = newbook [ book0/author, book0/title ] do book1 ==> newbook [ author [ "Abiteboul" ], author [ "Buneman" ], author [ "Suciu" ], title [ "Data on the Web" ] ] : book [ author[ String ]{1, *}, title[ String ] ]
The newbook
element contains copies of the nodes in the forest
book0/author
, book0/title
,
not the original nodes in book0
.
Copying guarantees that a node is always the
parent of its child nodes and a node is always
the child of its parent; these constraints are invariants of
[XML Query Data Model].
For example, we would expect that
the following expression is always true:
parent(book1/author) == book1If the element constructor did not copy its arguments, anomalies such as the following could occur:
parent(book1/author) == book0that is, the parent of
book1
's child node
is not book1
, and
this would violate the XML Query Data Model's parent-child invariant.
Sometimes it is useful to construct elements that do
preserve the identity of its child nodes, for example,
when constructing a view
of one or more XML documents.
In this case,
we want the new element to
contain references to, not copies of,
the original nodes.
The ref
operator constructs a reference
to a node. For example, book2
below
contains references to the nodes in book0
:
let book2 = newbook [ for a in book0/author do ref(a), ref(book0/title) ] do book2 ==> newbook [ ref(author [ "Abiteboul" ]), ref(author [ "Buneman" ]), ref(author [ "Suciu" ]), ref(title [ "Data on the Web" ]) ] : newbook [ ref(author[ String ]){1, *}, ref(title[ String ]) ]
Note that the type of the expression contains reference types.
The deref
operator dereferences a reference value.
In the following, it returns the elements in book0
.
for v in book2/* do deref(v) ==> author [ "Abiteboul" ], author [ "Buneman" ], author [ "Suciu" ], title [ "Data on the Web" ] : author[ String ]{1, *}, title[ String ]
For convenience, the expression above can be also be written as:
book2/deref()
Often it is useful to regroup elements in an XML document. For example,
each book
element in bib0
groups one title with multiple authors. This expression groups each author
with the titles of his/her publications.
for a in distinct_value(bib0/book/author/data()) do biblio [ author[a], for b in bib0/book do for a2 in b/author/data() do where a = a2 do b/title ] ==> biblio [ author [ "Abiteboul" ], title [ "Data on the Web" ] ], biblio [ author [ "Buneman" ], title [ "Data on the Web" ] ], biblio [ author [ "Suciu" ], title [ "Data on the Web" ], title [ "XML Query" ] ], biblio [ author [ "Fernandez" ], title [ "XML Query" ] ] : biblio [ author [ String ], title [ String ] {0, *} ] {0, *}
Readers may recognize this expression as a self-join of books on authors.
The expression distinct_value(bib0/book/author/data())
produces a
forest of author names whose values are all distinct.
The outer
for
expression binds a
to the name of each
author element, and the inner for
expression selects the
title of each book that has some author whose name equals a
.
Here distinct_value
is an example of a built-in function.
It produces a forest of nodes whose values are all distinct,
i.e., there are no duplicate values; the order of the
resulting forest is not defined.
The builtin function distinct_node
produces a forest of nodes whose identities are all distinct.
The type of the result expression may seem surprising:
each biblio
element may contain
zero or more title
elements,
even though in bib0
every author
co-occurs with a title
. Recognizing such a constraint is
outside the scope of the type system, so the resulting type is not as precise
as we would like.
Often it is useful to query the order of elements in an ordered forest or a document. There are two kinds of order among elements: local order and document (or global) order. The Algebra supports querying of local and global order.
Local order refers to the order among sibling elements in an ordered forest.
To query local order, the index
function pairs an integer index with
each element in an ordered forest:
index(book0/author) ==> pair [ fst [ 1 ], snd [ author [ ref("Abiteboul") ] ] ], pair [ fst [ 2 ], snd [ author [ ref("Buneman") ] ] ], pair [ fst [ 3 ], snd [ author [ ref("Suciu") ] ] ] : pair [ fst [ Integer ], snd [ ref(author [ String ]) ] ] {1, *}
The index
function uses reference in order to preserve node identity
when accessing local order. Note that the result type takes into
account that at least one pair exists in the result, as book0/author
always contains one or more authors.
Once we have paired authors with an integer index, we can select the first two authors:
for p in index(book0/author) do where (p/fst/data() <= 2) do p/snd/deref() ==> author [ "Abiteboul" ], author [ "Buneman" ] : author [ String ] {0, *}
The for
expression iterates over all pair elements produced by the
index
expression. It selects elements whose index value in the fst
element is between one and two inclusive, and it returns the original
content by dereferencing the content of the snd
element.
The result type may be surprising, because the Book
type
guarantees that each book has at least one author. However, the type
system cannot determine that the conditional where
expression will always
succeed, so the inner expression may produce zero results. (A
sophisticated analysis might improve type precision, but is likely to
require much work for little benefit.)
Document (or global) order refers to the total order among all elements in a document. Global order is defined as the order of appearance of the element nodes when performing a pre-order, depth-first traveral of a tree. This corresponds to the order of appearance of their opening tags in the XML serialization. This is equivalent to the definition used in [XPath].
To query global order, the <
operator
can be applied between two
element nodes. It returns true if the first node is before (and
different from) the second node in document order. It returns false if
the first node is equal to or after the second node in document
order. It raises an error if the nodes are in different documents.
For example, the nodes bib0
and review0
are unrelated therefore comparing their order raises an error:
bib0 < review0 ==> error : 0
The >
operator, defined
similiarly, can be derived using the <
operator and
the node equality operator ==
.
Using global order, the following expression returns all author nodes appearing after a book written in 2001:
for b in bib0/book do where b/@year/data() = 2001 do for a in bib0/book/author do where b < a do a ==> author[ "Fernandez" ], author[ "Suciu" ] : author[ String ]{0,*}
Note that the root element of a document is before any other
element. More generally, an element is before all of its children.
For example, the set of elements that are before bib0
is
empty:
empty(for b in bib0/book do where bib0 > b do b) ==> true : Boolean
The Algebra supports global order only for elements within the same document. Support for global order among elements in distinct documents is discussed in [Issue-0003: Global Order].
To sort a forest, the Algebra provides a sort
expression, whose form is:
sort
Var in
Exp1
by
Exp2.
The variable Var ranges over the items in the forest Exp1
and sorts those items using the key value
defined by Exp2.
For example, this expression sorts book
elements in
review0
by their titles.
Note that the variable b
appears in the key expression
b/title
.
sort b in review0/book by b/title ==> book [ title [ "Data on the Web" ], review [ "This is great!" ] ], book [ title [ "XML Query" ], review [ "This is pretty good too!" ] ] : book [ title [ String ], review [ String ] ] ] {0, *}
The sort
expression is a restricted
form of higher-order
function, i.e., it takes a function as an argument.
In this case, sort
takes a single function, which
extracts the key value from each element.
The sort
expression requires that the less-than
inequality, <
, be defined for
the type of Var.
We have already seen several several built-in
functions: index
, distinct_value
,
and
sort
. In addition to these
functions, the Algebra has five built-in aggregation
functions: avg
, count
,
max
, min
, and sum
.
This expression selects books that have more than two authors:
for b in bib0/book do where count(b/author) > 2 do b ==> book [ @year [ 1999 ], @isbn [ "1-55860-622-X" ], title [ "Data on the Web" ], author [ "Abiteboul" ], author [ "Buneman" ], author [ "Suciu" ] ] : Book{0, *}
All the aggregation functions take a forest with repetition type and return
an integer value; count
returns the number of elements
in the forest.
So far, all our examples of attributes and elements use unqualified local names, i.e., names that do not include an explicit namespace URI. It is also possible to specify and match on the expanded name of an attribute or element. The expanded name {Namespace}LocalName consists of a namespace URI Namespace and a local name LocalName.
Consider an inventory of books that contains data from
both http://www.BooksRUs.com
and
http://www.cheapBooks.com
.
In this example, the first element contains values whose names are
defined in the BooksRUs.com
namespace, and the second element
contains values whose names are defined in the
cheapBooks.com
namespace:
let inventory = inv [ {"http://www.BooksRUs.com/books.xsd"}book [ @year [ 1999 ], @isbn [ "1-55860-622-X" ], {"http://www.BooksRUs.com/books.xsd"}title [ "Data on the Web" ], {"http://www.BooksRUs.com/books.xsd"}author [ "Abiteboul" ], {"http://www.BooksRUs.com/books.xsd"}author [ "Buneman" ], {"http://www.BooksRUs.com/books.xsd"}author [ "Suciu" ] ], {"http://www.cheapBooks.com/ourschema.xsd"}book [ @year [ 2001 ], {"http://www.cheapBooks.com/ourschema.xsd"}title [ "XML Query" ], {"http://www.cheapBooks.com/ourschema.xsd"}author [ "Fernandez" ], {"http://www.cheapBooks.com/ourschema.xsd"}author [ "Suciu" ] {"http://www.cheapBooks.com/ourschema.xsd"}isbn [ "1-XXXXX-YYY-Z" ], ] ] ] : Inventory type Inventory = inv [ InvBook{0, *}]
In this example, elements imported from existing schemas each refer
to a single namespace, thus the definition of InvBook
is:
type BooksRUBook = {"http://www.BooksRUs.com/books.xsd"}book [ @year [ Integer ] & @isbn [ String ], {"http://www.BooksRUs.com/books.xsd"}title [ String ], {"http://www.BooksRUs.com/books.xsd"}author [ String ]{1, *} ] type CheapBooksBook = {"http://www.cheapBooks.com/ourschema.xsd"}book [ @year [ Integer ], {"http://www.cheapBooks.com/ourschema.xsd"}title [ String ], {"http://www.cheapBooks.com/ourschema.xsd"}author [ String ]{1, *}, {"http://www.cheapBooks.com/ourschema.xsd"}isbn [ String ], ] type InvBook = BooksRUBook | CheapBooksBook
Here vertical bar (|
) is used to indicate a
choice between types: each InvBook
is either a BooksRUBook
or
a CheapBooksBook
.
We have already seen how to project on the constant name of an attribute or
element.
It is also useful to project on wildcards, which
are used to match names with any namespace and/or any local name.
For example, this expression matches elements
with any local name and with
namespace URI
http://www.BooksRUs.com/books.xsd
:
inventory/{"http://www.BooksRUs.com/books.xsd"}* ==> {"http://www.BooksRUs.com/books.xsd"}book [ {"http://www.BooksRUs.com/books.xsd"}title [ "Data on the Web" ], {"http://www.BooksRUs.com/books.xsd"}author [ "Abiteboul" ], {"http://www.BooksRUs.com/books.xsd"}author [ "Buneman" ], {"http://www.BooksRUs.com/books.xsd"}author [ "Suciu" ] ] : {"http://www.BooksRUs.com/books.xsd"}book [ {"http://www.BooksRUs.com/books.xsd"}title [ String ], {"http://www.BooksRUs.com/books.xsd"}author [ String ] {0, *}, ]
Similarly, this expression first projects elements
in any namespace whose local name is book
and then projects on their year
attributes:
inventory/{*}book/@{*}year ==> @year [ 1999 ], @year [ 2001 ] : (@year [ Integer ] | @year [ Integer ]){0,*}
The expression Exp/a is shorthand for
Exp/{}a, that is,
the expanded name with the null namespace.
Similarly, *
is shorthand for {}*
.
In an XML document, comments and processing instructions (PIs) may appear anywhere outside other markup[XML]. Processing instructions (PIs) permit documents to contain instructions for applications. Comments and PIs are not part of the document's character data. An XML processor may, but need not, make the text of comments available to an application, but it must pass PIs to the application. The PI begins with a target used to identify the application to which the instruction is directed.
The Algebra supports comments and processing instructions
in types and expressions.
The type expression pic
t denotes
a value in which zero or more PIs and comments may be interleaved
arbitrarily with the nodes in type t.
For example, the two element types BibPIC
and
BookPIC
permit PIs and comments to be interleaved
with their content:
type BibPIC = bib [ pic BookPIC{0,*} ] type BookPIC = book [ @year [ Integer ] & @isbn [ String ], pic (title [ String ], author [ String ]{1, *}) ]Note that in the
book
element, the pic
operator is only applied to its element content, not its attribute content,
because comments and processing instructions may not occur in
attributes.
We can construct PI and comment values using the
built-in constructors pi
and comment
:
let bibpc0 : BibPIC = bib [ comment("Canonical XML Query Algebra example."), bib [ book [ @year [ 1999 ], @isbn [ "1-55860-622-X" ], comment("First book example"), processing_instruction("Publisher.asp", "publisher=http://www.mkp.com") title [ "Data on the Web" ], author [ "Abiteboul" ], author [ "Buneman" ], author [ "Suciu" ] ], book [ @year [ 2001 ], @isbn [ "1-XXXXX-YYY-Z" ], title [ "XML Query" ], comment("Second book example"), author [ "Fernandez" ], author [ "Suciu" ] ] ] ]
Finally, we can project on processing instructions and comments, in the same way we project on children, attributes, and atomic content:
bibpc0/book/comment() ==> comment("First book example"), comment("Second book example") : Comment{0,*} bibpc0/book/processing_instruction() ==> processing_instruction("Publisher.asp", "publisher=http://www.mkp.com") : PI{0,*}
Comments and processing instructions may be ignored by an XML processor, in which case they would not even be accessible to a query processor. If they are not ignored, however, comments and processing instructions are typed values and are treated like any other value in an algebraic expression.
An element type has mixed content when elements of that type may
contain character data, optionally interspersed with child
elements[XML].
The type expression mixed
t denotes
a value in which zero or more String values may be interleaved
arbitrarily with the nodes in type t.
For example, the content of the review
element
contains a reviewer
element, which may be interleaved
with string values:
type ReviewsMixed = reviews [ book [ title [ String ], review [ mixed reviewer [ String ] ] ]{0, *} ]
Here are two examples of mixed content, in which the text of the book review may be interleaved with the name of the reviewer:
let reviewmix0 : ReviewsMixed = reviews [ book [ title [ "XML Query" ], review [ "A darn fine book: ", reviewer [ "XML On-line" ] ] ], book [ title [ "Data on the Web" ], review [ "The ", reviewer [ "publisher" ], " says 'This is great!'" ] ] ]
It is often useful to
concatenate all the string values of a mixed-content element to
recover
its complete text value.
We use the builtin function string
;
as defined in XPath [XPath],
the string value of a node is determined by its kind,
e.g., element, attribute, etc.
for b in reviewmix0/book do string(b/review) ==> "A darn fine book : XML On-line", "The publisher says 'This is great!'"
Functions can make queries more modular and concise. Recall that we used the following query to find all books that do not have "Buneman" as an author.
for b in bib0/book do where empty(for a in b/author do where a/data() = "Buneman" do a) do b ==> book [ @year [ 2001 ], @isbn [ "1-XXXXX-YYY-Z" ], title [ "XML Query" ], author [ "Fernandez" ], author [ "Suciu" ] ] : Book{0, *}
A different way to formulate this query is to first define a function that
takes a string s
and a book b
as arguments, and returns true if book b
does not have
an author with name s
.
fun notauthor (s : String; b : Book) : Boolean = empty(for a in b/author do where a/data() = s do a)
The query can then be re-expressed as follows.
for b in bib0/book do where notauthor("Buneman"; b) do b ==> book [ @year [ 2001 ], @isbn [ "1-XXXXX-YYY-Z" ], title [ "XML Query" ], author [ "Fernandez" ], author [ "Suciu" ] ] : Book{0, *}
We use semicolon rather than comma to separate function arguments, since comma is used to concatenate forests.
Note that a function declaration includes the types of all its arguments and the type of its result. This is necessary for the type system to guarantee that applications of functions are type correct.
In general, any number of functions may be declared at the top-level. The order of function declarations does not matter, and each function may refer to any other function. Among other things, this allows functions to be recursive (or mutually recursive), which supports structural recursion, the subject of the next section.
Functions make the Algebra extensible. We have seen examples of built-in
functions (sort
and
distinct_value
) and examples of user-defined functions (notauthor
).
In addition to built-in and user-defined
functions, the Algebra could support externally defined functions, i.e.,
functions that are not defined in the Algebra itself, but in some external
language. This would make special-purpose implementations of, for example,
full-text search functions available in the Algebra. We discuss support for
externally defined functions in [Issue-0009:
Externally defined functions].
XML documents can be recursive in structure, for example, it is possible to
define a part
element that directly or indirectly
contains other part
elements. In the Algebra, we use
recursive types to define documents with a recursive structure, and we use
recursive functions to process such documents. (We can also use mutually
recursive functions for more complex recursive structures.)
For instance, here is a recursive type defining a part hierarchy.
type Part = Basic | Composite type Basic = basic [ cost [ Integer ] ] type Composite = composite [ assembly_cost [ Integer ], subparts [ Part{1, *} ] ]
And here is some sample data.
let part0 : Part = composite [ assembly_cost [ 12 ], subparts [ composite [ assembly_cost [ 22 ], subparts [ basic [ cost [ 33 ] ] ] ], basic [ cost [ 7 ] ] ] ]
Here vertical bar (|
) is used to indicate a
choice between types: each part is either basic (no subparts), and has a cost,
or is composite, and includes an assembly cost and subparts.
We might want to translate to a second form, where every part has a total cost and a list of subparts (for a basic part, the list of subparts is empty).
type Part2 = part [ total_cost [ Integer ], subparts [ Part2{0, *} ] ]
Here is a recursive function that performs the desired transformation. It
uses a new construct, the match
expression.
fun convert(p : Part) : Part2 = match p case b : Basic do part [ total_cost [ b/cost/data() ], subparts [] ] case c : Composite do let s = (for y in c/subparts/*) do convert(y)) do part [ total_cost [ q/assembly_cost/data() + sum(s/total_cost/data()) ], subparts[ s ] ] else error
Each branch of the match expression is labeled with a type,
Basic
or Composite
, and with a corresponding variable, b
or c
.
The evaluator checks the type of the value of p
at
query-evaluation time, also known as run time,
and evaluates the corresponding branch.
If the first branch is taken then b
is bound to the
value of p
, and the branch returns a new part with total cost the
same as the cost of b
, and with no subparts. If the second
branch is taken, then c
is bound to the value of p
. The
function is recursively applied to each of the subparts of c
,
giving an ordered forest of new subparts s
. The branch returns a new part
with total cost computed by adding the assembly cost of c
to the
sum of the total cost of each subpart in s
, and with subparts
s
.
One might wonder why b
and c
are required,
since they have the same value as p
. The reason why is
that p
, b
, and c
have different
types.
p : Part b : Basic c : Composite
The types of b
and c
are more precise than
the type of p
, because which branch is taken depends upon
the type of value in p
.
Applying the query to the given data gives the following result.
convert(part0) ==> part [ total_cost [ 74 ], subparts [ part [ total_cost [ 55 ], subparts [ part [ total_cost [ 33 ], subparts [] ] ] ], part [ total_cost [ 7 ], subparts [] ] ] ]
Of course, a match
expression may be used in any
query, not just in a recursive one.
Recursive types allow us to define a type that matches any
well-formed XML document. This type is called AnyTree
:
type AnyTree = AnyScalar | AnyElement | AnyAttribute type AnySimpleType = AnyScalar | AnyScalar{0,*} type AnyAttribute = @*[ AnySimpleType ] type AnyElement = *[ AnyComplexType ] type AnyComplexType = AnyTree{0, *} type AnyType = AnySimpleType | AnyComplexType
AnyTree
stands for any scalar, element, or attribute
value.
Here AnyScalar
is a built-in scalar type. It stands
for the most general scalar type, and all other scalar types
(like Integer
or String
) are
subtypes of it.
AnySimpleType
stands for the most
general simple type, which is a scalar or a list of scalars.
AnyAttribute
stands for the most general attribute type.
The asterisk (*
) is used to indicate a
wild-card type, i.e., a type whose name is not known statically.
The type AnyAttribute
may have any name, and its content must have type
AnySimpleType
, i.e., it may contain atomic values,
but no elements.
The AnyElement
stands for the most general element type,
which may have any name, and its content must be a
complex type, which is a repetition of zero or more
AnyTree
s.
Finally, an AnyType
is either
an AnySimpleType
or an AnyComplexType
.
In other words, any element, attribute, or atomic value
has type AnyType
.
The use of * is a significant extension to XML Schema, because
XML Schema has no type corresponding to *[t],
where t is some type other than
AnyComplexType
.
It is not clear that this extension is necessary, since the more
restrictive expressiveness of XML Schema wildcards may be adequate.
In particular, our earlier data also has type AnyTree
.
book0 : AnyTree ==> book [ @year [ 1999 ], @isbn [ "1-55860-622-X" ], title [ "Data on the Web" ], author [ "Abiteboul" ], author [ "Buneman" ], author [ "Suciu" ] ] : AnyTree
A specific type can be indicated for any expression in the query language, by writing a colon and the type after the expression.
As an example of a function that can be applied to all well-formed documents, we define a recursive function that converts any XML data into HTML. We first give a simplified definition of HTML.
type HTML_body = ( AnyScalar | b [ HTML_body ] | ul [ li [ HTML_body ] {0, *} ] ) {0, *}
An HTML body consists of a sequence of zero or more items, each of which is
either a scalar, or a b
element, where the content is
an HTML body, or an ul
element, where the children
are li
elements, each of which has as content an HTML
body.
Now, here is the function that performs the conversion.
fun html_of_xml( x : AnyTree ) : Html_Body = match x case s : AnyScalar do s case v1 : AnyAttribute do b [ name(v1) ], ul [ for y in v1/* do li [ html_of_xml(y) ] ], case v2 : AnyElement do b [ name(v2) ], ul [ for y in attributes(v2) do li [ html_of_xml(y) ], ul [ for y in v2/* do li [ html_of_xml(y) ] ] else error
The first branch of the match
expression checks whether the value
of x
is a subtype of AnyScalar
, and
if so then s
is bound to that value, so if this branch
is taken then s
is the same as x
, but
with a more precise type (it must be a scalar, not an element). This branch
returns the scalar.
The second branch checks whether the value
of x
is a subtype of AnyAttribute
. As before,
v1
is the same as
x
but with a more precise type (it must be an attribute,
not a scalar). This branch returns a b
element
containing the name of the attribute, and a ul
element
containing one li
element for each value of the
attribute. The function is recursively applied to get the content of the
li
element.
The last branch is analogous to the second, but it matches an element
instead of an attribute, and it applies html_of_xml
to each of the element's attributes and children.
Applying the query to the book element above gives the following result.
html_of_xml(book0) ==> b [ "book" ], ul [ li [ b [ "year" ], ul [ li [ 1999 ] ] ], li [ b [ "isbn" ], ul [ li [ "1-55860-622-X" ] ] ], li [ b [ "title" ], ul [ li [ "Data on the Web" ] ] ], li [ b [ "author" ], ul [ li [ "Abiteboul" ] ] ], li [ b [ "author" ], ul [ li [ "Buneman" ] ] ], li [ b [ "author" ], ul [ li [ "Suciu" ] ] ] ] : HTML_Body
A query consists of a sequence of top-level expressions, or query items, where each query item is either a type declaration, a function declaration, a global variable declaration, or a query expression. The order of query items is immaterial; all type, function, and global variable declarations may be mutually recursive.
Each query expression is evaluated in the environment specified by all of the declarations. (Typically, all of the declarations will precede all of the query expressions, but this is not required.) We have already seen examples of type, function, and global variable declarations. An example of a top-level query is:
query html_of_xml(book0)
To transform any expression into a top-level query, we simply precede the
expression by the query
keyword.
The result of a top-level query can be serialized into an XML document
by the application
in which the Algebra is used.
In this section, we summarize the Algebra and present the grammars for types and expressions. We start by defining the non-terminal and terminal symbols that appear in the type and expression grammars.
A Namespace is a URI, and a LocalName is an NCName, as in the Namespace recommendation [XML Names]. We let Namespace range over namespaces and the null value, which represents the absence of a namespace, and LocalName range over NCNames. The symbol Name ranges over expanded names. An expanded name is a (namespace, local name) pair.
|
The unqualified expression LocalName is a syntactic shorthand for {}LocalName. In the remainder of the document, all expanded names include a namespace qualifier.
A wildcard denotes a set of names. A wildcard item is
of the form {*}*
denoting any name in any namespace,
{Namespace}*
denoting any name in namespace Namespace,
{*
}LocalName
denoting the local name LocalName in any namespace,
or
{Namespace}LocalName denoting just the
name with namespace Namespace and local name LocalName.
A wildcard consists
of wildcard items, union of wildcards, or difference of wildcards.
We let WItem range over wildcard items, and
Wild range over wildcard expressions.
|
For example,
the wildcard {*}*!{"http://www.foo.org/baz}*!{"http://www.xsd.com/xsi}String
denotes any
name in any namespace except for names in namespace
http://www.foo.org/baz
and for local name
String
in namespace http://www.xsd.com/xsi
.
Wildcards denote sets of expanded names, so we can define a containment relation, <:wild that relate sets of wildcards. The inequalities that hold for this relation include:
{Namespace}LocalName | <:wild | {*}LocalName |
{Namespace}LocalName | <:wild | {Namespace}* |
{Namespace}* | <:wild | {*}* |
{*}LocalName | <:wild | {*}* |
{Namespace}LocalName | <:wild | {*}* |
The last inequality holds by transitivity. Note, however, that {*}LocalName <:wild {Namespace}* does not necessarily hold.
The Algebra takes as given the atomic datatypes from XML Schema Part 2 [XML Schema Part 2]. We let p range over all atomic datatypes.
|
Typically, an atomic datatype is either a primitive datatype, or is derived from another atomic datatype by specifying a set of facets. A type hierarchy is induced between scalar types by containment of facets. Note that lists of atomic datatypes are specified using repetition and unions are specified using alternation, as defined in [3.4 Types : abstract syntax]
The built-in atomic data type
AnyScalar
stands for the most
general scalar type, and all other scalar types (like Integer
or
String
) are subtypes of it.
[Figure 1] contains the abstract syntax for the Algebra's type system. A type is closely related to a model group as defined in [MSL]. MSL uses standard regular expression notation for model groups and we do the same for algebra types. This type system captures the essence of [XML Schema Part 1] and is similar to the type system used in XDuce [HP2000].
type variable y type t ::= y type variable | () empty sequence | Ø empty choice | Wild wildcard type | u unit type | t1 , t2 sequence, t1 followed by t2 | t1 & t2 interleaved product, nodes in t1 interleaved with nodes in t2 | t1 | t2 choice, t1 or t2 | t {m, n} repetition of t between m and n times | {t} unordered forest of nodes in t | pic t t interleaved with comments and processing instructions | mixed t t interleaved with strings bound m, n ::= natural number or *
unit type u ::= p atomic datatype | @Wild[t] attribute with name in Wild and content in t | Wild[t] element with name in Wild and content in t | PI processing instruction | Comment comment | ref (t) reference to t prime type q ::= u | q | q
Figure 1: Abstract Syntax for Types
The Wild type is necessary, because the Algebra's syntax permits the construction of name expressions (See NameExp in [3.6 Expressions]), whose type is Wild.
A unit type is is either an atomic type, an attribute or element type with a constant or wildcard name, a comment, or a processing instruction. A prime type is a unit type or a disjunction of prime types. Unit and prime types appear in several typing rules in [4 Static Semantics : Type-Inference Rules].
The empty sequence matches only the empty document; it is an identity for sequence and all. The empty choice matches no document; it is an identity for choice.
() , t | = t = | t , () |
() & t | = t = | t & () |
Ø | t | = t = | t | Ø |
The interleaved product (&
, also known as the shuffle product) is a generalization of XML Schema's [XML Schema Part 1]
allgroups. t1 &
t2
matches any sequence that is an interleaving of a sequence that matches t1
and a sequence that matches t2. For example,
(a[],b[]) & c[] = a[],b[],c[] | a[],c[],b[] | c[],a[],b[]
As another example, a[]{0,*}&b[]
matches any sequence of a[]
and b[]
that has exactly one b[]
.
Allgroups in XML Schema may only consist of global or local element declarations
with lower bound 0 or 1, and upper bound 1. With these restrictions, an allgroup
in XML Schema is equivalent to
p1{m1,1} & ... &
pn{mn,1}, where
pi
are element declarations and 0 <=
mi <=
1.
The repetition t{m,n}
denotes a sequence of at least m and at most n t.
The bounds on a repetition type will be either a natural number (that is,
either a positive integer or zero) or the special value *
,
meaning unbounded. We extend arithmetic to include *
in the obvious way:
|
For technical reasons, we allow the lower bound of a repetition to
be *
. A repetition t
{m, n} is equivalent to the empty
choice Ø if m > n or if
m is *
.
The bag type {t} ignores the order of nodes in t. Note that this semantics differs from the conventional one. {t} does not denote a bag of nodes, each with type t, but a bag of the nodes in the list with type t.
The type expression pic
t is a convenient
shorthand for the type (PI | Comment){0,*} &
t, that is, a type in which PIs and comments may
be interleaved with nodes in type t.
Similarly, the type expression mixed
t is a convenient
shorthand for the type String{0,*} &
t.
The Algebra's external type system, that is, the type definitions associated
with input and output documents, is XML Schema. The internal types are in some
ways more expressive than XML Schema, for example, XML Schema has no type
corresponding to
{*}*[t] where t is some
type other than AnyComplexType
. In general, mapping XML Schema
types into internal types will not lose information, however, mapping internal
types into XML Schema may lose information.
It is useful to define a concrete syntax for the Algebra's types using syntactic categories that specify various subsets of types. Syntactic classes can indicate, for example, that the content of an attribute should be a simple type and that the content of an element should consist of attributes followed by elements. These restrictions guarantee that errors in type construction can be caught during parsing of a query.
In the concrete syntax for types, we capitalize non-terminal symbols. We first distinguish between type variables used to name attribute groups, element groups, and the content types of elements.
TypeVar | ::= | AttrGroupVar |
| | ElemGroupVar | |
| | ContentTypeVar |
A simple type is either an atomic type, a list of atomic types, or a choice of atomic types, which allows a choice of atomic lists.
SimpleType | ::= | p | atomic |
| | p{m,n} | list of atomic | |
| | SimpleType | SimpleType | choice of atomic |
An attribute group defines the syntactic form of attributes: their content may only be a simple type and they are combined only with the interleaving operator, but not the sequence operator.
AttrGroup | ::= | @Wild[SimpleType] | |
| | @Wild[SimpleType]{0,1} | optional attribute | |
| | AttrGroup & AttrGroup | ||
| | AttrGroupVar | ||
| | (AttrGroup) | ||
| | () |
An element group contains elements with constant or wildcard names and they are combined in sequences, choices, interleavings, and repetitions.
ElemGroup | ::= | Wild[ContentType] | element |
| | ElemGroup , ElemGroup | ||
| | ElemGroup | ElemGroup | ||
| | ElemGroup & ElemGroup | ||
| | ElemGroup{m,n} | ||
| | ElemGroupVar | ||
| | pic ElemGroup |
||
| | mixed ElemGroup |
||
| | (ElemGroup) | ||
| | () | ||
| | Ø |
The content type of an element is either an attribute group, an element group, a sequence of an attribute group followed by an element group, or a content-type variable.
ContentType | ::= | AttrGroup | |
| | ElemGroup | ||
| | AttrGroup , ElemGroup | ||
| | ContentTypeVar | content-type variable |
Finally, a type in an algebraic expression may be an attribute group, an element group, or a content type:
Type | ::= | AttrGroup | ElemGroup | ContentType |
Ed. Note: MF (Nov-16-2000): This needs work.
[Figure 2] contains the concrete syntax for the Algebra's ``core'' expressions. We define the Algebra's typing rules on these expressions. Typing rules for the Algebra are defined in [4 Static Semantics : Type-Inference Rules]. We have seen examples of most of the expressions, so we will only point out details here.
name Name function FuncName variable Var constant Const atomic-datatype constant equality/inequality operator Opeq ::= =
|==
|!=
|<
|<=
|>
|>=
arithmetic operator Oparith ::= +
|-
|*
|/
|mod
|div
boolean operator Opbool ::= and
|or
binary operator BinaryOp ::= Opeq | Oparith | Opbool unary operator UnaryOp ::= +
|-
|not
name expression NameExp ::= Name | {Exp}Exp computed name expression Exp ::= Const atomic constant | NameExp name expression | Var variable | @NameExp[Exp] attribute | NameExp[Exp] element | Exp, Exp sequence or union | Exp - Exp difference | Exp /\ Exp intersection | () empty sequence | if
Expthen
Expelse
Expconditional | let
Var=
Expdo
Explocal binding | FuncName (Exp ;...; Exp) function application | Exp : Type explicit type | error
error | Exp BinaryOp Exp binary operator | UnaryOp Exp unary operator | sort
vin
Expby
Expsort expression | for
Varin
Expdo
Expiteration | match
Exp CaseRulesmatch case rules CaseRules ::= case
Var : Typedo
Exp CaseRules| else
Expquery item Query ::= type
TypeVar = Typetype declaration | fun
FuncName (Var : Type ; ... ; Var : Type): Type =Exp function declaration | let
Var : Type = Expglobal variable declaration | query
Expquery expression
Figure 2: Core Algebra Expression Syntax
Many of the expressions that appear in the examples, for example
Exp/*
, do not
appear in [Figure 2], because they are
reducible to expressions in the core syntax.
[Figure 3] lists these reducible expressions.
In [A Equivalences], we give the laws that relate a reducible
expression with its equivalent expression in the core algebra.
attributes
(Exp)attributes children
(Exp)children Exp / @Wild attribute projection Exp / Wild element projection Exp / data()
atomic projection Exp / comment()
comment projection Exp / processing_instruction()
processing instruction projection Exp / deref()
dereference projection where
Expdo
Expwhere conditional empty
(Exp)emptiness predicate cast
Exp : Typerun-time type cast let
Var : Type=
Expdo
Explocal variable with type
Figure 3: Reducible Algebra Expressions
In addition to the core grammar, the Algebra has
a set of operators and built-in functions.
The binary and unary operators are enumerated in [Figure 2].
They include two equality operators, ==
and
=
, which are also required by the [XML Query Data Model],
and five inequality operators, <=
, <
,
>=
, >
, and !=
.
The <
operator is overloaded: when applied to two
document nodes, it compares their global document order.
We have not defined the semantics of all the binary operators in
the Algebra.
In particular, it might be useful to define more than one type of
equality over scalar and element values, or to define implicit
coercions between values of related types.
A joint task force on operators with members from the [XSLT 99],
XML
Schema, and XML Query working groups is chartered to define
operators.
The Algebra will adopt the decisions of that group
(See [Issue-0056:
Operators
on Simple Types]).
Some of the built-in functions are constructors and accessors defined in the [XML Query Data Model]. They are listed in [Figure 4]. The remaining built-in functions are listed in [Figure 5]. One benefit of having built-in functions is that more precise types can be given to these functions than to user-defined functions. The type rules for these functions appear in [4 Static Semantics : Type-Inference Rules].
bagtolist
(Exp)Sorts unordered forest by document order. comment
(Exp)Constructs a comment. deref
(Exp)Dereferences a node reference. listtobag
(Exp)Disregards order of nodes in argument. localname
(Exp)Extracts local NCName from a QName value. name
(Exp)Returns element or attribute's tag name. namespace
(Exp)Extracts URI namespace from a QName value. nodes
(Exp)Returns nodes in content of element or attribute. parent
(Exp)Returns the parent of a document node. processing_instruction
(Exp, Exp)Constructs a processing instruction. ref
(Exp)Constructs a node reference. string
(Exp)Returns string value of given node, as defined in [XPath]. target
(Exp)Returns the target name of a processing instruction. value
(Exp)Returns the value of a comment or processing instruction.
Figure 4: Data Model Constructor and Accessor Functions
agg(Exp) Aggregation functions, where agg is one of avg
,count
,min
,max
,sum
.distinct_node
(Exp)Removes duplicate nodes from a forest. distinct_value
(Exp)Removes duplicate values from a forest. index
(Exp)Pairs each element of an ordered forest with integer index.
Figure 5: Built-In Functions
The Algebra's static semantics is presented as type inference rules, which relate Algebra expressions to types and and specify under what conditions an expression is well typed. The rules for typing expressions are described with an inference rule notation, which is now widely used for describing type systems and semantics of programming languages. For a textbook introduction to type systems, see, for example, [Mitchell].
In inference notation, when all judgements above the line hold, then the judgement below the line holds as well. Here is an example of a rule used later on:
|- Data1: t1 |- Data2: t2 |
|
|- (Data1, Data2): (t1, t2) |
Data is the subset of expressions that consists only of scalar constant, attribute, element, sequence, and empty-sequence expressions.
Data | ::= | Const | atomic constant | |
| | @Name[Data] | attribute | ||
| | Name[Data] | element | ||
| | Data, Data | sequence | ||
| | () | empty sequence |
The judgement "|- Data : t"
is read as "in the empty environment, the value Data has type
t."
The symbol |- is called a turnstyle, and is usually preceded by an
environment symbol, which represents
a mapping from variables to values.
In this example, there is no environment symbol,
which means the judgement holds in the empty environment.
In [4.4 Expressions], we will see examples of rules that
have non-empty environments.
The rule states that if both
Data1 : t1
and
Data2 : t2
hold, then
(Data1,
Data2):
(t1,
t2)
holds as well.
For instance,
take
Data1 = a[1]
,
Data2 = b["two"]
,
t1 = a[Integer]
, and
t2 = b[String]
.
Then since both
a[1]
: a[Integer]
and b["two"]
: b[String]
hold, we may conclude that
(a[1]
, b["two"]
) : (a[Integer]
,
b[String]
)
holds.
The following type rules relate Data values to types. The next rule states that for any constant whose lexical form defines a value of type p, the constant has type p:
|
|- Constp: p |
The next rule states that any constant also has type AnyScalar
:
|
|-
Const :
AnyScalar
|
The next two rules are for attribute and element construction. The first rule states that if Data has type t, then the attribute expression @Name[Data] has type @Name[t]. The subsequent rule is analogous for element expressions.
|- Data : t |
|
|- @Name[Data]: @Name[t] |
|- Data : t |
|
|- Name[Data]: Name[t] |
The next rule is for typing sequences and was already described at the beginning of [4 Static Semantics : Type-Inference Rules].
|- Data1: t1 |- Data2: t2 |
|
|- (Data1, Data2): (t1, t2) |
The next rule states that the empty sequence value always has the empty sequence type:
|
|- (): () |
The rules above associate the most specific type possible with a Data value. The remaining rules in this section associate more general types with a Data value, which are necessary when the type system must determine whether a Data value with most specific type t is permissible when a value of type t' is expected. This occurs during type checking of a query.
The next two rules are also for attribute and element construction, but these rules specify more general types. The first rule states that if Exp has type t, and Name is in the set of names defined by the wildcard expression Wild, then the given attribute expression also has type @Wild[t]. The subsequent rule is analogous for element expressions.
|
||
|
||
|- @Name[Data] : @Wild[t] |
|- Data : t |
|
|- Name[Data] : Wild[t] |
The next rule states that if the value Data has type t1, then it also has type t1| t2. The subsequent rule is analogous and states that the choice operator is associative.
|- Data : t1 |
|
|- Data : (t1| t2) |
|- Data : t2 |
|
|- Data : (t1| t2) |
The next rule states that the sequence of one value with type t followed by a value with repetition type t{m,n} has repetition type t{m+1,n +1}.
|- Data1: t |- Data2: t {m, n} |
|
|- (Data1, Data2): t {m +1, n +1} |
The next rule states that the sequence of one value with type t followed by a value with repetition type t{0,n} has repetition type t{0,n +1}.
|- Data1: t |- Data2: t {0, n} |
|
|- (Data1, Data2): t {0, n +1} |
The next rule states that the empty sequence is a repetition of any type with lower bound 0:
|
|- (): t {0, n} |
The symbol <: denotes the subtype relation. We write t1<: t2 if for every data Data such that |- Data : t1, it is also the case that |- Data : t2, i.e., is t1 is a subtype of t2. The subtyping relation is used in many of the type rules that follow. It is easy to see that the subtype relation <: is a partial order, i.e., it is reflexive, t <: t, and it is transitive, if t1<: t2and t2<: t3then t1<: t3. Here are some of the inequalities that hold:
|
Further, if Name <:wild Wild and t <: t', then
Name[t] | <: | Wild[t'] |
If t <: t'
and m >=
m' and
n <=
n' then
t {m, n} | <: | t' {m', n'} |
If t <: t' then
t | <: | {t'} |
If t1<: t1' and t2<: t2' then
t1, t2 | <: | t1', t2' |
t1, t2 | <: | t1' & t2' |
t1 & t2 | <: | t1' & t2' |
We write t1= t2 if t1<:t2 and t2<: t1. Here are some of the equalities that hold:
|
Ed. Note: PF, Dec 11 2000: The last two equivalence rules are necessary to reuse the auxiliary typing rules for iteration by
for
for iteration over unordered forests (see [4.8 Iteration expressions].)
Ed. Note: MF, Jan 09 2000: Peter, do you mean := above or =?
Ed. Note: PF, Jan 29 2000: yes, these equivalences are by definition
We also have that t1<: t2 if and only if t1| t2= t2.
We define the intersection t1 /\ t2 of two types t1 and t2 to be the largest type t that is smaller than both t1 and t2. That is, t = t1 /\ t2 if t <: t1 and t <: t2 and if for any t' such that t' <: t1 and t' <: t2, we have t' <: t.
The type rules use an environment that specifies the types of variables and functions. The type environment is denoted by G, and is composed of a comma-separated list of variable types, Var : t or function types, FuncName :(t1 ; ... ; tn) : t. We retrieve type information from the environment by writing (Var : t) in G to look up a variable, or by writing (FuncName : (t1;...; tn) : t) in G to look up a function.
The type checking starts with an environment that contains all the types declared for functions and global variables. For instance, before typing the first query of [2.2 Projection], the environment contains:
G = bib0 : Bib, book0 : Book
While doing the
type-checking, new
variables will be added in the environment. For instance, when typing the query
of [2.4 Iteration],
variable b
will be typed with Book
,
and
added in the
environment. This will result in a new environment:
G' = G, b : Book
We write
G |- Exp : t if
in
environment G
the
expression
Exp has type
t.
Below are
all the rules except those for for
and match
expressions, which are discussed in
later subsections.
The next rule states that in any environment G,
a constant whose lexical form
defines a value of type p has type
p. So, for example, G |- 1 : Integer
and G |- "two" : String
.
|
G |-
Constp:
p
|
The next five rules are for typing expressions that define names.
|
|- LocalName: {}LocalName |
|
|- {Namespace}LocalName: {Namespace}LocalName |
G |-
Exp :
NCName
|
|
G |- {Namespace}Exp: {Namespace}* |
G |-
Exp : URI
|
|
G |- {Exp}LocalName: {*}LocalName |
G |-
Exp1:
URI
G |-
Exp2:
NCName
|
|
G |- {Exp1}Exp2: {*}* |
The next rule is a trivial definition of the variable Var having type t in environment G:
(Var : t) in G |
|
G |- Var : t |
The next two rules are for attribute and element construction with a constant tag name. The first rule states that if Exp has type t, then the attribute expression @Name[Data] has type @Name[t]. The subsequent rule is analogous for element expressions.
G |- Exp : t |
|
G |- @Name[Exp]: @Name[t] |
G |- Exp : t |
|
G |- Name[Exp]: Name[t] |
The next two rules are for attribute and element construction in which the tag name is computed. The first rule states that if Exp1 is in the set of names defined by Wild, and Exp2 has type t, then the given computed attribute expression has type @Wild[t]. The subsequent rule is analogous for element expressions.
G |- Exp1: Wild G |- Exp2: t |
|
G |- @Exp1[ Exp2]: @Wild[t] |
G |- Exp1: Wild G |- Exp2: t |
|
G |- Exp1[ Exp2]: Wild[t] |
The following two rules are analogous to the sequence and empty sequence rules in [4.1 Relating data to types].
G |- Exp1: t1 G |- Exp2: t2 |
|
G |- Exp1, Exp2: t1, t2 |
|
G |- () : () |
Note that in the type rule for a conditional expression, the result type is the choice (t2|t3).
G |-
Exp1:
Boolean
G |-
Exp2:
t2
G |-
Exp3:
t3
|
|
G |-
if
Exp1
then
Exp2
else
Exp3:
(t2 |
t3)
|
A let
expression extends the current environment
Gamma; with the variable Var with type
t.
Note that Exp2,
the body of the let
expression,
is typed in the extended environment, and the type of the entire
let
expression is t2.
G |- Exp1: t1 G, Var : t1 |- Exp2: t2 |
|
G |-
let
Var =
Exp1
do
Exp2:
t2
|
The next rule is for function application. In a function application, the type of each actual argument to the function must be a subtype of the corresponding formal argument to the function, i.e., it is not necessary for the actual and formal types to be equal.
|
|||||
|
|||||
G |- FuncName (Exp1;...; Expn) : t |
The next rule states that is always permissible to explicitly type an expression with a type t' that is a supertype of the expression's type t. In programming-language terminology, this operation is sometimes called an ``upcast''.
G |- Exp : t' t' <: t |
|
G |- (Exp : t) : t |
The error
expression always has the empty choice type.
|
G |-
error
: Ø
|
The complete set of operators are not yet defined, so it is not possible to give the typing rules for operators. A joint task force on operators with members from the XSLT, XML Schema, and XML Query working groups is chartered to define operators. The Algebra will adopt the operators defined by that group (See [Issue-0056: Operators on Simple Types]). In general, however, arithmetic operators will have a type rule such as the following, in which t1 and t2 are numeric types and appropriate conversion exist between the two:
G |- Exp1 : p1 G |- Exp2 : p2 |
|
G |- Exp1 Oparith Exp2 : t |
Equality and inequality operators are typed similarly.
G |- Exp1 : t1 G |- Exp2 : t2 |
|
G |-
Exp1
Opeq
Exp2
: Boolean
|
Boolean operators require that their subexpressions be boolean expressions.
G |-
Exp1
: Boolean
G |-
Exp2
: Boolean
|
|
G |-
Exp1
Opbool
Exp2
: Boolean
|
The following type rules define the input and output types for built-in functions.
G |-
Exp
: t
|
|
G |-
count (
Exp) :
Integer
|
The next two rules are for comment and processing instruction constructors.
G |-
Exp : String
|
|
G |-
comment (Exp) : Comment
|
|
||
|
||
G |-
processing_instruction (Exp1; Exp2) : PI
|
The next two rules extract the type of an attribute or element's name from its type.
G |- Exp : @Wild[t] |
|
G |-
name (Exp) : Wild
|
G |- Exp : Wild[t] |
|
G |-
name (Exp) : Wild
|
The nodes
operator only applies to values with attribute
or element type.
G |- Exp : @Wild[t] |
|
G |-
nodes (Exp) : t
|
G |- Exp : Wild[t] |
|
G |-
nodes (Exp) : t
|
The next two rules are for name expressions: they extract the constituent parts of a name.
G |- Exp : Wild |
|
G |-
namespace (Exp)
: URI
|
G |- Exp : Wild |
|
G |-
localname (Exp) : NCName
|
The next rule is for the parent
function, which may be
applied to any unit type:
G |- Exp : u |
|
G |-
parent (Exp) : AnyElement {0,1}
|
The next rule two rules are for the reference constructor and
dereference function. Note that
ref
requires that its argument have a unit type.
G |- Exp : u |
|
G |-
ref (Exp) : ref(u)
|
G |- Exp : ref(t) |
|
G |-
deref (Exp) : t
|
The next rule two rules are for the target
and
value
accessor function on comments and processing
instructions.
G |-
Exp : PI
|
|
G |-
target (Exp) : NCName
|
G |- Exp : (PI | Comment) |
|
G |-
value (Exp) : String
|
Many of the operators in the algebra disregard the relative order of components
in their input type. This includes the built-in operators index
,
sort
, bag
.
For example, consider the following.
index (children (book0)) ==> pair [ fst [ 1 ], snd [ title [ "Data on the Web" ] ], pair [ fst [ 2 ], snd [ author [ "Abiteboul" ] ] ], pair [ fst [ 3 ], snd [ author [ "Buneman" ] ] ], pair [ fst [ 4 ], snd [ author [ "Suciu" ] ] ], pair [ fst [ 5 ], snd [ year [ 1999 ] ]
How should we describe the type of the result? By nesting the children of
book0
under snd
the original sequence of
title, author+, year
gets lost. snd
can contain
either a title
, an author
, or a year
.
More formally, we need to find
q, m, and n
such
that
children(book0) : q{m, n}
and then the type is given by
index(children(book0)) : pair [ fst [ String ], snd [q] ]{m, n}
In the case of books, the values of q are:
title [ String ] | author [ String ] | year [ Integer]
the value of m is 3 (because there will be
one
title
,
at
least one author
, and one year
element) and the value of
n
is *
(because
there may be any number of author elements).
We call a type like q a prime type. In general, it may contain scalar, attribute, element, choice, and empty choice types, but it will not contain repetition, sequence, or empty sequence types (except, perhaps, within an element or attribute type). The definition of prime types appears in [Figure 1].
factor (p) = p {1, 1} factor (Wild[t]) = Wild[t]{1, 1} factor (@Wild[t]) = @Wild[t]{1, 1} factor (t1 , t2) = (q1| q2){m1+ m2, n1+ n2} where qi{mi, ni} = factor (ti) factor (t1 & t2) = (q1| q2){m1+ m2, n1+ n2} where qi{mi, ni} = factor (ti) factor (t1| t2) = (q1| q2){m1min m2, n1max n2} where qi{mi, ni} = factor (ti) factor (t {m, n}) = q {m' · m, n' · n} where q {m', n'} = factor (t) factor (()) = Ø{0, 0} factor (Ø) = Ø{*, 0}
Figure 6: Definition of factoring
factor, as shown in [Figure 6], converts any type t to a type of the form q {m, n}, where t <: q {m, n}, so that any value that has type t also has type q {m, n}. For example,
|
We can see here that the factored type is less specific than the unfactored type. For mnemonic convenience we write q {m, n} = factor (t), but one should actually think of the function as returning a triple consisting of a prime type q and two bounds m and n.
Just as factoring a number yields a product of prime
numbers, factoring a type yields a repetition of prime types. Further, the
result yielded by factoring is in some sense optimal. If
q {m, n}
=
factor (t) then
t
<:
q {m, n}
and for any
q', m', and
n' such that t <:
q'{m',
n'} we have
that
q <: q' and
m
>=
m'
and
n <= n'. Also, if
t =
t',
then factor (t) =
factor (t'). In particular,
the
choice of the lower bound *
for
factor (Ø)
guarantees that
factor (t) =
factor (t |
Ø), since m min
*
= m.
Using factor we can type index
, sort
and the operations on unordered forests bag
, union
,
difference
, and intersection
as follows. Note
that factor is only used by the type inference.
rules, and thus is not part of the algebra expressions.
The type rule for index
requires that
its argument be a factored type. The second expression
above the judgement line converts t into a factored type.
G |- Exp : t q {m, n} = factor (t) |
|
G |-
index (Exp) : pair
[ fst
[ Integer ],
snd
[ref (q)]]{m,
n}
|
The types of aggregated expressions must be factored, and their prime type must be a numeric type.
G |-
Exp
:
t
q{m,
n} =
factor (t)
q <: Integer | Float | Decimal, etc.
|
|||
|
|||
|
bag
transforms an ordered forest to an unordered forest, and
accordingly transforms the input type to its factor
.
G |-
Exp :
t
q{m,n} = factor (t)
|
|
G |-
bag (Exp) : {q{m,n}}
|
sort
returns the factor
of its input type.
We distinguish between two kinds of sorts. Stable sort
operates on ordered forests, unstable sort
operates on unordered forests.
In both cases, the result is an ordered forest. The result type for the unstable
sort needs not be explicitly factored, because the input type is already factored.
The key expression
Exp2 is typed in the extended
environment G |-
Var : q.
|
||
|
||
G |-
sort Var in
Exp1 by Exp2:
q{m,n}
|
|
||
|
||
G |-
sort Var in
Exp1 by Exp2:
q{m,n}
|
Ed. Note: MF, Oct 23/2000: This definition assumes that the equality operator on t2 is defined. An alternative is requiring Exp2 to have
AnyScalar
, but that seems too restrictive.
distinct_node
removes all duplicate nodes from its input. Therefore,
the lower bound of the cardinality of the factor type can be at most 1.
As with sort
, we distinguish two kinds of distinct_node
.
Stable distinct_node
operates on an ordered forest,
and returns an ordered forest that preserves the relative order of nodes in
the original input, and unstable distinct_node
operates on and returns an unordered forest and, naturally, does not preserve
input order.
|
||
|
||
G |-
distinct_node (Exp) : q{min 1 m, n}
|
G |- Exp : {q{m,n}} |
|
G |-
distinct_node (Exp) :
{q{min 1 m,n}}
|
Ed. Note: PF (Jan 29, 2000): a more accurate lower bound may be derived by counting the disjoint constituent types of q. But this is probably too complex.
distinct_value
removes all duplicate values from its input. The typing
rules are analogous to the typing rules of distinct_node
.
|
||
|
||
G |-
distinct_value (Exp) : q{min 1 m, n}
|
G |- Exp : {q{m,n}} |
|
G |-
distinct_value (Exp) :
{q{min 1 m,n}}
|
Sequence "," on unordered forests is interpreted as disjoint union.
G |- Exp1 : {q1{m1, n1}} G |- Exp2 : {q2{m2, n2}} |
|
G |- Exp1 , Exp2 : {(q1 | q2){m1 + m2, n1 + n2}} |
Similar to distinct_node
, intersection (/\) and difference (-)
both need to set the lower bound of the cardinality of the input type to 0.
The upper bound of the cardinality of the intersection of two types can be at
most the minimum of upper bounds of their individual cardinalities.
G |- Exp1 : {q1{m1, n1}} G |- Exp2 : {q2{m2, n2}} |
|
G |-
Exp1 /\ Exp2 :
{(q1 /\ q2){0,
min n1 n2}}
|
G |- Exp1 : {q1{m1, n1}} G |- Exp2 : {q2{m2, n2}} |
|
G |- Exp1 - Exp2 : {q1{0,n1}} |
Note that for Exp1 /\
Exp2 :
t and
Exp1, Exp2 -
(Exp1 - Exp2)
- (Exp2 - Exp1) : t',
t <: t'. Thus, although the operational semantics of these
two expressions is equivalent, their static semantics is not necessarily equivalent.
Ed. Note: PF, Jan 29 2000: If this is considered a problem, we can treat /\ as a derived operator, losing some precision at type level.
The typing of for
expressions is rather subtle. We give an intuitive
explanation first and then the detailed typing rules below.
A unit type is defined in [Figure 1];
it is either an atomic type,
an attribute or element type with a constant or wildcard
name, a comment, or a processing instruction. A for
expression
for
Var in
Exp1 do
Exp2
is typed as follows. First, one finds the type of expression Exp1. Next, for each unit type in this type one assumes the variable Var has the unit type and one types the body Exp2. Note that this means we may type the body of Exp2 several times, once for each unit type in the type of Exp1. Finally, the types of the body Exp2 are combined, according to how the types were combined in Exp1. That is, if the type of Exp1 is formed with sequencing, then sequencing is used to combine the types of Exp2, and similarly for choice or repetition.
For example, consider the following expression, which selects all author
elements from a book.
for c in nodes(book0) do match c case a : author[AnyType] do a else ()
The type of nodes(book0)
is
title[String], year[Integer], author[String]{1,*}
This is composed of three unit types, and so the body is typed three times.
|
The three result types are then combined in the same way the original unit types were, using sequencing and iteration.
(), (), author[String]{1,*}
as the type of the iteration, and simplifying yields
author[String]{1,*}
as the final type.
As a second example, consider the following expression, which selects all title
and author
elements from a book, and renames them.
for c in book0/* do match c case t : title[String] do titl [ t/data() ] case y : year[Integer] do () case a : author[String] do auth [ a/data() ] else error()
Again, the type of book0/*
is
title[String], year[Integer], author[String]{1,*}
This is composed of three unit types, and so the body is typed three times.
|
The three result types are then combined in the same way the original unit types were, using sequencing and iteration. This yields
titl[String], (), auth[String]{1,*}
as the type of the iteration, and simplifying yields
titl[String], auth[String]{1,*}
as the final type. Note that the title occurs just once and the author occurs one or more times, as one would expect.
As a third example, consider the following expression, which selects all basic parts from a sequence of parts.
for p in part0/subparts/* do match p case b : Basic do b case c : Composite do () else error()
The type of part0/subparts/*
is
(Basic | Composite){1,*}
This is composed of two unit types, and so the body is typed two times.
|
The two result types are then combined in the same way the original unit types were, using sequencing and iteration. This yields
(Basic | ()){1,*}
as the type of the iteration, and simplifying yields
Basic{0,*}
as the final type. Note that although the original type involves repetition one or more times, the final result is a repetition zero or more times. This is what one would expect, since if all the parts are composite the final result will be an empty sequence.
In this way, we see that for
expressions can be combined with match
expressions
to select and rename elements from a sequence, and that the result is given a
sensible type.
In order for this approach to typing to be sensible, it is necessary that the unit types can be uniquely identified. However, the type system given here satisfies the following law.
a [t1 | t2] = a [t1] | a [t2]
This has one unit type on the left, but two distinct unit types on the right, and so might cause trouble. Fortunately, our type system inherits an additional restriction from XML Schema: we insist that the regular expressions can be recognized by a top-down deterministic automaton. In that case, the regular expression must have the form on the left, the form on the right is outlawed because it requires a non-deterministic recognizer. With this additional restriction, there is no problem.
The method of translating projection to iteration described in the previous section combined with the typing rules given here yield optimal types for projections, in the following sense. Say that variable x has type t, and the projection x / a has type t'. The type assignment is sound if for every value of type t, the value of x / a has type t'. The type assignment is complete if for every value y of type t' there is a value x of type t such that x / a = y. In symbols, we can see that these conditions are complementary.
|
Any sensible type system must be sound, but it is rare for a type system to be complete. But, remarkably, the type assignment given by the above approach is both sound and complete.
The type rule for for expressions uses the following auxiliary judgement. We
write G |- for
Var : t do
Exp : t', if in environment G when the bound variable
Var of an iteration has type t
then the body Exp of the iteration has type
t'.
G, Var : t |- Exp : t' |
|
G |-
for Var : t do Exp : t'
|
|
G |-
for Var : () do Exp : ()
|
G |-
for Var :
t1 Exp : t'1
G |-
for
Var : t2 Exp : t'2
|
|
G |-
for Var : t1, t2 do
Exp : t'1, t'2
|
G |-
for Var :
t1 Exp : t'1
G |-
for
Var : t2 Exp : t'2
|
|
G |-
for Var : t1 & t2 do
Exp : t'1 & t'2
|
|
G |-
for Var : Ø do Exp : Ø
|
G |-
for
Var : t1
Exp : t'1
G |-
for
Var : t2
Exp : t'2
|
|
G |-
for Var : t1 | t2 do
Exp : t'1 | t'2
|
G |-
for Var : t Exp : t'
|
|
G |-
for Var : t{m,n} do
Exp : t'{m,n}
|
Given the above rules, the type rules for for
expressions are immediate.
Again, we need to distinguish between ordered and unordered forests.
|
|||
|
|||
G |-
for Var in Exp1
do Exp2 :
t2
|
G |-
Exp1 : {f1}
G |-
for Var : f1
do Exp2 :
{f2}
|
|
G |-
for Var in Exp1
do Exp2 :
{f2}
|
Iteration over an ordered forests produces an ordered forest. Iteration over an unordered forest produces an unordered forest.
The typing of match
expressions is closely related to
the typing of for
expressions.
Due to the typing rules of
for
expressions, it is possible that the body of an iteration is checked
many times. Thus, when a match
expression is checked, it is possible that quite
a lot is known about the type of the expression being matched, and one can
determine that only some of the clauses of the match
apply. The definition of
match
uses the auxiliary judgments to check whether a given
clause is applicable.
We
write G |- case
Var : t do
Exp : t', if in environment G, the bound variable
Var of the case
has type t,
then the body Exp of case
has type
t'. Note the type of the body is irrelevant if t = Ø.
¬(t = Ø) G, Var : t |- Exp : t' |
|
G |-
case Var : t do Exp : t'
|
|
G |-
case Var : Ø do Exp : Ø
|
We write G |- t <: t' else
Exp: t''
if in environment G when t <: t'
does not hold, then
the body Exp of the match
expression's
else
clause has type t''. Note that
the type of the body is irrelevant if t < t'.
t <: t' |
|
G |-
t <: t'
else Exp : Ø
|
¬ t <: t' G |- Exp : t'' |
|
G |-
t <: t'
else Exp : t''
|
Given the above, it is straightforward to construct the typing rule for a
match
expression. Recall that we write t /\ t'
for the intersection of two types.
|
||||||||
|
||||||||
|
We write G |- Query if in environment G, the query item Query is well-typed. A type declaration is always well typed:
|
G |-
type
x =
t
|
A function declaration is well-typed if in the environment extended with the type assignments for its formal variables, its body is well-typed.
G, Var1 : t1, ..., Varn : tn |- Exp : t' t' <: t |
|
G |-
fun
FuncName (Var1 :
t1; ...;
Varn :
tn)
: t =
Exp
|
A top-level let
expression is well-typed if the type of
its body, t', is a subtype of the bound variable's type.
G |- Exp : t' t' <: t |
|
G |-
let
Var : t =
Exp
|
Finally, a top-level query expression is well-typed if its body is well-typed.
G |- Exp : t |
|
G |-
query
Exp
|
We extract the relevant component of a type environment from a query item Query with the function environment (Query).
|
We write |- Query1 ... Queryn if the sequence of query items Query1 ... Queryn is well typed.
|
|||
|
|||
|- Query1... Queryn |
Ed. Note: MF : 02-Feb-2001: This material is not yet stable.
The Algebra's dynamic, or operational, semantics is presented as value inference rules. The value inference rules are similar to the type inference rules in [4 Static Semantics : Type-Inference Rules], but they relate expressions to values or semantic objects. The Algebra's semantic objects are defined in the [XML Query Data Model]. An operational semantics specifies the order in which an Algebra expression is evaluated and guarantees that every expression can be reduced to a simple semantic object. The Algebra's dynamic semantics is modeled on the dynamic semantics presented in [Milner].
error
a in AtomicValues n in NodeValues u in UnitValues = AtomicValues U
NodeValuesl in OrderedForestValues = [ UnitValues ] b in UnorderedForestValues = { UnitValues } d in DataModelValues = UnitValues U
OrderedForestValuesU
UnorderedForestValuesv in Values = DataModelValues U
{error
}f in FunctionClosure = Exp X
EnvX
Env
Figure 7: Semantic objects
There is a single error value, named error
.
AtomicValues is the class of values denoted by the Const
expressions, i.e., the values contained in the XML Schema atomic
datatypes.
NodeValues is the class of Node values
defined in the [XML Query Data Model].
UnitValues is the union of all values in AtomicValues and NodeValues;
they are instances of the unit type [Figure 1].
DataModelValues is the class of values defined in the XML Query Data
model, which includes UnitValues and ordered and unordered forests.
Values is the class of values that includes DataModelValues and error
.
A FunctionClosure is the class of triples each containing an Algebra
expression and two environments.
All the above classes, except error
, are infinite sets.
[Figure 8] lists several basic functions used in the dynamic semantics.
val
: Const -> AtomicValue def
: Type-> Def_T Defined in [XML Query Data Model]: []
Construct an ordered forest {}
Construct an unordered forest append
Appends two ordered forests head
Returns the first node in a non-empty, ordered forest tail
Returns items in a ordered forest excluding its first item. union
Returns the union of two unordered forests diff
Returns the difference of two unordered forests intersect
Returns the intersection of two unordered forests
Figure 8: Functions used in dynamic semantics
The function val
takes a constant expression Const and
returns the atomic value it denotes.
In [XML Query Data Model], Def_Type
denotes a reference to the data-model representation of a schema type
Type.
The function def
takes a type expression
Type and returns the reference node that represents the
given type.
The remaining functions used in the operational semantics are
defined in [XML Query Data Model].
Ed. Note: MF : (Jan-15-2001) To define the sort expression completely, we need to specify what basic sort operators are available.
The value inference rules use an environment comprised of a type environment, a value environment, and a function environment, defined in [Figure 9]
G Type environment VE in ValueEnv = Var -> Values Value environment FE in FuncEnv = FuncName -> FunctionClosures Function environment E in Env = TypeEnv U
ValueEnvU
FuncEnv
Figure 9: Environments
The type environment G is defined by the static type rules [4 Static Semantics : Type-Inference Rules]. The value environment VE is a finite map from variables to values. The function environment FE is a finite map from function names to function closures. An environment E is a tuple containing a type enviroment, a value environment, and a function environment.
To select the type-environment component of an environment, we use the notation G of E; similarly, we use VE of E to select the value environment and FE of E for the function environment.
We look up the type of a variable by writing (G of E)(Var) = t, and we write (VE of E)(Var) = v to look up the value of a variable or (FE of E)(FuncName) = f to look up a function.
It is often necessary to modify an environment, for example, when defining variables or functions. The modification of one environment E by another E' is written E, E' and it denotes:
E, E' = ((G of E), (G' of E'); (VE of E), (VE' of E'); (FE of E), (FE' of E')) |
The finite map f, g
is the finite map with domain Dom f U
Dom g,
and values
(f, g)(a) = if a in Dom g then g(a) else f(a)This definition means that when ``looking up'' a variable in the environment E, E', the environment E' will always be searched first.
Each rule of the dynamic semantics are inferences among sentences of the form: E |- phrase => v, where E is an environment, phrase is a phrase in the core syntax [Figure 2], and v is a semantic object.
As in [4 Static Semantics : Type-Inference Rules], when all judgements above the line of an inference rule hold, then the judgement below the line holds as well. The folowing rule states that in any environment, a constant expression reduces to the value it denotes:
|
E |-
Const => val (Const)
|
The following rule states that an unqualified LocalName
denotes a QName value with a null-valued URI reference.
The function qnameValue
is a constructor in the [XML Query Data Model]; it constructs a new QName value.
E |- LocalName => v |
|
E |-
LocalName =>
qnameValue (null , v, def (QName))
|
The next rule constructs a fully qualified QName by evaluating its literal namespace and local-name components and then applying the QName constructor to the two values:
|
||
|
||
E |-
{Namespace}LocalName =>
qnameValue (v1, v2, def (QName))
|
The next rule constructs a fully qualified QName by evaluating its computed namespace and local-name components and then applying the QName constructor to the two values:
|
||
|
||
E |-
{Exp1}Exp2 =>
qnameValue (v1, v2, def (QName))
|
The next rule determines the value of a variable, which is simply the variable's value in the current value environment VE
(VE of E)(Var) = v |
|
E |- Var => v |
The next rule constructs an attribute node from its name and value sub-expressions.
|
||
|
||
E |- @NameExp[Exp] => attrNode(v1, v2) |
The next rule constructs an element node from
its name and children sub-expressions.
Before constructing the element node, a ``deep'' copy
of the children is created; this enforces the data-model
constraint that the
new node be the parent of all of its children.
This is the first rule that uses the type environment:
the run-time type def
(t) of the element is
obtained from the type environment
G.
|
||
|
||
E |- NameExp[Exp] => elemNode(v1, {}, copy([v2]), def(t)) |
Ed. Note: MF : (Jan-08-2001) NB: 1. The data model now has one elemNode constructor that takes a list of nodes containing attributes and children. 2. When creating a copy of the children, each new child should have the new element defined as its parent.
The next rule constructs a sequence. It requires that Exp1 and Exp2 be "ordered" values, i.e., values with prime type or not bags. The side conditions that check the expressions' static type are necessary because the ',' operator is overloaded : it operates on both ordered and unordered collections.
|
|||
|
|||
E |- Exp1, Exp2 => append([l1], [l2]) |
We note here that in the [XML Query Data Model], the ordered forest
constructor []
when applied to an ordered forest is the
identity function, i.e., it returns the original forest.
Similarly, the
unordered forest constructor {}
is the identity
function on unordered forests.
The next rule constructs the union of two bags. It requires that Exp1 and Exp2 be "unordered" values, i.e., values with prime type or bags, because the ',' operator is overloaded.
|
|||
|
|||
E |- Exp1, Exp2 => union({b1}, {b2}) |
The next two rules construct the difference, and intersection of two bags. They do not require any side conditions because the operators are not overloaded.
|
||
|
||
E |- Exp1 - Exp2 => diff({v1}, {v2}) |
|
||
|
||
E |- Exp1 /\ Exp2 => intersect({v1}, {v2}) |
The next rule constructs the empty sequence, which is always an ordered forest.
|
E |- () => [] |
The next rule evaluates a conditional expression. If the conditional's boolean expression Exp1 evaluates to true, Exp2 is evaluated and its value is produced. If the conditional's boolean expression evaluates to false, Exp3 is evaluated and its value is produced.
E |- Exp1 => true E |- Exp2 => v2 |
|
E |- if Exp1 then Exp2 else Exp3 => v2 |
E |- Exp1 => false E |- Exp3 => v3 |
|
E |- if Exp1 then Exp2 else Exp3 => v3 |
The next rule evaluates a local binding expression: it evaluates the body of the expression Exp2 in the environment E extended with the variable Var bound to the value v.
|
||
|
||
E |- let Var = Exp1 in Exp2 => v2 |
The next rule evaluates a function application. It evaluates all the function's arguments in the current environment, then extracts the definition of the function's closure from the function environment. It then evaluates the body of the function applied to its arguments in the environment E' provided in its closure.
|
|||
|
|||
E |- FuncName( Exp1; · · · ;Expn) => v |
Ed. Note: MF (Jan-08-2001) Define and explain the Rec function on environments and why its needed for (possibly) recursive function applications.
The next rule states that in any environment, the
error
expression evalutes to the error
value.
|
E |-
error => error
|
The Algebra's two equality operators, '=' and '==', are defined in [XML Query Data Model].
E |- Exp1 => v1 E |- Exp2 => v2 v3 = v1 Opeq v2 |
|
E |- Exp1 Opeq Exp2 => v3 |
The next two rule define the semantics of the boolean operators
and
, or
, and not
.
E |-
Exp1 =>
v1
E |-
Exp2 =>
v2
v3 = apply (Opbool, v1,
v2)
|
|
E |- Exp1 Opbool Exp2 => v3 |
E |-
Exp1 =>
v1
E |-
Exp2 =>
v2
v3 = apply (not , v1,
v2)
|
|
E |-
Exp1 not
Exp2 =>
v3
|
For the boolean operators and
, or
, and not
the apply
function is defined as expected:
apply(and, true, true) = true | apply(and, true, false) = false | apply(and, false, true) = false | apply(and, false, false) = false | apply(or, true, true) = true | apply(and, true, false) = true | apply(and, false, true) = true | apply(and, false, false) = false | apply(not, true) = false | apply(not, false) = true
We have not defined the semantics of all the arithmetic operators in
the Algebra.
A joint task force on operators with members from the [XSLT 99],
XML Schema, and XML Query working groups is chartered to define arithmetic
operators.
The Algebra will adopt the decisions of that group
(See [Issue-0056:
Operators
on Simple Types]).
In the following two rules, we assume that the binary and unary
operators
are implemented by the apply
function whose semantics are
defined by the operator task force.
E |-
Exp1 =>
v1
E |-
Exp2 =>
v2
v3 = apply (Oparith, v1,
v2)
|
|
E |- Exp1 Oparith Exp2 => v3 |
E |-
Exp1
=>
v1
v2 = apply (Op, v1)
|
|
E |- Op Exp1 => v2 |
There is a near one-to-one correspondence between the Algebra expressions for constructing and accessing data model values [Figure 4] and the data model constructors and accessors.
The next three rules construct comment, processing instruction, and reference nodes, respectively.
E |- Exp => v |
|
E |-
comment (Exp)
=>
commentNode (v)
|
E |- Exp1 => v1 E |- Exp2 => v2 |
|
E |-
processing_instruction (Exp1,
Exp2)
=>
piNode (v1, v2)
|
E |- Exp => v |
|
E |-
ref (Exp)
=>
ref (v)
|
A single rule defines all of the accessor and other data-model functions.
In the following rule, the function name f denotes any
one of the following accessors: bagtolist
,
deref
, listtobag
, localname
, namespace
,
name
, nodes
, parent
, value
,
target
.
E |- Exp => v |
|
E |- f(Exp) => f(v) |
The sort
expression can return any possible permutation of its input.
Stable sort
operates on ordered forests, unstable
sort
operates on unordered forests.
In both cases, the result is an ordered forest.
The easiest way to explain the dynamic semantics of sort
is by assuming the existence of two functions:
stable_sort_pairs
defined on ordered forests
and
unstable_sort_pairs
defined on unordered forests.
Each function takes a forest of
pair
elements that contain pairs
of values and sorts in increasing order the
pair
elements based on the value in the
fst
subelement.
The purpose of is stable_sort_pairs
and
unstable_sort_pairs
is only to explain the semantics of the
sort
expression.
An implementation of the Algebra is free
to implement the sort
expression as it chooses, as long as it guarantees
stability on ordered forests.
|
||||
|
||||
G |-
sort Var in
Exp1 by
Exp2 =>
v3
|
|
||||
|
||||
G |-
sort Var in
Exp1 by
Exp2 =>
v3
|
Ed. Note: MF (Jan-09-2001) Need definition for built-in function string(Exp)
The next two rules evaluate iteration expressions. If the iteration expression evaluates to the empty ordered forest, then the entire expression evalutes to the empty ordered forest.
E |- Exp1 => [] |
|
E |- for Var in Exp1 do Exp2 => [] |
The next rule first evaluates the iteration expression Exp1, which produces the value v. It evaluates the body of the iteration expression in the environment E extended with the variable Var bound to the head of the ordered forest v, which produces v2. It evaluates the entire iteration expression applied to the tail of the ordered forest v, which produces v3. Since the values v2 and v3 may be of any type, the sequence/union operator is applied to produce the value of the complete iteration expression.
|
|||
|
|||
E |- for Var in Exp1 do Exp2 => v4 |
Ed. Note: MF : Jan-09-2001. The rule above works for lists, but not for bags, because head/tail only defined on lists. Need a second rule for bags.
When evaluating a match expression, the value v to match occurs on the left of the turnstile |-. Each case rule of a match expression is always evaluated against this value. Alternative case rules are tried from left to right. The rule for the match expression evaluates its expression and sets up the appropriate environment for the case rules:
E |- Exp => v E, v |- CaseRules => v1 |
|
E, v |- match Exp CaseRules => v1 |
The next rule evaluates the body of a case rule if the intersection of the dynamic or run-time type of v and the type expression Type is not empty. It extends the environment by binding the variable Var to v.
type (v) /\ Type != Ø
E, { Var |-> v } |-
Exp => v2
|
|
E, v |- case Var : Type do Exp CaseRules => v2 |
The next rule evaluates the case rules following the current one if the intersection of dynamic or run-time type of v and the type expression Type is empty. The body of the given case rule is not evaluated if the intersection is empty.
(type (v) /\ Type) = Ø
E, v |-
CaseRules =>
v2
|
|
E, v |- case Var : Type do Exp CaseRules => v2 |
The next rule states that the else branch of a match expression always evaluates to its given expression.
E |- Exp => v1 |
|
E, v |- else Exp => v1 |
The top-level type declaration has no effect on the dynamic environment. It is only used during static type checking.
The function declaration and top-level let expressions extend the global environment E.
|
E |- fun FuncName (Var1 : Type1; · · · Varn : Typen) : Type = Exp => E, { FuncName |-> (Exp, E, {}) } |
E |- Exp => v |
|
E |- let Var = Exp => E, { Var |-> v } |
The top-level query expression always evaluates to a value, which is returned to the programming environment in which the Algebra expression is evaluated. It has no effect on the global environment E
E |- Exp => v |
|
E |- query Exp => v, E |
A sequence of query expressions evaluates to a value and a global environment.
|- Query1 => v1, E1 E1 |- Query2 => v2, E2 · · · En-1 |- Queryn => vn, En |
|
Query1 · · · Queryn -> vn, En |
Ed. Note: MF : Jan-09-2000 This definition assumes that the order of the global declarations is significant, that is, a global variable declaration is in scope for those query expressions that follow it, but not for those that precede it. For mutually recursive declarations, this needs to be changed.
The examples use the /
operator liberally, but in fact
we use /
as a convenient abbreviation for expressions built from
lower-level operators: for
expressions, the nodes
function,
and match
expressions.
For example, the expression:
book0/author
is equivalent to the expression:
for v in nodes(book0) do match v case a : author[AnyComplexType] do a else ()
Here the nodes()
function returns a forest consisting of the
content of the element book0
, namely, a title element
and three author elements (the order is preserved). The
for
expression binds the variable v
successively to each
of these elements. Then the match
expression selects a branch
based on the type of v
. If it is an author
element then
the first branch is evaluated, otherwise the second branch. If the
first branch is evaluated, it returns a
.
The variable a
contains the same
value as v
, but the
type of a
is author[String]
, which is
the intersection of the type of v
and the type author[AnyComplexType]
. If the
second branch is evaluated, then the branch returns ()
, the empty sequence.
To compose several expressions using /
, we again use for
expressions.
For example, the expression:
bib0/book/author
is equivalent to the expression:
for b in nodes(bib0) do match b case b : book[AnyComplexType] do for d in nodes(b) do match d case a : author[AnyComplexType] do a else () else ()
The for
expression iterates over all book
elements in
bib0
and binds the variable b
to each such element. For
each element bound to b
, the inner expression returns all the
author
elements in b
, and the resulting forests are
concatenated together in order.
In general, an expression of the form e / a is converted to the form
for v1 in e do for v2 in nodes(v1) do match v2 case v3 : a[AnyComplexType] do v3 else ()
where e is an expression, a is an element name, and v1, v2, and v3 are fresh variables (ones that do not appear in the expression being converted).
According to this rule, the expression bib0/book
translates to
for v1 in bib0 do for v2 in nodes(v1) do match v2 case v3 : book[AnyComplexType] do v3 else ()
In [A Equivalences], we discuss laws which allow us to simplify this to the previous expression:
for v2 in nodes(bib0) do match v2 case v3 : book[AnyComplexType] do v3 else ()
Similarly, the expression bib0/book/author
translates to
for v4 in (for v2 in nodes(bib0) do match v2 case v3 : book[AnyComplexType] do v3 else ()) do for v5 in nodes(v4) do match v5 case v6 : author[AnyComplexType] do v6 else ()
Again, the laws will allow us to simplify this to the previous expression
for v2 in nodes(bib0) do match v2 case v3 : book[AnyComplexType] do for v5 in nodes(v3) do match v5 case v6 : author[AnyComplexType] do v6 else () else ()
These examples illustrate an important feature of the Algebra: high-level operators may be defined in terms of low-level operators, and the low-level operators may be subject to algebraic laws that can be used to further simplify the expression.
[Figure 10] contains the laws that relate the reducible expressions (i.e., those labeled with "*" in [Figure 2]) to equivalent expressions.
In Rule 1, the projection expression
e / Wild is rewritten as a
for
expression, which binds v to
each element in the forest e, and returns the children
elements of v whose name is in Wild.
Similarly, Rule 2 rewrites the attribute projection expression
e / @Wild as a match embedded
in a for
expression.
Rule 3
rewrites the e / processing_instruction()
expression
by iterating over items in e.
The nested
match
expression returns
e's children if e
is a PI
. Otherwise, it returns the empty sequence.
Rules 4 and 5 are analogous for comments and atomic data.
Rule 6
rewrites the e / deref()
expression
by iterating over items in e.
The nested
match
expression returns
e's children if e
is ref(AnyTree)
. Otherwise, it returns the empty sequence.
Rule 7 simply rewrites a
where
expression by an if
- then
- else
expression.
Rule 8 rewrites empty
(e) as a
match
expression that returns true if the type of
e is the empty sequence.
Rule 9 rewrites cast
e : t as a
match
expression that returns the value of e
with type t, if it has that type at runtime, otherwise
it raises an error.
Rule 10 rewrites the let
expression with a type as
a let
expression without a type by moving the type
constraint into the expression.
e / Wild => for
v1in
edo
for
v2in
nodes
(v1)do
match
v2case
v3 : Wild[AnyComplexType]
do
v3else ()
(1) e / @Wild => for
v1in
edo
for
v2in
nodes
(v1)do
match
v2case
v3 : @Wild[AnySimpleType]
do
v3else ()
(2) e / data()
=> for
vin
nodes
(e)do
match
vcase
v1 :AnySimpleType do
v1else ()
(3) e / processing_instruction()
=> for
vin
nodes
(e)do
match
vcase
v1 :PI do
v1else ()
(4) e / comment()
=> for
vin
nodes
(e)do
match
vcase
v1 :Comment do
v1else ()
(5) e / deref()
=> for
vin
nodes
(e)do
match
vcase
v1 :ref(AnyTree) do
deref
(v1)else ()
(6) where
e1do
e2=> if
e1then
e2else
()(7) empty
(e)=> match
ecase
v :() do true else false
(8) cast
e : t=> match
ecase
v : tdo
velse error
(9) let
v : Type=
e1do
e2=> let
v=
(e1: Type)do
e2(10)
Figure 10: Definitions
In this section, we describe some laws that hold for the Algebra.
These laws are important for defining rules that simplify
algebraic expressions, such as eliminating
unnecessary for
or match
expressions.
The iteration construct of the Algebra is closely related to an important mathematical object called a monad. A monad, among other things, generalizes set, bag, and list types. In functional languages, the comprehension construct is used to express iteration over set, bag, and list types [BNTW95], [LW97]. A comprehension corresponds directly to a monad [Wad92], [Wad93], [Wad95].
The correspondence between the Algebra's iteration construct
and a monad is close, but not exact.
Each monad is
based on a unary type constructor, such as Set{t} or List{t},
representing a homogenous set or list where all elements are of type
t. In the Algebra, we have more complex and heterogenous types,
such as a forest consisting of a title, a year, and a sequence of one
or more authors. Also, one important component of a monad is the unit
operator, which converts an element to a set or list. If x has type
t, then set{x} is a unit set of type Set{t} or [x] is a unit
list of type List{t}. In the Algebra, we simply write, say,
author["Buneman"]
, which stands for both a tree and for the unit
forest containing that tree.
Monads satisfy three laws, and three corresponding laws are satisfied by the the Algebra's for expression.
First, iteration over a unit forest can be replaced by substition. This is called the left unit law.
for v in e1 do e2 = e2{v := e1}
provided that e1 is a unit type (e.g., is an element or a scalar constant). We write e2{v := e1} to denote the result of taking expression e2 and replacing occurrences of the variable v by the expression e1. For example,
for v in author["Buneman"] do auth[v/data()] = auth["Buneman"]
The iteration over a forest of one item can always be eliminated using variable substitution.
Second, an iteration that returns the iteration variable is equivalent to the identity. This is called the right unit law.
for v in e do v = e
For example
for v in book0 do v = book0
An important feature of the type system described here is that the left side of the above equation always has the same type as the right side.
Third, there are two ways of writing an iteration over an iteration, both of which are equivalent. This is called the associative law.
for v2 in (for v1 in e1 do e2) do e3 = for v1 in e1 do (for v2 in e2 do e3)
For example, a projection over a forest includes an implicit
iteration, so e / a =
for
v in
e
do
v / a. Say we
define a forest of bibliographies, bib1
=
bib0, bib0
. Then bib1/book/author
is equivalent
to the first expression below, which in turn is equivalent to the
second.
for b in (for a in bib1 do a/book) do b/author = for a in bib1 do (for b in a/book do b/author)
With nested relational algebra, the monad laws play a key role in optimizing queries. Similarly, the monad laws can also be exploited for optimization in this context.
For example, if b
is a book, the following finds all authors
of the book that are not Buneman:
for a in b do where a/data() != Buneman do a
If l
is a list of authors, the following renames all author
elements to auth
elements:
for a' in l do auth[ a'/data() ]
Combining these, we select all authors that are not Buneman, and rename the elements:
for a' in (for a in b do where a/data() != Buneman do a) do auth[ a'/data() ]
Applying the associative law for a monad, we get:
for a in b do for a' in (where a/data() != Buneman do a) do auth[ a'/data() ]
Expanding the where
clause to a conditional, we get:
for a in b do for a' in (if a/data() != Buneman then a else ()) do auth[ a'/data() ]
Applying a standard law for distributing loops over conditionals gives:
for a in b do if a/data() != Buneman then for a' in a do auth[ a'/data() ] else ()
Applying the left unit law for a monad, we get:
for a in b do if a/data() != Buneman then auth[ a/data() ] else ()
And replacing the conditional by a where
clause, we get:
for a in b do where a/data() != Buneman do auth[ a/data() ]
Thus, simple manipulations, including the monad laws, fuse the two loops.
[A.1 Relating projection to iteration] ended with two examples of simplification. Returning to these, we can now see that the simplifications are achieved by application of the left unit and associative monad laws.
[Figure 11] contains a dozen algebraic simplification
laws. In a relational query engine, algebraic simplifications are often applied
by a query optimizer before a physical execution plan is generated; algebraic
simplification can often reduce the size of the intermediate results computed
by a query evaluator. The purpose of our laws is similar -- they eliminate
unnecessary for
or match
expressions, or they enable other optimizations by reordering or distributing
computations. The set of laws given is suggestive, rather than complete.
E ::= if
[]then
e1else
e2| let
v=
[]in
e| for
vin
[]do
e| match
[]case
v : tdo
e...
case
v : tdo
eelse
efor
vin
()do
e=> () (8) for
vin
(e1, e2)do
e3=> ( for
vin
e1do
e3), (for
vin
e2do
e3)(9) for
vin
e1do
e2=> e2{e1/ v}, if e1 : u (10) for
vin
edo
v=> e (11)
e1 : {t1}, e2 : {t2}, v1 free in e2 for
v1in
e1do
for
v2in
e2do
e3=> for
v2in
e2do
for
v1in
e1do
e3(12) E[ if
e1then
e2else
e3]=> if
e1then
E[e2]else
E[e3](13) E[ let
v=
e1do
e2]=> let
v=
e1do
E[e2](14) E[ for
vin
e1do
e2]=> for
vin
e1do
E[e2](15)
E[ match
e1case
v : tdo
e2case
v : tdo
en-1... else
en ]=>
match
e1case
v : tdo
E[ e2 ]... case
v : tdo
E[ en-1 ]else
E[en ](16)
Figure 11: Optimization Laws
Rules 8, 9, and 10 simplify iterations. Rule 8 rewrites an iteration over the empty sequence as the empty sequence. Rule 9 distributes iteration through sequence: iterating over the sequence (e1, e2) is equivalent to the sequence of two iterations, one over e1 and one over e2. Rule 10 eliminates an iteration over a single element or scalar. If e1 is a unit type, then e1 can be substituted for occurrences of v in e2.
Ed. Note: MF (Oct-18-2000) The rules for eliminating trivial
match
expressions need to be written. They are more complex than those for the oldcase
expressions.
Rule 11 eliminates an iteration when the result expression is simply the iteration variable v.
Rule 12 commutes a nested iteration over unordered forests. This equivalence does not hold for so called dependent joins, where the outer iteration variable v1 is bound in the inner expression e2.
Rules 13--16 are used to commute expressions. Each rule actually
abbreviates a number of other rules, since the context
variable
E stands for a number of different
expressions. The notation E[e] stands for one of the four expressions given with
expression e replacing the hole [] that appears in
each of the alternatives. For instance, one of the expansions of Rule 15
is the following, when e is taken to
be for
v
in
[] do
e :
|
The issues in [B.2 Issues list] serve as a design history for this document. The ordering of issues is irrelevant. Each issue has a unique id of the form Issue-<dddd> (where d is a digit). This can be used for referring to the issue by <url-of-this-document>#Issue-<dddd>. Furthermore, each issue has a mnemonic header, a date, an optional description, and an optional resolution. For convenience, resolved issues are displayed in green. Some of the issues contain references to W3C internal archives. These are marked with "W3C-members only". Some of the descriptions of the resolved issues are obsolete w.r.t. to the current version of the document.
Ed. Note: PF (Aug-05-2000): For the sake of archival, there are some duplicate issues raised in multiple instances. Duplicate issues are marked as "resolved" with reference to the representative issue.
Unless stated explicitly otherwise, [Issue-0001: Attributes] through [Issue-0039: Dereferencing semantics] have been raised in http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Jul/0142.html (W3C-members only).
Issue-0001: Attributes
Date: Jul-26-2000
One example of the need for support of [Issue-0049: Unordered Collections], but also: Attributes need to be constrained to contain white space separated lists of simple types only.
Attributes are represented by @attribute-name[content]. See [3.4 Types : abstract syntax], [3.5 Types : concrete syntax] for the constraint on white space seperated lists of simple types, and [3.6 Expressions] for selecting and constructing attributes.
Issue-0002: Namespaces
Date: Jul-26-2000
Namespaces are represented by {uri-of-namespace}localname. See [3.1 Expanded names] and [3.2 Wildcards].
Issue-0003: Global Order
Date: Jul-26-2000
The data model and algebra do not define a global order on documents. Querying global order is often required in document-oriented queries.
See the thread starting at http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Sep/0179.html (W3C-members only).
Resolved by adding < operator defined on nodes in same document. See [2.12 Querying order]. See [Issue-0079: Global order between nodes in different documents] for order between nodes in different documents.
Issue-0004: References vs containment
Date: Jul-26-2000
The query-algebra datamodel currently does not explicitly model children-elements by references (other than the XML-Query Datamodel. This facilitates presentation, but may be an oversimplification with regard to [Issue-0005: Element identity].
This issue is resolved by subsumption as follows: (1) As [Figure ] points out, all child-elements are (implicit) references to nodes. (2) Thus, having resolved [Issue-0005: Element identity] this issue is resolved too.
Issue-0005: Element identity
Date: Jul-26-2000
Do expressions preserve element identity or don't they? And does "=" and distinct use comparison by reference or comparison by value?
The first part of the question has been resolved by resolution of [Issue-0010: Construct values by copy]. The second part raises a more specific issue [Issue-0066: Shallow or Deep Equality?].
Issue-0006: Source and join syntax instead of "for"
Date: Jul-26-2000
Another term for "source and join syntax" is "comprehension". See [A Equivalences] for a discussion of the relationship between iteration by for and comprehension syntax.
This issue is resolved by subsumption under [Issue-0021: Syntax]. List comprehension is a syntactic alternative to "for v in e1 do e2", which has been favored by the WG in the resolution of [Issue-0021: Syntax].
Issue-0007: References: IDREFS, Keyrefs, Joins
Date: Jul-26-2000
Currently,
the Algebra does not support reference values, such as IDREF, or Keyref
(not to be mixed up with "node-references" - see [Issue-0005:
Element identity],
which are defined in the XML
Query Data Model. The Algebra's type system should be extended to support
reference types and the data model operators ref
, and
deref
should be supported (similar to id() in XPath).
Issue-0008: Fixed point operator or recursive functions
Date: Jul-26-2000
It may be useful to add a fixed-point operator, which can be used in lieu of recursive functions to compute, for example, the transitive closure of a collection.
Currently, the Algebra does not guarantee termination of recursive expressions. In order to ensure termination, we might require that a recursive function take one argument that is a singleton element, and any recursive invocation should be on a descendant of that element; since any element has a finite number of descendants, this avoids infinite regress. (Ideally, we should have a simple syntactic rule that enforces this restriction, but we have not yet devised such a rule.)
Impacts optimization; hard to do static type inference; current algebra is first-order
See for the subproblem of typing "//" or desc() http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Sep/0187.html (W3C-members only).
Issue-0009: Externally defined functions
Date: Jul-26-2000
There is no explicit support for externally defined functions.
The set of built-in functions may be extended to support other important operators.
See also http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Sep/0003.html (W3C-members only).
Issue-0010: Construct values by copy
Date: Jul-26-2000
Need to be able to construct new types from bits of old types by reference and by copy. Related to [Issue-0005: Element identity].
The WG wishes to support both: construction of values by copy, as well as references to original nodes ( http://www.w3.org/XML/Group/2000/09/ql/unedited-minutes-day1.txt (W3C-members only) (W3C-members only)) See also http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Sep/0155.html (W3C-members only) (W3C-members only). This needs some further investigation to sort out all technical difficulties (see [Issue-0062: Open questions for constructing elements by reference]) so the change has not yet been reflected in the Algebra document.
Issue-0011: XPath tumbler syntax instead of index?
Date: Jul-26-2000
XPath provides as a shorthand syntax [integer] to select child-elements by their position on the sibling axes, whereas the xml-query algebra uses a combination of a built-in function index() and iteration. See http://lists.w3.org/Archives/Member/w3c-archive/2000Sep/0168.html (W3C-members only) for a suggestion to to support indexed iteration in the form "for v sub i in e1 do e2", and to express index() as a function (or macro).
Addendum by JS (submitted by MF) Dec 19/2000:
The typing of index is lossy : it produces a factored type.
Jerome suggests the more precise range
operator:
e : q{m,n} n' - (m'-1) = r m' >= m n' <= n ----------------------------------------------- range(e;m';n') : q{r,r} nth(e;n) == range(e;n;n)The
range
operator takes a repetition of prime types and
those values in the range m'
to n'
;
if the repetition does not include that range,
a run-time error is raised.
The range
and nth
operators
could also be defined in terms of head
and
tail
and polymorphic recursive functions.
In the absence of parameteric polymorphism, it is not possible
to define range
and nth
with precise types.
Here are Peter's rules:
e : p{m,n} n!=* ------------------------------------------------- range(e;m';n') : p{n'-max(m,m')+1,min(n',n)-m'+1} For example: let v1 = a[]{2,4} range(v1;3;3): a[]{1,1} range(v1;1;3): a[]{2,3} range(v1;3;5): a[]{1,2} range(v1;1;5): a[]{2,4} e : p{m,*} ----------------------------- range(e;m';n') : p{0,n'-m'+1} let v2 = a[]{0,*} range(v2;1;3): a[]{0,2} this follows the typical semantics for head() and tail(): head(()) = tail(()) = () and the semantics behind range(e;m',n') = tail o ...(m' times) ... o tail o head, tail o ...(m'+1 times) ... o tail o head, ... tail o ...(n' times) ... o tail o head
I would have no troubles in restricting ourselves to nth() instead of range() in the algebra (range can always be enumerated by nth()). Furthermore, we should consider whether m',n' can be computed numbers.
Issue-0012: GroupBy - needs second order functions?
Date: Jul-26-2000
The type system is currently first order: it does not support function types nor higher-order functions. Higher-order functions are useful for specifying, for example, sorting and grouping operators, which take other functions as arguments.
The WG has decided to express groupBy
by a combination of for
and distinct
(see also
http://www.w3.org/XML/Group/2000/09/ql/unedited-minutes-day1.txt (W3C-members only) (W3C-members only) and [Issue-0042:
GroupBy]):. Thus w.r.t. to GroupBy this
Issue is resolved. Because GroupBy is not the only use case for higher order functions,
a new issue [Issue-0063:
Do we need (user defined) higher order functions?] is raised.
Issue-0013: Collations
Date: Jul-26-2000
Collations identify the ordering to be applied for sorting strings. Currently, it is considered to have an (optional parameter) collation "name" as follows: "SORT variable IN exp BY +(expression {ASCENDING|DESCENDING} {COLLATION name}) (see http://www.w3.org/XML/Group/2000/09/ql/unedited-minutes-day1.txt (W3C-members only) (W3C-members only)). An alternative would be to model a collation as a simple type derived from string, and use type-level casting, i.e. expression :collationtype (which is already supported in the XML Query Algebra), for specifying the collation. That would make: "SORT variable IN exp BY +(expression:collationname {ASCENDING|DESCENDING}). But that requires some support from XML-Schema.
More generally, collations are important for any operator in the Algebra that involves string comparison, among them: sort, distinct, "=" and "<".
Issue-0014: Polymorphic types
Date: Jul-26-2000
The type system is currently monomorphic: it does not permit the definition of a function over generalized types. Polymorphic functions are useful for factoring equivalent functions, each of which operate on a fixed type.
The current type system has already a built-in polymorphic type (lists) and is likely to have more (unordered collections). The question is, whether to allow for user-defined polymorphic types and user defined polymorphic functions.
See also thread around http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Sep/0111.html (W3C-members only) (W3C-members only).
Issue-0015: 3-valued logic to support NULLs
Issue-0016: Mixed content
Date: Jul-26-2000
The XML-Query Algebra allows to generate elements with an arbitrary mixture of data (of simple type) and elements. XML-Schema only allows for a combination of strings interspersed with elements (aka mixed content). We need to figure out whether and how to constrain the XML-Query Algebra accordingly (e.g. by typing rules?)
The type system has been extended
to support the interleaving operator &
- see
[3.4 Types : abstract syntax]. Mixed content is
defined in terms of &
.
Issue-0017: Unordered content
Date: Jul-26-2000
All-groups in XML-Schema, not to be mixed up with [Issue-0049: Unordered Collections]
The type system has been extended with the support of all-groups - see [3.4 Types : abstract syntax].
Issue-0018: Align algebra types with schema
Date: Jul-26-2000
The Algebra's internal type system is the type system of XDuce. A potentially significant problem is that the Algebra's types may lose information when converted into XML Schema types, for example, when a result is serialized into an XML document and XML Schema.
James Clark points out : "The definition of AnyComplexType doesn't match the concrete syntax for types since it applies unbounded repetition to AnyTree and one alternative for AnyTree is AnyAttribute." This is another example of an alignment issue.
This issue comprises also issues [Issue-0016: Mixed content], [Issue-0017: Unordered content], [Issue-0053: Global vs. local elements], [Issue-0054: Global vs. local complex types], [Issue-0019: Support derived types], substitution groups.
Issue-0019: Support derived types
Date: Jul-26-2000
The current type system does not support user defined type hierarchies (by extension or by restriction).
Issue-0020: Structural vs. name equivalence
Date: Jul-26-2000
The subtyping rules in [4.1 Relating data to types] only define structural subtyping. We need to extend this with support for subtyping via user defined type hierarchies - this is related to [Issue-0019: Support derived types].
Issue-0021: Syntax
Date: Jul-26-2000
(e.g. for.<-.in vs for.in.do)
The WG has voted for several syntax changes (see also http://www.w3.org/XML/Group/2000/09/ql/unedited-minutes-day1.txt(W3C-members only) (W3C-members only), [3.6 Expressions]: "for v in e do e", "let v = e do", "sort v in e by e ...", "distinct", "match case v:t e ... else e".
Issue-0022: Indentation, Whitespaces
Date: Jul-26-2000
Is indentation significant?
The WG has consensus that indentation is not significant (see http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Aug/0043.html (W3C-members only) (W3C-members only)), i.e., all documents are white space normalized.
Issue-0023: Catch exceptions and process in algebra?
Date: Jul-26-2000
Does the Algebra give explicit support for catching exceptions and processing them?
Subsumed by new issue [Issue-0064: Error code handling in Query Algebra].
Issue-0024: Value for empty forests
Date: Jul-26-2000
What does "value" do with empty forests?
The definition of value(e) has changed to:
value(e) = match children(e) case v: AnyScalar do v else()
Furthermore, the typing rules for "for v in e1 do e2" have been changed such that the variable v is typed-checked seperately for each unit-type occuring in expression e1.
Consequently the example in http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Jul/0138.html (W3C-members only) (W3C-members only) would be typed as follows:
query for b in b0/book do value(b/year): Integer{0,*}
rather than leading to an error.
Issue-0025: Treatment of empty results at type level
Date: Jul-26-2000
According to http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Jul/0138.html(W3C-members only) (W3C-members only) this is closely related to [Issue-0024: Value for empty forests].
Resolved by resolution of [Issue-0025: Treatment of empty results at type level].
Issue-0026: Project - one tag only
Date: Jul-26-2000
Project is only parameterized by one tag. How can we translate a0/(b | c)?
With the new syntax (and type system) a0/(b | c) can be translated to "for v in a0 do match case v1:b[AnyType] do v1 case v2:c[AnyType] do c else ()" - see also [A.1 Relating projection to iteration].
Issue-0027: Case syntax
Date: Jul-26-2000
N-ary case can be realized by nested binary cases. For design alternatives of case see: http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Aug/0017.html (W3C-members only) (W3C-members only)
New (n-ary) case syntax is introduced in [3.6 Expressions].
Issue-0028: Fusion
Date: Jul-26-2000
Does the Algebra support fusion as introduced by query languages such as LOREL? This is related to [Issue-0005: Element identity], because fusion only makes sense with support of element identity.
Issue-0029: Views
Date: Jul-26-2000
One of the problems in views: Can we undeclare/hide things in environment? For example, if we support element-identity, can we explicitly discard a parent, and/or children from an element in the result-set? Related to [Issue-0005: Element identity]. See also description in http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Sep/0047.html (W3C-members only) (W3C-members only).
Issue-0030: Automatic type coercion
Date: Jul-26-2000
What do we do if a value does not have a type or a different type from what is required? See also thread around http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Oct/0071.html (W3C-members only) (W3C-members only). This link also contains a recommendation, which has been agreed as the general direction to go in http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Oct/0196.html (W3C-members only) (W3C-members only):
We believe that the XML Query Language should specify default type coercions for mixed mode arithmetic should be performed according to a fixed precedence hierarchy of types, specifically integer to fixed decimal, fixed decimal to float, float to double. This policy has the advantage of simplicity, tradition, and static type inference. Programmers could explicitly specify alternative type coercions when desirable.
Issue-0031: Recursive functions
Date: Jul-26-2000
subsumed by [Issue-0008: Fixed point operator or recursive functions]
Issue-0032: Full regular path expressions
Date: Jul-26-2000
Full regular path expressions allow to constrain recursive navigation along paths by means of regular expressions, e.g. a/b*/c denotes all paths starting with an a, proceeding with arbitrarily many b's and ending in a c. Currently the XML-Query Algebra can express this by means of (structurally) recursive functions. An alternative may be the introduction of a fixpoint operator [Issue-0008: Fixed point operator or recursive functions].
Issue-0033: Metadata Queries
Date: Jul-26-2000
Metadata queries are queries that require runtime access to type information. See also discussion starting at http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Oct/0087.html (W3C-members only) (W3C-members only).
Issue-0034: Fusion
Date: Jul-26-2000
Identical with [Issue-0028: Fusion]
Issue-0035: Exception handling
Date: Jul-26-2000
Subsumed by [Issue-0023: Catch exceptions and process in algebra?] and [Issue-0064: Error code handling in Query Algebra].
Issue-0036: Global-order based operators
Date: Jul-26-2000
Subsumed by [Issue-0003: Global Order]
Issue-0037: Copy vs identity semantics
Date: Jul-26-2000
subsumed by [Issue-0005: Element identity]
Issue-0038: Copy by reachability
Date: Jul-26-2000
Is it possible to copy children as well as IDREFs, Links, etc.? Related to [Issue-0005: Element identity] and [Issue-0008: Fixed point operator or recursive functions]
Resolved by addition of "deep" copy operator in [XML Query Data Model].
Issue-0039: Dereferencing semantics
Date: Jul-26-2000
subsumed by [Issue-0005: Element identity]
[Issue-0040: Case Syntax] through [Issue-0047: Attributes] are raised in http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Aug/0010.html (W3C-members only) (W3C-members only)
Issue-0040: Case Syntax
Date: Aug-01-2000
We suggest that the syntax for "case" be made more regular. At present, it takes only two branches, the first labelled with a tag-name and the second labelled with a variable. A more traditional syntax for "case" would have multiple branches and label them in a uniform way. If the algebra is intended only for semantic specification, "case" may not even be necessary.
subsumed by [Issue-0027: Case syntax]
Issue-0041: Sorting
Date: Aug-01-2000
We are not happy about the three-step sorting process in the Algebra. We would prefer a one-step sorting operator such as the one illustrated below, which handles multiple sort keys and mixed sorting directions: SORT emp <- employees BY emp/deptno ASCENDING emp/salary DESCENDING
The WG has decided to go for the above syntax, with an (optional) indication of COLLATION. (see http://www.w3.org/XML/Group/2000/09/ql/unedited-minutes-day1.txt (W3C-members only) (W3C-members only), [2.13 Sorting]).
Issue-0042: GroupBy
Date: Aug-01-2000
We do not think the algebra needs an explicit grouping operator. Quilt and other high-level languages perform grouping by nested iteration. The algebra can do the same.
related to [Issue-0012: GroupBy - needs second order functions?]
The WG has decided (see http://www.w3.org/XML/Group/2000/09/ql/unedited-minutes-day1.txt (W3C-members only) (W3C-members only)) to skip groupBy for the time being (see also revised [2.11 Restructuring and grouping] and raise [Issue-0069: Organization of Document] for a possible future revision of this resolution.
Issue-0043: Recursive Descent for XPath
Date: Aug-01-2000
The very important XPath operator "//" is supported in the Algebra only by writing a recursive function. This is adequate for a semantic specification, but if the Algebra is intended as an optimizable target language it will need better support for "//" (possibly in the form of a fix-point operator.)
Resolved by subsumption under [Issue-0043: Recursive Descent for XPath] (see http://www.w3.org/XML/Group/2000/09/ql/unedited-minutes-day1.txt(W3C-members only) (W3C-members only)).
Issue-0044: Keys and IDREF
Date: Aug-01-2000
We think the algebra needs some facility for dereferencing keys and IDREFs (exploiting information in the schema.)
Subsumed by [Issue-0007: References: IDREFS, Keyrefs, Joins]
Issue-0045: Global Order
Date: Aug-01-2000
We are concerned about absence of support for operators based on global document ordering such as BEFORE and AFTER.
subsumed by [Issue-0003: Global Order]
Issue-0046: FOR Syntax
Date: Aug-01-2000
We agree with comments made in the face-to-face meeting about the aesthetics of the Algebra's syntax for iteration. For example, the following syntax is relatively easy to understand: FOR x IN some_expr EVAL f(x) whereas we find the current algebra equivalent to be confusing and misleading: FOR x <- some_expr IN f(x) This syntax appears to assign the result of some_expr to variable x, and uses the word IN in a non-intuitive way.
subsumed by [Issue-0021: Syntax]
Issue-0047: Attributes
Date: Aug-01-2000
subsumed by [Issue-0001: Attributes]
[Issue-0048: Explicit Type Declarations] through [Issue-0050: Recursive Descent for XPath] are raised in http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Jul/0148.html (W3C-members only) (W3C-members only)
Issue-0048: Explicit Type Declarations
Date: Jul-27-2000
Type Declaration for the results of a query: The issue is whether to auto construct the result type from a query or to pre-declare the type of the result from a query and check for correct type on the return value. Suggestion: Support for pre-declared result data type and as well as to coerce the output to a new type is desirable. Runtime or compile time type checking is to be resolved? Once you attach a name to a type, it is preserved during the query processing.
W.r.t. compile time type casts this is already possible with e:t (see [3.6 Expressions]). For run-time casts an issue has been raised in [Issue-0062: Open questions for constructing elements by reference].
Issue-0049: Unordered Collections
Date: Jul-27-2000
Currently, all forests in the data model are
ordered. It may
be useful to have unordered forests. The distinct_node
function, for example,
produces an inherently unordered forest. Unordered forests can benefit from
many
optimizations for the relational algebra, such as commutable joins.
Handling of collection of attributes is easy but the collection of elements is complex due to complex type support for the elements. It makes sense to allow casting from unordered to ordered collection and vice versa. It is not clear whether the new ordered or unordered collection is a new type or not. It affects function resolution, optimization.
See also thread around http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Oct/0135.html (W3C-members only) (W3C-members only).
Our request to Schema to represent insignificance of ordering at schema level has not been fulfilled - see http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Sep/0136.html (W3C-members only) (W3C-members only). Thus we need to be aware that this information may get lost, when mapping to schema.
Unordered collections are described by {t} see [3.4 Types : abstract syntax], some operators (sort, distinct_node, for, and sequence) are overloaded, and some operators (difference, intersection) are added). A new issue [Issue-0076: Unordered types] is raised.
Issue-0050: Recursive Descent for XPath
Date: Jul-27-2000
Suggestion: The group likes to add a support for fixed-point operator in the query language that will allow us to express the semantics of the // operator in an xpath expression. A path expression of the form a//b may be represented by a fixed-point operator fp(a, "/.")/b.
subsumed by [Issue-0043: Recursive Descent for XPath]
Issue-0051: Project redundant?
Date: Aug-05-2000
It appears that project a e could be reduced to sth. like
for v <- e in case v of a[v1] => a[v1] | v2 => ()
... or would that generate a less precise type?
With the new type system and handling of the for operator, project is indeed redundant. See [A.1 Relating projection to iteration].
Issue-0052: Axes of XPath
Date: Aug-05-2000
The current algebra makes navigation to parents difficult to impossible. With support of Element Identity [Issue-0005: Element identity] and recursive functions [Issue-0008: Fixed point operator or recursive functions] one can express parent() by a recursive function via the document root. More direct support needs to be investigated w.r.t its effect on the type system.
The WG wishes to support a built-in operator parent() (see http://www.w3.org/XML/Group/2000/09/ql/unedited-minutes-day1.txt (W3C-members only) (W3C-members only)). For the current state of affairs see thread around http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Sep/0074.html (W3C-members only) (W3C-members only). For some use-cases see http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Oct/0011.html (W3C-members only) (W3C-members only).
Issue-0053: Global vs. local elements
Date: Aug-05-2000
The current type system cannot represent global element-declarations of XML-Schema. All element declarations are local.
Issue-0054: Global vs. local complex types
Date: Aug-05-2000
The current type system does not distinguish between global and local types as XML-Schema does. All types appear to be fully nested (i.e. local types)
Issue-0055: Types with non-wellformed instances
Date: Aug-05-2000
The type system and algebra allows for sequences of simple types, which can usually be not represented as a well-formed document. How shall we constrain this? Related to [Issue-0016: Mixed content].
Issue-0056: Operators on Simple Types
Date: Jul-15-2000
We intentionally did not define equality or relational operators on element and atomic type. These operators should be defined by consensus.
See also first designs for support of arithmetic operators http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Sep/0138.html (W3C-members only) (W3C-members only) and for support of operators for date/time http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Sep/0113.html (W3C-members only) (W3C-members only).
Ed. Note: MF, 15-Jan-2001 A joint task force on operators with members from the XSLT, XML Schema, and XML Query working groups is chartered to define arithmetic operators.
Issue-0057: More precise type system; choice in path
Date: Aug-07-2000
(This subsumes [Issue-0051: Project redundant?]). If the type system were more precise, then (project a e) could be replaced by:
for v <- e in case v of a[v1] => a[v1] | v2 => ()
One could also represent (e/(a|b)) directly in a similar style.
for v <- e in case v of a[v1] => a[v1] | v2 => case v2 of b[v3] => b[v3] | v4 => ()
Currently, there is no way to represent (e/(a|b)) without loss of precision, so if we do not change the type system, we may need to have some way to represent (e/(a|b)) and similar terms without losing precision. (The LA team has a design for this more precise type system, but it is too large to fit in the margin of this web page!)
See resolution of [Issue-0051: Project redundant?]
Issue-0058: Downward Navigation only?
Date: Aug-07-2000
Related to [Issue-0052: Axes of XPath]. The current type system (and the more precise system alluded to in [Issue-0057: More precise type system; choice in path]) seems well suited for handling XPath children and descendant axes, but not parent, ancestor, sibling, preceding, or following axes. Is this limitation one we can live with?
Subsumed by [Issue-0052: Axes of XPath]
Issue-0059: Testing Subtyping
Date: Aug-07-2000
One operation required in the Algebra is to test whether XML type t1 is a subtype of XML type t2, indicated by writing t1 <: t2. There is a well-known algorithm for this, based on tree automata, which is a straightforward variant of the well-known algorithm for testing whether the language generated by one regular-expression is a subset of the language generated by another. (The algorithm involves generating deterministic automata for both regular expressions or types.)
However, the naive implementation of the algorithm for comparing XML types can be slow in practice, whereas the naive algorithm for regular expressions is tolerably fast. The only acceptably fast implementation of a comparison for XML types that the LA team knows of has been implemented by Haruo Hasoya, Jerome Voullion, and Benjamin Pierce at the University of Pennsylvania, for their implementation of Xduce. (Our implementation of the Algebra re-uses their code, with permission.)
So, should we adopt a simpler definition of subtyping which is easier to test? One possibility is to adopt the sibling restriction from Schema, which requires that any two elements which appear a siblings in the same content model must themselves have contents of the same type. Jerome Simeon and Philip Wadler discovered that adopting the sibling restriction reduces the problem of checking subtyping of XML types to that of checking regular languages for inclusion, so it may be worth adopting the restriction for that reason.
Issue-0060: Internationalization aspects for strings
Date: Jun-26-2000
These issues are taken from the comments on the Requirements Document by I18N (http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Jun/0137.html (W3C-members only) (W3C-members only)).
Further information can be found at http://www.w3.org/TR/WD-charreq.
It is a goal of i18n that queries involving string matching ("select x where x='some_constant'") treat canonically equivalent strings (in the Unicode sense) as matching. If the query and the target are both XML, early normalization (as per the Character Model) is assumed and binary comparison ensures that the equivalence requirement is satisfied. However, if the target is originally a legacy database which logically has a layer that exports the data as XML, that XML must be exported in normalized form. The XML Query spec must impose the normalization requirement upon such layers.
Similarly, the query may come from a user-interface layer that creates the XML query. The XML Query spec must impose the normalization requirement upon such layers.
Provided that the query and the target are in normalized form C, the output of the query must itself be in normalized form C.
Queries involving string matching should support various kinds of loose matching (such as case-insensitivity, katakana-hiragana equivalence, accent-accentless equivalence, etc.)
If such features as case-insensitivity are present in queries involving string matching, these features must be properly internationalized (e.g. case folding works for accented letters) and language-dependence must be taken into account (e.g. Turkish dotless-i).
Queries involving character counting and indexing must take into account the Character Model. Specifically, they should follow Layer 3 (locale-independent graphemes). Additional details can be found in The Unicode Standard 3.0 and UTR#18. Queries involving word counting and indexing should similarly follow the recommendations in these references.
Issue-0061: Model for References
Date: Aug-16-2000
Raised in: http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Aug/0063.html (W3C-members only) (W3C-members only). Related to a number of issues around [Issue-0005: Element identity].
Use Cases
Table of Contents
REF *could* do this well if it were restructured - it does not maintain unforeseen relationships or use them...
Bibliographies
Recursive parts
RDF assertions
Inversion of simple parent/child references (related to [Issue-0058: Downward Navigation only?]).
What can we leave out?
can we leave out transitive closure?
can we limit recursion?
can we leave out fixed point recursion?
related to [Issue-0008: Fixed point operator or recursive functions]
Do we need to be able to...
a. Find the person with the maximum number of descendants?
b. Airplane routes: how can I get from RDU to Raleigh? (fixed point: guaranteeing termination in reasonable time...)
c. Given children and their mothers, can I get mothers and their children? (without respect to the form of the original reference...)
related to [Issue-0008: Fixed point operator or recursive functions].
Should we abstract out the difference between different kinds of references? If so, should we be able to cast to a particular kind of reference in the output?
a. abstracting out the differences is cheaper, which is kewl...
b. the kind of reference gives me useful information about: locality (same document, same repository, big bad internet...) static vs. dynamic (xpointer *may* be resolved dynamically, or *may* be resolved at run time, ID/IDREF is static).
related to [Issue-0007: References: IDREFS, Keyrefs, Joins].
do we need to be able to generate ids, e.g. using skolem functions?
for a document in RAM, or in a persistent tree, identity may be present, implicit, system dependent, and cheap - it's nice to have an abstraction that requires no more than the implicit identity
persistable ID is more expensive, may want to be able to serialize with ID/IDREF to instantiate references in the data model
can use XPath instead of generating ID/IDREF, but these references are fragile, and one reason for queries is to create data that may be processed further
persistable ID unique within a repository context
persistable ID that is globally unique
related to [Issue-0005: Element identity].
copy vs. reference semantics
"MUST not preclude updates..."
in a pure query environment, sans update, we do not need to distinguish these
if we have update, we may need to distinguish, perhaps in a manner similar to "updatable cursors" in SQL
programs may do queries to get DOM nodes that can that be modified. It is essential to be able to distinguish copies of nodes from the nodes themselves.
copy semantics - what does it mean?
copy the descendant hierarchy?
copy the reachability tree? (to avoid dangling references)
related to [Issue-0038: Copy by reachability].
The following issues have been raised since Sep-25-2000.
Issue-0062: Open questions for constructing elements by reference
Date: Sep-25-2000
(1) What is the value of parent() when constructing new elements with children refering to original nodes? See also discussion at http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Sep/0155.html (W3C-members only) (W3C-members only).
(2) Is an approach to either make copies for all children or provide references to all children, or should we allow for a more flexible combination of copies and references?
Operational semantics specifies that element node constructor creates copies of all its children. Addition of RefNode in [XML Query Data Model] supports explicit reference value.
Issue-0063: Do we need (user defined) higher order functions?
Date: Oct-16-2000
The current XML-Query-Algebra does not allow functions to be parameters of another function - so called higher order functions. However, most of the Algebra operators are (built-in) higher functions, taking expressions as an argument ("sort", "for", "case" to name a few). Even a fixpoint operator, "fun f(x)=e, fix f(x) in e" (see also [Issue-0008: Fixed point operator or recursive functions]), would be a built-in higher order function.
As agreed in http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Oct/0196.html (W3C-members only) (W3C-members only) the XML Query Algebra will not support user defined higher order functions. It does support a number of built-in higher order functions.
Issue-0064: Error code handling in Query Algebra
Date: Oct-04-2000
How do we return an error code from a function defined in current Query algebra. Do we need to create an array (or a structure) to merge the return value and error code to do this. If that is true, it may be inefficient to implement. In order for cleaner and efficient implementation, it may be necessary to allow a function declaration to take a parameter of type "output" and allow it to return an error code as part of the function definition. See also thread starting with http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Oct/0051.html(W3C-members only) (W3C-members only).
One does not need to create a structure to combine return values with error codes, provided each operator or function /either/ returns a value /or/ raises an error. The XML-Query Algebra supports means to raise errors, but does not define standard means to catch errors. Raising errors is accomplished by the expression "error" of type Ø (empty choice). Because Ø | t = t, such runtype errors do not influence static typing. The surface syntax and/or detailed specification of operators on simple types (see [Issue-0056: Operators on Simple Types]) may choose to differentiate errors into several error-codes.
Issue-0065: Built-In GroupBy?
Date: Oct-16-2000
As discussed in http://www.w3.org/XML/Group/2000/09/ql/unedited-minutes-day1.txt (W3C-members only) (W3C-members only), we may revisit the resolution of [Issue-0042: GroupBy] and reintroduce GroupBy along the lines of sort: "group v in e1 by [e2 {collation}]". One reason for this may be that this allows to use collation for deciding about the equality of strings.
In http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Oct/0196.html (W3C-members only) (W3C-members only) the WG has decided to close this issue, and for the time being not consider GroupBy as a built-in operator. Furthermore, [Issue-0013: Collations] is ammended to deal with collations for all operators involving a comparison of strings.
Issue-0066: Shallow or Deep Equality?
Date: Oct-16-2000
What is the meaning of "=" and "distinct"? Equality of references to nodes or deep equality of data?
[XML Query Data Model] defines "=" (value equality) and "==" (identity equality) operators. Description of distinct states that it uses "==".
Issue-0067: Runtime Casts
Date: Sep-21-2000
In some contexts it may be desirable to cast values at runtime. Such runtime casts lead to an error if a value cannot be cast to a given type. See also http://www.w3.org/XML/Group/2000/09/ql/unedited-minutes-day1.txt (W3C-members only) (W3C-members only), where the Algebra team has been put in charge of introducing run-time casts into the Algebra.
cast
e :
t
has been introduced as a reducible operator expressed in terms of match
(see [A.2 Reducible expressions]).
Issue-0068: Document Collections
Date: Oct-16-2000
Per our requirements document we are chartered to support document collections. The current XML-Query Algebra deals with single documents only. There are a number of subissues:
(a) Do we need a more elaborate notion of node-references? E.g. pair of (URI of root-node, local node-ref)
(b) Does the namespace mechanism suffice to type collections of nodes from different documents? Probably yes.
(c) Provided (a) and (b) can be settled, will the approach taken for [Issue-0049: Unordered Collections] do the rest?
Issue-0069: Organization of Document
Date: Oct-16-2000
The current document belongs more to the genre (scientific) paper than to the genre specification. One may consider the following modifications: (a) reorganize intro to give a short overview and then state the purpose (strongly typed, neutral syntax with formal semantics as a basis for possibly multiple syntaxes, etc.) (compared to version Aug-23, this version has already gone a good deal in this direction). (b) Equip various definitions and type rules with id's. (c) Elaborate appendices on mapping XML-Query-Algebra Model vs. XML-Query-Datamodel, XML-Query-Type System vs. XML-Schema-Type System. (d) Maybe add an appendix on use-case-solutions. The problem is of course: Part of this is a lot of work, and we may not achieve all for the first release.
At http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Oct/0196.html (W3C-members only) (W3C-members only) the WG decided to dispose of this issue. The current overall organization of the document is quite adequate, but of course editorial decisions will have to made all the time.
Issue-0070: Stable vs. Unstable Sort/Distinct
Date: Oct-02-2000
Should sort (and distinct) be stable on ordered collections, i.e. lists, and unstable on unordered collections (see [Issue-0049: Unordered Collections])? For more details see thread around http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Oct/0007.html (W3C-members only) (W3C-members only).
sort and distinct are stable on ordered collections, and unstable on unordered collections - see [4.7 Typing unordered forests].
Issue-0071: Alignment with the XML Query Datamodel
Date: Sep-26-2000
Currently, the XML Query Algebra Datamodel does not model PI's and comments. For more details see thread starting with http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Sep/0167.html (W3C-members only) (W3C-members only).
Addition of operational semantics defines relationship of Algebra to Data Model.
Issue-0072: Facet value access in Query Algebra
Date: Oct-04-2000
Each of the date-time data types have facet values as defined by the schema data types draft spec. This problem is general enough to be applied to other atomic data types.
The question is : Should we provide access to these facet values on an instance of a particular data types? If so, what type of access? My take is the facets are to be treated like read-only attributes of a data instance and one should have a read access to them. See also thread starting at http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Oct/0044.html (W3C-members only) (W3C-members only).
Issue-0073: Facets for simple types and their role for typechecking
Date: Oct-16-2000
XML-Schema introduces a number of constraining facets http://www.w3.org/TR/xmlschema-2/#dc-facets for simple types (among them: length, pattern, enumeration, ...). We need to figure out whether and how to use these constraining facets for type-checking. See also thread starting at http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Oct/0146.html(W3C-members only) (W3C-members only).
Issue-0074: Operational semantics for expressions
Date: Nov-16-2000
It is necessary to add an operational semantics that formally defines each operator in the Algebra.
Issue-0075: Overloading user defined functions
Date: Nov-17-2000
User defined functions can not be overloaded in the XML Query Algebra, i.e., a function is exclusively identified by its name, and not by its signature. Should this restriction be relaxed and if so - to which extent?
Issue-0076: Unordered types
Date: Dec-11-2000
Currently unorderedness is represented at type level by {t}, and some (built-in) operators are overloaded such they have different semantics (and potentially different return type) depending on their input type. An alternative is to not represent unorderedness at type level, but rather support unordered for, unordered (unstable) sort, unordered (unstable) distinct.
Issue-0077: Interleaved repetition and closure
Date: Dec-12-2000
Regular Languages are closed w.r.t. to the interleaved product. However, they are not closed w.r.t. to interleaved repetition, which can (e.g) generate the 1 degree Dyck language D[1] = () | a D[1] b | D[1] D[1] = (a,b)^{0,*}, and more generally, any language that coordinates cardinalities of individual members from an alphabeth: E.g. (a ^ b)^{0,*} = all strings with equally many a's and b's. These are beyond regular languages. Should we thus try to do without interleaved repetition?
Issue-0078: Generation of ambiguous types
Date: Dec-12-2000
Unambiguous content-models in XML 1.0 and XML Schema are not closed w.r.t. union. It appears that the XML Query-Algebra can generate result types which can not be transformed to an unambiguous content-model.
Issue-0079: Global order between nodes in different documents
Date: Dec-16-2000
The global order operator < is defined on nodes in the same document, but not between nodes in different documents.
Issue-0080: Typing of parent
Date: Dec-16-2000
Currently, the parent
operator yields an
imprecise type : AnyElement{0,1}
. It might be
possible to type parent
more precisely, for
example, by using the normalized names in MSL, which encode
containment of types.
Issue-0081: Lexical representation of Schema simple types
Date: Jan-17-2001
Schema simple types must be defined for the Algebra and XQuery.
Issue-0082: Type and expression operator precedence
Date: Jan-17-2001
The precedence of expression operators and type operators is not defined.
Issue-0083: Expressive power and complexity of match expression
Date: Jan-17-2001
When processing an XML document without schema information, i.e., the type of the document is AnyComplexType, then match expressions may be very expensive to evaluate:
match x case t1 : AnyTree do 1 case t2 : AnyTree{0,2} do 2 case t3 : *[*[*[*[* ... [AnyAttribute] ]]]] do 3 else errormatch itself is not the issue. The real problem is having very liberal type patterns. We could restrict the kinds of type patterns that we permit.
Issue-0084: Execution model
Date: Jan-17-2001
Need prose describing execution model scenarios : interpretor vs. compile/runtime vs. translation into another query language. Explain relationship between static and dynamic semantics.
Issue-0085: Semantics of Wildcard type
Date: Jan-17-2001
Cite: wildcard types cannot be implemented ( Section 2.12: Expanded names, paragraph 11 http://www.w3.org/XML/Group/2000/12/xmlquery-algebra20001204.html; critical, core) If x!y means any name in x except names in y, what does x!y!z mean? In general, how do ! and | operate (precedence, associativity)? Parentheses are required to force the desired grouping of these two operators. Also, what does x!* mean? (There's an infinite family of such examples.)
Issue-0086: Syntactic rules
Date: Jan-17-2001
Need rules for specifying syntactic correctness of query: symbol spaces; variable def'ns precede uses; list of keywords, etc.
Issue-0087: More examples of Joins
Date: Jan-17-2001
Cite: no join operator; wants example of many-to-many joins, inner join, left and full outer joins.
[Issue-0015:
3-valued logic to support NULLs]
[Issue-0018:
Align algebra types with schema]
[Issue-0030:
Automatic type coercion]
[Issue-0052:
Axes of
XPath]
[Issue-0013:
Collations]
[Issue-0068:
Document Collections]
[Issue-0084:
Execution model]
[Issue-0083:
Expressive power and complexity of match expression]
[Issue-0009:
Externally defined functions]
[Issue-0073:
Facets for simple types and their role for typechecking]
[Issue-0072:
Facet value access in Query Algebra]
[Issue-0008:
Fixed point operator or recursive
functions]
[Issue-0032:
Full regular
path expressions]
[Issue-0028:
Fusion]
[Issue-0078:
Generation of ambiguous types ]
[Issue-0079:
Global order between nodes in different documents]
[Issue-0054:
Global vs. local complex
types]
[Issue-0053:
Global vs. local
elements]
[Issue-0077:
Interleaved repetition and closure]
[Issue-0060:
Internationalization aspects for strings]
[Issue-0081:
Lexical representation of Schema simple types]
[Issue-0033:
Metadata Queries]
[Issue-0061:
Model for References]
[Issue-0087:
More examples of Joins]
[Issue-0074:
Operational semantics for expressions]
[Issue-0056:
Operators
on Simple Types]
[Issue-0075:
Overloading user defined functions]
[Issue-0014:
Polymorphic types]
[Issue-0007:
References: IDREFS, Keyrefs, Joins]
[Issue-0085:
Semantics of Wildcard type]
[Issue-0020:
Structural vs. name equivalence]
[Issue-0019:
Support
derived types]
[Issue-0086:
Syntactic rules]
[Issue-0059:
Testing Subtyping]
[Issue-0082:
Type and expression operator precedence]
[Issue-0055:
Types with
non-wellformed instances]
[Issue-0080:
Typing of parent]
[Issue-0076:
Unordered types]
[Issue-0029:
Views]
[Issue-0011:
XPath tumbler syntax instead of index?]
[Issue-0071:
Alignment with the XML Query Datamodel]
[Issue-0001:
Attributes]
[Issue-0047:
Attributes]
[Issue-0065:
Built-In GroupBy?]
[Issue-0027:
Case syntax]
[Issue-0040:
Case
Syntax]
[Issue-0023:
Catch exceptions and process in algebra?]
[Issue-0010:
Construct values by copy]
[Issue-0038:
Copy by reachability]
[Issue-0037:
Copy vs identity semantics]
[Issue-0039:
Dereferencing semantics]
[Issue-0063:
Do we need (user defined) higher order functions?]
[Issue-0058:
Downward Navigation only?]
[Issue-0005:
Element identity]
[Issue-0064:
Error code handling in Query Algebra]
[Issue-0035:
Exception handling]
[Issue-0048:
Explicit Type Declarations]
[Issue-0046:
FOR
Syntax]
[Issue-0034:
Fusion]
[Issue-0003:
Global Order]
[Issue-0045:
Global
Order]
[Issue-0036:
Global-order based operators]
[Issue-0042:
GroupBy]
[Issue-0012:
GroupBy - needs second order functions?]
[Issue-0022:
Indentation, Whitespaces]
[Issue-0044:
Keys and IDREF]
[Issue-0016:
Mixed content]
[Issue-0057:
More precise type system; choice in path]
[Issue-0002:
Namespaces]
[Issue-0062:
Open questions for constructing elements by reference]
[Issue-0069:
Organization of Document]
[Issue-0026:
Project - one tag only]
[Issue-0051:
Project redundant?]
[Issue-0043:
Recursive Descent for
XPath]
[Issue-0050:
Recursive Descent for
XPath]
[Issue-0031:
Recursive functions]
[Issue-0004:
References vs containment]
[Issue-0067:
Runtime Casts]
[Issue-0066:
Shallow or Deep Equality?]
[Issue-0041:
Sorting]
[Issue-0006:
Source and join syntax instead of "for"]
[Issue-0070:
Stable vs. Unstable Sort/Distinct]
[Issue-0021:
Syntax]
[Issue-0025:
Treatment of empty results at type level]
[Issue-0049:
Unordered
Collections]
[Issue-0017:
Unordered content]
[Issue-0024:
Value for empty forests]