This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.
A while back I introduced an implementation of collection() in Saxon that uses a URI to identify a set of XML documents in filestore, with the ability to do pattern matching on the file names, recursively traverse the directory structure, and so on. This has proved very popular with XSLT users in particular: for example it allows you to build an index over a large set of source documents. The problem is that it isn't stable: if you call collection() again with the same URI, and files have been created or deleted, you will get a different result the next time. I've tried various devices to get around this problem, but the only conformant solution I can think of is to abandon using the collection() function for this purpose and introduce a proprietary extension function instead, which doesn't seem to be in anyone's interests. As far as I can tell there are only two ways of making the collection() function stable. One is to lock the stored collection against updates for the duration of the query or transformation. This is only possible where you have exclusive access to the data, it's not a practical solution for files in filestore. The other approach is to take a snapshot of the entire collection. But that's hideously expensive, given that the collection will usually be too big to fit in memory, and that the chances are that 99% of the time it will only be read once, often with each document being processed to completion before the next one is examined. So I think there's a strong case for relaxing the requirement that collection() should be stable. David Carlisle made an interesting suggestion: one could define the semantics so that collection() is guaranteed to create new nodes rather than return existing nodes. Since our processing model already allows a function to create new nodes each time it is called, this shouldn't be problematic. Of course for XQuery scenarios involving a database that might be updated, one does want a reference to the existing node, which suggests a need for two options or modes. Michael Kay
I implemented collection for a direectory of files in a stable manner, by keeping all documents in storage. Of course this does not scale well. We could have two different functions, as Michael suggests - say collection and stable-collection. Alternatively, we could have a form with three arguments, the third being a boolean flag. In this case the stability could be computed at runtime. I'm not sure though that there are any use cases for this.
similar comments would apply to doc() and xslt's document() functions Consider (to take a real example) processing the set of XqueryX files in the xquery test suite (as input documents, to do some kind of transformation, or query-on-queries. The files are (typically) on the filestore so Michael's comments about the non-availablility if database-style locking mechanisms applies, but you wouldn't normally want to load them via collection, but rather something like <xsl:for-each select="descendant::qt:test-case"> ... select="doc(concat(@FilePath,query/@name,'.xqx'))" ... If you try doing this on a system that holds all documents in memory you run out of memory pretty quickly (or have a much bigger machine than I have) Saxon's discard-documemnt extension function has proved invaluable in making this type of operation, processing large numbers of smallish files, feasible. Something in the standard that addresses this (it doesn't have to be exactly discard-document) would be really useful, I think, otherwise I suspect that discard-document is going to be 2.0's xx:node-set() ie a must-have extension that every implementer has to implement, and which causes confusion to beginners who don't know why they need an extension anyway, and causes interoperability problems for non-beginners (due to usual extension function issues about differences of behaviour in edge cases, different namespaces, etc) I know it's late in the process, but CR is about implementation experience, and a worryingly large proportion of my xslt2 stylesheets are using this extension, already. David
Re. David's comment that similar considerations should apply to doc() and XSLT's document(), and the reference to Saxon's discard-document() extension function. This is fundamentally unsound, I believe. If after a call to discard-document(), the stylesheet later encounters a reference to nodes within that same document, then the document will have to be re-parsed. Although it is not difficult to implement generate-id() in such a way that it's results are guarenteed to be the same before and after re-parsing a document for a given node (determined by it's numbering in document order), this is not sufficient to guarantee node identity for the same generated id, as the same URI may no longer refer to the same document contents (this is often the case for HTTP URIs referring to dynamically generated documents).
David's comment that the same arguments apply to document()/doc() is true in principle: there are use cases where you read a large number of documents using doc(), and where you only access each document once, and where the "stability" provision therefore gives you a lot of pain and no gain by locking all the documents into memory. Perhaps we can solve this as follows: (a) we specify that doc() and collection() are stable by default (in SQL terms, the default isolation level is SERIALIZABLE) (b) we specify that implementations may provide an option to select a different isolation level (c) we specify that a call on doc() or collection() may fail if the implementation cannot provide access to the requested resource with the requested isolation level This is anticipating a more comprehensive treatment of transactions and isolation levels in a future version of the spec. Michael Kay
(In reply to comment #3) > Re. David's comment that similar considerations should apply to doc() and XSLT's > document(), and the reference to Saxon's discard-document() extension function. > > This is fundamentally unsound, I believe. > Yes but your comments re soundness apply equally to collection(), so I don't think you were really disagreeing with my comment that doc() and collection() could be considered equally. I was going to answer (but I see Michael already made a similar suggestion) that the solution may be along the lines of having a mode that drops the requiremnt that the same nodes are returned if you call doc() twice with the same uri. There are use cases where the guaranteed stability is a good thing but it isn't really an essential part of the language. Other functions that generate nodes do not have this feature. A function definition like declare function x:f () {<x/>} means that x:f() returns a new node each time, so it is not a pure function in that sense. A mode in which doc() acted the same way, would be very useful I think. David
You are right that the lack of soundness applies to collection too. I like Michael's suggestion that an implementation may provide non-default isolation levels, and that an error may result in these cases. I'm going to implement such a scheme starting today.
At the joint meeting on 10 Jan the change was accepted in principle and I was actioned to supply detailed text. Here is the proposal: 1. In the XQuery book, section 2.4.4, delete the paragraph "If one of the above functions is invoked repeatedly with arguments that resolve to the same absolute URI during the processing of a single query, each invocation must return the same node sequence. This rule applies also to repeated invocations of fn:collection with zero arguments during the processing of a single query." (This information is currently redundant) 2. In F+O, 1.7, under the definition of the term "stable", add a note after the first paragraph: "Note: in the case of fn:collection() and fn:doc(), the requirement for stability may be relaxed: see the function definitions for details" 3. In F+O 15.5.6 fn:collection, delete the sentence "This function is ·stable·. " and replace it with the following paragraph: By default, this function is stable. This means that repeated calls on the function specifying the same URI will return the same result each time. However, for performance reasons, implementations may provide a user option to evaluate the function without a guarantee of stability. The manner in which any such option is provided is implementation-defined. If the user has not selected such an option, a call of the function must either return a stable result or must raise an error. 4. In F+O 15.5.4, fn:doc, change "This function is stable" to "By default, this function is stable". After the example explaining what this means, add the paragraph: However, for performance reasons, implementations may provide a user option to evaluate the function without a guarantee of stability. The manner in which any such option is provided is implementation-defined. If the user has not selected such an option, a call of the function must either return a stable result or must raise an error. At the end of this section, add a fifth bullet: * Implementations may provide user options that relax the requirement for the function to return stable results. 5. In F+O 15.5.5, fn:doc-available, add a final sentence: "However, if non-stable processing has been selected for the fn:doc function, this guarantee is lost." 6. In XSLT 5.4.3 (Initializing the Dynamic Context) add after the second paragraph: As specified in [F+O], implementations may provide user options that relax the requirement for the doc and collection functions (and therefore, by implication, the document function) to return stable results. By default, however, the functions must be stable. The manner in which such user options are provided, if at all, is implementation-defined.
Fixed as per the wording suggested by Michael Kay and modified in the minutes of the joint XSL/XQuery telcon Jan 24, 2006.