Copyright © 2004 Xerox Corporation.
The Xerox Document Services Document Model (XDSDM) is part of the Xerox Document Services Platform, a prototype XML application under development for sequencing invocations of operations on a document model. The model represents the identity, content, and meta-data of compound documents, with a focus on scanned (image) documents and documents that are derived from them, although it is applicable to other other document types as well. The document model, the sequencing mechanism, and the operations themselves are defined in a modular and extensible way. This paper discusses work in progress on the document model.
The structure of the XDSDM is designed to allow easy access and manipulation of the model by XPath expressions (see [XPath 1.0]), as is done in XForms (see [XForms 1.0]). Document Services that operate on the XDSDM can retrieve documents, metadata, and renditions from the model by using XPath expressions, and can update the model to add or change renditions or metadata through the use of XML Events (see [XML Events]) with DOM mutation actions, whose location is specified by XPath expressions and whose contents are described in the event payload.
The document content itself is not stored in the model, but is referred to by URI (see [RFC 2396]).
The following issues arise related to combining multiple vocabularies:
The following issue arises related to web applications:[Issue xml:id]
[Issue mustUnderstand]
[Issue Foreign Elements and Ordering]
This document is a position paper for the [The W3C Workshop on Web Applications and Compound Documents] It represents a work in progress, and is not supported by Xerox Corporation.
1 Introduction
1.1 Background
1.2 Documentation Conventions
2 Document Structure
2.1 Common
Attributes
2.1.1 Attribute doc:id
2.1.2 Attribute
doc:mustUnderstand
2.2 Elements
related to documents
2.2.1 Element documents
2.2.2 Element document
2.3 Elements related to
renditions
2.3.1 Element
doc:renditions
2.3.2 Element
doc:rendition
2.3.3 Element
doc:renditionSequence
2.3.4 Namespace
http://www.example.com/rendition-info
2.4 Elements
related to Meta Data
2.4.1 Element doc:metadata
2.4.2 Dublin Core
Elements
3 Glossary Of Terms
A Schemas for Xerox Document
Service Document Model
A.1 Schema
for Document Model
A.2 Schema for Rendition
Information
B References
B.1 Normative
References
B.2 Informative References
C Xerox Document Service Document Model
Use Example (Non-Normative)
C.1 Before
OCR
C.2 After
OCR
D Changelog (Non-Normative)
E Acknowledgments (Non-Normative)
F Production Notes (Non-Normative)
While electronic documents are a dominant and increasing part of the business world, paper documents have not disappeared, and are in fact increasing in absolute terms. While paper documents offer significant advantages in reading, understanding, some kinds of editing, under certain legal conditions, and in the many other affordances of paper, electronic documents are undeniably the coin of the realm in today's business world. Increasingly, business is dealing with compound documents containing both paper and electronic forms. Storing copies of paper documents electronically gives the best of both worlds, allowing the full ease of electronic document operations to be applied to paper documents, yet allows paper access to electronic documents when necessary and valuable. Capturing paper documents in a usable electronic form — and being able to print, copy, or otherwise operate on them from a desktop computer or a networked document appliance — is of great importance to business.
Documents are most valuable when they are , and can serve as memory association triggers. Many paper documents are situated by physical filing systems and human spatial memory, and not all of the information about the documents (meta-data) is in a form that is easily captured electronically. Electronic documents are usually situated by use of meta-data, or named properties of documents. Scanned electronic documents rapidly lose their value if the association between the document identity, document content, and documnt meta-data is lost.
Capturing paper documents has traditionally been an expensive business process. First, documents are scanned, and saved to removable media or an extranet, then checked for quality and possibly re-scanned, then sent to a "coding bureau" to have meta-data typed in and associated, and then finally shipped to a document repository. The total cost of this cycle is tremendous, as information present at each step is lost before the next and must be recreated, at great cost. In summary, the process of capturing scanned document meta-data for scanned documents is labor intensive, and is best done closest to the source of the documents, as it can be expensive to recover the meta-data at a later date, or by someone other than the document's owner.
Document Services re-envisions the paper-electronic boundary, and uses a capture technology that associates the meta-data with the document as soon as possible, when the document is still situated, and gives immediate feedback about the document quality, thus reducing the cost of both the capture and QA steps of document capture.
A typical paper business document achieves importance by being in the hands of a knowlege worker, who not only knows the value of the document, but also knows the context. Thus, he or she is an idea person to capture the meta-data associated with the value and context of the document, and to approve the quality of its capture. Unfortunately, traditional document capture technologies are cumbersome and time-consuming, so it is not cost-effective to pay knowledge workers to handle their own documents. A Document Services-based system aims to reduce the cost of handling and capturing documents to produce rich repositories of electronic knowledge, at low cost, by integrating the handling of paper and electronc documents into the normal work practice of knowledge workers, with operations that are defined in their terms, rather than focused on traditional scanning procedures.
This specification defines one important part of a compound paper-electronic document processing system, the Xerox Document Services Document Model, which is an XML instance document modeling the documents under processing, and holding their rendition and meta data information. Other key components of a document processing system are listed here, but are beyond the scope of this paper: a Document Service Orchestrator, which accepts a workflow definition XML document describing a process for performing document services. Document Services include capturing a document, adding meta-data to it, performing quality assurance, apply transformations such as OCR to both renditions and meta-data, storing in a document repository and dispatching the document to a target such as a printer or e-mail address.
Manipulation of the XDSDM by document services is done through
XPath expressions (see [XPath 1.0]),
as is done in XForms (see [XForms
1.0]). In fact, the XDSDM itself is similar to the
instance
document in XForms, but instead of being
modified through user interface controls, it is modified by document
services.
Document Services that operate on the XDSDM can retrieve
documents, metadata, and renditions from the model by using XPath
expressions, and can update the model to add or change renditions
or metadata through the use of XML Events (see [XML Events]) with DOM mutation actions,
whose location is specified by XPath expressions and whose contents
are described in the event payload. Unfortunately, the XML Events
specification does not specify action handlers, and so the DOM
mutation handlers are presently implemented as shorthand for an
XSLT transformation (see [XSLT 1.0]) in
which the action handler body is an transformed into an XSLT
transformation, which is then applied to the identified element
document
in the XDSDM with the event payload available
as a the result of an XSLT extension function in XPath
expressions.
Issue (issue-xml-event-handlers):
XML Handlers
A recommendation for DOM Mutation and scripting in XML Event handlers would be most welcome.
The document content itself is not stored in the model, but is
refererred to by URI (see [RFC 2396]),
and is compatible with the XForms 1.0 element upload
and XForms 1.0 submission
method
multipart-related
serialization.
Throughout this document, the following namespace prefixes and corresponding namespace identifiers are used:
doc:The Document Services Document Model namespace (http://www.example.com/document) A.1 Schema for Document Model
ri:The Document Services Rendition Information namespace (http://www.example.com/rendition-info) A.2 Schema for Rendition Information
xsd:The XML Schema namespace (http://www.w3.org/2001/XMLSchema)[XML Schema part 1]
xsi:The XML Schema for instances namespace (http://www.w3.org/2001/XMLSchema-instance)[XML Schema part 1]
my:Any user defined namespace
The XDSDM is derived from the document model of [System 33], in which a document is separated into a triple:
In XDSDM, each of the items in this triple is represented by an element: the identity by the elementan identity with a unique handle
a set of parallel renditions representing the content of the document, each rendition having a series of named properties
a set of named metadata items
document
, the
renditions by a sequence of elements rendition
, and
the metadata items by a containing element metadata
.
The content of the renditions themselves are not stored in the
model, but are referenced by an attribute on
rendition
. An XDSDM element documents
contains zero or more
documents, each of which can have zero or more renditions
(content), and zero or more pieces of metadata. XML Schema
descriptions for the XDSDM instance, document, rendition, and
metadata structures are given. These schemas use XML namespaces for
extensibility. Other XML applications such as [Guidelines for implementing Dublin
Core in XML] are used where appropriate.
doc:id
Attribute doc:id
are common to most elements in
this proposal; however the use of multiple namespaces complicates
the question of the namespace for the declaration of attribute
id
.
xml:id
An attribute xml:id
added to the XML namespace
would simplify matters greatly for XML applications using
containing languages and multiple namespaces.
Foreign attributes are generally allowed, but and services may ignore them.
doc:mustUnderstand
Services must process all elements and attributes in the following namespaces:
The attributehttp://www.example.com/document
http://www.example.com/rendition-info
http://purl.org/dc/elements/1.1/
doc:mustUnderstand
is used on any child
element of metadata
or rendition
to
indicate that any service processing the document
must
understand that element, and must not process the
document
if it does not. This concept is borrowed from
[SOAP 1.2] and [XForms 1.0].
Issue (issue-mustUnderstand-attribute):
mustUnderstand
A common namespace for this concept would be beneficial to producers and services of loosely coupled multiple-namespace documents.
documents
XDSDM provides a containing element documents
which
holds a sequence of elements document
.
Foreign attributes are allowed, and services may ignore them. Foreign elements are allowed, and processing is subject to 2.1.2 Attribute doc:mustUnderstand.
document
In XDSDM, the document identity is provided by element
document
, with the unique handle provided by attribute
id
, which is unique only to the particular model. The
model is composed of an XML document containing a sequence of zero
or more elements document
.
Foreign attributes are allowed, and services may ignore them. Foreign elements are allowed, and processing is subject to 2.1.2 Attribute doc:mustUnderstand.
Issue (issue-foreign-elements-ordering):
Foreign Elements and Ordering
We would like element document
to contain at most
one element renditions
and at most one element
metadata
, but any number of foreign elements. It is
difficult to express the unordered choice of zero or one of these
specified elements and at the same time allow any number of
unordered foreign elements. This problem puts uncomfortable
constraints on documents with multiple XML applications.
doc:renditions
The element doc:renditions
serves as a containing
element for elements doc:rendition
.
Foreign attributes are allowed, and services may ignore them. Foreign elements are allowed, and processing is subject to 2.1.2 Attribute doc:mustUnderstand.
doc:rendition
A document can have zero or more child elements
doc:rendition
. Each rendition is a whole rendition of
the document, though the content type, quality, fidelity, and other
attributes of the rendition may vary.
Foreign attributes are allowed, and services may ignore them. Foreign elements are allowed, and processing is subject to 2.1.2 Attribute doc:mustUnderstand.
doc:renditionSequence
Some documents are composed of an ordered sequence of
renditions; for example, a document consisting of a scanned TIFF [TIFF 6.0] file followed by a PDF file [PDF 3.0], would have a
doc:rendition
containing a
doc:renditionSequence
containing a sequence of two
doc:rendition
elements.
Foreign attributes are allowed, and services may ignore them. Foreign elements are allowed, and processing is subject to 2.1.2 Attribute doc:mustUnderstand.
While the A.1 Schema for
Document Model provides basic information about the
existence of renditions and the location of their content, it
provides no information about the rendition itself. Any namespace
is allowed as a child element of doc:rendition
, but
for interoperability, this paper proposes a canonical set of
rendition information elements in the namespace
http://www.example.com/rendition-info.
While all renditions of a document are in some sense equivalent, they do have different properties; for example, an original scanned image will have near 100% fidelity to the paper document, but an OCR'd version of the document as plain text would have a low fidelity, perhaps 10%, and an uncorrected accuracy of perhaps 85%. The Schema for these and other common properties of renditions is given in A.2 Schema for Rendition Information.
doc:metadata
The element doc:metadata
specifies a sequence of
any items in any other namespace. It is up to the application using
the document model to place constraints on the type of metadata to
be gathered; however, see 2.4.2 Dublin Core
Elements.
Foreign attributes are allowed, and services may ignore them. Foreign elements are allowed, and processing is subject to 2.1.2 Attribute doc:mustUnderstand.
The [Guidelines for implementing
Dublin Core in XML] specify an embedding of Dublin Core
elements in XML, and it is proposed that Dublin Core metadata items
be used where practical. Services must understand these elements,
and must not process a document
if not.
[Definition: A generic term for systems and services that process documents, but in this paper used specifically to refer to services on scanned image documents and their derivatives. Services include document capture, transformation, and distribution. ]
[Definition: A Document
Services Document Model is a single element documents
which serves as a container for a series of documents]
[Definition: A processor designed to apply a sequence of to a .]
[Definition: A
class of that accepts an mediaType
image/*
rendition produces a new rendition of type
text/*
(or similar coded type) and optionally also
produces new metadata
.]
[Definition: A document storage and retrieval facility, such as a file server, web server, or other system.]
[Definition: Situated documents obtain meaning from physical context. ]
[Definition: A destination for a document, such as a or a printer.]
[Definition: Data about a document, separate from its content; For example, the type of document is metadata -- contract, letter, newspaper clipping.]
[Definition: In this paper, "document" refers to a scanned image document or a coded document derived from one.]
[Definition: A rendition of a document is a reference the content of the document, as distinct from the location or identity of the document, or its meta-data. Documents can have multiple renditions, each with different properties; for example, there may be both an image and a text rendition of a document.]
The example XML Schemas for XDSDM and related Rendition Information and Meta-Data namespaces are below:
This is the XML Schema for the Document Model
This is the XML Schema for Rendition Information. Rendition Information is a common set of rendition properties that are expected to be understood by all services, but are not the exclusive set of properties.
This section presents an example use of the XDSDM in . The first example
shows a job document before OCR, and the second shows how the
documents
instance is updated by the OCR service.
This section summarizes changes since the previous draft of this document..
Approved for Publication May 17, 2004.
This model was produced with the participation the following individuals:
This document was encoded in the XMLspec DTD (which has documentation available). The XML sources were transformed using xmlspec.xsl style sheet. The XML Schemas and examples were rendered with the xmlverbatim XSLT stylesheet Emacs was used for editing. The XML was validated using XMLLint (part of the GNOME libxml package) and transformed using XSLTProc—part of the GNOME libxsl package).