3 The abstract model of content labels

The requirements in the previous section can be expressed in a more programmatic way as follows. A Content Label (cLabel) can carry a variety statements such as:

cLabel {
  That resource R has the property P1 is true
  That resource R has Property P2 that has value V
  That resource R meets WCAG 1.0 AA is true
  That resource R was created in accordance with satisfactory procedures is 
true
}

Where R may be either a single resource identified by its own URI or a group of resources. Membership of a group R may be defined either by pattern matching based on URIs or with reference to specified properties of resources. The latter case includes, but is not limited to, properties such as creation date, ISAN number etc.

Further, it is necessary to be able to make statements like:

metadata {
  cLabel was created by $organization
  {
    has the e-mail address mail@organization.org
    has a homepage at $url
    has a feedback page at $URL
    ...
  }
  cLabel was created on $date
  cLabel was last reviewed by $person
}

Finally, it is necessary to be able to send a real-time request to $organization seeking automatic confirmation that it was responsible for creating the cLabel, i.e. authenticating the label and the claims made. This amounts to making statements like:

authentication {
metadata and cLabel verified by $organization  
  { has email sss ... }
verified on $date}

Such a discussion leads us to the abstract model of cLabels as described in figure 3.1 and the following text.

UML Dagram of abstract model for content labels

Figure 3.1 Schematic diagram of the abstratc model for content labels.

A Vocabulary collects together a number of related properties or aspects of a resource that may be useful in saying things about that resource. Those properties or aspects are identified by terms of the vocabulary. Each term is identified by a descriptor, may have constraints associated with values that are appropriate for us with that descriptor, and possible other information such as test suites for checking value assignment. The scope of use of the vocabulary as a whole as well as the scope of individual terms may be noted.

An expression is a statement in respect of a resource that the aspect of the resource denoted by a descriptor chosen from a specific vocabulary has a certain value. A valid expression makes reference to a vocabulary term that exists and whose chosen value conforms to the constraints specified in the vocabulary.

An assertion is a specific type of expression which is said to be true by the entity that makes it.

A claim is a specific type of assertion, whose veracity can be ascertained - either by reference to the test conditions described in the relevant vocabulary term, or by observation and interpretation of the meaning of the term as given in its scope notes.

A label is an abstraction, which is specific type of resource and contains Assertions and Claims in respect of a Group of Resources.

A cLabel is a specific type of label that has assertions describing the circumstances of its creation.

cLabelMetadata makes assertions about a cLabel. In turn, assertions may be made about both a cLabel and its metadata in a certificate.

3.1 The WCL vocabulary

In order to support the abstract model we define a limited vocabulary.

Firstly, in order to aid the creation of a descriptive vocabulary, we define two terms:

Category: A thematically-linked grouping for descriptors.
Descriptor: As defined in the glossary

In addition to whatever descriptive vocabulary or vocabularies are used within a cLabel, the following terms are available.

Summary: The summary is a general text field within a cLabel that summarizes the assertions for display to end users.
Include: Allows one cLabel to include another
Classification: Links to a classification that is defined elsewhere.

There are well established vocabularies for describing people and organizations, such as Dublin Core and FOAF, that should be used when providing data about who created cLabels and cLabelMetadata. We define one further term to supplement these for the context of Content Labels.

Authority for: A namespace of for a vocabulary for which the label creator is an authority. The element can be occur any number of times to declare multiple vocabulary namespaces.

This term is provided because although it is usual for an LA to issue labels from its own vocabulary, it may wish to include other vocabularies as well. Furthermore, additional vocabularies may be included in cLabels by the content provider or others and this term enables an LA to specify exactly for which descriptions it is and is not responsible.

As noted above, an important aspect of Content Labels is the cLabelMetadata. Who created the labels? When etc. To this end we define the following terms:

Last reviewed: The date on which the labeled resources were last reviewed. This is a specialization of the Dublin Core Date element and SHOULD be expressed in the W3C date & time format [W3CDTF].
Reviewed by: The individual who reviewed the resource and verified the claims made in the cLabel.
Approved: The individual who checked the reviewer's verification of the claims made in the label.
Valid until: The date until which the cLabel or certificate SHOULD be treated as valid. This is a specialization of the Dublin Core Date element and SHOULD be expressed in the W3C date & time format [W3CDTF].
Withdrawn: The date on which the cLabel creator or certification authority withdrew the cLabel or certificate. This is a specialization of the Dublin Core Date element and SHOULD be expressed in the W3C date & time format [W3CDTF].
Test Result: A link to a test results, such as an EARL assertion.

NB: The Dublin Core Term Issued SHOULD be used to declare when a cLabel or certificate was issued (the DC Term 'Created' is more suited as a descriptor for when the labeled resource was created). As with the WCL descriptors 'Last reviewed', 'Valid until' and 'Wthdrawn', DC Terms 'Issued' is a specialization of the Dublin Core Date element and SHOULD be expressed in the W3C date & time format [W3CDTF].

3.2 Grouping resources

The grouping of resources is a fundamental aspect of content labels and is the subject of a separate paper [URIPM] that sets out an abstract model for this specific area. The aim is to be able to define a group in such a way that it is programmatically possible to determine whether a particular URI is a member of that group. This then makes it possible to identify the correct cLabel for the initial URI.

The creation of cLabels is usually done by people cf. the reading of cLabels which is usually done by machines. Therefore it is important that associating cLabels is a simple task with the burden of processing of the data pushed to the client side. That said, processing of cLabels may be done in real time as requests for resources are made. Therefore the processing burden should be as light as possible to minimize any latency caused by systems making use of cLabels.

As much from a policy perspective as a technical one, it is important that the scope of a set of cLabels be clearly defined. At the topmost level this means that it must be possible to link a set of cLabels to one or more domains since resources on those domains are either produced by the domain owner or produced according to a set of policies set down by the domain owner. Furthermore, policies such as:

everything on my three domains except this bit has label A
Unless told otherwise, everything on this domain has label B
The cLabel only applies to this group of things

are all supported. More formally, the model supports the definition of groups of resources based on a list of domain names, with support for specific inclusion or exclusion of sub-domains. Further support is provided for pattern matching against URIs in as intuitive a way as possible.

Finally, group definition by matching against URIs, or parts of a URI, is not always possible. As a result, WCL also supports the definition of groups based on properties of resources once resolved from a URI, through simple lists of URIs and merely by a resource pointing to a cLabel.

3.2.1 The WCL vocabulary for URI matching

The following terms are taken directly or derived from the abstract model for URI pattern matching [URIPM]. For each term we define a data type for the value it can take.

The following descriptors take a case insensitive string:

Scheme (encoded as scheme): Matches scheme from L to R
Exact scheme (encoded as exactScheme): Scheme must be an exact match
Exclude scheme (encoded as exclScheme): Explicitly excludes a scheme
Host (encoded as host): Match given host and all subdomains
Exact Host (encoded as exactHost): Match given host only
Exclude Host (encoded as exclHost): Explicitly exclude given host and any subdomains

The following descriptors take digits (with or without ancillary punctuation):

Port (encoded as port): Match the given port
Port range (encoded as portRange): Match a range of port numbers or a comma separated list. e.g. 80, 100-200, 630
Exclude Port (encoded as exclPort): Explicitly exclude a port
Exclude Port Range (encoded as exclPortRange): Explicitly exclude a range of port numbers or a comma separated list.

The following descriptors take a case sensitive string (except any percent escaped characters which are case insensitive):

Path (encoded as path): Match anywhere in path
Path begins (encoded as pathBegins): Match from L to R
Path ends (encoded as pathEnds): Match from R to L
Exact path (encoded as exactPath): Match path exactly
Exclude path (encoded as exclPath): Match anywhere in path and negate
Query (encoded as query): Match anywhere in query string
Exact query (encoded as exactQuery): Match query string exactly
Exclude query (encoded as exclQuery): Match anywhere in query string and negate
Fragment (encoded as fragment): Match anywhere in fragment
Exact fragment (encoded as exactFragment): Match fragment exactly
Exclude Fragment (encoded as exclFragment): Match anywhere in fragment and negate

The following descriptors take a Perl 5 Regular Expression:

Has URI (encoded as hasURI): Match against whole URI
Exclude URI (encoded as exclURI): Match against whole URI and negate

In order to define a group of resources independently of their URI, addition terms are provided:

Property (encoded as hasProperty): Match the stated property. This descriptor can take any value but it usually constrained by the particular descriptive vocabulary in use.
Exclude property (encoded as exclProperty): Match the stated property and negate
URI List (encoded as listURI): A list of URIs separated by whitespace. Resources resolved from those URIs are members of the group irrespective of all other conditions.
Exclude URI list (encoded as exclListURI): A list of URIs separated by whitespace. Resources resolved from those URIs are NOT members of the group, irrespective of all other conditions.

These terms can be used to define a group of URIs. A candidate URI can then be processed and a determination made as to whether it is a member of the group or not. The terms MUST be combined according to the following rules.

scheme and exactScheme are combined with logical OR; exclScheme is combined with logical AND NOT.

Example data

scheme= http
exactScheme = ftp

Candidate URIs

http://www.example.org is a member (since its scheme is http).

https://www.example.org/payments?param1=value1&param2=value2 is a member. Scheme matching is done left to right so http matches against https.

ftp://download.example.org is a member since ftp is explicitly quoted.

ftps://secure.download.example.org is not a member since the match for ftp must be exact.

The various terms for host and port are combined in exactly the same way. The 'excl' terms are provided for cases where it is easier to state what is not covered than what is.

Example data

host = example.com
exclScheme = https
exclHost = test.example.com
exclPort = 443

In this instance, all URIs on the example.com domain are in the group, irrespective of the scheme except those served over https. Likewise, the test.example.com subdomain, and any subdomain thereof, is not in the group. Any URI that specifies port 443 is excluded from the group (if the port is not specified in the candidate URI then this rule doesn't apply).

Paths, queries, fragments, properties and hasURI/exclURI regular expressions are combined with logical AND. If multiple elements of these types are quoted in the group definition, then they MUST all be true for the match to be positive.

Example data

scheme = http
host = example.com
path = current
hasURI = sport|fashion

URIs with scheme of either http or https on the example.com domain are in the group if the path includes the string 'current' AND matches either 'sport 'or 'fashion'.

The URI list and its inverse allow resources to be grouped by simple enumeration. In order to avoid any possible ambiguity, any URI given in a list IS a member of the group (or is NOT a member if on the excluded URI list), even if this clashes with other definition terms.

Example data

host = example.com
pathBegins = images
listURI http://www.example.com/news.jpg 
exclListURI http://search.example.com/images/logo.png

The news.jpg image is included in the group even though it is not in the images directory. Following the same logic, the logo.png image is NOT a member of the group, even though the other criteria are satisfied.

3.3 The Ruleset

A group of resources, as discussed above, must be associated with a particular cLabel. Practically, there must also be a defined starting point for an agent to begin processing the data to extract the correct cLabel for a given resource. This is the function of the Ruleset which has the following elements, all of which are optional.

Scope: The is the widest scope for the cLabels in the data set. Typically this will be one or more domain names perhaps with further parameters to limit the scope. This might be, for example, where a single domain name is shared between several users such as a blog site or personal web pages offered by an ISP.
Default label: The default cLabel for the resources within the Scope. Where no Scope is defined, the default label applies to any resource linked to the Ruleset unless a Rule is matched.
Rules: In the context of a WCL Ruleset, a Rule defines a group of resources and associates that group with a cLabel. If a resource is a member of the defined group then the associated cLabel MUST be applied instead of the default cLabel. Rules MUST be processed in order and the cLabel associated with the first rule that matches the candidate URI is the one to apply.

The processing model that follows from this is shown in figure 3.2.

Figure 3.2 Diagrammatic representation of the WCL processing model.

For a given resource, processing a Ruleset MUST return either 0 or 1 cLabels.

If it is intended to associate more than one cLabel with a particular resource then there are two points worthy of note:

Although a single Ruleset may only return 0 or 1 cLabels, a resource may point to any number of Rulesets.
A cLabel may include any number of other cLabels.

5 System architecture

Content Labels can be used in a variety of systems and it is not the XG's intention to define a single architecture. However. we do recommend that the following elements are present in any complete system.

Content Labels served either from the labeled site or from a labeling authority
Metadata about a given cLabel giving detail of who created it, when, its period of validity and so on (see WCL Vocabulary)
An authentication route through which a client can make an automated request to the LA
An API through which an LA makes available labels for a given resource
An API though which an LA makes all its labels available as a single download

These elements can be combined in any number of ways, some of which are highlighted in the following sections.

5.1 The Quatro model

The Quatro project, co-funded by the European Union [SIP], combines the elements listed above into a basic architecture with 3 variations. In all cases, content is linked to the cLabel and the cLabelMetadata points to the Labeling Authority.

It should be noted that although every resource must be linked to the cLabel in this model, a client seeking that cLabel will only have to retrieve it once. After the data has been retrieved, whether from the labeled site or the LA database, no further network request is required for as long as the cLabel is held in cache by the client.

5.1.1 Onsite labels that may be edited

Fig 5.1 Diagrammatic representation of simple architecture in which data about resources is held on the same server as those resources. The cLabelMetadata provides a link to the LA which is able to authenticate the data.

Labels are hosted near to the resources they describe (typically the same website) and the LA's database supplies simple authentication. Since in this model content providers are allowed to edit their label to reflect changes in their content, the cLabel may not be exactly the same as the one issued. However the LA is able to assert that it trusts the content provider to make such changes faithfully.

5.1.2 Onsite labels that may not be edited

Fig 5.2 Diagrammatic representation of simple architecture in which data about resources is held on the same server as those resources. A hash of this data is sent to the LA (identified in the cLabelMetadata) which is then is able to authenticate the data by comparing it with the hash stored in its database.

This is similar to the previous model but differs in the important respect that the LA does not allow content providers to edit their cLabels. Therefore a hash of the label can be checked against data held by the LA to ensure label integrity.

5.1.3 cLabels delivered directly from the LA database

Fig 5.3 Diagrammatic representation of architecture in which all cLabel data is held by the Labeling Authority.

In this architecture, cLabels are delivered directly from the LA's database. Therefore there is no possibility for the label to be modified by the content provider and the source of the labels carries greater inherent trust. On the downside, this model places greater demands on the server infrastructure, system integration and bandwidth usage by the LA.

5.2 Third Party cLabels and certificates

The 3 variants of the Quatro architecture above all show a linkage between the content and the cLabel and a link from the cLabelMetadata to the LA. It is possible to create and use cLabels independently of such links.

An important point to note in this regard is that cLabels, like the resources they describe, have URIs. A certification authority could add its own certificate to the architecture shown in figures 5.1. and 5.2; likewise a client might be configured to seek cLabels from an independent organization's database without following links from the content itself.

5.3 Existing Semantic Web models

The issue of trust is a key issue for the Semantic Web and several relevant models have been proposed. Any of these could potentially apply here, especially if cLabels are encoded in RDF.

For example, TriQL.P is a query language that allows trust to be evaluated by comparing data drawn from different sources. In the WCL context one might see this working by querying cLabels from different sources describing the same group of URIs.

The University of Maryland's Trust Project looks at ways of building trust on the Web using shared personal movie ratings, shared contacts files etc.

W3C's Annotea project offers mechanisms for sharing annotations and bookmarks - ideas that find expression in things like social network sites and bookmark sharing systems. These networks can readily be leveraged to add trust to, and make use of, cLabels in a variety of ways.

6 Encodings

Given the use cases and the detailed requirements derived from them, The Content Label Incubator Group believes that RDF provides the best and most appropriate technology for content labels. Section 6.1 describes the RDF-based model and describes a small set of classes and properties that are the basis for defining labeling schemes. A specific labeling scheme is created by defining instances of these classes and using the properties to define the relationships between those instances.

That said, the group has been careful to express its aims independently of any particular technology and recognizes that other approaches are equally possible and may be more appropriate, or just feel more natural, to some people. For this reason, we have sketched out some initial ideas for other encodings in section 6.2. These are not fully worked through but are offered as hints for others to follow if they so wish.

6.1 RDF Model

The RDF model is based substantially on work done under the QUATRO project which ran under the European Union's Safer Internet Programme [SIP]. The project in turn took ideas developed under previous working groups. The model developed under QUATRO is known as RDF Content Labels [RDF-CL], the original documentation for which remains available. Note: the model presented here is not a simple copy of RDF-CL. Many changes and simplifications have been made under the WCL-XG.

The concepts used in the RDF model are first introduced by way of an extended example. These are then formalized in the schema description.

6.1.1 Introducing the Extended example

A typical labeling scheme consists of one or more categories that group together related descriptors for which there are defined legal values and zero or more descriptors that have no value. The presence of the latter indicates that a particular feature is claimed to be true of the resource being described.

To create a trivial example, a labeling authority (LA) might define "Appearance" as a category within which there are descriptors for:

Color with legal values of Black, Red and Green
Percentage transparency (pt) with legal values between 0 and 30.

The LA further defines "Matt" and "Shiny" properties that are either true or false.

Assuming relevant namespace declarations have been made, an example of a cLabel written in RDF and serialized XML might then be:

<wcl:ContentLabel>
  <ex:color>Green</ex:color>
  <ex:pt>20</ex:pt>
  <ex:shiny>true</ex:shiny>
</wcl:ContentLabel>

Example 1: The core of a basic cLabel

This simply means that, according to the example labeling scheme, the labeled resource is green, 20% transparent and shiny.

To extend the example, a complete set of cLabels might be created as shown in Example 2 and made available at http://resources.example.com/labels.rdf.

<?xml version="1.0"?>
<rdf:RDF
  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
  xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
  xmlns:wcl="http://www.w3.org/2004/12/q/contentlabel#"
  xmlns:ex="http://labelingauthority.example.org/vocabulary#">

  <wcl:ContentLabel rdf:ID="label_1">
    <ex:color>Green</ex:color>
    <ex:pt>20</ex:pt>
    <ex:shiny>true</ex:shiny>
  </wcl:ContentLabel>

  <wcl:ContentLabel rdf:ID="label_2">
    <ex:color>Red</ex:color>
    <ex:pt>10</ex:pt>
    <ex:matt>true</ex:matt>
  </wcl:ContentLabel>

  <wcl:ContentLabel rdf:ID="label_3">
    <ex:color>Black</ex:color>
    <ex:pt>0</ex:pt>
</wcl:ContentLabel>

</rdf:RDF>

Example 2: An RDF/XML instance containing 3 cLabels

Resources can now be created that link to specific cLabels. Any number of resources that appear shiny and green with 20% transparency can include the link tag below (or its HTTP Response header equivalent):

Similar tags can be included to identify resources to be red, 10% transparent and matt, or, black and 0% transparent.

6.1.2 Restricting the scope of a cLabel

Any resource, anywhere on the Web can link to a cLabel using the mechanism described above. There are circumstances where this is a useful facility. Equally, however, a content provider or labeling authority may wish to restrict the scope of their labels. This is achieved by using the wcl:Scope class within the wcl:Ruleset as shown in example 3.

The Scope class is provided so that hosts can easily be listed in a separate RDF instance if required. Whether this is done or not, the same SPARQL query will give the list of hosts.

<?xml version="1.0"?>
<rdf:RDF
  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
  xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
  xmlns:wcl="http://www.w3.org/2004/12/q/contentlabel#"
  xmlns:ex="http://labelingauthority.example.org/vocabulary#">

  <wcl:Ruleset>
    <wcl:hasScope>
      <wcl:Scope>
        <wcl:host>resources.example.co.uk</wcl:host>
        <wcl:host>resources.example.com</wcl:host>
      </wcl:Scope>
    </wcl:hasScope>
  </wcl:Ruleset>

  <wcl:ContentLabel rdf:ID="label_1">
    <ex:color>Green</ex:color>
    <ex:pt>20</ex:pt>
    <ex:shiny>true</ex:shiny>
  </wcl:ContentLabel>

  <wcl:ContentLabel rdf:ID="label_2">
    <ex:color>Red</ex:color>
    <ex:pt>10</ex:pt>
    <ex:matt>true</ex:matt>
  </wcl:ContentLabel>

  <wcl:ContentLabel rdf:ID="label_3">
    <ex:color>Black</ex:color>
    <ex:pt>0</ex:pt>
  </wcl:ContentLabel>

</rdf:RDF>

Example 3: A repeat of example 2 with host restrictions applied

Example 3 declares the same cLabels as example 2, however, an agent MUST only consider the labels applicable to URIs that match either of the wcl:host elements declared within the wcl:Scope class. That is, URIs on the resources.example.com or resources.example.co.uk hosts. If a resource from another host links to the cLabel, an agent MUST disregard the description.

cLabels may be applied to subdomains of the listed hosts. In our examples with declared hosts of resources.example.com and resources.example.co.uk, cLabels would also be applicable to, for instance, http://search.resources.example.com, and http://support.resources.example.co.uk.

If it is necessary to restrict the scope of the labels further, this can be achieved by provding one or more wcl:hasURI properties for the wcl:Scope class. Labels MUST only be applied to URIs that are within at least one host AND that match all Perl 5 regular expressions given in wcl:hasURI elements. This is particularly useful where several users may have space on a common host, as is often the case with personal websites on their ISP's domain.

6.1.3 Identifying default labels

It is possible to define a default cLabel for the group of resources identified in the wcl:Ruleset class using the hasDefaultLabel property.

<?xml version="1.0"?>
<rdf:RDF
  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
  xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
  xmlns:wcl="http://www.w3.org/2004/12/q/contentlabel#"
  xmlns:ex="http://labelingauthority.example.org/vocabulary#">

  <wcl:Ruleset>
    <wcl:hasScope>
      <wcl:Scope>
        <wcl:host>resources.example.co.uk</wcl:host>
        <wcl:host>resources.example.com</wcl:host>
      </wcl:Scope>
    </wcl:hasScope>
    <wcl:hasDefaultLabel rdf:resource="#label_1" />
  </wcl:Ruleset>

  <wcl:ContentLabel rdf:ID="label_1">
    <ex:color>Green</ex:color>
    <ex:pt>20</ex:pt>
    <ex:shiny>true</ex:shiny>
  </wcl:ContentLabel>

  <wcl:ContentLabel rdf:ID="label_2">
    <ex:color>Red</ex:color>
    <ex:pt>10</ex:pt>
    <ex:matt>true</ex:matt>
  </wcl:ContentLabel>

  <wcl:ContentLabel rdf:ID="label_3">
    <ex:color>Black</ex:color>
    <ex:pt>0</ex:pt>
  </wcl:ContentLabel>

</rdf:RDF>

Example 4: A repeat of example 3 with a default label declared

In this example, everything on the resources.example.com and resources.example.co.uk hosts is described by label 1. This can be overwritten by linking the resource directly to specific labels or via rules (discussed below),

It is expected that, for the sake of optimization, agents will locate the wcl:Ruleset class first to quicky ascertain whether the RDF instance carries information that can be applied to the resource in question (by checking for hosts) and, if so, find any default data supplied.

6.1.4 Rules for identifying a label

In examples 2 and 3, a resource was linked to a specific cLabel. A content provider will need to include a specific link to the correct label in each resource. This is shown diagrammatically in figure 6.1 below.

Figure 1. Each resource includes a link to a specific label within the RDF instance at labels.rdf

Figure 6.1. Each resource includes a link to a specific label within the RDF instance at labels.rdf.

This will be a convenient approach for some providers. However, we began to extend this in example 4 with the introduction of a default label. We now add a simple set of application rules that can be used to override that default. This will allow resources to be linked to a single RDF instance containing multiple labels. An agent can process those rules to select the correct cLabel for a given resource. This is shown diagrammatically in figure 6.2.

Figure 2. A simple rule set allows all content to link to the same RDF instance and the correct label to be identified.

Figure 6.2. A simple rule set allows all content to link to the same RDF instance and the correct label to be identified.

The advantage of this system is that a content management system or suite of servers can be configured to include exactly the same link tag with all resources, for example:

The labels for an entire website, no matter its size, can be managed by editing a single file. Example 5 shows how such rules can be encoded.

<?xml version="1.0"?>
<rdf:RDF
  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
  xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
  xmlns:wcl="http://www.w3.org/2004/12/q/contentlabel#"
  xmlns:ex="http://labelingauthority.example.org/vocabulary#">

  <wcl:Ruleset>
    <wcl:hasScope>
      <wcl:Scope>
        <wcl:host>resources.example.co.uk</wcl:host>
        <wcl:host>resources.example.com</wcl:host>
      </wcl:Scope>
    </wcl:hasScope>
    <wcl:hasDefaultLabel rdf:resource="#label_1" />

    <wcl:rules rdf:parseType="Collection">

      <rdf:Description>
        <wcl:hasURI>sectionA</wcl:hasURI>
        <wcl:hasLabel rdf:resource="#label_2" />
      </rdf:Description>

      <rdf:Description>
        <wcl:hasProperty>
          <ex:ManufacturingTime>
            <ex:after>12.00</ex:after>
          </ex:ManufacturingTime>
        </wcl:hasProperty>
        <wcl:hasLabel rdf:resource="#label_3" />
      </rdf:Description>

    </wcl:rules>

  </wcl:Ruleset>

  <wcl:ContentLabel rdf:ID="label_1">
    <ex:color>Green</ex:color>
    <ex:pt>20</ex:pt>
    <ex:shiny rdf:resource="http://www.w3.org/2004/12/q/contentlabel#true" true</>
  </wcl:ContentLabel>

  <wcl:ContentLabel rdf:ID="label_2">
    <ex:color>Red</ex:color>
    <ex:pt>10</ex:pt>
    <ex:matt>true</ex:matt>
  </wcl:ContentLabel>

  <wcl:ContentLabel rdf:ID="label_3">
    <ex:color>Black</ex:color>
    <ex:pt>0</ex:pt>
  </wcl:ContentLabel>

</rdf:RDF>

Example 5. A similar listing to previous examples with added application rules.

Example 5 shows that the rules are presented in an ordered list. The first rule states that the resources resolved by dereferencing any URI that matches the Perl 5 regular Expression "sectionA" will be associated with label 2. When combined with the rest of the available data, it can be seen that everything in section A of resources.example.com and resources.example.co.uk is red, 10% transparent and matt.

The second rule relates not to the URI of the resource but to its properties. Specifically in this case, the example vocabulary allows us to construct a rule that says "anything created after midday is Black and opaque."

It is possible to construct rules that have more than one hasURI and/or hasProperty property. In such circumstances, all conditions must be met for the rule to be satisfied.

6.1.5 Provenance of a label

As the use cases and detailed requirements make clear, cLabel metadata is of great importance. Since a wcl:Ruleset and cLabels are themselves resources, the full expressivity of RDF and related technologies can be used to describe them. Where cLabels and a wcl:Ruleset are delivered in a single RDF instance, an RDF description rdf:about that instance can be created that typically will give details of things like who created the data and when. In example 2 we suggested that the RDF instance be available at http://resources.example.com/labels.rdf. Thus we could create a description like that shown in Example 6 below.

<rdf:Description rdf:about="">
  <dc:creator>
    <foaf:organization>
      <foaf:isPrimaryTopicOf rdf:resource="http://example.org/foaf.rdf" />
    </foaf:organization>
  </dc:creator>
  <wcl:authorityFor>http://labelingauthority.example.org/vocabulary#</wcl:authorityFor>
<rdf:Description>

Example 6: The creator of the RDF instance is declared, along with the namespace(s) for which it is responsible.

As a single RDF instance may contain cLabels from any number of LAs, and, indeed, any other RDF data, WCL provides a facility for an LA to declare for which descriptions (i.e. which namespaces) it is responsible. We further define a set of vocabulary terms that can be used that SHOULD be used when providing cLabel metadata [Link to this].

6.1.6 Classifications

In the examples given so far, we have created content labels that provide detailed descriptions of resources. As well as this granular approach, there is support for broad classifications that may or may not need further processing to be understood. Example 7 below shows the wcl:hasDefaultClassification and wcl:hasClassification.

<?xml version="1.0"?>
<rdf:RDF
  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
  xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
  xmlns:wcl="http://www.w3.org/2004/12/q/contentlabel#"
  xmlns:ex="http://movies.example.org/vocabulary#">

  <wcl:Ruleset>
    <wcl:hasScope>
      <wcl:Hosts>
        <wcl:host>resources.example.com</wcl:host>
      </wcl:Hosts>
    </wcl:hasScope>
    <wcl:hasDefaultLabel rdf:resource="#label_1" />
    <wcl:hasDefaultClassification rdf:resource="http://movies.example.org/ratings#PG" />
    <wcl:rules rdf:parseType="Collection">

      <rdf:Description>
        <wcl:hasURI>disasterShip</wcl:hasURI>
        <wcl:hasLabel rdf:resource="#label_2" />
        <wcl:hasClassification rdf:resource="http://movies.example.org/ratings#Teen" />
      </rdf:Description>

    </wcl:rules>

  </wcl:Ruleset>

  <wcl:ContentLabel rdf:ID="label_1">
    <ex:violence>mild</ex:violence>
    <ex:peril>mild</ex:peril>
    <ex:language>none</ex:language>
    <ex:sex>mildInnuendo</ex:sex>
  </wcl:ContentLabel>

  <wcl:ContentLabel rdf:ID="label_2">
    <ex:violence>mild</ex:violence>
    <ex:peril>frequent</ex:peril>
    <ex:language>occasional</ex:language>
    <ex:sex>occasional</ex:sex>
  </wcl:ContentLabel>

</rdf:RDF>

Example 7: Showing the use of broad classifications as well as granular descriptions

Example 7 shows that, by default, everything on resources.example.com is classified as "PG" by movies.example.org. The rule states, however, that if the URI matches the Perl 5 regular expression disasterShipthen the resource has been classified as "Teen." and that a full description is provided by label 2.The wcl:hasClassification property MAY either be included in a rule as shown or in a cLabel.

6.1.7 Inclusions

It is recognized that in providing detailed descriptions of resources, differences between those descriptions may be relatively slight. This has the potential to lead to a great deal of repetition between labels. For this reason, a wcl:include is provided, as exemplified below.

<?xml version="1.0"?>
<rdf:RDF
  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
  xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
  xmlns:wcl="http://www.w3.org/2004/12/q/contentlabel#"
  xmlns:ex1="http://labelingauthority.example.org/vocabulary#"
  xmlns:ex2="http://morelabels.example.org/vocabulary#">

  <wcl:Ruleset>
    <wcl:hasScope>
      <wcl:Hosts>
        <wcl:host>resources.example.com</wcl:host>
      </wcl:Hosts>
    </wcl:hasScope>
    <wcl:hasDefaultLabel rdf:resource="#label_1" />
    <wcl:rules rdf:parseType="Collection">

      <rdf:Description>
        <wcl:hasURI>sectionA</wcl:hasURI>
        <wcl:hasLabel rdf:resource="#label_2" />
      </rdf:Description>

    </wcl:rules>

  </wcl:Ruleset>

  <wcl:ContentLabel rdf:ID="label_1">
    <wcl:include rdf:resource="#block_A" />
    <wcl:include rdf:resource="#block_C" />
  </wcl:ContentLabel>

  <wcl:ContentLabel rdf:ID="label_2">
    <wcl:include rdf:resource="#block_B" />
    <wcl:include rdf:resource="#block_C" />
  </wcl:ContentLabel>

  <wcl:ContentLabel rdf:ID="blockA">
    <ex1:color>Green</ex1:color>
    <ex1:tp>20</ex1:tp>
  </wcl:ContentLabel>

  <wcl:ContentLabel rdf:ID="blockB">
    <ex1:color>Red</ex1:color>
    <ex1:tp>10</ex1:tp>
  </wcl:ContentLabel>

  <wcl:ContentLabel rdf:ID="blockC">
    <ex2:shape>Square</ex2:shape>
    <ex2:material>Glass</ex2:material>
  </wcl:ContentLabel>

</rdf:RDF>

Example 8. Showing the use of inclusions to avoid repeating descriptions

As can be seen in example 8, we define three blocks of description. Block C describes shape and material whereas blocks A and B are the ones we saw in earlier examples. By including these blocks we can see that, by default, all resources at resources.example.com are square, made of glass, green and 20% transparent. Those in section A, however, are still square and made of glass but are now red and 10% transparent.

NB. WCL makes it a condition that a given wcl:Ruleset can only define one default label and this can be overridden by one other cLabel at a time. This does not prevent multiple Rulesets defining multiple labels for the same resources.

To Do

Extend Scope discussion and rules in light of Grouping discussion
Write in actual schema description
Include clabel metadata vocab definitions
Show an extension - suggest use RDF-CL's frequent/occasional scenes stuff
All the stuff we've forgotten so far...

6.2 Other methods

RSS/Atom

Steve Ives has sent this:

<?xml version="1.0"?>
<rss>
   <channel>
      <title>Mobilestreams</title>
      <link>http://www.tastefultunes.com/</link>
      <description>Find tasteful ringtones at this site.</description>

      <!-- use one of these for crawl/refresh/frequency -->
      <ttl>1440</ttl>
      <pubDate>Tue, 10 Jun 2003 04:00:00 GMT</pubDate>
      <lastBuildDate>Tue, 10 Jun 2003 09:41:01 GMT</lastBuildDate>

      <!-- future idea: define a collection of default values for
           item fields up top, saves repeating prices etc -->

      <!-- the tags below start off from RSS, but then
           mix in tag names from Dublin Core and iTunes RSS.
           Namespacing could be added, but has been left out
           to keep the appearance more simple. -->

           
      <!-- list of items -->

      <item>
         <title>Crazy</title>
         <link>URL to page that will provide the item (buy link)</link>
         <description>Optional description of item</description>

         <!-- date when the item was added/released,
              could help annotate a search result with
              "new" if its recently published -->
         <pubDate>Tue, 03 Jun 2003 09:39:21 GMT</pubDate>

         
         <!-- unique ID for the item to assist in comparing new
              crawls of this file from the previous crawl. 
              Doesn't need to be a URL, but using a URL helps
              ensure uniqueness, but any string will do.
               -->
         <guid>http://www.tastefultune.com/catalog.html#item573</guid>

         <!-- 'type' is dublin core, and we could spec a starting
              vocabulary, e.g.
                 music
                 realtone
                 polyphonic
                 video ringtone
                 video
                 game
                 wallpaper
                 theme
                 -->
         <type>music</type>

         <!-- 'creator' is DC,
              'author' is itunes (with podcasts in mind) -->
         <creator>Grarls Barkley</creator>

         <!-- 'rights' is DC -->
         <rights>Copyright 2006 Grarls Barkley</rights>
         
         <!-- borrow the image tag idea from itunes -->
         <image href="http://www.tastefultunes.com/images/item573"/>
         
         <!-- 'format' is dublin core, but DC doesn't spec
              the contents. So we could propose IMT, or "auto" -->
         <format>audio/mpeg</format>

         <!-- borrow duration if wanted from itunes -->
         <duration>180</duration>
         
         <!-- scope for many other properties of the asset, e.g.
            dimensions
            color depth
            frame rate
            etc
            -->

         <!-- borrow category from itunes, maybe don't nest the
              levels like itunes does.
              Need a vocabulary for these categories.
              -->
         <category>Alternative</category>


         <!-- price could be a simple string, including
              currency symbol, idea is just to repeat the
              string in any search result rather than try to
              understand it.
              Pricing needs to be able to address all kinds
              of options, e.g. subscriptions, multibuys,
              conditional discounts, first time purchases.
              A plain string might be the easiest way to
              leave this open - afterall, it's the owners
              site that will be performing the transaction. -->
         <price>$11.99</price>

         
         <!-- One problem still to address: how to handle multiple instances of the same item, but in different formats. Writing a script to build this file from a database may need logic to spot this. Yahoo are proposing media groups, where each different version is it's own <item> but has a tag that declares itself part of a group. Might be better to instead list the available formats inside the one item tag, but this is a bit (not much) harder to generate with a script. 
            However, if this version of the system assumes the target site will do the handset type resolution, then don't need to catalog the type information, but extraction script will still need to list one item per collection of formats that item is available in. -->
      </item>

      <item>
        ....
      </item>

      ....
      
  </channel>
</rss>

Steve also notes areas for future work

need a vocabulary for categories
need more thought on how to represent price
we may need namespaces in each tag name to properly extend RSS

Need to do the same for Atom

Kjetil's stuff on XPath

Pantelis's stuff on microformats here

Appendix 1 The use cases

Use Case 1: Profile matching

The original use case given in the charter has been simplified by reducing the number of essential actors to three:

CONTENT PROVIDER (metadata provider)
PORTAL PROVIDER (metadata consumer)
END USER

One can imagine a range of scenarios with very similar characteristics that amount to "sub-use cases."

Sub use case 1A: END USER discovers content appropriate to their device ["MobileOK"]

Diagrammatic representation of use case 1A

Fig 1. Diagrammatic version of sub-use case 1A.

END USER visits portal
END USER's device profile is extracted with reference to a separate metadata store
END USER searches for a topic of interest.
PORTAL PROVIDER matches END USER's device profile with contentprofiles provided by CONTENT PROVIDER.
PORTAL PROVIDER provides search results matching this topic.
PORTAL PROVIDER filters results based on the metadata encoded in the content with regard to the "mobile friendliness" of the content/presentation in question and the known properties of the device profile according to business rules.

Sub use-case B: END USER discovers content appropriate to their age-group ["Child Protection"]

Diagrammatic representation of use case 1B

Fig 2. Diagrammatic version of sub-use case 1B.

END USER visits portal
END USER's user profile is extracted from a repository, perhaps the portal's own.
END USER searches for a topic of interest.
PORTAL PROVIDER matches END USER's age with content profiles provided by CONTENT PROVIDER.
PORTAL PROVIDER provides search results matching this topic.
PORTAL PROVIDER filters results based on the metadata encoded in the content with regard to the "child friendliness" of the content/presentation in question and the known age of the user according to local business rules.

Use case 2: Trustmark Scheme operator to content portal

The Example Trustmark Scheme reviews online traders, providing a trustmark for those that meet a set of published criteria. The scheme operator wishes to make its trustmark available as machine readable code as well as a graphic so that content aggregators, search engines and end-user tools can recognize and process them in some way.

The trustmark operator maintains a database of sites it has approved and makes this available in two ways:

First, the labelled site includes a link to the database. This can be achieved in a variety of ways such as an XHTML Link tag, an HTTP Response Header or even a digital watermark in an image. A user agent visiting the site detects and follows the link to the trustmark scheme's database from which it can extract the description of the particular site in real time.

Secondly, the scheme operator makes the full database available in a single file for download and processing offline.

Since the actual data comes directly from the trustmark scheme operator, it is not open to corruption by the online trader and can therefore be considered trustworthy to a large degree. To reduce the risk of spoofing, however, the data is digitally signed.

Use case 3: Website to end-user

Mrs Chaplin teaches 7 year olds at her local school. An IT enthusiast, she makes her teaching materials available through her personal website. She adds metadata to her material that describes the subject matter and curriculum area. In order to gain wider trust in her work she submits her site for review by her local education authority and a trustmark scheme. Both reviewers offer Mrs Chaplin a digitally signed, machine-readable version of their trustmark that she can add to her site. She merges these into a single pool of metadata to which she adds content descriptors from a recognized vocabulary that declare the site to contain no sex or violent content. She adds her own digital signature to the metadata. The set of digital signatures allow user-agents to identify the origin of the various assertions made. As in use case 2, links from the content itself point to this metadata.

Since the metadata is on the website itself, user agents are unlikely to take the assertions made in the metadata at face value. Unlike the trustmark operator, the local authority does not operate a web service that can support the label, it does, however, digitally sign its labels and publishes its public key on its website. This can be used to verify that it is indeed the local education authority that issued the relevant data in the label.

Separately, a user-agent can interrogate the trustmark operator's database in real time to check whether Mrs Chaplin is authorized to make the assertions relevant to their namespace. Furthermore, the use of a recognized vocabulary for the content description means that a content analyser trained to work with that vocabulary can give a probabilistic assessment of the accuracy of the relevant data.

Taken together, these multiple sources of data can provide confidence in the quality of the content and the local authority trustmark which is not directly testable. The multiple data sources may be further supported by recognising that Mrs Chaplin's work is cited in many online bookmarks, blog entries and postings to education-related message boards.

Use Case 4: Rich Metadata for RSS/ATOM

Dave Cook's website offers reviews of children's films and the site is summarized in both RSS and ATOM feeds. Most of the films reviewed have an MPAA rating of G and/or British Board of Film Classification rating of U. This is declared in a rating for the channel as a whole. However, Dave includes reviews of some films rated PG-13 or 12 respectively which is declared at the item level and overrides the channel level metadata.

The actual rating information comes from an online service operated by the relevant film classification board itself and is identified using a URL and human-readable text. The movie itself is identified by either an ISAN number or the relevant Internet Movie Database entry ID number. As with use case 2, trust is implicit given the source of the data, which is indicated by a link to Dave's site's policy.

Separately, Fred combines Dave Cook's and other review feeds to provide alternative reviews of the movies by transforming the ATOM feeds into RDF and creating an aggregate view using SPARQL queries.

Use Case 5: MLK and the KKK

Fred operates an antiracism education site which aggregates and curates content from around the Web. Fred wants to label the resources that he aggregates such that educational and other institutions may harvest the resources and associated commentary and metadata automatically for reuse within their instructional support systems, etc.

One of the ways in which Fred wants to curate resources is to say about them that they are pedagogically useful but politically noxious. For example, some sites on the Web make claims about Martin Luther King, Jr that are motivated by a racist ideology and are historically indefensible. Fred's vocabulary allows him to claim that such resources are pedagogically useful for purposes of analysis, but that they are otherwise suspicious and should only be consumed by students in an age-appropriate manner or with appropriate supervision, etc. In other words, Fred needs to be able to make sharply divergent claims about resources: (1) that they are noteworthy, and (2) that they are, from his perspective, dangerous or noxious or troublesome.

Use Case 6: Scalar Classification

A company named Advance Medical Inc. reviews medical literature on the Web based on a range of quality criteria such as effectiveness and research evidence. The criteria may be changed according to current scientific and professional developments. The review process leads to literature being classified as belonging to one of 5 levels as follows.

Level A : clear evidence.
Level B : supportive evidence.
Level C : poor evidence.
Level D : expert opinion with explicit critical appraisal.
Level E : no evidence.

The company produces label data that declares the classification level value and provides a summary of each document. The label data is stored in a metadata repository which can be accessed via the Web.

M.D. Smith uses the label data in the repository to make decisions about heath care for specific clinical circumstances.

Requirements

The following requirements have been approved by the group.

It must be possible to group resources and to make assertions that apply to the group as a whole (This is fundamental to all use cases)
It must be possible to self-label (use cases 2 - 4)
To provide as complete a description as possible, labels must be able to contain unambiguous assertions using more than one vocabulary (all use cases, especially 3)
It must be possible for a content provider to make reference to third party labels (use case 2)
It must be possible to make assertions about the accuracy of claims made in a label (use case 2)
The system must be readily usable within a commercial workflow, allowing a content provider to apply metadata to a large number of resources in one step and to separate the activity of labelling from that of content creation, where desired (use case 1).
The system must support a concept of default and override metadata. The mechanism that is used to determine where overrides apply should be based on the full concept of a URI rather than, for example, just a web URL. (Use case 1, 2, 4)
It should be possible to ascertain unambiguously who created the label, using techniques such as digital signatures, S/MIME etc. (use cases 2, 3 and perhaps 5)
It must be possible for a labeling organization to make all its labels available as a single database (use case 2)
It should be possible to include assertions from an unlimited number of vocabularies in a single content label. Assertions from each vocabulary may be subject to its own verification mechanism (use case 3)
Labels should support a human-readable summary as well as the machine-readable code (all).
Labels should validate to formal published grammars (all)
It must be possible to encode labels in a compact/efficient form (all)
It must be possible to identify whether labels are self-applied or created by a third party. (use case 2)
It must be possible to discover a feedback mechanism for reporting false claims (all, especially use case 2)
It must be possible to associate labels with a 'time to live' and/or 'expiry date' (all, especially user case 2)
It must be possible to discover the date and time when a label was last verified and by whom. (all, especially use case 2)
It must be possible to describe the process by which data in labels is to be verified (use case 3)

Although not a testable requirement, the group has further resolved the principle that adding labels to resources should be easy and intuitive. It is recognized that this is likely to be made so through implementation but the design of the system should nonetheless be mindful of the principle (use case 3).

W3C Content Labels

W3C Incubator Group Report Draft 0.7.1 12 July 2006

Abstract

Status of this document

Fundementals

Fitting in with commercial or other large scale workflows

Encoding labels for humans and machines

6 Encodings

To Do

RSS/Atom

Kjetil's stuff on XPath

Pantelis's stuff on microformats here

The use cases

Use Case 1: Profile matching

Sub use case 1A: END USER discovers content appropriate to their device ["MobileOK"]

Sub use-case B: END USER discovers content appropriate to their age-group ["Child Protection"]

Requirements