Copyright ©2001 Republica Corporation.
This document outlines the Data Extraction Language. DEL is an XML format for describing data conversion processes from other data formats to XML. A DEL script specifies how to locate and extract fragments from input data and where to insert them in the resulting XML format. The DEL processor executing the DEL script can use the extracted data to either create a new XML document or modify an existing XML document by creating new elements and attributes at locations specified with XPath expressions.
This document was submitted to the World Wide Web Consortium on June 21 2001 (see Submission Request, W3C Staff Comment) intention that the W3C use it as a basis for furthering the work on any-to-XML transformations. For a full list of all acknowledged Submissions, please see Acknowledged Submissions to W3C.
This document is a NOTE made available by the W3C for discussion only. Publication of this Note by W3C indicates no endorsement by W3C or the W3C Team, or any W3C Members. W3C has had no editorial control over the preparation of this Note. This document is a work in progress and may be updated, replaced, or rendered obsolete by other documents at any time.
A list of current W3C technical documents can be found at the Technical Reports page.
1 INTRODUCTION
2 DEFINITION OF DATA EXTRACTION LANGUAGE
2.1 wrapper
2.2 template
2.3 repeat
2.4 map
2.5 extract
2.6 test
2.7 runtemplate
2.8 set
2.9 charsetmap (values)
2.10 stringconvert (register)
2.11 count
3 BASIC EXAMPLE OF DATA EXTRACTION LANGUAGE
3.1 input data
3.2 DEL script
3.3 output XML
4 ADVANCED EXAMPLE WITH CHARACTER SET MAPPING AND OUTPUT TEMPLATE
4.1 input data
4.2 DEL script
4.3 XML template
4.4 output XML
APPENDIX 1. DATA EXTRACTION LANGUAGE DTD
This document outlines the Data Extraction Language. DEL is an XML [XML] format for describing data conversion processes from other data formats to XML. A DEL script specifies how to locate and extract fragments from input data and where to insert them in the resulting XML format.
The DEL processor executing the DEL script can use the extracted data to either create a new XML document or modify an existing XML document by creating new elements and attributes at locations specified with XPath [XPath] expressions. A DEL script along with the source data are given to the DEL processor which performs the actual data conversion according to the script. The output from the DEL processor is a well-formed XML document containing the desired parts of the source data.
Locating data fragments in the input data can be done by searching for patterns and the matching regular expressions [REGEX]. The extracted data fragments are first temporarily stored to DEL processor's registers (or stack) in order to be refined before outputting and possibly re-used as a search pattern. The data is then read from the registers or stack and placed into its proper position in the DOM [DOM] tree of the resulting XML document. In placing the data to XML, a cursor function is used to keep track of the current position. The cursor position can be modified using XPath expressions.
The following sections describe the use of Data Extraction Language elements, their attributes and attribute values.
NOTE: The attribute values of "stack" and "regX" (where X is a user-defined register name with one character at least) are parsed before use. Then the actual value is taken from memory (stack or register). Otherwise, the actual value is the given value of the attribute.
Function: wrapper element is the container (root element) for DEL script rules.
NOTE: When creating a new output XML, the first map element should include maptype attribute with value "createDocument".
Contains: First template and map elements and then optionally one of the following elements: repeat, map, extract, test, set or runtemplate.
Attributes: No attributes.
Syntax:
<wrapper>
</wrapper>
Function: template element is used as container for common sequences of other elements. runtemplate element (for loading the template, see 2.7) is then used to call the content of template element.
Contains: Optionally one of the following elements:
repeat, map, extract, test, set or runtemplate.
Attributes: name (REQUIRED)
Syntax:
<template name="attribute_value">
</template>
Example:
<template name="MakeDate">
</template>
Function: repeat element repeats its content, ie. the elements located under it.
Contains: Optionally one of the following elements:
repeat, map, extract, test, set or runtemplate.
Attributes: times (REQUIRED)
Times attribute:
NOTE: In repeat loop, when trying to extract data but not finding the given (regular) expression where expected (at the beginning of the data under processing), the DEL processor sets "dataStreamError" status to stop the repeat loop. In case "dataStreamError" status is set outside repeat loop, the whole wrapping process stops.
Syntax:
<repeat times="attribute_value">
</repeat>
<repeat times="*">
</repeat>
Function: map element inserts
content to the output as XML node(s).
It moves the cursor in the output
XML to specify the insertion point for the node(s). It also keeps track of
the current element.
Contains: No elements
Attributes: The attributes and their possible values are:
Syntax:
Example:
Function: extract element gets data from the source data.
Contains: Optionally one of the following elements:
repeat, map, extract, test, set or runtemplate.
Attributes: exptype (REQUIRED), expression (REQUIRED, will be parsed), save (OPTIONAL, will be parsed)
Exptype and expression attributes: exptype attribute tells the DEL processor which part of the data should be extracted and where it is save.
expression attribute specifies exptype.
Possible exptype attribute values are:
Save attribute: save attribute tells where the extractable data is saved.
Possible save values are:
Syntax:
Example
Tip!
In case of large source data, try a multi-level extraction script. In such a script, extract element contains other DEL elements (which in turn can contain more). This allows you to 'chop up' the data into more manageable chunks.
Consider an XML source file with <table> as root element containing <tr> elements (table rows) and those containing <td> (cells):
<table>
<tr>
<td>1</td>
<td>4</td>
</tr>
<tr>
<td>5</td>
</tr>
<tr>
<td>74</td>
<td>99</td>
<td>100</td>
</tr>
</table>
Now look at the script below. The first extract (top level) element moves the cursor to a larger section from the source data (section within "<tr>" tags). Then the second extract element puts the cursor within the "<td>" tags located in that larger section. And the third extract element extracts the content within "<td>" tags and saves it to a register from where it will be later placed to result XML:
<extract exptype="over" expression="</tr>">
<extract exptype="over" expression="<td>"/>
<extract exptype="upto" expression="</td>" save="reg1"/>
</extract>
Such a multi-level script can also be reused and edited more effectively than one where all extract elements are on the same level.
Function: test element compares two values. The content of the test element will be processed when the test result is "true".
Contains: Optionally one of the following elements:
repeat, map, extract, test, set or runtemplate.
Attributes: testtype (REQUIRED), value1 (REQUIRED, parsed), value2 (OPTIONAL, parsed)
Testtype, value1 and value2: testtype attribute indicates what kind of test is processed.
Possible testtype values are:
NOTE: Content of the attributes value1 and value2 are parsed before comparing.
Syntax:
Example
Function: runtemplate element runs elements from a predefined template (see defining the template, chapter 2.2).
Contains: No elements.
Attributes: nameref (REQUIRED)
nameref attribute:
Syntax:
<runtemplate nameref="attribute_value"/>
<runtemplate nameref="MakeDate"/>
Function: set element gives instructions to the DEL parser for processing the data.
Contains: No elements.
Attributes:
parameter attribute values (value1 and value2):
NOTE: The place for storing the combination ("append") must be a register (e.g. "regCombo"). Other values can either be registers or strings.
Syntax:
<set parameter="attribute_value" value1="attribute_value" value2="attribute_value"/>
Example:
<set parameter="memory" value1="stack" value2="Testing"/>
Function: Creates a character set map where you can define which characters to replace with which characters.
For running the character set map, stringconvert needs to be defined (see below).
Attributes: name
Contains: values
Syntax:
<charsetmap name="value">
<values search="replaceable" replace="replacer"/>
</charsetmap>
Example:
<charsetmap name="MyMap">
<values search="AAABBB" replace="GISSE"/>
</charsetmap>
Function: Loads and runs a character set map defined by charsetmap (see above).
Attributes:
Contains: register
Syntax:
<stringconvert use="mapname" overlapping="true|false">
<register nameref="registername"/>
</stringconvert>
Example
<stringconvert use="MyMap" overlapping="true">
<register nameref="reg1"/>
<register nameref="reg2"/>
</stringconvert>
Function: A simple arithmetic calculator.
Contains: No elements.
Attributes: parameter (REQUIRED), value1 (REQUIRED), value2, value3
parameter attribute values:
Syntax:
<count parameter="function" value1="number|register" value2="number|register" value3="registername"/>
Example:
<count parameter="minus" value1="23" value2="20" valu3="reg3"/>
<count parameter="decimal" value1="1"/>
Below are example rules (3.2) that extract data from normal HTML pages (3.1), producing a result XML file (3.3).
Here is an example input data HTML file that will be processed below:
<html>
<body>
<p>Test Material</p>
<table>
<tr><td>Some numbers<td>Other numbers</tr>
<tr><td>1<td>2
<tr><td>3</td><td>4</tr>
</table>
</body>
</html>
This example DEL script contains the following rules:
<wrapper>
<map maptype="createDocument" node="root"/>
<map maptype="createElement" node="description"/>
<map maptype="createTextNode" node="Extracted data"/>
<extract exptype="over" expression="<table>"/>
<extract exptype="upto" expression="</table>">
<repeat times="*">
<extract exptype="over" expression="<tr>"/>
<extract exptype="over" expression="<td>"/>
<extract exptype="upto" expression="<" save="stack"/>
<map maptype="moveCursor" node="/root"/>
<map maptype="createElement" node="row"/>
<map maptype="createElement" node="field1" content="stack"/>
<map maptype="moveCursor" node=".."/>
<extract exptype="over" expression="<td>"/>
<extract exptype="upto" expression="<" save="reg1"/>
<map maptype="createElement" node="field2" content="reg1"/>
<test testtype="equal" value1="4" value2="reg1">
<map maptype="createAttribute" node="test" content="success"/>
</test>
</repeat>
</extract>
</wrapper>
The output XML from the above input data (3.1) and rules (3.2) is as follows:
<?xml version="1.0" encoding="ISO-8859-1"?>
<root>
<description>Extracted data</description>
<row>
<field1>Some numbers</field1>
<field2>Other numbers</field2>
</row>
<row>
<field1>1</field1>
<field2>2</field2>
</row>
<row>
<field1>3</field1>
<field2 test="success">4</field2>
</row>
</root>
Below is another example where DEL script modifies an output XML template is modified with source data. Character set mapping feature is used to search and replace characters.
Input data can be complex and hard to extract, as follows:
<area id=#2202>
#0,12.10
#23,1.514
#1,4.444
#abba
<area id=#2203>
#amount
#contact
<area id=#2204>
#3
#5
<area id=#2205>
#end
This example script makes a character set conversion using charsetmap and stringconvert elements (first with "false" overlapping and then with "true" overlapping). It produces two result XML files using "getTemplate". It also uses count to add register values and "append" to combine register values.
<wrapper>
',' (comma) à '.' (dot)
'.' (dot) à '_'
'a' à 'b'
'b' à 'c'
Note that no data conversion is made yet. The actual conversion is made using stringconvert command.
<charsetmap name="map1">
<values search="," replace="."/>
<values search="." replace="-"/>
<values search="a" replace="b"/>
<values search="b" replace="c"/>
</charsetmap>
<repeat times="1">
<map maptype="getTemplate" node="document"/>
<extract exptype="mark" expression="regBegin"/>
<extract exptype="re_over" expression="\r\n<">
<map maptype="moveCursor" node="/document/charsetmap/original"/>
<set parameter="memory" value1="regC" value2="0"/>
<repeat times="*">
<extract exptype="re_over" expression="\r\n"/>
<extract exptype="over" expression="#"/>
<extract exptype="re_upto" expression="\r\n" save="regValue"/>
<count parameter="plus" value1="regC" value2="1" value3="regCC"/>
<set parameter="append" value1="value_" value2="regCC" value3="regRes"/>
<map maptype="createElement" node="regRes" content="regValue"/>
<map maptype="moveCursor" node=".."/>
</repeat>
</extract>
<extract exptype="set" expression="regBegin"/>
<extract exptype="re_over" expression="\r\n<">
<map maptype="moveCursor" node="/document/charsetmap/overlappingfalse"/>
<set parameter="memory" value1="regC" value2="0"/>
<repeat times="*">
<extract exptype="re_over" expression="\r\n"/>
<extract exptype="over" expression="#"/>
<extract exptype="re_upto" expression="\r\n" save="regValue"/>
<stringconvert use="map1" overlapping="false">
<register nameref="regValue"/>
</stringconvert>
<count parameter="plus" value1="regC" value2="1" value3="regCC"/>
<set parameter="append" value1="value_" value2="regCC" value3="regRes"/>
<map maptype="createElement" node="regRes" content="regValue"/>
<map maptype="moveCursor" node=".."/>
</repeat>
</extract>
<extract exptype="set" expression="regBegin"/>
<extract exptype="re_over" expression="\r\n<">
<map maptype="moveCursor" node="/document/charsetmap/overlappingtrue"/>
<set parameter="memory" value1="regC" value2="0"/>
<repeat times="*">
<extract exptype="re_over" expression="\r\n"/>
<extract exptype="over" expression="#"/>
<extract exptype="re_upto" expression="\r\n" save="regValue"/>
<stringconvert use="map1" overlapping="true">
<register nameref="regValue"/>
</stringconvert>
<count parameter="plus" value1="regC" value2="1" value3="regCC"/>
<set parameter="append" value1="value_" value2="regCC" value3="regRes"/>
<map maptype="createElement" node="regRes" content="regValue"/>
<map maptype="moveCursor" node=".."/>
</repeat>
</extract>
<extract exptype="mark" expression="regMiddle"/>
<extract exptype="over" expression="2204>"/>
<extract exptype="mark" expression="regCount"/>
<extract exptype="re_over" expression="<">
<map maptype="moveCursor" node="/document/count/original"/>
<repeat times="*">
<extract exptype="re_over" expression="\r\n"/>
<extract exptype="over" expression="#"/>
<extract exptype="re_upto" expression="\r\n" save="regValue"/>
<stringconvert use="map1" overlapping="false">
<register nameref="regValue"/>
</stringconvert>
<count parameter="plus" value1="regC" value2="1" value3="regCC"/>
<set parameter="append" value1="value_" value2="regCC" value3="regRes"/>
<map maptype="createElement" node="regRes" content="regValue"/>
<map maptype="moveCursor" node=".."/>
</repeat>
</extract>
<extract exptype="set" expression="regCount"/>
<map maptype="moveCursor" node="/document/count/result"/>
<set parameter="memory" value1="regC" value2="0"/>
<extract exptype="re_over" expression="\r\n"/>
<extract exptype="over" expression="#"/>
<extract exptype="re_upto" expression="\r\n" save="regValue"/>
<extract exptype="re_over" expression="\r\n"/>
<extract exptype="over" expression="#"/>
<extract exptype="re_upto" expression="\r\n" save="regValue2"/>
<extract exptype="re_over" expression="\r\n"/>
<count parameter="plus" value1="regValue" value2="regValue2" value3="regCC"/>
<map maptype="createTextNode" node="" content="regCC"/>
<extract exptype="set" expression="regMiddle"/>
<map maptype="moveCursor" node="/document/append/original"/>
<set parameter="memory" value1="regC" value2="0"/>
<extract exptype="re_over" expression="\r\n"/>
<extract exptype="over" expression="#"/>
<extract exptype="re_upto" expression="\r\n" save="regV1"/>
<extract exptype="re_over" expression="\r\n"/>
<extract exptype="over" expression="#"/>
<extract exptype="re_upto" expression="\r\n" save="regV2"/>
<map maptype="createElement" node="value1" content="regV1"/>
<map maptype="moveCursor" node=".."/>
<map maptype="createElement" node="value1" content="regV2"/>
<map maptype="moveCursor" node=".."/>
<map maptype="moveCursor" node="/document/append/result"/>
<set parameter="append" value1="regV1" value2="regV2" value3="regV3"/>
<map maptype="createTextNode" node="" content="regV3"/>
<map maptype="documentReady" />
</repeat>
</wrapper>
The DEL script (in 4.2) uses the following kind of XML template (using "getTemplate" to call this template):
<document>
<charsetmap>
<original/>
<overlappingfalse/>
<overlappingtrue/>
</charsetmap>
<count>
<original/>
<result/>
</count>
<append>
<original/>
<result/>
</append>
</document>
The output XML from the above input data and rules is as follows:
<?xml version="1.0" encoding="ISO-8859-1" ?>
<document>
<charsetmap>
<original>
<value_1>0,12.10</value_1>
<value_1>23,1.514</value_1>
<value_1>1,4.444</value_1>
<value_1>abba</value_1>
</original>
<overlappingfalse>
<value_1>0.12-10</value_1>
<value_1>23-1.514</value_1>
<value_1>1.4-444</value_1>
<value_1>bbcc</value_1>
</overlappingfalse>
<overlappingtrue>
<value_1>0-12-10</value_1>
<value_1>23-1-514</value_1>
<value_1>1-4-444</value_1>
<value_1>cccc</value_1>
</overlappingtrue>
</charsetmap>
<count>
<original>
<value_1>3</value_1>
<value_1>5</value_1>
</original>
<result>8</result>
</count>
</append>
<original>
<value1>amount</value1>
<value1>contact</value1>
</original>
<result>amountcontact</result>
</append>
</document>
<!DOCTYPE wrapper [
<!ELEMENT wrapper (template*, map, (repeat | test | map | extract | set | runtemplate | charsetmap | stringconvert | count)*) >
<!ELEMENT template (repeat | test | map | extract | set |runtemplate | charsetmap | stringconvert | count)*>
<!ATTLIST template
name ID #REQUIRED >
<!ELEMENT repeat (repeat | test | map | extract | set | runtemplate | charsetmap | stringconvert | count)* >
<!ATTLIST repeat
times CDATA #REQUIRED >
<!ELEMENT test (repeat | test | map | extract | set | runtemplate | charsetmap | stringconvert | count)* >
<!ATTLIST test
testtype (equal | unequal | less | greater | re_equal | re_unequal | contains) #REQUIRED
value1 CDATA #REQUIRED
value2 CDATA #IMPLIED >
<!ELEMENT map EMPTY >
<!ATTLIST map
maptype (getTemplate | documentReady | moveCursor | createDocument | createElement | createElementBefore | createElementAfter | createAttribute | createTextNode | createComment | createCDATA | createProcessingInstructions) #REQUIRED
node CDATA #REQUIRED
content CDATA #IMPLIED >
<!ELEMENT extract (repeat | test | map | extract | set | runtemplate | charsetmap | stringconvert | count)* >
<!ATTLIST extract
exptype (length | upto | over | re_upto | re_over| content) #REQUIRED
expression CDATA #REQUIRED
save CDATA #IMPLIED >
<!ELEMENT set EMPTY >
<!ATTLIST set
parameter (doctype | encoding | memory | append | serializeOutput) #REQUIRED
value1 CDATA #REQUIRED
value2 CDATA #IMPLIED
value3 CDATA #IMPLIED >
<!ELEMENT runtemplate EMPTY >
<!ATTLIST runtemplate
nameref IDREF #REQUIRED >
<!ELEMENT charsetmap (values)*>
<!ATTLIST charsetmap
name CDATA #REQUIRED >
<!ELEMENT values EMPTY >
<!ATTLIST values
search CDATA #REQUIRED
replace CDATA #REQUIRED >
<!ELEMENT stringconvert (register)*>
<!ATTLIST stringconvert
use CDATA #REQUIRED
overlapping (true | false) #REQUIRED >
<!ELEMENT register EMPTY>
<!ATTLIST register
nameref CDATA #REQUIRED >
<!ELEMENT count EMPTY
<!ATTLIST count
parameter (plus | minus | multiply | divide | decimal) #REQUIRED
value1 CDATA #REQUIRED
value2 CDATA #REQUIRED
value3 CDATA #IMPLIED >
]>