For "structured data" think of such things as
spreadsheets, address books, configuration parameters, financial
transactions, technical drawings, etc. Programs that produce such data often
also store it on disk, for which they can use either a binary format or a
text format. The latter allows you, if necessary, to look at the data without
the program that produced it. XML is a set of rules, guidelines, conventions,
whatever you want to call them, for designing text formats for such data, in
a way that produces files that are easy to generate and read (by a computer),
that are unambiguous, and that avoid common pitfalls, such as lack of
extensibility, lack of support for internationalization/localization, and
platform-dependency.
Like HTML, XML makes use of tags (words bracketed by '<' and
'>') and attributes (of the form
name="value"
), but
while HTML specifies what each tag & attribute means (and often how the
text between them will look in a browser), XML uses the tags only to delimit
pieces of data, and leaves the interpretation of the data completely to the
application that reads it. In other words, if you see "<p>" in an XML
file, don't assume it is a paragraph. Depending on the context, it may be a
price, a parameter, a person, a p... (b.t.w., who says it has to be a word
with a "p"?)
XML files are text files, as I said above, but even less than HTML are they
meant to be read by humans. They are text files, because that allows experts
(such as programmers) to more easily debug applications, and in
emergencies, they can use a simple text editor to fix a broken XML file. But
the rules for XML files are much stricter than for HTML. A forgotten tag, or
an attribute without quotes makes the file unusable, while in HTML such
practice is often explicitly allowed, or at least tolerated. It is written in
the official XML specification: applications are not allowed to try
to second-guess the creator of a broken XML file; if the file is broken, an
application has to stop right there and issue an error.
There is XML 1.0, the specification
that defines what "tags" and "attributes" are, but around XML 1.0, there is a
growing set of optional modules that provide sets of tags & attributes,
or guidelines for specific tasks. There is, e.g., Xlink (still in development as of November
1999), which describes a standard way to add hyperlinks to an XML file.
XPointer & XFragments (also still being developed)
are syntaxes for pointing to parts of an XML document. (An XPointer is a bit
like a URL, but instead of pointing to documents on the Web, it points to
pieces of data inside an XML file.) CSS, the style sheet language, is
applicable to XML as it is to HTML. XSL (autumn 1999) is the advanced language for expressing style
sheets. It is based on XSLT, a
transformation language that is often useful outside XSL as well, for
rearranging, adding or deleting tags & attributes. The DOM is a standard set of function
calls for manipulating XML (and HTML) files from a programming language.
XML Namespaces is a
specification that describes how you can associate a URL with every single
tag and attribute in an XML document. What that URL is used for is up to the
application that reads the URL, though. (RDF, W3C's standard for metadata, uses it
to link every piece of metadata to a file defining the type of that data.) XML Schemas 1 and 2 help developers to
precisely define their own XML-based formats. There are several more modules
and tools available or under development. Keep an eye on W3C's technical reports page.
Since XML is a text format, and it uses tags to delimit
the data, XML files are nearly always larger than comparable binary formats.
That was a conscious decision by the XML developers. The advantages of a text
format are evident (see 3 above), and the disadvantages can usually be
compensated at a different level. Disk space isn't as expensive anymore as it
used to be, and programs like zip and gzip can compress files
very well and very fast. Those programs are available for nearly all
platforms (and are usually free). In addition, communication protocols such
as modem protocols and HTTP/1.1 (the core protocol of
the Web) can compress data on the fly, thus saving bandwidth as effectively
as a binary format.
Development of XML started in 1996 and it is a W3C standard since February
1998, which may make you suspect that this is rather immature technology. But
in fact the technology isn't very new. Before XML there was SGML, developed
in the early '80s, an ISO standard since 1986, and widely used for large
documentation projects. And of course HTML, whose development started in
1990. The designers of XML simply took the best parts of SGML, guided by the
experience with HTML, and produced something that is no less powerful than
SGML, but vastly more regular and simpler to use. Some evolutions, however,
are hard to distinguish from revolutions... And it must be said that while
SGML is mostly used for technical documentation and much less for other kinds
of data, with XML it is exactly the opposite.
These I don't know yet.
By
choosing XML as the basis for some project, you buy into a large and growing
community of tools (one of which may already do what you need!) and engineers
experienced in the technology. Opting for XML is a bit like choosing SQL for
databases: you still have to build your own database and your own
programs/procedures that manipulate it, but there are many tools available
and many people that can help you. And since XML, as a W3C technology, is
license-free, you can build your own software around it without paying
anybody anything. The large and growing support means that you are also not
tied to a single vendor. XML isn't always the best solution, but it is
always worth considering.
Copyright © 1999-2000 W3C® ( MIT, INRIA, Keio), All Rights Reserved.