This section describes the syntax
for URIs as used in the WorldWide
Web initiative. The generic syntax
provides a framework for new schemes
for names to be resolved using as
yet undefined protocols.
URI syntax
A complete URL consists of a naming
scheme specifier followed by a string
whose format is a function of the
naming scheme. For locators of information
on the internet, a common syntax
is used for the IP address part.
A BNF description of the URL syntax
is given in an a later section. The
components are as follows. Fragment
identifiers and relative URIs are
not involved in the basic URL definition.
Scheme
Within the URL of a object, the first
element is the name of the scheme,
separated from the rest of the object
by a colon.
Path
The rest of the URL follows the colon
in a format depending on the scheme.
The path is interpreted in a manner
dependent on the protocol being used.
However, when it contains slashes,
these must imply a hierarchical structure.
Reserved characters
The path in the URI has a significance
defined bythe particular scheme.
Typically it is used to encode a
name in a given name space, or an
algorithm for accessing an object.
In either case, the encoding may
use those characters allowed by the
BNF syntax, or hexadecimal encodings
of other characters.
Some of the reserved characters have
special uses as defined here.
The percent sign
The percent sign ("%", ASCII 25 hex)
is used in the encoding scheme and
is never allowed for anything else.
Hierarchical forms
The slash ("/", ASCII 2F hex) character
is reserved for the delimiting of
substrings whose relationship is
hierarchical. This enables partial
forms of the URI. Substrings consisting
of single or double dots ("." or
"..") are similiarly reserved.
Note
The similarity to unix and msdos
filename conventions should be taken
as purely coincidental, and should
not be taken to indicate that URIs
should be intepreted as filenames.
Hash for Fragment Identifiers
The hash ("#", ASCII 23 hex) character
is reserved as a delimiter to separate
the URI of an object from a fragment
identifier .
Query strings
The question mark ("?", ASCII 3F
hex) is used to delimit the boundary
between the URL of a queryable object,
and a set of words used to express
a query on that object. When this
form is used, the combined URI stands
for the object which results from
the query being applied to the original
object.
Within the query string, the plus
sign is reserved as shorthand notation
for a space. Therefore, real plus
signs must be encoded. This method
was used to make query URLs easier
to pass in systems which did not
allow spaces.
Unsafe characters
The URI specicfication specifies
that in connonical form, certain
characters such as spaces, control
characters, and some characters whose
ASCII code is used differently in
different national character variant
7 bit sets, are not used unencoded.
This is a recommendation for trouble-free
interchange, and as indicated below,
the safe set may be under certain
circumstances extended or reduced.
When a system uses a local addressing
scheme, it is useful to provide a
mapping from local addresses into
URLs so that references to objects
within the addressing scheme may
be referred to globally, and possibly
accessed through gateway servers.
For a new naming scheme, any mapping
scheme may be defined provided it
is unambiguous, reversible, and provides
valid URIs. It is recommended that
where hierarchical aspects to the
local naming scheme exist, they be
mapped onto the hierarchical URL
path syntax in order to allow the
partial form to be used.
It is also recommended that the
conventional scheme below be used
in all cases except for any scheme
which encodes binary data as opposed
to text, in which case a more compact
encoding such as pure hexadecimal
or base 64 might be more appropriate.
For example, the conventional URI
encoding method is used for mapping
WAIS, FTP, Prospero and Gopher addresses
in the URL specification..
Conventional URI encoding scheme
Where the local naming scheme uses
ASCII characters which are not allowed
in the URL, these may be represented
in the URL by a percent sign "%"
followed by two hexadecimal digits
(0-9, A-F) giving the ISO Latin 1
code for that character. Character
codes other than those allowed by
the syntax shall not be used unencoded
in a URL.
Reduced or increased safe character
sets
The same encoding method may be used
for encoding characters whose use,
although technically allowed in a
URL, would be unwise due to problems
of corruption by imperfect gateways
or misrepresentation due to the use
of variant character sets, or which
would simply be awkward in a given
environment. Because a % sign always
indicates an encoded character, a
URL may be made "safer" simply by
encoding any characters considered
unsafe, while leaving already encoded
characters still encoded. Similarly,
in cases where a larger set of characters
is acceptable, % signs can be selectively
and reversibly expanded.
Before two URIs can be compared,
it is therefore necessary to bring
them to the same encoding level.
However, the reserved characters
mentioned above have a quite different
significance when encoded, and so
may NEVER be encoded and unencoded
in this way.
The percent sign intended as such
must always be encoded, as its presence
otherwise always indciates an encoding.
Sequences which start with a percent
sign but are not followed by two
hexadecimal characters are reserved
for future extenstion.
Example 1
The URIs
http://www.w3.org/albert/bertram/marie-claude
and
http://www.w3.org/albert/bertram/marie%2D
claude are identical, as the %2D
encodes a hyphen character.
Example 2
The URIs
http://www.w3.org/albert/bertram/marie-claude
and
http://www.w3.org/albert/bertram%2Fmarie-claude
are NOT identical, as in the second
case the encoded slash does not have
hierarchical significance.
Example 3
The URIs
fxqn:/us/va/reston/cnri/ietf/24/asdf%*.fred
and
news:12345667123%asdghfh@info.cern.ch
are illegal, as all % characters
imply encodings, and there is no
decoding defined for "%*" or "%as"
in this recommendation.