This section describes the syntax
for "Uniform Resource Locators" (URLs):
that is, basically physical addresses
of objects which are retrievable
using protocols already deployed
on the net. The generic syntax provides
a framework for new schemes for names
to be resolved using as yet undefined
protocols.
The syntax is described in two parts.
Firstly, we give the syntax rules
of a completely specified name; secondly,
we give the rules under which parts
of the name may be omitted in a well-defined
context.
URL syntax
A complete URL consists of a naming
scheme specifier followed by a string
whose format is a function of the
naming scheme. For locators of information
on the internet, a common syntax
is used for the IP address part.
A BNF description of the URL syntax
is given in an a later section. The
components are as follows. Fragment
identifiers and partial URLs are
not involved in the basic URL definition.
Scheme
Within the URL of a object, the first
element is the name of the scheme,
separated from the rest of the object
by a colon. The rest of the URL follows
the colon in a format depending on
the scheme.
Internet protocol parts
Those schemes which refer to internet
protocols mostly have a common syntax
for the rest of the object name.
This starts with a double slash "//"
to indicate its presence, and continues
until the following slash "/". Within
that section are
- An optional user name,
- if this must
be quoted to the server, followed
by a commercial at sign "@". (Use
of this field is discouraged. Provision
of encoding a password after the
user name, delimited by a colon,
could be made but obviously is only
useful when the password is public,
in which case it should not be necessary,
so that is also discouraged.)
- The internet domain name
- of the host
in RFC1037 format (or, optionally
and less advisably, the IP address
as a set of four decimal digits)
- The port number,
- if it is not the
default number for the protocol,
is given in decimal notation after
a colon.
- Path
- The rest of the locator is known
as the "path". It may define details
of how the client should communicate
with the server, including information
to be passed transparently to the
server without any processing by
the client.
The path is interpreted in a manner
dependent on the protocol being used.
However, when it contains slashes,
these must imply a hierarchical structure.
When a system uses a local addressing
scheme, it is useful to provide a
mapping from local addresses into
URLs so that references to objects
within the addressing scheme may
be referred to globally, and possibly
accessed through gateway servers.
Any mapping scheme may be defined
provided it is unambiguous, reversible,
and provides valid URLs. It is recommended
that where hierarchical aspects to
the local naming scheme exist, they
be mapped onto the hierarchical URL
path syntax in order to allow the
partial form to be used.
The following encoding method shall
be used for mapping WAIS, FTP, Prospero
and Gopher addresses onto URLs. Where
the local naming scheme uses ASCII
characters which are not allowed
in the URL, these may be represented
in the URL by a percent sign "%"
followed by two hexadecimal digits
(0-9, A-F) giving the ISO Latin 1
code for that character. Character
codes other than those allowed by
the syntax shall not be used unencoded
in a URL.
The same encoding method may be used
for encoding characters whose use,
although technically allowed in a
URL, would be unwise due to problems
of corruption by imperfect gateways
or misrepresentation due to the use
of variant character sets, or which
would simply be awkward in a given
environment. Because a % sign always
indicates an encoded character, a
URL may be made safer simply by encoding
any characters considered unsafe,
while leaving already encoded characters
still encoded. Similarly, in cases
where a larger set of characters
is acceptable, % signs can be selectively
and reversibly expanded.
(Note: If a new naming scheme is
introduced which encodes binary data
as opposed to text, then a more compact
encoding such as pure hexadecimal
or base 64 would be more appropriate.)