Henrik Frystyk, July 94

World-Wide Web Software at CERN

This document is an overview of the World-Wide Web software developed at CERN. The development of the World-Wide Web code was started by Tim Berners-Lee in 1990. Ever since the code has been subject for changes due to modifications in the architectural model, additions of new features etc. The code is freely available as public domain software with a very mild Copyright statement.

During the last two years more and more World-Wide Web applications have become available from a large amount of software providers on almost every platform connected to the Internet. Many of them are based on the same architectural model as the CERN software but with additional functionality and increased performance. Most of the software is characterized by being freely available as public domain for educational institutions and other non-profit organizations whereas commercial companies must pay a fee for using them.

The CERN World-Wide Web software is written in plain C and is especially designed to be used on a large set of different platforms. It has often been discussed if the CERN code or especially the Common Code Library should be rewritten in ANSI C using the IEEE Std 1003.1-1990 standard (commonly referred to as POSIX.1), but eventually this will limit the portability on many platforms currently supported. A newly started collaboration between WWW software providers will expand this portability to also include MS-DOS and MacIntosh so that the most popular platforms are covered from large computers down to PCs. The document describes the following software product maintained at CERN:

The Library of Common Code
The Line Mode Browser
The HTTP Server
The Proxy Server

The Library of Common Code

The CERN World-Wide Web Library of Common Code is a general code base that can be used to build World-Wide Web clients and servers. It contains code for accessing HTTP, FTP, Gopher, News, WAIS, Telnet servers, and the local file system. Furthermore it provides modules for parsing, managing and presenting hypertext objects to the user and a wide spectra of generic programming utilities. The Library is the basis of many World-Wide Web applications and all the CERN WWW software is built on top of it. Even though it is written in plain C, many of the data structures used are highly object oriented - an implementation form often referred to as "a poor man's C++". The following figure is an overview of the current architecture of the library. The view is especially for the client side of the library. The CERN Proxy Server and the CERN HTTP server have slightly different views of the architecture.

The flow of the library shows that all network communication and data object parsing is handled internally. Only the presentation to the user is left to the client as this is a very platform dependent task. The main elements in the figure are explained in the following sections. A more specific description of the implementation of the library is given in Internals and Programmer's guide

Graphic Object

A graphic object is a displayable entity handled and maintained by the client. It is built from the data contained in a server response upon a successful request initiated by the client. The object can either be build directly from the data, e.g, if the data object returned is a HTML document, or it can be generated from a format converter within the library. The latter could be the generation of a HTML object from a FTP directory listing (7-bit ASCII).

Graphic objects are in general necessarily coded differently on different window systems. The graphic object is responsible for displaying itself, catching mouse clicks, and calling the navigation object in order to follow links. Often the more common term "document" is used to describe the logical entity which a graphics object represents and displays.

For the moment, a graphic object is created and maintained in the client side of the library and the client itself. However, it would be possible to extend the definition of a graphic object to also describe a data object being transferred from the server to the client using the HTTP protocol. The client can then use meta information given in the graphic object to display the raw data in the representation desired or available in the client.

Anchor Manager

Anchors represent parts of graphic objects which may be the sources or destinations of links. There are basically two types of anchors: Parent anchors and child anchors.

Parent Anchors

These represent whole graphic objects (documents). Every graphic object has an associated parent anchor. Associated with a parent anchor is data including:

The title of the associated document, if known. This allows the document's title to be displayed in lists of previous nodes visited, etc., even when the document itself has been freed.
A flag as to whether the document is an index
The address of the document. When a new anchor is created, the code ensures that if an anchor with that address already exists, then this anchor is returned instead, so no duplicates can exist. A problem concerning parent anchors is how to know when two URLs are equivalent, i.e., they point to the same resource on the Web. Host aliases, soft links and other constraints makes a directly comparison impossible even if the URL's are canonicalized.
A list of children

Child Anchors

These represent parts of documents. The graphic object stores the correlation between the identity of the anchor and the actual space shape which is referred to. Child anchors contain

A pointer to the parent
Address of this anchor relative to the parent anchor

The relationship between parent anchors and child anchors can be illustrated as follows:

An anchor can be the source of zero, one, or many links . It has one "main" link for the (common) case in which it is the source for one link. When posting a data object to, e.g, a NNTP News Group or using the POST method in the HTTP Protocol it is common to have more than one recipient for the data object to be posted. The list of recipients are all in the "link list" of the anchor, this is explained in a later section on Put and Post

An anchor may be the destination of zero, one, or many links. The anchor module stores all links known by the program, and so in fact manages a copy of a small part of the Web.

Cache Manager

This is a local cache module specifically for WWW Clients. It is used to save data objects once they have been down loaded from the Internet. The CERN Proxy server has its own cache manager to handle a large scale cache that can serve hundreds of clients with documents once they have been received form the remote host. The client cache is made for clients not using a proxy cache or having a very slow link but a large local temporary storage.

Navigation and History

This module keeps track of the part of the Web that the user has visited during the World-Wide Web session. When a request is passed from the client to the library, this module searches the list of previous requested resources managed by the Anchor Manager. If it has already been accessed it first checks if the graphic object is still in memory on the client machine or if not it asks the cache manager if the object is stored in a temporary storage (local file system) on client side. The difference between a data object in the cache and in memory is that the memory version is a graphic object whereas the cache version is a resource that has to be loaded into memory and passed in order to be transformed into a graphic object.

Protocol Manager

The Protocol Manager is invoked by the client in order to access a document. Each protocol module is responsible for extracting information from a local file or remote server using a particular protocol. Depending on the protocol, the protocol module either builds a graphic object (e.g. hypertext) itself, or it passes a socket descriptor to the format manager for parsing by one of the parser modules. As mentioned in the Graphic Object section it can also perform a conversion of the raw data returned from the remote server into, e.g. a HTML object.

Stream Manager

Streams are unidirectional objects which accept characters, strings, and blocks of data to be written to them. The Stream Manager handles a generic representation of a stream class so that the interface is always the same for all types of different input and output streams to the manager.

Streams can be thought of as like files open for write. The stream-based architecture allows the software to be event-driven in the sense that when input arrives, it is put into a stream, and any necessary actions then cascade off that.

Stream might be cascaded so that one stream writes into into another stream after having performed some processing on the data. An output stream is often referred to as the "target" or "sink" stream.

Structured streams

A structured stream is a subclass of a stream, but instead of just accepting data, it also accepts SGML events such as begin and end elements. A structured stream therefore represents a structured document. A structured stream can be thought of as the output from an SGML parser. It is more efficient for modules which generate hypertext objects to output a structured stream than to output SGML which is then parsed.

The elements and entities in the stream are referred to by numbers, rather than strings. The DTD contains the mapping between element names and numbers, so each structured stream when created is associated with the DTD which it using. Any instance of a structured stream has a related DTD which gives the rules and element and entity names for events on the structured stream. The only DTD which is currently in the library is an extended version of a HTML DTD version 1.0.

The SGML parser uses a DTD to output to a structured stream from a stream of SGML. A hypertext editor will output to a structured stream when writing out a document. Many protocol modules output to a structures stream when generating their data structures.

Format Conversion and Stream Stacks

Often it is desired to perform a format conversion between the entry point and the output point of the stream. As illustrated in the Figure of the library the stream manager is the node between the input format given by the protocol modules and the desired output format specified by the client. Though, often it is desirable to perform more than one data conversion on a data object. Therefore, the stream manager is designed as a stream stack where several streams can be cascaded, each one performing a part of the total data conversion.

The Line Mode Browser

The CERN Line Mode Browser is a character based World-Wide Web Browser. It is developed for use on dumb terminals and as a test tool for the CERN Common Code Library. It can be run in interactive mode, non-interactive mode, as a proxy client and a set of other run modes that all are explained in Command Line Options. Even though it is not often used as a World-Wide Web browser, the possibility of executing it in the background or from a batchjob makes it a useful tool. Furthermore it gives a variety of possibilities for data format conversion, filtering etc.

The easist way to get an idea of what the Line Mode Browser is all about is actually to try directly from The info server at CERN. No userid or password is needed.

The HTTP Server

CERN httpd is a generic hypertext server which can be used as a regular HTTP server. The allocated port for HTTP connections is TCP port 80, but the server can be put up to listen on any other TCP port (above 1024 if not running as root). The CERN server includes features such as Access Authentication, Clickable Images etc.

The Proxy Server

The CERN server also has the possibility of running as a proxy server. A proxy is a special HTTP server that typically runs on a firewall machine. The proxy waits for a request from inside the firewall, forwards the request to the remote server outside the firewall, reads the response and then sends it back to the client. Kevin Altis, Ari Luotonen and Lou Montulli have been the principle designers behind the current proxy standard as is illustrated in the following figure:

As seen from the figure, all communication between the client inside the firewall and the Proxy server is done using HTTP. This makes the client application much more effective as it can concentrate on the user interface and not on the Internet interface including presentation protocol clients etc.

In the usual case, the same proxy is used by all the clients within a given subnet. This gives another advantage of using a proxy server as it is possible for the proxy to do efficient caching of documents that are requested by a number of clients. The ability to cache documents also makes proxies attractive to groups of clients not inside a firewall as it cuts down the network traffic costs to remote hosts.

The CERN server had gateway features for a long time provided by Tim Berners-Lee, but this has recently been extended to support all the methods in the HTTP protocol used by WWW clients. Clients don't lose any functionality by going through a proxy, except special processing they may have done for non-native Web protocols such as Gopher and FTP.

Henrik Frystyk, frystyk@info.cern.ch, July 1994