NOTE-GDIFF
This document is a NOTE made available by the W3 Consortium for discussion only. This indicates no endorsement of its content, nor that the Consortium has, is, or will be allocating any resources to the issues addressed by the NOTE.
This document is a submission to W3C from Marimba. Please see Acknowledged Submissions to W3C regarding its disposition.
W3C staff note:
This document was submitted and should be considered along with NOTE-drp.The Consortium is pleased to have received this specification as a submission, and acknowledges the interest in such a mechanism indicated by several member companies.
The proposal will be discussed at the W3C workshop on "push" technology in September.
That workshop will give participants a chance tocomment on the path the consortium should take. At this time we are not promising any allocation of further W3C resources tothis proposal after the Push Workshop.
This technology impacts URI specifications, HTTP, and includes anew XML mime type. Much of the functionality should be able to be provided by the Resource Description Framework currently being developed in the metadata activity. There are other components of the DRP proposal that have requirements in common with other W3C and IETF activities(notably DSig Manifests and WEBDAV versioning) and should this proposal be actively considered, W3C will work to insure the harmonization of the related activities.
This document provides a specification of a generic file format for representing the differences between two files.
1. Introduction
2. The Generic Diff Format
3. Conclusion
3.1 References
This document describes the Generic Diff Format (GDIFF). The GDIFF format can be used to efficiently describe the differences between two arbirary files. The format does not make any assumptions about the type or contents of the files, and thus can be used to describe the differences between text files as well as binary files. The GDIFF format is itself a binary file format.
This proposal does not describe how to compute the differences between two files. It only defines how the resulting differences can be described in an efficient and generic manner. The proposal describes the GDIFF file format and how to interpret it.
The GDIFF format is primarily useful in applications which compute the differences between two versions of a file. The resulting differences can be stored in a file using the GDIFF format. The differences described by the GDIFF file can later be applied to the old file to obtain the new file.
The GDIFF format is particularly useful in situations where it is more efficient to distribute the differences between two versions of a file, rather than the entire new version of the file.
Any file differencing algorithm can be used to compute the differences between the old and new versions of a file. An example is the rsync algorithm. The result can be expressed using the GDIFF format that is described here.
To apply a GDIFF file, you need random access to the old version of the file, and sequential access to the GDIFF file. The new version of the file is produced on the output stream.
The GDIFF format is a binary format. The mime type of a GDIFF file is "application/gdiff". All binary numbers in a GDIFF file are stored in big endian format (most significant byte first).
Each diff stream starts with the 4-byte magic number (value
0xd1ffd1ff
), followed by a 1-byte version number (value 4).
The version number is followed by a sequence of 1 byte commands which are
interpreted in order. The last command in the stream is the end-of-file command
(value 0).
The GDIFF commands are listed in the table below.
Name |
Cmd |
Followed By |
Action |
EOF |
0 |
|
End of File |
DATA |
1 |
1 byte |
append 1 data byte |
DATA |
2 |
2 bytes |
append 2 data bytes |
DATA |
<n> |
<n> bytes |
append <n> data bytes |
DATA |
246 |
246 bytes |
append 246 data bytes |
DATA |
247 |
ushort, <n> bytes |
append <n> data bytes |
DATA |
248 |
int, <n> bytes |
append <n> data bytes |
COPY |
249 |
ushort, ubyte |
copy <position>, <length> |
COPY |
250 |
ushort, ushort |
copy <position>, <length> |
COPY |
251 |
ushort, int |
copy <position>, <length> |
COPY |
252 |
int, ubyte |
copy <position>, <length> |
COPY |
253 |
int, ushort |
copy <position>, <length> |
COPY |
254 |
int, int |
copy <position>, <length> |
COPY |
255 |
long, int |
copy <position>, <length> |
There are two kinds of GDIFF commands. The first kind is the DATA command (1 through 248). Each data command is followed by a number of data bytes which are copied onto the output stream.
The second kind of GDIFF command is the COPY command (249 through 255). Each COPY command is followed by two arguments: position and length. The arguments specify the portion of the old file that must be copied onto the output stream.
If a number larger than 1^31-1
bytes is needed for a command
command that takes only int
arguments, the command must be split
into multiple commands. This may be necessary when dealing with very large
files.
To illustrate the use of the GDIFF format we will use two input streams old and new as an example and prepare a simple GDIFF file by hand:
0 1 2 3 4 5 6 7 8 9 ----------------------------------------------- old: A B C D E F G new: A B X Y C D B C D E |
A diff algorithm can be used to discover patterns which occur in both the old and new files, and a set of GDIFF commands can be computed:
Comments | GDIFF | Output |
magic version COPY 0, 2 DATA "XY" COPY 2, 2 COPY 1, 4 EOF |
0xd1, 0xff, 0xd1, 0xff 4 249, 0, 0, 2 2, 'X', 'Y' 249, 0, 2, 2 249, 0, 1, 4 0 |
A, B X, Y C, D B, C, D, E |
Note that in this case the resulting GDIFF file is larger than the new file. This is not normally the case when the files get larger. Also note that there can be many different GDIFF files which produce the same result. The size of the resulting GDIFF file largely depends on the similarity between the two input files, and the ability of the diff algorithm to find the most optimal set of differences.
The GDIFF format is a simple diff format that can efficiently describe the differences between a two files of any type. It is defined in this proposal to provide a simple interoperability format for binary differencing.