NOTE-GDIFF

Generic Diff Format Specification

Submitted to W3C August 21, 1997

This document is available at:: http://www.w3.org/XXX
Editors:: Arthur van Hoff, Marimba Inc.; Jonathan Payne, Marimba Inc.

Status of this Document

This document is a NOTE made available by the W3 Consortium for discussion only. This indicates no endorsement of its content, nor that the Consortium has, is, or will be allocating any resources to the issues addressed by the NOTE.

This document is a submission to W3C from Marimba. Please see Acknowledged Submissions to W3C regarding its disposition.

W3C staff note:

This document was submitted and should be considered along with NOTE-drp.
The Consortium is pleased to have received this specification as a submission, and acknowledges the interest in such a mechanism indicated by several member companies.
The proposal will be discussed at the W3C workshop on "push" technology in September.
That workshop will give participants a chance tocomment on the path the consortium should take. At this time we are not promising any allocation of further W3C resources tothis proposal after the Push Workshop.
This technology impacts URI specifications, HTTP, and includes anew XML mime type. Much of the functionality should be able to be provided by the Resource Description Framework currently being developed in the metadata activity. There are other components of the DRP proposal that have requirements in common with other W3C and IETF activities(notably DSig Manifests and WEBDAV versioning) and should this proposal be actively considered, W3C will work to insure the harmonization of the related activities.

Abstract

This document provides a specification of a generic file format for representing the differences between two files.

1. Introduction

2. The Generic Diff Format

3. Conclusion

3.1 References

1. Introduction

This document describes the Generic Diff Format (GDIFF). The GDIFF format can be used to efficiently describe the differences between two arbirary files. The format does not make any assumptions about the type or contents of the files, and thus can be used to describe the differences between text files as well as binary files. The GDIFF format is itself a binary file format.

This proposal does not describe how to compute the differences between two files. It only defines how the resulting differences can be described in an efficient and generic manner. The proposal describes the GDIFF file format and how to interpret it.

2. The Generic Diff Format

The GDIFF format is primarily useful in applications which compute the differences between two versions of a file. The resulting differences can be stored in a file using the GDIFF format. The differences described by the GDIFF file can later be applied to the old file to obtain the new file.

The GDIFF format is particularly useful in situations where it is more efficient to distribute the differences between two versions of a file, rather than the entire new version of the file.

Any file differencing algorithm can be used to compute the differences between the old and new versions of a file. An example is the rsync algorithm. The result can be expressed using the GDIFF format that is described here.

To apply a GDIFF file, you need random access to the old version of the file, and sequential access to the GDIFF file. The new version of the file is produced on the output stream.

File Format

The GDIFF format is a binary format. The mime type of a GDIFF file is "application/gdiff". All binary numbers in a GDIFF file are stored in big endian format (most significant byte first).

Each diff stream starts with the 4-byte magic number (value 0xd1ffd1ff), followed by a 1-byte version number (value 4). The version number is followed by a sequence of 1 byte commands which are interpreted in order. The last command in the stream is the end-of-file command (value 0).

The GDIFF commands are listed in the table below.

Name
Cmd

Followed By

Action

EOF

0

End of File

DATA

1

1 byte

append 1 data byte

DATA

2

2 bytes

append 2 data bytes

DATA

<n>

<n> bytes

append <n> data bytes

DATA

246

246 bytes

append 246 data bytes

DATA

247

ushort, <n> bytes

append <n> data bytes

DATA

248

int, <n> bytes

append <n> data bytes

COPY

249

ushort, ubyte

copy <position>, <length>

COPY

250

ushort, ushort

copy <position>, <length>

COPY

251

ushort, int

copy <position>, <length>

COPY

252

int, ubyte

copy <position>, <length>

COPY

253

int, ushort

copy <position>, <length>

COPY

254

int, int

copy <position>, <length>

COPY

255

long, int

copy <position>, <length>

There are two kinds of GDIFF commands. The first kind is the DATA command (1 through 248). Each data command is followed by a number of data bytes which are copied onto the output stream.

The second kind of GDIFF command is the COPY command (249 through 255). Each COPY command is followed by two arguments: position and length. The arguments specify the portion of the old file that must be copied onto the output stream.

If a number larger than 1^31-1 bytes is needed for a command command that takes only int arguments, the command must be split into multiple commands. This may be necessary when dealing with very large files.

Types

byte - 8 bit signed
ubyte - 8 bit unsigned
ushort - 16 bit unsigned, most significant byte first
int - 32 bit signed, most significant byte first
long - 64 bit signed, most significant byte first

Example

To illustrate the use of the GDIFF format we will use two input streams old and new as an example and prepare a simple GDIFF file by hand:

0 1 2 3 4 5 6 7 8 9 ----------------------------------------------- old: A B C D E F G new: A B X Y C D B C D E

A diff algorithm can be used to discover patterns which occur in both the old and new files, and a set of GDIFF commands can be computed:

Comments GDIFF Output

magic version COPY 0, 2 DATA "XY" COPY 2, 2 COPY 1, 4 EOF

0xd1, 0xff, 0xd1, 0xff 4 249, 0, 0, 2 2, 'X', 'Y' 249, 0, 2, 2 249, 0, 1, 4 0

A, B X, Y C, D B, C, D, E

Note that in this case the resulting GDIFF file is larger than the new file. This is not normally the case when the files get larger. Also note that there can be many different GDIFF files which produce the same result. The size of the resulting GDIFF file largely depends on the similarity between the two input files, and the ability of the diff algorithm to find the most optimal set of differences.

3. Conclusion

The GDIFF format is a simple diff format that can efficiently describe the differences between a two files of any type. It is defined in this proposal to provide a simple interoperability format for binary differencing.

3.1 References

"The rsync algorithm" by Andrew Tridgell and Paul Mackerras.
http://cs.anu.edu.au/techreports/1996/TR-CS-96-05.html

Name	Cmd	Followed By	Action
EOF	0		End of File
DATA	1	1 byte	append 1 data byte
DATA	2	2 bytes	append 2 data bytes
DATA	<n>	<n> bytes	append <n> data bytes
DATA	246	246 bytes	append 246 data bytes
DATA	247	ushort, <n> bytes	append <n> data bytes
DATA	248	int, <n> bytes	append <n> data bytes
COPY	249	ushort, ubyte	copy <position>, <length>
COPY	250	ushort, ushort	copy <position>, <length>
COPY	251	ushort, int	copy <position>, <length>
COPY	252	int, ubyte	copy <position>, <length>
COPY	253	int, ushort	copy <position>, <length>
COPY	254	int, int	copy <position>, <length>
COPY	255	long, int	copy <position>, <length>

Comments	GDIFF	Output
magic version COPY 0, 2 DATA "XY" COPY 2, 2 COPY 1, 4 EOF	0xd1, 0xff, 0xd1, 0xff 4 249, 0, 0, 2 2, 'X', 'Y' 249, 0, 2, 2 249, 0, 1, 4 0	A, B X, Y C, D B, C, D, E

Generic Diff Format Specification

Submitted to W3C August 21, 1997

Status of this Document

Abstract

Table of Contents

1. Introduction

2. The Generic Diff Format

File Format

Types

Example

3. Conclusion

3.1 References