The Hackerlab at regexps.com

The Theory of Patches and Revisions

up: arch
next: The arch Patch Set Format
prev: Implementing Development Policies

This appendix briefly explains "patch sets" and "revision control" in the abstract.

The Theory of Patches

A patch set is an expression of the differences between two revisions of a tree of files (usually, primarily, text files). A patch set tells you what files and directories have been added or removed between the two revisions, what files have been renamed, what files have changed. For files added or removed, a patch set tells you the complete contents of those files. For files modified, a patch set contains a description of the changes in the form of a context diff (see the man page for diff(1) ). If a file is a symbolic link, and the link target has changed, the patch set records that fact. If a regular file or directory is replaced by a symbolic link (or vice versa) the patch records that fact. Finally, if any files have had their permissions or modification times changed -- a patch set records that too.

Some notation will be helpful. A shorter name for "patch set" is delta . Let's suppose that A and B are two revisions of a source tree. Then:

        delta (A, B)

is the name for a patch set describing the differences between A and B .

Anytime you make a series of changes to a tree, perhaps using shell utilities and text editors, the entire series of changes can be summarized as a single patch set. In a sense, a patch set (or "delta") is the fundamental editting operation , in terms of which all others, and all combinations of others, can be described.

You can apply a patch -- which means to make the changes it describes. In our notation:

        delta (A, B) [A] == B

says "the patch set describing the differences between A and B, when applied to A, gives B". A patch set can also be applied "in reverse":

        delta (A, B) {B} == A

"the delta from A to B, applied in reverse to B, gives A".

A patch set can also be applied (or reverse-applied) to a tree which is not the same as either A or B . For example, suppose that we have a tree A_prime which is similar to A , but has some slight differences. Then:

        delta (A, B) [A_prime] ~== B_prime

where ~== means "approximately equals". When a patch set is applied to a tree which is not one of the trees used to compute the patch set, the edits might or might not go well. For example, if the patch set wants to modify a file F , but A_prime doesn't contain F , the patch can't be applied perfectly. Similarly, if the patch set wants to modify F , but the version of F in A_prime is already very different from the version in A , then those edits can't be done automatically.

Nevertheless, for the kinds of changes people typically make to trees of source code, approximately applied patch sets are very useful. For example, suppose we start with tree A, and create two revisions B_one and B_two . Then:

        delta (A, B_one) [B_two] ~== delta (A, B_two) [B_one]

To make that more concrete: if programmer Alice makes a set of changes to give us B_one , and programmer Bob makes a set of changes to give us B_two , then the patch sets between A and those two revisions of B give us a way to get B_three -- a tree that contains both Alice's and Bob's changes. Even when a patch set can't be perfectly applied in that way, it can often be applied to do "80%" of the work, making it much easier to finish merging the two sets of changes by hand.

Incidently, patch sets have a useful algebraic property if we think of them as functions that can be composed. Using the notation F o G to mean "the function F composed with the function G":

        For all trees, A, B, and C:

        delta (B, C) o delta (A, B) == delta (A, C)

so to build C from A , we can use:

        delta (A, C) [ A ] == C

but that is the same as:

        (delta (B, C) o delta (A, B)) [ A ] == C

or, in other words:

        delta (B, C) [ delta (A, B) [ A ]] == C

The algebraic property suggests that if I want to apply:

        delta (B, C) o delta (A, B)

I can save time by instead applying:

        delta (A, C)

and if I want to have a record of all three:

        delta (A, B)
        delta (B, C)
        delta (A, C)

I can save space by storing only:

        delta (A, B)
        delta (B, C)

And if we're applying patch sets to trees that might need to be touched up by hand, and I want to apply delta (A, C) , then I have a choice between applying:

        delta (A, C)

and have just one, possibly large set of errors to clean-up by hand, or applying:

        delta (A, B) then delta (B, C)

and having two, but possibly smaller sets of errors to clean-up by hand.

Patch sets have many uses, but three important ones are:

Compression One use is compression. If A and B are large trees, and the differences between them small, then the patch set between them will be much smaller than either tree. You can save disk space by not storing both A and B , but instead, storing only A (or only B ) along with delta (A, B) . That makes patch sets an ideal form for space-efficient archival of multiple revisions of a tree.

Similarly, if someone has downloaded a copy of A (or B ) and they want a copy of B (or A ), they can save download time and bandwidth by downloading only delta (A, B) . That makes patch sets a good way to distribute multiple revisions of a tree.

Inspection Another use for patch sets is inspection. If a patch set is stored in a human-readable format, it provides a useful way to quickly see precisely what has changed between two revisions of a tree. For example, a patch set is handy for reviewing the changes made by programmers to a large source tree.

Combining Separate Efforts For some kinds of trees, patch sets are good at merging (combining) changes made by people working separately (as in the example of Alice and Bob, above). This is especially true of program source code. That makes patch sets a very handy tool for making a team of programmers more effective -- allowing the work separately up to a point, then combine their efforts by creating and applying patch sets.

The Theory of Revisions

Suppose that we start with a tree, A0 , and make a set of changes resulting in the tree A1 :

        A0
        A1

We can repeat that process several times:

        A0
        A1
        A2
        A3
        ...

Each instance of the tree is called a revision . Between each revision and its successor, we can compute a patch set:

        delta (A0,A1) [A0] == A1
        delta (A1,A2) [A1] == A2
        delta (A2,A3) [A2] == A3
        ...

Something we can usefully do is create an archive of revisions. We might store the first tree verbatim, and every successive tree as a delta:

        A0: "complete copy of tree"
        A1: delta (A0,A1)
        A2: delta (A1,A2)
        A3: delta (A2,A3)
        ...

If we want to retrieve an An , we start with A0 and apply the first n deltas:

        delta (An-1, An) [delta (An-2, An-1) [....[A0]...]] = An

or, making our notation more concise:

        An [ An-1 [ An-2 [ ... [A0] ...]]] = An

Each revision in a series like this is called a patch level . The entire series is called a development path .

At any point along the way, we might make a copy of some An , and start a new development path. For example, we might copy A2 to form B0 :

        A0
        A1
        A2 ----------------> B0
        A3                   B1
        ...                  B2
                             ...

When we have multiple, related development paths, each is called a branch . The tree we copied to start a new branch (e.g. A2 ) is called a branch point .

If we're building an archive, we can store B0 as a pointer to the A0 development path, and every successive revision of the B0 path as an ordinary delta:

        A0                      B0: "equal to A2"
        A1: delta (A0,A1)       B1: delta (B0, B1)
        A2: delta (A1,A2)       B2: delta (B1, B2)
        A3: delta (A2,A3)       B3: delta (B2, B3)
        ...                     ...

To make all this more concrete, imagine that the A0 development path is successive revisions of a program we're working on. Alice wants to add a very complicated feature. Rather than make many small changes to the A0 development path, she makes a branch, the B0 development path -- and works on the complicated feature there. Each new revision of the B0 branch is a small step on the way to the complicated feature.

Eventually, Alice is done with the feature -- but meanwhile, the A0 development path has added several changes of its own. What we'd like to do next is to make a revision of the tree that has both sets of changes -- from both development paths.

        A0                      B0: "equal to A2"
        A1: delta (A0,A1)       B1: delta (B0, B1)
        A2: delta (A1,A2)       B2: delta (B1, B2)
        A3: delta (A2,A3)       B3: delta (B2, B3)
        A4: delta (A3,A4)       B4: delta (B3, B4)

              How can we make revision A5, which
               includes all the changes made on
                        both branches?

The "theory of patches" gives us several possible solutions.

One solution is to make this tree:

        B4 [ B3 [ B2 [ B1 [ A4 ]]]]

That solution is called replaying the patches from the B0 branch against the A0 branch.

That might work reasonably, but patch set B1 wasn't formed from A4 -- it was formed from B0 which is the same as A2 . So when we apply B1 to A4 , there might be problems that have to be resolved "by hand". The same will happen again when we apply B2 , B3 , and B4 . In some situations, the risk and complexity of doing all that work by hand is worth it -- but not in other situations. What other options do we have?

Another solution is to make this tree:

        A4 [ A3 [ B4 ] ]

That solution is also replaying: replaying the patches from the A0 branch against the B0 branch. The same kind of problem might occur (having to fix things up by hand), but in this case, we're only applying two patches instead of four -- so this might be a simpler solution.

Here's a third solution:

        delta (A2, B4) [A4]

That solution is based on the fact that:

        B0 == A2

and the algebraic property that:

        B4 o B3 o B2 o B1 == delta (B0, B4)

That solution is called updating the B0 development path with respect to the A0 path. The difference between "replaying" and "updating" is a little bit subtle. When we "replay" from another development path, that means that we take all patches we're missing from that other path, and apply them in order. When we "update"from another development path, that means we take the latest revision on that other path, and apply to it a delta between the branch point and our own most-recent revision.

Applying the patch during an update certainly can fail to work perfectly -- it might require fixing up by-hand. On the other hand, an update only ever applies one patch; often, therefore, the amount of by-hand repairs is minimized. "Nine times out of ten," updating is the preferred technique for joining two previously branched revisions.

There are other, more obscure solutions too. To choose one arbitrarilly, we might try building:

        B4 [ A3 [ B2 [ A4 [ B1 [ B3 [ A2 ]]]]]]

A bizarre solution like that is so rare it doesn't really have name -- but "one time in ten thousand" -- it's the solution that works best.

No matter what solution we choose, if we store the resulting revision back on the B0 path, we'll wind up with:

        A0                      B0: "equal to A2"
        A1: delta (A0,A1)       B1: delta (B0, B1)
        A2: delta (A1,A2)       B2: delta (B1, B2)
        A3: delta (A2,A3)       B3: delta (B2, B3)
        A4: delta (A3,A4)       B4: delta (B3, B4)
                                B5: delta (B4, B5) "has changes A3, A4"

We can store that same revision back on the A0 development path:

        A0                      B0: "equal to A2"
        A1: delta (A0,A1)       B1: delta (B0, B1)
        A2: delta (A1,A2)       B2: delta (B1, B2)
        A3: delta (A2,A3)       B3: delta (B2, B3)
        A4: delta (A3,A4)       B4: delta (B3, B4)
        A5: delta (A5, A4) ==   B5: delta (B4, B5) "has changes A3, A4"

A5 and B5 are called a merge point . For all practical purposes, a merge point is also a branch point -- since, using the example, B5 and A5 are equal, just the the two revisions of the original branch point (A2 and B0 ) were equal. If additional development happens on the two branches, we no longer have to worry about merging all changes since A2 and B0 ; we can instead just merge only the changes since A5 and B5 .

What is a Revision Control System?

So what is a revision control system?

A revision control system is a set of tools for computing and applying patch sets, for archiving patch sets, for distributing patch sets, and for helping to merge changes on the basis of patch sets.

A revision control system has to come up with a reasonable way of naming and cataloging revisions. It has to be able to represent branch points and help with merges. When merges occur, a good revision control system should help figure out what patches to apply to which revisions in order to minimize hand-editting.

arch: The arch Revision Control System
The Hackerlab at regexps.com