The main reference for this tutorial is the Pro Git book section on GIT internals.
This tutorial uses three libraries:
- CodeMirror, released under the MIT license
- sha1.js, released under the MIT license
- pako 2.0.3, released under the MIT and Zlib licenses, see the project page for details.
Introduction
GIT is based on a simple model, with a lot of shorthands for common use cases. This model is sometimes hard to guess just from the everyday commands. To illustrate how GIT works, we'll implement a stripped down clone of GIT in a few lines of JavaScript. * empty lines and single closing braces excluded, a few more in total.
The Operating System's filesystem
Model of the filesystem
We will simulate the Operating System's filesystem with a very
simple key-value store. In this very simple filesystem, directories
are entries mapped to null
and files are entries mapped
to strings. The path to the current directory is stored in a separate
variable.
Filesystem access functions (read
, write
, mkdir
, exists
, cd
)
The filesystem exposes functions to read an entire file, create or replace an entire file, create a directory, test the existence of a filesystem entry, and change the current directory.
Filesystem access functions (listdir
)
It will be handy for some operations to list the contents of a directory.
Example working directory
Our imaginary user will create a proj
directory,
and start filling in some files.
git init
(creating .git
)
The first thing to do is to initialize the GIT directory.
For now, only the .git
folder is needed, The rest
of the function implementing git init
will be
written later.
git hash-object
(storing a copy of a file in .git
)
The most basic element of a GIT repository is an object. It is a copy of a file that is stored in GIT's database. That copy is stored under a unique name. The unique name is obtained by hashing the contents of the file.
Adding a file to the GIT database
So far, our GIT database does not know about any of the user's
files. In order to add the contents of the README
file in
the database, we use git hash-object -w -t blob README
,
where -w
tells GIT to write the object in its
database, and -t blob
indicates that we want to create
a blob object, i.e. the contents of a file.
The objects stored in the GIT database are compressed with zlib (using the "deflate" compression method). The filesystem view shows the deflated: followed by the uncompressed data. Click on the file contents to toggle between this pretty-printed view and the raw compressed data.
You will notice that the database does not contain the name of the file, only its contents, stored under a unique identifier which is derived by hashing its contents. Let's add the second user file to the database.
zlib
compression
The real implementation of GIT compresses objects with zlib. To
view a zlib-compressed object in your terminal, simply write this
declaration in your shell, and then call e.g. unzlib
.git/objects/95/d318ae78cee607a77c453ead4db344fc1221b7
unzlib() { python -c \ "import sys,zlib; \ sys.stdout.buffer.write(zlib.decompress(open(sys.argv[1], 'rb').read()));" \ "$1" }
Storing trees (list of hashed files and subtrees)
Now GIT knows about the contents of both of the user's files, but it would be nice to also store the filenames. This is done by creating a tree object
A tree object can contain files (by associating the file's blob to its name), or directories (by associating the hash of other subtrees to their name).
The mode (100644
for the file and 40000
) incidates the permissions, and is given in octal using the values used by *nix
This function needs a small utility to convert hashes encoded in hexadecimal to a binary form.
Example use of store_tree
Storing a tree from a list of paths
Making trees out of the subfolders one by one is cumbersome. Here's a utility function which takes a list of paths, and builds a tree from those.
Storing a commit in the GIT database
Now that the GIT database contains the entire tree for the current version, a commit can be created. A commit contains
- a pointer to the tree
- a pointer to the previous ("parent") commit (or to multiple parent commits merging them, or no parents for the initial commit)
- information about the author (the person who initially wrote the code)
- information about the committer (the person who adds the code to the GIT database, often the same person as the author, but it can be a different person e.g. when someone else makes changes to the history or applies a patch recieved by e-mail)
- a description
The author and committer information contain
- the person's name
- the person's email
- the *nix timestamp at which the version was authored or committed
- the timezone for that timestamp
Storing an example commit
It is now possible to store a commit in the database. This saves a copy of the tree along with some metadata about this version. The first commit has no parent, which is represented by passing the empty list.
resolving references
git symbolic-ref
git rev-parse
git branch
A branch is a pointer to a commit, stored in a file in .git/refs/heads/name_of_the_branch
.
The branch can be overwritten with git branch -f
. Also, as will be explained later,
git commit
can update the pointer of a branch.
HEAD
The HEAD indicates the "current" commit. It is set at first as part of the git init
routine.
git commit
If the HEAD
points to a commit hash, then git commit
updates the HEAD
to point to the new commit.
Otherwise, when the HEAD
points to a branch, then the target branch (represented by a file named .git/refs/heads/the_branch_name
) is updated.
git tag
Tags are like branches, but are stored in .git/refs/tags/the_tag_name
and a tag is not normally modified. Once created, it's supposed to always point
to the same version.
GIT does offer a git tag -f existing-tag new-hash
command,
but using it should be a rare occurrence.
git checkout
Checkout, branches and other references
More importantly, the HEAD does not normally point to a tag. Although nothing actually
prevents writing ref: refs/tags/v1.0
into .git/HEAD
, the GIT
commands will not automatically do this. For example, git checkout tag-or-branch-or-hash
will put a symbolic ref:
in .git/HEAD
only if the argument is a branch.
Checking out files
Assert
The parsers will check that their input looks reasonably well-formed, usingassert()
.
Reading compressed objects
Parsing tree objects
The parse_tree
function above needs a small utility to convert hashes in binary form to a hexadecimal representation.
Parsing commit objects
Example checkout
git init
The git init
command creates the .git
directory and points .git/HEAD
to the default branch (a file which does not exist yet, as this branch does not contain any commit at this point).
The index
When adding files with git add
, GIT does not immediately create a commit object.
Instead, it adds the files to the index, which uses a binary format with lots of metadata.
The mock filesystem used here lacks most of these pieces of information, so thr value 0
will be used for most fields. See this blog post
for a more in-depth study of the index.
Playground
The implementation is now sufficiently complete to create a small repository.
By clicking on "Copy commands to recreate in *nix terminal.", it is possible to copy a series of mkdir …
and printf … > …
commands that, when executed, will recreate the virtual filesystem on a real system. The resulting
folder is binary-compatible with the official git log
, git status
, git checkout
etc.
commands.
Conclusion
This article shows that a large part of the core of GIT can be re-implemented in a few source lines of code* (copy all the code). * empty lines and single closing braces excluded, a few more in total.
darcs
tries to expose an interface which matches this intuition, it is clear that the implementation of GIT considers commits as copies of the entire repository, and are linked to the previous version solely by the parent
metadata in the commit headers.A few core commands like git diff
and git apply
are not described in this tutorial.
They are little more than improved versions of the classical *nix commands diff
and patch
.
Most other commands provided by GIT are merely convenience wrappers around these commands. For example, git cherry-pick
is simply a combination of git diff
between the tree of a commit and the tree of its parent, followed by git apply
to apply the patch and git commit
to create a new commit whose diff is equivalent to the diff of the original commit. As an other example, the command git rebase
performs as succession of cherry-pick
operations.
By keeping in mind the internal model of GIT, it becomes easier to understand the usual commands and their quirks. By undersanding the design philosophy behind the implementation, the day-to-day usage can become, hopefully, less surprising.