Under construction
The main reference for this tutorial is the Pro Git book section on GIT internals.
This tutorial uses three libraries:
- CodeMirror, released under the MIT license
- sha1.js, released under the MIT license
- pako 2.0.3, released under the MIT and Zlib licenses, see the project page for details.
- Viz.js (v1.8.2 which has a synchronous API), released under the MIT license
Introduction
GIT is based on a simple model, with a lot of shorthands for common use cases. This model is sometimes hard to guess just from the everyday commands. To illustrate how GIT works, we'll implement a stripped down clone of GIT in a few lines of JavaScript. * empty lines and single closing braces excluded, a few more in total.
The Operating System's filesystem
Model of the filesystem
The Operating System's filesystem will be simulated by a very
simple key-value store. In this very simple filesystem, directories
are entries mapped to null
and files are entries mapped
to strings. The path to the current directory is stored in a separate
variable.
Filesystem access functions (read
, write
, mkdir
, exists
, remove
, cd
)
The filesystem exposes functions to read an entire file, create or replace an entire file, create a directory, test the existence of a filesystem entry, and change the current directory.
Filesystem access functions (listdir
)
It will be handy for some operations to list the contents of a directory.
Example working directory
Our imaginary user will create a proj
directory,
and start filling in some files.
git init
(creating .git
)
The first thing to do is to initialize the GIT directory.
For now, only the .git
folder is needed, The rest
of the function implementing git init
will be
written later.
Click on the eval button to see the files and directories that were created so far.
git hash-object
(storing a copy of a file in .git
)
The most basic element of a GIT repository is an object. Objects have a type which can be
blob
(individual files), tree
(directories),
commit
(pointers to a specific version of the root directory,
with a description and some metadata) and tag
(named pointers to a specific commit,
with a description and some metadata).
When a file is added to the git repostitory, a compressed copy is stored in GIT's database,
in the .git/objects/
folder. This copy is a blob object.
The compressed copy is given a unique filename, which is obtained by hashing the contents of the original file.
Some filesystems have poor performance when a single directory contains a large number of files, and some filesystems
have a limit on the number of files that a directory may contain. To circumvent these issues, the first two characters
of the hash are used as the name of an intermediate directory: if a file's hash is 0a1bd…
, its compressed
copy will be stored in .git/objects/0a/1bd…
This function creates a file that looks like this:
The objects stored in the GIT database are compressed with zlib (using the "deflate" compression method). The filesystem view shows the marker deflated: followed by the uncompressed data. Click on the (un)compressed data to toggle between this pretty-printed view and the raw compressed data.
When creating some blob
objects, the result could be, for example:
This function reproduces faithfully the behaviour of (a subset of the options of)
the git hash-object
command which can be called on a real git command-line.
Adding a file to the GIT database
So far, our GIT database does not know about any of the user's
files. In order to add the contents of the README
file in
the database, we use git hash-object -w -t blob README
,
where -w
tells GIT to write the object in its
database, and -t blob
indicates that we want to create
a blob object, i.e. the contents of a file.
Click on the eval button to see the file that was created by this call.
You can notice that the database does not contain the name of the original file, only its content, stored under a unique identifier which is derived by hashing that content. Let's add the second user file to the database.
zlib
compression
GIT compresses objects with zlib. The deflate()
function used in
the script above comes from the pako 2.0.3 library.
To view a zlib-compressed object in your *nix terminal, simply write this
declaration in your shell.
unzlib() { python -c \ "import sys,zlib; \ sys.stdout.buffer.write(zlib.decompress(open(sys.argv[1], 'rb').read()));" \ "$1" }
You can then inspect git objects as follows, using hexdump
to view the null bytes and other non-printable bytes.
unzlib .git/objects/95/d318ae78cee607a77c453ead4db344fc1221b7 | hexdump -Cv
Storing trees (list of hashed files and subtrees)
At this point GIT knows about the contents of both of the user's files, but it would be nice to also store the filenames. This is done by creating a tree object
A tree object can contain files (by associating the blob's hash to its name), or directories (by associating the hash of other subtrees to their name).
The mode (100644
for the file and 40000
for the folder) incidates the permissions, and is given in octal using the values used by *nix
In the contents of a tree, subdirectories (trees) are listed before files (blobs); within each group the entries are ordered alphabetically.
This function needs a small utility to convert hashes encoded in hexadecimal to raw bytes.
Example use of store_tree()
The following code, once uncommented, stores into the GIT database the trees for src
and for the root directory of the GIT project.
The store_tree()
function needs to be called for the contents of subdirectories
first, and that result can be used to store the trees of upper directories. In the next section,
we will write a function which takes a list of paths, constructs an internal representation of
the hierarchy, and stores the corresponding trees bottom-up.
Storing a tree from a list of paths
Making trees out of the subfolders one by one is cumbersome. The following utility function takes a list of paths, and builds a tree from those.
Storing a commit in the GIT database
Now that the GIT database contains the entire tree for the current version, a commit can be created. A commit contains
- the hash of the tree object,
- the hash of the previous commit, which is dubbed the
parent
(merge commits have two or more parents, and the initial commit has no parent commit), - information about the author (the person who initially wrote the code),
- information about the committer (the person who adds the code to the GIT database, often the same person as the author, but it can be a different person e.g. when someone else rewrites the history with a rebase or applies a patch recieved by e-mail),
- and a description.
The author and committer information contain
- the person's name,
- the person's email,
- the *nix timestamp at which the version was authored or committed,
- and the timezone for that timestamp.
Storing an example commit
It is now possible to store a commit in the database. This saves a copy of the tree along with some metadata about this version. The first commit has no parent, which is represented by passing the empty list.
resolving references
The next few subsections will introduce symbolic references
and other references like branch names, the special name HEAD
or tag names.
Most GIT commands accept as an argument a commit hash or a named reference to a hash. In order to implement those, we need to be able to resolve these references first.
Symbolic references are nothing more than regular files containing a hexadecimal
hash or a string of the form ref: path/to/other/symbolic/reference
.
The HEAD
reference is stored in .git/HEAD
, and can point
directly to a commit hash like
0123456789abcdef0123456789abcdef01234567,
or can point to another symbolic reference, in which case the .git/HEAD
file
will contain e.g. refs/heads/main
.
Branches are simple files stored in .git/refs/heads/name-of-the-branch
and usually contain a hash like
0123456789abcdef0123456789abcdef01234567.
Tags are identical to branches in terms of representation. It seems that the only difference
between tags and branches is the behaviour of git checkout
and similar commands.
These commands, as explained in the section about git checkout
below,
normally write ref: refs/heads/name-of-branch
in .git/HEAD
when
checking out a branch, but write the hash of the target commit when checking out a tag or
any other non-branch reference.
We'll start with a small utility to remove the newline at the end of a string. GIT references are usually files containing a hexadecimal hash, and following *NIX tradition these files finish with a newline byte. When reading these references, we need to get rid of the newline first.
git symbolic-ref
git symbolic-ref
is a low-level command which reads
(and in the official GIT implementation also writes and updates)
symbolic references given a path relative to .git/
.
For example, git symbolic-ref HEAD
will read the
contents of the file .git/HEAD
, and if that file starts
with ref:
, the rest of the line will be returned.
The official implementation of GIT follows references recursively
and returns the path/to/file
of the last file of the
form ref: path/to/file
. In the example below,
git symbolic-ref HEAD
would
- read the file
proj/.git/HEAD
which containsref: refs/heads/main
, - follow that indirection and read the file
proj/.git/refs/heads/main
which containsref: refs/heads/other
- follow that indirection and read the file
proj/.git/refs/heads/other
which contains a hash - return the last file path that contained a
ref:
, i.e. return the stringrefs/heads/other
git rev-parse
git rev-parse
is another low-level command. It takes a symbolic reference or other reference,
and returns the hash. The difference with git symbolic-ref
is that symbolic-ref
follows indirections
to other references, and returns the last named reference in the chain of indirections, whereas rev-parse
goes one step further and returns the hash pointed to by the last named reference.
git branch
A branch is a pointer to a commit, stored in a file in .git/refs/heads/name_of_the_branch
.
The branch can be overwritten with git branch -f
. Also, as will be explained later,
git commit
can update the pointer of a branch.
When we call git branch main HEAD
or equivalently
git branch main 0123456789012345678901234567890123456789
,
a file containing that hash is created in .git/refs/heads/main
. This file acts as a pointer
to the branch, and this pointer can be read e.g. by git rev-parse
.
After creating the branch, we show how the file .git/refs/heads/main
can be overwritten
using git branch -f
HEAD
The HEAD
indicates the "current" commit. It is set at first as part of the git init
routine.
Usually, the HEAD
is a symbolic reference to a branch, i.e. the
file .git/HEAD
contains ref: refs/heads/name-of-branch
.
When checking out a commit by specifying its hash directly, or when checking out
a non-branch reference, the file .git/HEAD
contains the hash of the
commit instead.
The state in which .git/HEAD
contains a commit hash is called
"detached HEAD", and often sounds alarming to people who have not encountered this
before. As we will see in the following sections, the only difference between detached
HEAD and the normal state is that git commit
updates the branch to point
to the new commit in the normal mode of operation. When the HEAD
is detached,
it does not point to a specific branch, and git commit
updates the HEAD
directly instead, overwriting it with the new commit hash.
Since the HEAD is supposed to be a transient pointer, it is easy to lose track of the hash of an important commit. For example, the following sequence of operations:
git checkout 0123456789abcdef0123456789abcdef01234567 touch new_file git add new_file git commit -m 'This is a commit adding a new file' git checkout branch-of-feature-foobarroughly means:
HEAD = 0123456789abcdef0123456789abcdef01234567 // overwrite the contents of the working directory with // the contents of commit 0123456789abcdef0123456789abcdef01234567 checkout(0123456789abcdef0123456789abcdef01234567) // create commit with the new file: HEAD = commit(…) // Checkout other branch HEAD = git_rev_parse('branch-of-feature-foobar')
The hash of the new commit which is stored in HEAD on the second step is overwritten in the third step. In order to later retrieve that specific version with the precious new_file, one needs that hash. It would be possible to note down these hashes in a simple text file, but GIT offers a mechanism for that: branches. After all, branches are merely named text files containing the hash of the latest commit in that line of work.
The hash of a commit created with git commit
does not only exist in the
HEAD file (when in detached HEAD) or in the current branch file (normal mode). The official
implementation of GIT keeps a log of the changes being made to the various references.
.git/logs/HEAD
contains a log of the hashes pointed to by .git/HEAD
,
and .git/logs/refs/heads/main
contains a log of the hashes pointed to by
.git/refs/heads/main
, and the commands git reflog
and
git reflog main
pretty-print these files.
There are a few more ways to find a lost commit hash, including a careful invocation of
git fsck
which checks that the files stored in .git/
are not
corrupted, and that no reference (to another reference or a commit, tree or blob) points
to a non-existing file. The git fsck --unreachable
option tells this command
to print all object hashes which are not pointed to indirectly by any named reference
(so-called unreachable objects, which are well-formed but are not indirectly linked to
from a branch or other kind of named pointer).
The reflog can be used to recover a lost hash but handling hashes manually like this is somewhat error-prone, and most new users are not aware of those features; for this reason GIT commands tend to display a warning when switching to a detached HEAD state.
git config
The official implementation of GIT stores the settings in various files (.git/config
within a repository,
~/.gitconfig
in the user's home folder, and several other places).
These files use a .ini
syntax
with key = value
lines grouped under some [section]
headings. The configuration above could be
stored in ~/.gitconfig
or .git/config
using the following syntax:
[user] name = Ada Lovelace email = ada@analyti.cal
The $EDITOR
variable is a traditional *NIX environment variable, and could e.g. be declared with
EDITOR=nano
in ~/.profile
or ~/.bashrc
.
git commit
The git commit
command stores a commit (metadata and a pointer to a tree
containing the files given on the command-line), and updates the HEAD
or
current branch to point to the new commit.
If the HEAD
points to a commit hash, then git commit
updates the HEAD
to point to the new commit.
Otherwise, when the HEAD
points to a branch, then the target branch (represented by a file named .git/refs/heads/the_branch_name
) is updated.
The official implementation of git commit
makes use of the index.
When a file is scheduled for the next commit using git add path/to/file
, it is added to
the index. The index is a representation of a collection of copies of files, which can efficiently be
compared to the working directory. It uses a different representation, but its role is very similar
to that of a tree object along with the subtrees and blob objects of individual files. When
git commit
is called without specifying any files, it creates a commit containing the
version of the files stored in the index.
In this simplified implementation, we only support creating commits by specifying all the files that
must be present in the commit (including unchanged files). This contrasts with the official implementation
which would create a tree containing the files from the current HEAD, as well as the added, modified or
deleted files specified by git add
or specified directly on the git commit
command-line.
git tag
Tags behave like branches, but are stored in .git/refs/tags/the_tag_name
and a tag is not normally modified. Once created, it's supposed to always point
to the same version.
GIT does offer a git tag -f existing-tag new-hash
command,
but using it should be a rare occurrence.
Intuitively, tags differ from branches in the following way: when checking out a branch,
and a subsequent commit is made, the branch is updated to point to the new commit's hash.
As we've seen in the implementation of git commit
, the difference is actually
in the contents of the .git/HEAD
file. If it is a symbolic reference (generally
a pointer to a branch), then the target of that reference is updated every time a new commit
is created. If the .git/HEAD
file contains the hash of a commit, then the
.git/HEAD
file itself is updated every time a new commit is created.
Therefore, tags and branches differ only in their usage and in the path under which they are
stored (.git/refs/heads/name-of-the-branch
vs. .git/refs/tags/name-of-the-tag
).
The file .git/HEAD
is overwritten by git commit
and git checkout
.
It is the latter command which will behave differently for tags and branches; git checkout branch-name
turns the HEAD into a symbolic reference, whereas git checkout tag-name
resolves the tag name to
a commit hash, and writes that hash directly into .git/HEAD
.
git checkout
The git checkout commit-hash-or-reference
command modifies the HEAD to point to the given commit,
and modifies the working directory to match the contents of the tree object pointed to by that commit.
Checkout, branches and other references
The HEAD does not normally point to a tag. Although nothing actually
prevents writing ref: refs/tags/v1.0
into .git/HEAD
, the GIT
commands will not automatically do this. For example, git checkout tag-or-branch-or-hash
will put a symbolic ref:
in .git/HEAD
only if the argument is a branch.
Checking out files
In order to replace the contents of the working directory with those of the given commit, we recursively compare the subtrees, deleting from the working directory the files or directories that are not present in the tree object, and overwriting the others.
The official implementation of GIT will record the diff between the current working directory
and the current commit, and will re-apply these changes on top of the freshly checked-out commit.
The official git checkout
command will print warnings and refuse to proceed when
these changes cannot be re-applied without conflict, encouraging the user to create a commit
containing this updated version or to stash the changes (effectively creating a temporary commit
containing this version, pointed to by .git/refs/stash
). Our simple implementation
will always overwrite the changes.
Assert
The checkout_tree()
function needs to read the commit, tree and blob objects from the
.git/
folder. The following sections will introduce some parsers for these objects.
The parsers will check that their input looks reasonably well-formed, using assert()
.
Reading compressed objects
Parsing tree objects
The parse_tree
function above needs a small utility to convert hashes represented using raw bytes to a hexadecimal representation.
Parsing commit objects
Example checkout
git init
The git init
command creates the .git
directory and points .git/HEAD
to the default branch (a file which does not exist yet, as this branch does not contain any commit at this point).
The index
When adding files with git add
, GIT does not immediately create a commit object.
Instead, it adds the files to the index, which uses a binary format with lots of metadata.
The mock filesystem used here lacks most of these pieces of information, so thr value 0
will be used for most fields. See this blog post
for a more in-depth study of the index.
Playground
The implementation is now sufficiently complete to create a small repository.
By clicking on "Copy commands to recreate in *nix terminal.", it is possible to copy a series of mkdir …
and printf … > …
commands that, when executed, will recreate the virtual filesystem on a real system. The resulting
folder is bit-compatible with the official git log
, git status
, git checkout
etc.
commands.
Suggested exercises
The reader willing to improve their grasp of GIT's mental model, and reduce their reliance on a few learned recipies, might be interested in the following warm-up exercises:
-
Inspect an existing repository, starting with
cat .git/HEAD
and usinggit cat-file -p some-hash
to pretty-print an object given its hash. -
Inspect an existing repository, starting with
cat .git/HEAD
and using thezlib
decompression tool from thezlib
compression section. -
Run
git init new-directory
in a terminal, and create an initial single-file commit from scratch, using onlygit hash-object
,printf
and overwriting.git/HEAD
. This will involve retracing the steps in this tutorial to create a blob object for the file, a tree object to be the directory containing just that file, and a commit object. -
For a couple of weeks, only use the GIT commands
commit
,diff
,checkout
,merge
,cherry-pick
,log
,clone
,fetch
andpush remote hash-of-commit:refs/heads/name-of-the-branch
. In particular, don't userebase
which is just a wrapper around a sequence ofcherry-pick
commands, don't usepull
which is just a wrapper aroundfetch
andmerge
, don't usegit push
as-is and instead explicitly give the name (origin) or URL of the remote, the hash of the commit to push, and the path that should be updated on the remote (git push
while themain
branch is checked out locally is equivalent togit push origin HEAD:refs/heads/main
, whereHEAD
can be replaced by the actual hash of the commit). -
Try not even using
git cherry-pick
orgit diff
a few times, instead make two copies the git directoy, check out the two different commits in each copy, and use the traditional *NIX commandsdiff
andpatch
.
Conclusion
This article shows that a large part of the core of GIT can be re-implemented in a few source lines of code* (copy all the code). * empty lines and single closing braces excluded, a few more in total.
darcs
tries to expose an interface which matches this intuition, it is clear that the implementation of GIT considers commits as copies of the entire repository, and are linked to the previous version solely by the parent
metadata in the commit headers.A few core commands like git diff
and git apply
are not described in this tutorial.
They are little more than improved versions of the classical *nix commands diff
and patch
.
Most other commands provided by GIT are merely convenience wrappers around these commands. For example, git cherry-pick
is simply a combination of git diff
between the tree of a commit and the tree of its parent, followed by git apply
to apply the patch and git commit
to create a new commit whose diff is equivalent to the diff of the original commit. As an other example, the command git rebase
performs as succession of cherry-pick
operations.
By keeping in mind the internal model of GIT, it becomes easier to understand the usual commands and their quirks. By undersanding the design philosophy behind the implementation, the day-to-day usage can become, hopefully, less surprising.