git-book

Introduction

GIT is based on a simple model, with a lot of shorthands for common use cases. This model is sometimes hard to guess just from the everyday commands. To illustrate how GIT works, we'll implement a stripped down clone of GIT in a few lines of JavaScript.

The Operating System's filesystem

We will simulate the Operating System's filesystem with a very simple key-value store. In this very simple filesystem, directories are entries mapped to null and files are entries mapped to strings.

The filesystem exposes functions to read an entire file, create or replace an entire file, and create a directory.

It will be handy for some operations to list the contents of a directory.

Example working directory

Our imaginary user will create a proj directory, and start filling in some files.

git init (creating .git)

The first thing to do is to initialize the GIT directory. For now, only the .git folder is needed, The rest of the function implementing git init will be implemented later.

git hash-object (storing a copy of a file in .git)

The most basic element of a GIT repository is an object. It is a copy of a file that is stored in GIT's database. That copy is stored under a unique name. The unique name is obtained by hashing the contents of the file.

So far, our GIT database does not know about any of the user's files. In order to add the contents of the README file in the database, we use git hash-object -w -t blob README, where -w tells GIT to write the object in its database, and -t blob indicates that we want to create a blob object, i.e. the contents of a file.

The objects stored in the GIT database are compressed with zlib (using the "deflate" compression method). The filesystem view shows the deflated: followed by the uncompressed data. Click on the file contents to toggle between this pretty-printed view and the raw compressed data.

You will notice that the database does not contain the name of the file, only its contents, stored under a unique identifier which is derived by hashing its contents. Let's add the second user file to the database.

zlib compression

The real implementation of GIT compresses objects with zlib. To view a zlib-compressed object in your terminal, simply write this declaration in your shell, and then call e.g. unzlib .git/objects/95/d318ae78cee607a77c453ead4db344fc1221b7

unzlib() {
  python -c \
    "import sys,zlib; \
     sys.stdout.buffer.write(zlib.decompress(open(sys.argv[1], 'rb').read()));" \
    "$1"
}

Storing trees (list of hashed files and subtrees)

Now GIT knows about the contents of both of the user's files, but it would be nice to also store the filenames. This is done by creating a tree object

A tree object can contain files (by associating the file's blob to its name), or directories (by associating the hash of other subtrees to their name). The mode (100644 for the file and 40000) incidates the permissions, and is given in octal using the values used by *nix

This function needs a small utility to convert hashes encoded in hexadecimal to a binary form.
Making trees out of the subfolders one by one is cumbersome. Here's a utility function which takes a list of paths, and builds a tree from those.

Now that the GIT database contains the entire tree for the current version, a commit can be created. A commit contains

It is now possible to store a commit in the database. This saves a copy of the tree along with some metadata about this version. The first commit has no parent, which is represented by passing the empty list.

Branches

HEAD

The HEAD indicates the "current" commit. It is set at first as part of the git init routine.

Tags

git commit

If the HEAD points to a commit hash, then git commit updates the HEAD to point to the new commit. Otherwise, when the HEAD points to a branch, then the target branch (represented by a file named .git/refs/heads/the_branch_name) is updated.

git init

END OF DOCUMENT