git-tutorial/index.html
2021-06-23 02:06:06 +01:00

1440 lines
59 KiB
HTML

<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>GIT tutorial</title>
<!-- Third-party libraries: -->
<link rel="stylesheet" href="codemirror-5.60.0/lib/codemirror.css">
<script src="codemirror-5.60.0/lib/codemirror.js"></script>
<script src="codemirror-5.60.0/mode/javascript/javascript.js"></script>
<script src="sha1.js"></script>
<script src="pako.min.js"></script>
<script src="viz.js"></script>
<link rel="stylesheet" href="codemirror-5.60.0/lib/codemirror.css">
<!-- Implementation of the tutorial's helper tools (code editor, graph view, table of contents, table output and arrows): -->
<link rel="stylesheet" href="git-tutorial.css">
<script src="git-tutorial.js"></script>
<script class="example">
var examples=[];
function ___h2f(hash) { return 'proj/.git/objects/'+hash.substr(0,2)+'/'+hash.substr(2); }
function ___example(id, f) {
examples.push(function () {
var result = f();
var fs = {};
for (var i = 0; i < result.names.length; i++) {
fs[result.names[i]] = filesystem[result.names[i]];
}
var previous_fs = {};
for (var i = 0; i < result.previous_names.length; i++) {
previous_fs[result.previous_names[i]] = filesystem[result.previous_names[i]];
}
___eval_result_to_html(id, fs, previous_fs, [], true);
});
}
</script>
</head>
<body>
<article id="git-tutorial">
<h1>Under construction</h1>
<p>The main reference for this tutorial is the <a href="https://git-scm.com/book/en/v2/Git-Internals-Git-Objects">Pro Git book</a> section on GIT internals.</p>
<p>This tutorial uses three libraries:</p>
<ul>
<li><a href="https://codemirror.net/">CodeMirror</a>, released under the MIT license</li>
<li><a href="https://www.movable-type.co.uk/scripts/sha1.html">sha1.js</a>, released under the MIT license</li>
<li><a href="https://github.com/nodeca/pako">pako 2.0.3</a>, released under the MIT and Zlib licenses, see the project page for details.</li>
<li><a href="https://github.com/mdaines/viz.js">Viz.js</a> (<a href="https://github.com/mdaines/viz.js/releases/tag/v1.8.2">v1.8.2</a> which has a synchronous API), released under the MIT license</li>
</ul>
<section id="introduction">
<h1>Introduction</h1>
<p>
GIT is based on a simple model, with a lot of shorthands for common
use cases. This model is sometimes hard to guess just from the
everyday commands. To illustrate how GIT works, we'll implement a
stripped down clone of GIT in <span class="loc-count">a few</span> lines of
JavaScript.
<span style="font-size: small">*&nbsp;empty lines and single closing braces
excluded, <span class="loc-count-total">a few more</span> in total.</span>
</p>
</section>
<section id="os-filesystem">
<h1>The Operating System's filesystem</h1>
<section id="os-filesystem-model">
<h1>Model of the filesystem</h1>
<p>The Operating System's filesystem will be simulated by a very
simple key-value store. In this very simple filesystem, directories
are entries mapped to <code>null</code> and files are entries mapped
to strings. The path to the current directory is stored in a separate
variable.</p>
<textarea id="in0">
var filesystem = {};
var current_directory = '';
</textarea>
</section>
<section id="os-filesystem-functions">
<h1>Filesystem access functions<span class="notoc"> (<code>read</code>, <code>write</code>, <code>mkdir</code>, <code>exists</code>, <code>remove</code>, <code>cd</code>)</span></h1>
<p>The filesystem exposes functions to read an entire file, create or
replace an entire file, create a directory, test the existence of a filesystem entry, and change the current directory.</p>
<textarea id="in1">
function read(filename) {
return filesystem[filename];
}
function write(filename, data) {
filesystem[filename] = String(data);
}
function exists(filename) {
return typeof(filesystem[filename]) !== 'undefined';
}
function mkdir(dirname) {
filesystem[dirname] = null;
}
function cd(dirname) {
current_directory = dirname;
}
function remove(path, recursive) {
if (recursive && filesystem[path] === null) {
var children = listdir(path);
for (var i = 0; i < children.length; i++) {
remove(path + '/' + children[i], true);
}
}
delete filesystem[path];
}
</textarea>
</section>
<section id="os-filesystem-listdir">
<h1>Filesystem access functions<span class="notoc"> (<code>listdir</code>)</span></h1></h1>
<p>It will be handy for some operations to list the contents of a
directory.</p>
<textarea id="in2">
function listdir(dirname) {
var depth = dirname.split('/').length;
// Get all paths in the filesystem
var paths = Object.keys(filesystem);
// Filter to keep only the paths starting with the given dirname
var prefix = dirname + '/';
var descendents = paths
.filter(function (filename) { return filename.startsWith(prefix) && (filename.length > prefix.length); });
// Keep only the next path component
var children = descendents
.map(function (filename) { return filename.split('/')[depth]; });
// remove duplicates, listdir('a') with paths a/b/c and a/b/d and a/x
// should only return ['b', 'x'], not 'b', 'b', x.
return Array.from(new Set(children));
}
</textarea>
</section>
</section>
<section id="example-working-directory">
<h1>Example working directory</h1>
<p>Our imaginary user will create a <code>proj</code> directory,
and start filling in some files.</p>
<textarea id="in3">
mkdir('proj');
cd('proj');
write('proj/README', 'This is my Scheme project.\n');
mkdir('proj/src');
write('proj/src/main.scm', '(map (lambda (x) (+ x 1)) (list 1 2 3))\n');
</textarea>
</section>
<section id="git-init">
<h1><code>git init</code> (creating <code>.git</code>)</h1>
<p>The first thing to do is to initialize the GIT directory.
For now, only the <code>.git</code> folder is needed, The rest
of the function implementing <code>git init</code> will be
written later.</p>
<textarea id="in4">
function join_paths(a, b) {
return (a == "") ? b : (a + "/" + b);
}
// git init (partial implementation: create the .git directory)
function git_init_mkdir() {
mkdir(join_paths(current_directory, '.git'));
}
git_init_mkdir();
</textarea>
<p>Click on the <em>eval</em> button to see the files and directories that were
created so far.</p>
</section>
<section id="git-hash-object">
<h1><code>git hash-object</code><span class="notoc"> (storing a copy of a file in <code>.git</code>)</span></h1>
<p>The most basic element of a GIT repository is an <em>object</em>. Objects have a type which can be
<code>blob</code> (individual files), <code>tree</code> (directories),
<code>commit</code> (pointers to a specific version of the root directory,
with a description and some metadata) and <code>tag</code> (named pointers to a specific commit,
with a description and some metadata).
When a file is added to the git repostitory, a compressed copy is stored in GIT&apos;s database,
in the <code>.git/objects/</code> folder. This copy is a <em>blob</em> object.</p>
<p>The compressed copy is given a unique filename, which is obtained by hashing the contents of the original file.
Some filesystems have poor performance when a single directory contains a large number of files, and some filesystems
have a limit on the number of files that a directory may contain. To circumvent these issues, the first two characters
of the hash are used as the name of an intermediate directory: if a file's hash is <code>0a1bd…</code>, its compressed
copy will be stored in <code>.git/objects/0a/1bd…</code></p>
<p>This function creates a file that looks like this:</p>
<div id="example-blob-object-template"></div>
<script class="example">
___example('example-blob-object-template', function() {
var object_contents = 'type length\000Contents of path_or_data';
var hash = sha1(object_contents);
var path = ___h2f(hash);
write(path, deflate(object_contents));
return { filesystem: filesystem, names: [path], previous_names: [] };
});
</script>
<p>The objects stored in the GIT database are compressed with zlib
(using the "deflate" compression method). The filesystem view shows
the marker <span class="deflated">deflated:</span> followed by the
uncompressed data. Click on the (un)compressed data to toggle between
this pretty-printed view and the raw compressed data.</p>
<p>When creating some <code>blob</code> objects, the result could be, for example:</p>
<div id="example-blob-objects"></div>
<script class="example">
___example('example-blob-objects', function() {
var names = [
___h2f(hash_object(true, 'blob', false, 'src/main.scm')),
___h2f(hash_object(true, 'blob', false, 'README')),
];
return { filesystem: filesystem, names: names, previous_names: [] };
});
</script>
<p>This function reproduces faithfully the behaviour of (a subset of the options of)
the <code>git hash-object</code> command which can be called on a real git command-line.</p>
<textarea id="in5">
// git hash-object [-w] -t <type> [--stdin] [path]
function hash_object(must_write, type, is_data, path_or_data) {
var data = is_data ? path_or_data : read(join_paths(current_directory, path_or_data));
object_contents = type + ' ' + data.length + '\0' + data;
var hash = sha1(object_contents);
if (must_write) {
mkdir(join_paths(current_directory, '.git/objects'));
mkdir(join_paths(current_directory, '.git/objects/' + hash.substr(0,2)));
var object_path = join_paths(current_directory, '.git/objects/' + hash.substr(0,2) + '/' + hash.substr(2));
// deflate() compresses using zlib
write(object_path, deflate(object_contents));
}
return hash;
}
</textarea>
<section id="add-file-to-git">
<h1>Adding a file to the GIT database</h1>
<p>So far, our GIT database does not know about any of the user&apos;s
files. In order to add the contents of the <code>README</code> file in
the database, we use <code>git hash-object -w -t blob README</code>,
where <code>-w</code> tells GIT to <em>write</em> the object in its
database, and <code>-t blob</code> indicates that we want to create
a <em>blob</em> object, i.e. the contents of a file.</p>
<textarea id="in6">
// git hash-object -w -t blob README
hash_object(true, 'blob', false, 'README');
</textarea>
<p>Click on the <em>eval</em> button to see the file that was
created by this call.</p>
<p>You can notice that the database does not contain the name of the
original file, only its content, stored under a unique identifier which is
derived by hashing that content. Let&apos;s add the second user file
to the database.</p>
<textarea id="in7">
// git hash-object -w -t blob src/main.scm
hash_object(true, 'blob', false, 'src/main.scm');
</textarea>
</section>
</section>
<section id="zlib-compression-note">
<h1><code>zlib</code> compression</h1>
<p>GIT compresses objects with zlib. The <code>deflate()</code> function used in
the script above comes from the <a href="https://github.com/nodeca/pako">pako 2.0.3</a> library.
To view a zlib-compressed object in your *nix terminal, simply write this
declaration in your shell.</p>
<pre>
unzlib() {
python -c \
"import sys,zlib; \
sys.stdout.buffer.write(zlib.decompress(open(sys.argv[1], 'rb').read()));" \
"$1"
}
</pre>
<p>You can then inspect git objects as follows, using <code>hexdump</code> to view the null bytes and other non-printable bytes.</p>
<pre>unzlib .git/objects/95/d318ae78cee607a77c453ead4db344fc1221b7 | hexdump -Cv</pre>
</section>
<section id="storing-trees">
<h1>Storing trees (list of hashed files and subtrees)</h1>
<p>At this point GIT knows about the contents of both of the user's
files, but it would be nice to also store the filenames.
This is done by creating a <em>tree</em> object</p>
<p>A tree object can contain files (by associating the blob's hash to its name), or directories (by associating the hash of other subtrees to their name).
The mode (<code>100644</code> for the file and <code>40000</code> for the folder) incidates the permissions, and is given in octal using <a href="https://unix.stackexchange.com/a/145118/19059">the values used by *nix</a></p>
<div id="example-tree-objects"></div>
<script class="example">
___example('example-tree-objects', function() {
var main = ___h2f(hash_object(true, 'blob', false, 'src/main.scm'));
var readme = ___h2f(hash_object(true, 'blob', false, 'README'));
var src = ___h2f(store_tree("src", ["main.scm"], []));
var proj = ___h2f(paths_to_tree(["README", "src/main.scm"]));
var previous_names = [ main, readme ];
var names = [ main, readme, src, proj ];
return { filesystem: filesystem, names: names, previous_names: previous_names };
});
</script>
<p>In the contents of a tree, subdirectories (trees) are listed before files (blobs);
within each group the entries are ordered alphabetically.</p>
<textarea id="in8">
// base_directory is a string
// filenames is a list of strings
// subtrees is a list of {name, hash} objects.
function store_tree(base_directory, filenames, subtrees) {
function get_file_hash(filename) {
var path = join_paths(base_directory, filename);
var hash = hash_object(true, 'blob', false, path)
return hex_to_raw_bytes(hash);
}
var blobs = filenames.map(function (filename) {
return "100644 " + filename + "\0" + get_file_hash(filename);
});
var trees = subtrees.map(function (subtree) {
return "40000 " + subtree.name + "\0" + hex_to_raw_bytes(subtree.hash);
});
// blobs are listed before subtrees
var tree_content = blobs.join('') + trees.join('');
// cat tree_content | git hash-object -w -t tree --stdin
return hash_object(true, 'tree', true, tree_content);
}
</textarea>
<p>This function needs a small utility to convert hashes encoded in hexadecimal to raw bytes.</p>
<textarea id="in9">
function hex_to_raw_bytes(hex) {
var hex = String(hex);
var str = ""
for (var i = 0; i < hex.length; i+=2) {
str += String.fromCharCode(parseInt(hex.substr(i, 2), 16));
}
return str;
}
</textarea>
<section id="store-tree-example">
<h1>Example use of <code>store_tree()</code></h1>
<p>The following code, once uncommented, stores into the GIT database the trees for <code>src</code>
and for the root directory of the GIT project.</p>
<textarea id="in10">
//hash_src_tree = store_tree("src", ["main.scm"], []);
//hash_root_tree = store_tree("", ["README"], [{name:"src", hash:hash_src_tree}]);
</textarea>
<p>The <code>store_tree()</code> function needs to be called for the contents of subdirectories
first, and that result can be used to store the trees of upper directories. In the next section,
we will write a function which takes a list of paths, constructs an internal representation of
the hierarchy, and stores the corresponding trees bottom-up.</p>
</section>
<section id="store-tree-from-paths">
<h1>Storing a tree from a list of paths</h1>
<p>Making trees out of the subfolders one by one is cumbersome.
The following utility function takes a list of paths, and builds
a tree from those.</p>
<textarea id="in11">
function paths_to_tree(paths) {
// This temporary mutable object will store a hierarchy of
// subfolders and files, e.g.
// {
// subfolders: { src: { subfolders: [], files: ['main.scm'] } }
// files: ['README']
// }
var hierarchy = { subfolders: {}, files: [] };
// This splits the input paths on occurrences of "/",
// and inserts them into the "hierarchy" object.
for (var i = 0; i < paths.length; i++) {
var path_components = paths[i].split('/');
var h = hierarchy;
for (var j = 0; j < path_components.length - 1; j++) {
if (! h.subfolders.hasOwnProperty(path_components[j])) {
h.subfolders[path_components[j]] = {
subfolders: {},
files: []
};
}
h = h.subfolders[path_components[j]];
}
h.files[h.files.length] = path_components[path_components.length - 1];
}
// This function takes the path to a directory, e.g. "src",
// and a hierarchy object e.g. { subfolders: [], files: ['main.scm'] }.
// It recursively stores the tree object for that directory into
// GIT's database.
var to_tree = function(base_directory, hierarchy) {
var subtrees = [];
for (var i in hierarchy.subfolders) {
if (hierarchy.subfolders.hasOwnProperty(i)) {
subtrees[subtrees.length] = {
name: i,
hash: to_tree(join_paths(base_directory, i), hierarchy.subfolders[i])
};
}
}
return store_tree(base_directory, hierarchy.files, subtrees);
}
// Store the trees for the whole hierarchy, starting from the
// root directory of the GIT repository (which is represented
// as an empty path "")
return to_tree("", hierarchy);
}
// git add README src/main.scm
paths_to_tree(["README", "src/main.scm"]);
</textarea>
</section>
</section>
<section id="store-commit">
<h1>Storing a commit in the GIT database</h1>
<p>Now that the GIT database contains the entire tree for the current version,
a commit can be created. A commit contains</p>
<ul>
<li>the hash of the tree object,</li>
<li>the hash of the previous commit, which is dubbed the <code>parent</code> (merge commits have two or more parents, and the initial commit has no parent commit),</li>
<li>information about the author (the person who initially wrote the code),</li>
<li>information about the committer (the person who adds the code to the GIT
database, often the same person as the author, but it can be a different person
e.g. when someone else rewrites the history with a rebase or applies a patch recieved
by e-mail),</li>
<li>and a description.</li>
</ul>
<div id="example-commit-object"></div>
<script class="example">
___example('example-commit-object', function() {
var main = ___h2f(hash_object(true, 'blob', false, 'src/main.scm'));
var readme = ___h2f(hash_object(true, 'blob', false, 'README'));
var src = ___h2f(store_tree("src", ["main.scm"], []));
var proj = ___h2f(paths_to_tree(["README", "src/main.scm"]));
var initial_commit = ___h2f(store_commit(
paths_to_tree(["README", "src/main.scm"]),
[],
{name:'Ada Lovelace', email:'ada@analyti.cal', date:new Date(1617120803000), timezoneMinutes: +60},
{name:'Ada Lovelace', email:'ada@analyti.cal', date:new Date(1617120803000), timezoneMinutes: +60},
'Initial commit'));
var previous_names = [ main, readme, src, proj ];
var names = [ main, readme, src, proj, initial_commit ];
return { filesystem: filesystem, names: names, previous_names: previous_names };
});
</script>
<p>The author and committer information contain</p>
<ul>
<li>the person's name,</li>
<li>the person's email,</li>
<li>the *nix timestamp at which the version was authored or committed,</li>
<li>and the <a href="https://www.youtube.com/watch?v=q2nNzNo_Xps">timezone for that timestamp</a>.</li>
</ul>
<textarea id="in12">
function store_commit(tree, parents, author, committer, message) {
var commit_contents = '';
commit_contents += 'tree ' + tree + '\n';
for (var i = 0; i < parents.length; i++) {
commit_contents += 'parent ' + parents[i] + '\n';
}
commit_contents += 'author ' + author.name
+ ' <' + author.email + '> '
+ format_date(author.date) + ' '
+ format_timezone(author.timezoneMinutes) + '\n';
commit_contents += 'committer ' + committer.name
+ ' <' + committer.email + '> '
+ format_date(committer.date) + ' '
+ format_timezone(committer.timezoneMinutes) + '\n';
commit_contents += '\n';
commit_contents += '' + message + (message[message.length-1] == '\n' ? '' : '\n');
// cat commit_contents | git hash-object -w -t commit --stdin
return hash_object(true, 'commit', true, commit_contents);
}
function format_date(d) {
return Math.floor((+d) / 1000);
}
function left_pad(s, char, len) {
while ((''+s).length < len) { s = '' + char + s; }
return s;
}
function format_timezone(tm) {
var h = Math.floor(Math.abs(+tm)/60);
var m = Math.abs(+tm)%60;
return (tm >= 0 ? '+' : '-') + left_pad(h, '0', 2) + left_pad(m, '0', 2);
}
</textarea>
<section id="store-commit-example">
<h1>Storing an example commit</h1>
<p>It is now possible to store a commit in the database. This saves
a copy of the tree along with some metadata about this version.
The first commit has no parent, which is represented by passing
the empty list.</p>
<textarea id="in13">
initial_commit = store_commit(
paths_to_tree(["README", "src/main.scm"]),
[],
{name:'Ada Lovelace', email:'ada@analyti.cal', date:new Date(1617120803000), timezoneMinutes: +60},
{name:'Ada Lovelace', email:'ada@analyti.cal', date:new Date(1617120803000), timezoneMinutes: +60},
'Initial commit');
</textarea>
</section>
</section>
<section id="resolving-references">
<h1>resolving references</h1>
<p>The next few subsections will introduce <em>symbolic references</em>
and other references like branch names, the special name <code>HEAD</code>
or tag names.</p>
<p>Most GIT commands accept as an argument a commit hash or a named reference to a hash.
In order to implement those, we need to be able to resolve these references first.</p>
<p>Symbolic references are nothing more than regular files containing a hexadecimal
hash or a string of the form <code>ref: path/to/other/symbolic/reference</code>.
The <code>HEAD</code> reference is stored in <code>.git/HEAD</code>, and can point
directly to a commit hash like
<span id="example-reference-head-hash">0123456789abcdef0123456789abcdef01234567</span>,
or can point to another symbolic reference, in which case the <code>.git/HEAD</code> file
will contain e.g. <code>refs/heads/main</code>.</p>
<p>Branches are simple files stored in <code>.git/refs/heads/name-of-the-branch</code>
and usually contain a hash like
<span id="example-reference-branch-hash">0123456789abcdef0123456789abcdef01234567</span>.</p>
<p>Tags are identical to branches in terms of representation. It seems that the only difference
between tags and branches is the behaviour of <code>git checkout</code> and similar commands.
These commands, as explained in <a href="git-checkout">the section about <code>git checkout</code></a> below,
normally write <code>ref: refs/heads/name-of-branch</code> in <code>.git/HEAD</code> when
checking out a branch, but write the hash of the target commit when checking out a tag or
any other non-branch reference.</p>
<div id="example-reference"></div>
<script class="example">
___example('example-reference', function() {
var h2f = function(hash) { return 'proj/.git/objects/'+hash.substr(0,2)+'/'+hash.substr(2); }
var main = h2f(hash_object(true, 'blob', false, 'src/main.scm'));
var readme = h2f(hash_object(true, 'blob', false, 'README'));
var src = h2f(store_tree("src", ["main.scm"], []));
var proj = h2f(paths_to_tree(["README", "src/main.scm"]));
var initial_commit_hash = store_commit(
paths_to_tree(["README", "src/main.scm"]),
[],
{name:'Ada Lovelace', email:'ada@analyti.cal', date:new Date(1617120803000), timezoneMinutes: +60},
{name:'Ada Lovelace', email:'ada@analyti.cal', date:new Date(1617120803000), timezoneMinutes: +60},
'Initial commit');
var initial_commit = h2f(initial_commit_hash);
git_branch('main', initial_commit_hash, true);
var main_branch = 'proj/.git/refs/heads/main';
git_tag('v1.0', initial_commit_hash, true);
var v1_0_tag = 'proj/.git/refs/tags/v1.0';
git_init_head();
var head = 'proj/.git/HEAD';
document.getElementById('example-reference-head-hash').innerText = initial_commit_hash;
document.getElementById('example-reference-branch-hash').innerText = initial_commit_hash;
var previous_names = [ main, readme, src, proj, initial_commit ];
var names = [ main, readme, src, proj, initial_commit, main_branch, v1_0_tag, head ];
return { filesystem: filesystem, names: names, previous_names: previous_names }
});
</script>
<p>We'll start with a small utility to remove the newline at the end of a string.
GIT references are usually files containing a hexadecimal hash, and following
*NIX tradition these files finish with a newline byte. When reading these
references, we need to get rid of the newline first.</p>
<textarea>
// Removes the newline at the end of a string, if present.
function trim_newline(s) {
if (s.endsWith('\n')) { return s.substr(0, s.length-1); } else { return s; }
}
</textarea>
<section id="git-symbolic-ref">
<h1><code>git symbolic-ref</code></h1>
<p><code>git symbolic-ref</code> is a low-level command which reads
(and in the official GIT implementation also writes and updates)
symbolic references given a path relative to <code>.git/</code>.
For example, <code>git symbolic-ref HEAD</code> will read the
contents of the file <code>.git/HEAD</code>, and if that file starts
with <code>ref: </code>, the rest of the line will be returned.</p>
<textarea>
function git_symbolic_ref(ref) {
var ref_file = join_paths(current_directory, '.git/' + ref);
if (exists(ref_file) && read(ref_file).startsWith('ref: ')) {
var result = trim_newline(read(ref_file)).substr('ref: '.length);
var recursive = git_symbolic_ref(result);
return recursive || result;
} else {
return false;
}
}
</textarea>
<div class="trivia">
<p>The official implementation of GIT follows references recursively
and returns the <code>path/to/file</code> of the last file of the
form <code>ref: path/to/file</code>. In the example below,
<code>git symbolic-ref HEAD</code> would
<ul>
<li>read the file <code>proj/.git/HEAD</code> which contains <code>ref: refs/heads/main</code>,</li>
<li>follow that indirection and read the file <code>proj/.git/refs/heads/main</code> which contains <code>ref: refs/heads/other</code></li>
<li>follow that indirection and read the file <code>proj/.git/refs/heads/other</code> which contains a hash</li>
<li>return the last file path that contained a <code>ref:</code>, i.e. return the string <code>refs/heads/other</code></li>
</ul>
<div id="example-recursive-ref"></div>
<script class="example">
___example('example-recursive-ref', function() {
var h2f = function(hash) { return 'proj/.git/objects/'+hash.substr(0,2)+'/'+hash.substr(2); }
var main = h2f(hash_object(true, 'blob', false, 'src/main.scm'));
var readme = h2f(hash_object(true, 'blob', false, 'README'));
var src = h2f(store_tree("src", ["main.scm"], []));
var proj = h2f(paths_to_tree(["README", "src/main.scm"]));
var initial_commit_hash = store_commit(
paths_to_tree(["README", "src/main.scm"]),
[],
{name:'Ada', email:'ada@...', date:new Date(1617120803000), timezoneMinutes: +60},
{name:'Ada', email:'ada@...', date:new Date(1617120803000), timezoneMinutes: +60},
'Initial commit');
var initial_commit = h2f(initial_commit_hash);
write('proj/.git/refs/heads/main', 'ref: refs/heads/other\n');
var main_branch = 'proj/.git/refs/heads/main';
git_branch('other', initial_commit_hash, true);
var other_branch = 'proj/.git/refs/heads/other';
git_init_head();
var head = 'proj/.git/HEAD';
document.getElementById('example-reference-head-hash').innerText = initial_commit_hash;
document.getElementById('example-reference-branch-hash').innerText = initial_commit_hash;
var previous_names = [ initial_commit ];
var names = [ initial_commit, main_branch, other_branch, head ];
return { filesystem: filesystem, names: names, previous_names: previous_names }
});
</script>
</div>
</section>
<section id="git-rev-parse">
<h1><code>git rev-parse</code></h1>
<p><code>git rev-parse</code> is another low-level command. It takes a symbolic reference or other reference,
and returns the hash. The difference with <code>git symbolic-ref</code> is that <code>symbolic-ref</code> follows indirections
to other references, and returns the last named reference in the chain of indirections, whereas <code>rev-parse</code>
goes one step further and returns the hash pointed to by the last named reference.</p>
<textarea>
function git_rev_parse(ref) {
var symbolic_ref_target = git_symbolic_ref(ref);
if (symbolic_ref_target) {
// symbolic ref like "ref: refs/heads/main"
return git_rev_parse(symbolic_ref_target);
} else if (/[0-9a-f]{40}/.test(ref)) {
// hash like "0123456789abcdef0123456789abcdef01234567"
return ref;
} else if (ref == 'HEAD') {
// user-friendly reference like "HEAD"
return git_rev_parse(trim_newline(read(join_paths(current_directory, '.git/' + ref))));
} else if (ref.startsWith('refs/') && exists(join_paths(current_directory, '.git/' + ref))) {
// user-friendly reference like "refs/heads/main"
return git_rev_parse(trim_newline(read(join_paths(current_directory, '.git/' + ref))));
} else if (exists(join_paths(current_directory, '.git/refs/heads/' + ref))) {
// user-friendly reference like "main" (a branch)
return git_rev_parse(trim_newline(read(join_paths(current_directory, '.git/refs/heads/' + ref))));
} else if (exists(join_paths(current_directory, '.git/refs/tags/' + ref))) {
// user-friendly reference like "v1.0" (a branch)
return git_rev_parse(trim_newline(read(join_paths(current_directory, '.git/refs/tags/' + ref))));
} else {
// unknown ref
return false;
}
}
</textarea>
</section>
</section>
<section id="git-branch">
<h1><code>git branch</code></h1>
<p>A branch is a pointer to a commit, stored in a file in <code>.git/refs/heads/name_of_the_branch</code>.
The branch can be overwritten with <code>git branch -f</code>. Also, as will be explained later,
<code>git commit</code> can update the pointer of a branch.</p>
<textarea id="in14">
function git_branch(branch_name, commit_ref, force) {
var commit_hash = git_rev_parse(commit_ref);
mkdir(join_paths(current_directory, '.git/refs'));
mkdir(join_paths(current_directory, '.git/refs/heads'));
if (!force && exists(join_paths(current_directory, '.git/refs/heads/' + branch_name))) {
alert("branch already exists");
return false;
} else {
write(join_paths(current_directory, '.git/refs/heads/' + branch_name), commit_hash + '\n');
return true;
}
}
</textarea>
<p>When we call <code>git branch main HEAD</code> or equivalently
<code>git branch main <span id="example-git-branch-head-hash">0123456789012345678901234567890123456789</span></code>,
a file containing that hash is created in <code>.git/refs/heads/main</code>. This file acts as a pointer
to the branch, and this pointer can be read e.g. by <code>git rev-parse</code>.</p>
<div id="example-git-branch"></div>
<script class="example">
___example('example-git-branch', function() {
var h2f = function(hash) { return 'proj/.git/objects/'+hash.substr(0,2)+'/'+hash.substr(2); }
var main = h2f(hash_object(true, 'blob', false, 'src/main.scm'));
var readme = h2f(hash_object(true, 'blob', false, 'README'));
var src = h2f(store_tree("src", ["main.scm"], []));
var proj = h2f(paths_to_tree(["README", "src/main.scm"]));
var initial_commit_hash = store_commit(
paths_to_tree(["README", "src/main.scm"]),
[],
{name:'Ada', email:'ada@...', date:new Date(1617120803000), timezoneMinutes: +60},
{name:'Ada', email:'ada@...', date:new Date(1617120803000), timezoneMinutes: +60},
'Initial commit');
var initial_commit = h2f(initial_commit_hash);
git_branch('main', initial_commit_hash, true);
var main_branch = 'proj/.git/refs/heads/main';
//git_init_head();
//var head = 'proj/.git/HEAD';
document.getElementById('example-git-branch-head-hash').innerText = initial_commit_hash;
var previous_names = [ main, readme, src, proj, initial_commit ];
var names = [ main, readme, src, proj, initial_commit, main_branch ];
return { filesystem: filesystem, names: names, previous_names: previous_names }
});
</script>
<p>After creating the branch, we show how the file <code>.git/refs/heads/main</code> can be overwritten
using <code>git branch -f</code></p>
<textarea id="inex14">
// git branch main 0123456789012345678901234567890123456789
git_branch('main', initial_commit, false);
// git branch -f main 0123456789012345678901234567890123456789
git_branch('main', initial_commit, true);
</textarea>
</section>
<section id="HEAD">
<h1><code>HEAD</code></h1>
<p>
The <code>HEAD</code> indicates the "current" commit. It is set at first as part of the <code>git init</code> routine.
</p>
<textarea id="in15">
function git_init_head() {
write(join_paths(current_directory, '.git/HEAD'), 'ref: refs/heads/main\n');
}
git_init_head();
</textarea>
<p>
Usually, the <code>HEAD</code> is a symbolic reference to a branch, i.e. the
file <code>.git/HEAD</code> contains <code>ref: refs/heads/name-of-branch</code>.
When checking out a commit by specifying its hash directly, or when checking out
a non-branch reference, the file <code>.git/HEAD</code> contains the hash of the
commit instead.
</p>
<p>
The state in which <code>.git/HEAD</code> contains a commit hash is called
"detached HEAD", and often sounds alarming to people who have not encountered this
before. As we will see in the following sections, the only difference between detached
HEAD and the normal state is that <code>git commit</code> updates the branch to point
to the new commit in the normal mode of operation. When the <code>HEAD</code> is detached,
it does not point to a specific branch, and <code>git commit</code> updates the HEAD
directly instead, overwriting it with the new commit hash.
</p>
<p>
Since the HEAD is supposed to be a transient pointer, it is easy to lose track of the hash of
an important commit. For example, the following sequence of operations:
<pre>
git checkout 0123456789abcdef0123456789abcdef01234567
touch new_file
git add new_file
git commit -m 'This is a commit adding a new file'
git checkout branch-of-feature-foobar
</pre>
roughly means:
<pre>
HEAD = 0123456789abcdef0123456789abcdef01234567
// overwrite the contents of the working directory with
// the contents of commit 0123456789abcdef0123456789abcdef01234567
checkout(0123456789abcdef0123456789abcdef01234567)
// create commit with the new file:
HEAD = commit(…)
// Checkout other branch
HEAD = git_rev_parse('branch-of-feature-foobar')
</pre>
</p>
<p>
The hash of the new commit which is stored in HEAD on the second step is overwritten
in the third step. In order to later retrieve that specific version with the precious
new_file, one needs that hash. It would be possible to note down these hashes in a
simple text file, but GIT offers a mechanism for that: branches. After all, branches are
merely named text files containing the hash of the latest commit in that line of work.
</p>
<p>
The hash of a commit created with <code>git commit</code> does not only exist in the
HEAD file (when in detached HEAD) or in the current branch file (normal mode). The official
implementation of GIT keeps a log of the changes being made to the various references.
<code>.git/logs/HEAD</code> contains a log of the hashes pointed to by <code>.git/HEAD</code>,
and <code>.git/logs/refs/heads/main</code> contains a log of the hashes pointed to by
<code>.git/refs/heads/main</code>, and the commands <code>git reflog</code> and
<code>git reflog main</code> pretty-print these files.
</p>
<p>
There are a few more ways to find a lost commit hash, including a careful invocation of
<code>git fsck</code> which checks that the files stored in <code>.git/</code> are not
corrupted, and that no reference (to another reference or a commit, tree or blob) points
to a non-existing file. The <code>git fsck --unreachable</code> option tells this command
to print all object hashes which are not pointed to indirectly by any named reference
(so-called unreachable objects, which are well-formed but are not indirectly linked to
from a branch or other kind of named pointer).
</p>
<p>
The reflog can be used to recover a lost hash but handling hashes manually like this is
somewhat error-prone, and most new users are not aware of those features; for this reason
GIT commands tend to display a warning when switching to a detached HEAD state.
</p>
</section>
<section id="git-config">
<h1>git config</h1>
<p>
The official implementation of GIT stores the settings in various files (<code>.git/config</code> within a repository,
<code>~/.gitconfig</code> in the user's home folder, and several other places).
</p>
<textarea id="in16">
var gitconfig = {
user: {
name: 'Ada Lovelace',
email: 'ada@analyti.cal',
}
};
var $EDITOR = function() { return window.prompt('Commit message:'); }
</textarea>
<p>
These files use a <code>.ini</code> syntax
with <code>key = value</code> lines grouped under some <code>[section]</code> headings. The configuration above could be
stored in <code>~/.gitconfig</code> or <code>.git/config</code> using the following syntax:
</p>
<pre>
[user]
name = Ada Lovelace
email = ada@analyti.cal
</pre>
<p>
The <code>$EDITOR</code> variable is a traditional *NIX environment variable, and could e.g. be declared with
<code>EDITOR=nano</code> in <code>~/.profile</code> or <code>~/.bashrc</code>.
</p>
</section>
<section id="git-commit">
<h1><code>git commit</code></h1>
<p>
The <code>git commit</code> command stores a commit (metadata and a pointer to a tree
containing the files given on the command-line), and updates the <code>HEAD</code> or
current branch to point to the new commit.
</p>
<textarea>
function git_commit(file_paths, message) {
var now = new Date();
var timestamp = (+now)/1000;
var timezoneMinutes = -(now.getTimezoneOffset());
var parent = git_rev_parse('HEAD');
var parents = parent ? [parent] : []
var new_commit_hash = store_commit(
paths_to_tree(file_paths),
parents,
{name:gitconfig.user.name, email:gitconfig.user.email, date:now, timezoneMinutes:timezoneMinutes },
{name:gitconfig.user.name, email:gitconfig.user.email, date:now, timezoneMinutes:timezoneMinutes },
message || $EDITOR());
advance_head_or_branch(new_commit_hash);
return new_commit_hash;
}
</textarea>
<p>If the <code>HEAD</code> points to a commit hash, then <code>git commit</code> updates the <code>HEAD</code> to point to the new commit.
Otherwise, when the <code>HEAD</code> points to a branch, then the target branch (represented by a file named <code>.git/refs/heads/the_branch_name</code>) is updated.</p>
<textarea>
function advance_head_or_branch(new_commit_hash) {
var referenced_branch = git_symbolic_ref('HEAD');
if (referenced_branch) {
// Update the target of the ref:
write(join_paths(current_directory, '.git/' + referenced_branch), new_commit_hash + '\n');
} else {
// Detached HEAD, update .git/HEAD directly.
write(join_paths(current_directory, '.git/HEAD'), new_commit_hash + '\n');
}
}
</textarea>
<p>
The official implementation of <code>git commit</code> makes use of <a href="#index">the index</a>.
When a file is scheduled for the next commit using <code>git add path/to/file</code>, it is added to
the index. The index is a representation of a collection of copies of files, which can efficiently be
compared to the working directory. It uses a different representation, but its role is very similar
to that of a tree object along with the subtrees and blob objects of individual files. When
<code>git commit</code> is called without specifying any files, it creates a commit containing the
version of the files stored in the index.
</p>
<p>
In this simplified implementation, we only support creating commits by specifying all the files that
must be present in the commit (including unchanged files). This contrasts with the official implementation
which would create a tree containing the files from the current HEAD, as well as the added, modified or
deleted files specified by <code>git add</code> or specified directly on the <code>git commit</code>
command-line.
</p>
<textarea>
write('proj/README', 'This is my Scheme project -- with updates!');
var second_commit = git_commit(['README', 'src/main.scm'], 'Some updates');
</textarea>
</section>
<section id="git-tag">
<h1><code>git tag</code></h1>
<p>Tags behave like branches, but are stored in <code>.git/refs/tags/the_tag_name</code>
and a tag is not normally modified. Once created, it's supposed to always point
to the same version.</p>
<p>GIT does offer a <code>git tag -f existing-tag new-hash</code> command,
but using it should be a rare occurrence.</p>
<textarea id="in17">
function git_tag(tag_name, commit_hash, force) {
mkdir(join_paths(current_directory, '.git/refs'));
mkdir(join_paths(current_directory, '.git/refs/tags'));
if (!force && exists(join_paths(current_directory, '.git/refs/tags/' + tag_name))) {
alert("tag already exists");
return false;
} else {
write(join_paths(current_directory, '.git/refs/tags/' + tag_name), commit_hash + '\n');
return true;
}
}
</textarea>
<p>Intuitively, tags differ from branches in the following way: when checking out a branch,
and a subsequent commit is made, the branch is updated to point to the new commit's hash.
As we've seen in the implementation of <code>git commit</code>, the difference is actually
in the contents of the <code>.git/HEAD</code> file. If it is a symbolic reference (generally
a pointer to a branch), then the target of that reference is updated every time a new commit
is created. If the <code>.git/HEAD</code> file contains the hash of a commit, then the
<code>.git/HEAD</code> file itself is updated every time a new commit is created.
</p>
<p>
Therefore, tags and branches differ only in their usage and in the path under which they are
stored (<code>.git/refs/heads/name-of-the-branch</code> vs. <code>.git/refs/tags/name-of-the-tag</code>).
The file <code>.git/HEAD</code> is overwritten by <code>git commit</code> and <code>git checkout</code>.
It is the latter command which will behave differently for tags and branches; <code>git checkout branch-name</code>
turns the HEAD into a symbolic reference, whereas <code>git checkout tag-name</code> resolves the tag name to
a commit hash, and writes that hash directly into <code>.git/HEAD</code>.
</p>
<textarea id="inex17">
// git tag v1.0 0123456789012345678901234567890123456789
git_tag('v1.0', second_commit);
</textarea>
</section>
<section id="git-checkout">
<h1><code>git checkout</code></h1>
<section id="checkout-branch-vs-other">
<p>
The <code>git checkout commit-hash-or-reference</code> command modifies the HEAD to point to the given commit,
and modifies the working directory to match the contents of the tree object pointed to by that commit.
</p>
<textarea id="in18">
function git_checkout(tag_or_branch_or_hash) {
if (exists(join_paths(current_directory, '.git/refs/heads/' + tag_or_branch_or_hash))) {
// Normal (attached) HEAD, points to 'ref: refs/heads/the_branch_name'
write(join_paths(current_directory, '.git/HEAD'), 'ref: refs/heads/' + tag_or_branch_or_hash + '\n');
} else {
// Detached HEAD, points directly to commit hash
write(join_paths(current_directory, '.git/HEAD'), git_rev_parse(tag_or_branch_or_hash) + '\n');
}
checkout_files(git_rev_parse('HEAD'));
}
</textarea>
<h1>Checkout, branches and other references</h1>
<p>The HEAD does not normally point to a tag. Although nothing actually
prevents writing <code>ref: refs/tags/v1.0</code> into <code>.git/HEAD</code>, the GIT
commands will not automatically do this. For example, <code>git checkout tag-or-branch-or-hash</code>
will put a symbolic <code>ref: </code> in <code>.git/HEAD</code> only if the argument is a branch.</p>
</section>
<section id="checkout-files">
<h1>Checking out files</h1>
<p>
In order to replace the contents of the working directory with those of the given commit, we
recursively compare the subtrees, deleting from the working directory the files or directories
that are not present in the tree object, and overwriting the others.
</p>
<p>
The official implementation of GIT will record the diff between the current working directory
and the current commit, and will re-apply these changes on top of the freshly checked-out commit.
The official <code>git checkout</code> command will print warnings and refuse to proceed when
these changes cannot be re-applied without conflict, encouraging the user to create a commit
containing this updated version or to stash the changes (effectively creating a temporary commit
containing this version, pointed to by <code>.git/refs/stash</code>). Our simple implementation
will always overwrite the changes.
</p>
<textarea>
function checkout_files(hash) {
var commit = parse_commit(hash);
checkout_tree(current_directory, commit.tree);
}
function checkout_tree(path_prefix, hash) {
var entries = parse_tree(hash);
var entries_names = entries.map(function (entry) { return entry.name; });
var working_directory_contents = listdir(path_prefix);
for (var i = 0; i < working_directory_contents.length; i++) {
if (entries_names.indexOf(working_directory_contents[i]) == -1
&& working_directory_contents[i] != '.git') {
// The file or directory exists in the working directory, but
// not in the commit that is being checked out, remove it recursively.
remove(join_paths(path_prefix, working_directory_contents[i]), true);
}
}
for (var i = 0; i < entries.length; i++) {
var o = parse_object(entries[i].hash);
var entry_path = join_paths(path_prefix, entries[i].name);
if (o.type == 'blob') {
write(entry_path, o.contents);
} else {
checkout_tree(entry_path, entries[i].hash)
}
}
}
</textarea>
</section>
<section id="parse-assert">
<h1>Assert</h1>
<p>
The <code>checkout_tree()</code> function needs to read the commit, tree and blob objects from the
<code>.git/</code> folder. The following sections will introduce some parsers for these objects.
The parsers will check that their input looks reasonably well-formed, using <code>assert()</code>.</p>
<textarea>
function assert(boolean, text) {
if (! boolean) { alert("assertion failed: " + text); throw new Error(text); }
}
</textarea>
</section>
<section id="parsed-compressed">
<h1>Reading compressed objects</h1>
<textarea>
function parse_object(hash) {
var compressed = read(join_paths(current_directory, '.git/objects/' + hash.substr(0,2) + '/' + hash.substr(2)));
var inflated = inflate(compressed);
var split = inflated.match(/^([\s\S]*?) ([\s\S]*?)\0([\s\S]*)$/);
assert(split, "ill-formed object");
var type = split[1];
var length = split[2];
var contents = split[3];
assert(contents.length == length, "object has incorrect length");
return { type: type, length: length, contents: contents };
}
</textarea>
</section>
<section id="parse-tree">
<h1>Parsing tree objects</h1>
<textarea>
function parse_tree(hash) {
var tree = parse_object(hash);
var split = tree.contents.split(/(?<=\0[\s\S]{20})/);
assert(split, 'invalid contents of tree object');
var entries = [];
for (var i = 0; i < split.length; i++) {
entries.push(parse_tree_entry(split[i]));
}
return entries;
}
</textarea>
<textarea>
function parse_tree_entry(entry) {
var split = entry.match(/^([0-9]+) ([\s\S]*)\0([\s\S]{20})$/);
assert(split, 'invalid entry in tree object');
var mode = split[1];
var name = split[2];
var hash = to_hex(split[3]);
return { mode: mode, name: name, hash: hash };
}
</textarea>
<p>The <code>parse_tree</code> function above needs a small utility to convert hashes represented using raw bytes to a hexadecimal representation.</p>
<textarea id="in19">
function to_hex(bin) {
var bin = String(bin);
var hex = "";
for (var i = 0; i < bin.length; i++) {
hex += left_pad(bin.charCodeAt(i).toString(16), '0', 2);
}
return hex;
}
</textarea>
</section>
<section id="parse-commit">
<h1>Parsing commit objects</h1>
<textarea>
function parse_commit(hash) {
var commit = parse_object(hash);
var lines = commit.contents.split('\n');
var tree = null;
var parents = [];
var author = null;
var committer = null;
var i;
// A blank line separates the headers from the message.
for (i = 0; i < lines.length && lines[i] != ''; i++) {
var split = lines[i].match(/^(.*?) (.*)$/);
assert(split, "ill-formed commit header: " + lines[i]);
var header = split[1];
var value = split[2];
switch (header) {
case 'tree':
assert(!tree, 'duplicate tree header in commit');
assert(/^[0-9a-f]{40}$/.test(value), "invalid tree header in commit");
tree = value;
break;
case 'parent':
assert(/^[0-9a-f]{40}$/.test(value), "invalid parent header in commit");
parents.push(value);
break;
case 'author':
assert(!author, 'duplicate author header in commit');
author = parse_author(value, 'author');
break;
case 'committer':
assert(!committer, 'duplicate committer header in commit');
committer = parse_author(value, 'committer');
break;
default: /* unknown field, skipping */ break;
}
}
// The message is everything after the blank line.
message = lines.splice(i+1).join('\n');
assert(tree, 'commit lacks tree header');
assert(author, 'commit lacks author header');
assert(committer, 'commit lacks committer header');
return {
tree: tree,
parents: parents,
author: author,
committer: committer,
message: message
};
}
</textarea>
</section>
<section id="parse-author-committer">
<h1>Parsing author and committer metadata</h1>
<textarea>
function parse_author(value, field) {
var split = value.match(/^(.*?) <(.*?)> ([0-9]+) ([+-])([0-9][0-9])([0-9][0-9])$/);
assert(split, 'ill-formed ' + field)
var name = split[1];
var email = split[2];
var date = new Date(parseInt(split[3], 10) * 1000);
var timezone_sign = (split[4] == '+' ? 1 : -1);
var timezone_hours = parseInt(split[5], 10);
var timezone_minutes = parseInt(split[6], 10);
var timezone = timezone_sign * (timezone_hours * 60 + timezone_minutes);
return { name: name, email: email, date: date, timzeone: timezone };
}
</textarea>
</section>
<section id="checkout-example">
<h1>Example checkout</h1>
<p></p>
<textarea id="in20">
git_checkout(initial_commit);
</textarea>
</section>
</section>
<section id="git-init">
<h1><code>git init</code></h1>
<p>The <code>git init</code> command creates the <code>.git</code> directory and points <code>.git/HEAD</code>
to the default branch (a file which does not exist yet, as this branch does not contain any commit at this point).</p>
<textarea id="in21">
function git_init() {
git_init_mkdir();
git_init_head();
}
</textarea>
</section>
<section id="index">
<h1>The index</h1>
<p>When adding files with <code>git add</code>, GIT does not immediately create a commit object.
Instead, it adds the files to the index, which uses a binary format with lots of metadata.
The mock filesystem used here lacks most of these pieces of information, so thr value <code>0</code>
will be used for most fields. See <a href="https://mincong.io/2018/04/28/git-index/">this blog post</a>
for a more in-depth study of the index.</p>
<textarea id="index-raw-bytes-utils">
function raw_bytes(val, bytes) {
return hex_to_raw_bytes(left_pad(val.toString(16), '0', bytes*2));
}
function raw_bytes16(val) { return raw_bytes(val, 2); }
function raw_bytes32(val) { return raw_bytes(val, 4); }
function raw_bytes64(val) { return raw_bytes(val, 8); }
</textarea>
<textarea id="make-index">
function store_index(paths) {
var magic = 'DIRC' // DIRectory Cache
var version = raw_bytes32(2);
var entries = raw_bytes32(paths.length);
var header = magic + version + entries;
index = header;
for (var i = 0; i < paths.length; i++) {
var ctime = raw_bytes64(0);
var mtime = raw_bytes64(0);
var device = raw_bytes32(0);
var inode = raw_bytes32(0);
// default permissions for files, in octal.
var mode = raw_bytes32(0100644);
var uid = raw_bytes32(0);
var gid = raw_bytes32(0);
var size = raw_bytes32(read(join_paths(current_directory, paths[i])).length);
var hash = hex_to_raw_bytes(hash_object(true, 'blob', false, paths[i]));
// for this simple index, the flags (the 4 higher bits) are 0.
assert(paths[i].length < 0xfff)
var flags_and_file_path_length = raw_bytes16(paths[i].length)
var file_path = paths[i] + '\0';
entry = ctime + mtime + device + inode + mode + uid + gid + size
+ hash + flags_and_file_path_length + file_path;
while (entry.length % 8 != 0) {
// pad with null bytes to a multiple of 8 bytes (64-bits).
entry += '\0';
}
index += entry;
}
index += hex_to_raw_bytes(sha1(index));
write(join_paths(current_directory, '.git/index'), index)
}
</textarea>
</section>
<section id="playground">
<h1>Playground</h1>
<p>The implementation is now sufficiently complete to create a small repository.</p>
<textarea id="playground-reset">
// Reset the filesystem to its initial state
filesystem = {};
current_directory = '';
</textarea>
<textarea id="playground-play">
mkdir('proj');
cd('proj');
write('proj/README', 'This is my implementation of GIT.\n');
mkdir('proj/src');
write('proj/src/main.scm', "(define filesystem '())\n...\n");
git_init();
git_commit(['README', 'src/main.scm'], 'A well-understood initial commit.');
git_branch('dev', 'HEAD');
git_checkout('dev');
write('proj/src/main.scm', "(define filesystem '())\n(define current_directory \"\")\n");
git_commit(['README', 'src/main.scm'], 'What an update!');
git_checkout('main');
// update the cache of the working directory. Without this,
// GIT finds an empty cache, and thinks all files are scheduled
// for deletion, until "git add ." allows it to realize that
// the working directory matches the contents of HEAD.
store_index(['README', 'src/main.scm']);
</textarea>
<p>By clicking on "Copy commands to recreate in *nix terminal.", it is possible to copy a series of <code>mkdir …</code> and <code>printf … > …</code> commands that, when executed, will recreate the virtual filesystem on a real system. The resulting
folder is bit-compatible with the official <code>git log</code>, <code>git status</code>, <code>git checkout</code> etc.
commands.</p>
</section>
<section id="suggested-exercises">
<h1>Suggested exercises</h1>
<p>
The reader willing to improve their grasp of GIT's mental model, and reduce their reliance on a few learned recipies, might
be interested in the following warm-up exercises:
</p>
<ul>
<li>
Inspect an existing repository, starting with <code>cat .git/HEAD</code> and using <code>git cat-file -p some-hash</code>
to pretty-print an object given its hash.
</li>
<li>
Inspect an existing repository, starting with <code>cat .git/HEAD</code> and using the <code>zlib</code> decompression tool
from the <a href=#zlib-compression-note><code>zlib</code> compression</a> section.
</li>
<li>
Run <code>git init new-directory</code> in a terminal, and create an initial single-file commit from scratch, using only
<code>git hash-object</code>, <code>printf</code> and overwriting <code>.git/HEAD</code>. This will involve retracing the
steps in this tutorial to create a blob object for the file, a tree object to be the directory containing just that file,
and a commit object.
</li>
<li>
For a couple of weeks, only use the GIT commands <code>commit</code>, <code>diff</code>, <code>checkout</code>,
<code>merge</code>, <code>cherry-pick</code>, <code>log</code>, <code>clone</code>, <code>fetch</code> and
<code>push remote hash-of-commit:refs/heads/name-of-the-branch</code>. In particular, don't use <code>rebase</code>
which is just a wrapper around a sequence of <code>cherry-pick</code> commands, don't use <code>pull</code> which is
just a wrapper around <code>fetch</code> and <code>merge</code>, don't use <code>git push</code> as-is and instead
explicitly give the name (origin) or URL of the remote, the hash of the commit to push, and the path that should be
updated on the remote (<code>git push</code> while the <code>main</code> branch is checked out locally is equivalent
to <code>git push origin HEAD:refs/heads/main</code>, where <code>HEAD</code> can be replaced by the actual hash of
the commit).
</li>
<li>
Try not even using <code>git cherry-pick</code> or <code>git diff</code> a few times, instead make two copies the git
directoy, check out the two different commits in each copy, and use the traditional *NIX commands <code>diff</code> and
<code>patch</code>.
</li>
</ul>
</section>
<section id="conclusion">
<h1>Conclusion</h1>
<p>This article shows that a large part of the core of GIT can be re-implemented in <span class="loc-count">a few</span> source lines of code* (<a href="javascript:___copy_all_code(); void(0);">copy all the code</a>).
<span style="font-size: small">*&nbsp;empty lines and single closing braces excluded, <span class="loc-count-total">a few more</span> in total.</span></p>
<div id="copy-all-code" style="display: none;"></div>
<ul>
</ul>
<li>Some of the features which may appear mysterious at first sight (e.g. detached HEAD) should be clearer with the knowledge of how GIT works behind the scenes.</li>
<li>Furthermore, branches are often associated with an intuition (containers into which commits are added) which does not match the implementation (mutable pointers to commits).</li>
<li>Finally, it is tempting to think of commits as patches. While <code>darcs</code> tries to expose an interface which matches this intuition, it is clear that the implementation of GIT considers commits as copies of the entire repository, and are linked to the previous version solely by the <code>parent</code> metadata in the commit headers.</li>
</ul>
<p>A few core commands like <code>git diff</code> and <code>git apply</code> are not described in this tutorial.
They are little more than improved versions of the classical *nix commands <code>diff</code> and <code>patch</code>.</p>
<p>Most other commands provided by GIT are merely convenience wrappers around these commands. For example, <code>git cherry-pick</code> is simply a combination of <code>git diff</code> between the tree of a commit and the tree of its parent, followed by <code>git apply</code> to apply the patch and <code>git commit</code> to create a new commit whose diff is equivalent to the diff of the original commit. As an other example, the command <code>git rebase</code> performs as succession of <code>cherry-pick</code> operations.</p>
<p>By keeping in mind the internal model of GIT, it becomes easier to understand the usual commands and their quirks. By undersanding the design philosophy behind the implementation, the day-to-day usage can become, hopefully, less surprising.</p>
</section>
<div id="toc"></div>
</article>
<script>
(function() {
var script = ___script_log_header;
var ta = document.getElementsByTagName('textarea');
for (var j = 0; j < ta.length; j++) {
if (ta[j] == document.getElementById('playground-reset')) {
break;
}
script += ta[j].value + "\n\n";
}
var js = document.getElementsByTagName('script');
for (var j = 0; j < js.length; j++) {
if (js[j].className.indexOf('example') != -1) {
script += js[j].innerText;
}
}
script += '\nfor (var i = 0; i < examples.length; i++) { examples[i](); }';
eval(script);
})();
___git_tutorial_onload()
</script>
</body>
</html>