git-tutorial/index.html

1080 lines
41 KiB
HTML

<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>GIT tutorial</title>
<link rel="stylesheet" href="codemirror-5.60.0/lib/codemirror.css">
<script src="codemirror-5.60.0/lib/codemirror.js"></script>
<script src="codemirror-5.60.0/mode/javascript/javascript.js"></script>
<script src="sha1.js"></script>
<script src="pako.min.js"></script>
<link rel="stylesheet" href="codemirror-5.60.0/lib/codemirror.css">
<link rel="stylesheet" href="git-tutorial.css">
<script src="git-tutorial.js"></script>
<script class="example">
var examples=[];
function ___h2f(hash) { return 'proj/.git/objects/'+hash.substr(0,2)+'/'+hash.substr(2); }
function ___example(id, f) {
examples.push(function () {
var result = f();
var fs = {};
for (var i = 0; i < result.names.length; i++) {
fs[result.names[i]] = filesystem[result.names[i]];
}
var previous_fs = {};
for (var i = 0; i < result.previous_names.length; i++) {
previous_fs[result.previous_names[i]] = filesystem[result.previous_names[i]];
}
document.getElementById(id).innerHTML = ___filesystem_to_string(fs, true, previous_fs);
});
}
</script>
</head>
<body>
<article id="git-tutorial">
<h1>Under construction</h1>
<p>The main reference for this tutorial is the <a href="https://git-scm.com/book/en/v2/Git-Internals-Git-Objects">Pro Git book</a> section on GIT internals.</p>
<p>This tutorial uses three libraries:</p>
<ul>
<li><a href="https://codemirror.net/">CodeMirror</a>, released under the MIT license</li>
<li><a href="https://www.movable-type.co.uk/scripts/sha1.html">sha1.js</a>, released under the MIT license</li>
<li><a href="https://github.com/nodeca/pako">pako 2.0.3</a>, released under the MIT and Zlib licenses, see the project page for details.</li>
</ul>
<section id="introduction">
<h1>Introduction</h1>
<p>
GIT is based on a simple model, with a lot of shorthands for common
use cases. This model is sometimes hard to guess just from the
everyday commands. To illustrate how GIT works, we'll implement a
stripped down clone of GIT in <span class="loc-count">a few</span> lines of
JavaScript.
<span style="font-size: small">*&nbsp;empty lines and single closing braces
excluded, <span class="loc-count-total">a few more</span> in total.</span>
</p>
</section>
<section id="os-filesystem">
<h1>The Operating System's filesystem</h1>
<section id="os-filesystem-model">
<h1>Model of the filesystem</h1>
<p>The Operating System's filesystem will be simulated by a very
simple key-value store. In this very simple filesystem, directories
are entries mapped to <code>null</code> and files are entries mapped
to strings. The path to the current directory is stored in a separate
variable.</p>
<textarea id="in0">
var filesystem = {};
var current_directory = '';
</textarea>
</section>
<section id="os-filesystem-functions">
<h1>Filesystem access functions<span class="notoc"> (<code>read</code>, <code>write</code>, <code>mkdir</code>, <code>exists</code>, <code>cd</code>)</span></h1>
<p>The filesystem exposes functions to read an entire file, create or
replace an entire file, create a directory, test the existence of a filesystem entry, and change the current directory.</p>
<textarea id="in1">
function read(filename) {
return filesystem[filename];
}
function write(filename, data) {
filesystem[filename] = String(data);
}
function exists(filename) {
return typeof(filesystem[filename]) !== 'undefined';
}
function mkdir(dirname) {
filesystem[dirname] = null;
}
function cd(dirname) {
current_directory = dirname;
}
</textarea>
</section>
<section id="os-filesystem-listdir">
<h1>Filesystem access functions<span class="notoc"> (<code>listdir</code>)</span></h1></h1>
<p>It will be handy for some operations to list the contents of a
directory.</p>
<textarea id="in2">
function listdir(dirname) {
var depth = dirname.split('/').length + 1;
var descendents = filesystem
.filter(function (filename) { return filename.startsWith(dirname + '/'); });
var children = descendents
.map(function (filename) { return filename.split('/')[depth]; });
// remove duplicates:
return Array.from(new Set(children));
}
</textarea>
</section>
</section>
<section id="example-working-directory">
<h1>Example working directory</h1>
<p>Our imaginary user will create a <code>proj</code> directory,
and start filling in some files.</p>
<textarea id="in3">
mkdir('proj');
cd('proj');
write('proj/README', 'This is my Scheme project.\n');
mkdir('proj/src');
write('proj/src/main.scm', '(map (lambda (x) (+ x 1)) (list 1 2 3))\n');
</textarea>
</section>
<section id="git-init">
<h1><code>git init</code> (creating <code>.git</code>)</h1>
<p>The first thing to do is to initialize the GIT directory.
For now, only the <code>.git</code> folder is needed, The rest
of the function implementing <code>git init</code> will be
written later.</p>
<textarea id="in4">
function join_paths(a, b) {
return (a == "") ? b : (a + "/" + b);
}
// git init (partial implementation: create the .git directory)
function git_init_mkdir() {
mkdir(join_paths(current_directory, '.git'));
}
git_init_mkdir();
</textarea>
<p>Click on the <em>eval</em> button to see the files and directories that were
created so far.</p>
</section>
<section id="git-hash-object">
<h1><code>git hash-object</code><span class="notoc"> (storing a copy of a file in <code>.git</code>)</span></h1>
<p>The most basic element of a GIT repository is an <em>object</em>. Objects have a type which can be
<code>blob</code> (individual files), <code>tree</code> (directories),
<code>commit</code> (pointers to a specific version of the root directory,
with a description and some metadata) and <code>tag</code> (named pointers to a specific commit,
with a description and some metadata).
When a file is added to the git repostitory, a compressed copy is stored in GIT&apos;s database,
in the <code>.git/objects/</code> folder. This copy is a <em>blob</em> object.</p>
<p>The compressed copy is given a unique filename, which is obtained by hashing the contents of the original file.
Some filesystems have poor performance when a single directory contains a large number of files, and some filesystems
have a limit on the number of files that a directory may contain. To circumvent these issues, the first two characters
of the hash are used as the name of an intermediate directory: if a file's hash is <code>0a1bd…</code>, its compressed
copy will be stored in <code>.git/objects/0a/1bd…</code></p>
<p>This function creates a file that looks like this:</p>
<div id="example-blob-object-template"></div>
<script class="example">
___example('example-blob-object-template', function() {
var object_contents = 'type length\000Contents of path_or_data';
var hash = sha1(object_contents);
var path = ___h2f(hash);
write(path, deflate(object_contents));
return { filesystem: filesystem, names: [path], previous_names: [] };
});
</script>
<p>The objects stored in the GIT database are compressed with zlib
(using the "deflate" compression method). The filesystem view shows
the marker <span class="deflated">deflated:</span> followed by the
uncompressed data. Click on the (un)compressed data to toggle between
this pretty-printed view and the raw compressed data.</p>
<p>When creating some <code>blob</code> objects, the result could be, for example:</p>
<div id="example-blob-objects"></div>
<script class="example">
___example('example-blob-objects', function() {
var names = [
___h2f(hash_object(true, 'blob', false, 'src/main.scm')),
___h2f(hash_object(true, 'blob', false, 'README')),
];
return { filesystem: filesystem, names: names, previous_names: [] };
});
</script>
<p>This function reproduces faithfully the behaviour of (a subset of the options of)
the <code>git hash-object</code> command which can be called on a real git command-line.</p>
<textarea id="in5">
// git hash-object [-w] -t <type> [--stdin] [path]
function hash_object(must_write, type, is_data, path_or_data) {
var data = is_data ? path_or_data : read(join_paths(current_directory, path_or_data));
object_contents = type + ' ' + data.length + '\0' + data;
var hash = sha1(object_contents);
if (must_write) {
mkdir(join_paths(current_directory, '.git/objects'));
mkdir(join_paths(current_directory, '.git/objects/' + hash.substr(0,2)));
var object_path = join_paths(current_directory, '.git/objects/' + hash.substr(0,2) + '/' + hash.substr(2));
// deflate() compresses using zlib
write(object_path, deflate(object_contents));
}
return hash;
}
</textarea>
<section id="add-file-to-git">
<h1>Adding a file to the GIT database</h1>
<p>So far, our GIT database does not know about any of the user&apos;s
files. In order to add the contents of the <code>README</code> file in
the database, we use <code>git hash-object -w -t blob README</code>,
where <code>-w</code> tells GIT to <em>write</em> the object in its
database, and <code>-t blob</code> indicates that we want to create
a <em>blob</em> object, i.e. the contents of a file.</p>
<textarea id="in6">
// git hash-object -w -t blob README
hash_object(true, 'blob', false, 'README');
</textarea>
<p>Click on the <em>eval</em> button to see the file that was
created by this call.</p>
<p>You can notice that the database does not contain the name of the
original file, only its content, stored under a unique identifier which is
derived by hashing that content. Let&apos;s add the second user file
to the database.</p>
<textarea id="in7">
// git hash-object -w -t blob src/main.scm
hash_object(true, 'blob', false, 'src/main.scm');
</textarea>
</section>
</section>
<section id="zlib-compression-note">
<h1><code>zlib</code> compression</h1>
<p>GIT compresses objects with zlib. The <code>deflate()</code> function used in
the script above comes from the <a href="https://github.com/nodeca/pako">pako 2.0.3</a> library.
To view a zlib-compressed object in your *nix terminal, simply write this
declaration in your shell.</p>
<pre>
unzlib() {
python -c \
"import sys,zlib; \
sys.stdout.buffer.write(zlib.decompress(open(sys.argv[1], 'rb').read()));" \
"$1"
}
</pre>
<p>You can then inspect git objects as follows, using <code>hexdump</code> to view the null bytes and other non-printable bytes.</p>
<pre>unzlib .git/objects/95/d318ae78cee607a77c453ead4db344fc1221b7 | hexdump -Cv</pre>
</section>
<section id="storing-trees">
<h1>Storing trees (list of hashed files and subtrees)</h1>
<p>At this point GIT knows about the contents of both of the user's
files, but it would be nice to also store the filenames.
This is done by creating a <em>tree</em> object</p>
<p>A tree object can contain files (by associating the blob's hash to its name), or directories (by associating the hash of other subtrees to their name).
The mode (<code>100644</code> for the file and <code>40000</code> for the folder) incidates the permissions, and is given in octal using <a href="https://unix.stackexchange.com/a/145118/19059">the values used by *nix</a></p>
<div id="example-tree-objects"></div>
<script class="example">
___example('example-tree-objects', function() {
var main = ___h2f(hash_object(true, 'blob', false, 'src/main.scm'));
var readme = ___h2f(hash_object(true, 'blob', false, 'README'));
var src = ___h2f(store_tree("src", ["main.scm"], []));
var proj = ___h2f(paths_to_tree(["README", "src/main.scm"]));
var previous_names = [ main, readme ];
var names = [ main, readme, src, proj ];
return { filesystem: filesystem, names: names, previous_names: previous_names };
});
</script>
<p>In the contents of a tree, subdirectories (trees) are listed before files (blobs);
within each group the entries are ordered alphabetically.</p>
<textarea id="in8">
// base_directory is a string
// filenames is a list of strings
// subtrees is a list of {name, hash} objects.
function store_tree(base_directory, filenames, subtrees) {
function get_file_hash(filename) {
var path = join_paths(base_directory, filename);
var hash = hash_object(true, 'blob', false, path)
return hex_to_raw_bytes(hash);
}
var blobs = filenames.map(function (filename) {
return "100644 " + filename + "\0" + get_file_hash(filename);
});
var trees = subtrees.map(function (subtree) {
return "40000 " + subtree.name + "\0" + hex_to_raw_bytes(subtree.hash);
});
// blobs are listed before subtrees
var tree_content = blobs.join('') + trees.join('');
// cat tree_content | git hash-object -w -t tree --stdin
return hash_object(true, 'tree', true, tree_content);
}
</textarea>
<p>This function needs a small utility to convert hashes encoded in hexadecimal to raw bytes.</p>
<textarea id="in9">
function hex_to_raw_bytes(hex) {
var hex = String(hex);
var str = ""
for (var i = 0; i < hex.length; i+=2) {
str += String.fromCharCode(parseInt(hex.substr(i, 2), 16));
}
return str;
}
</textarea>
<section id="store-tree-example">
<h1>Example use of <code>store_tree()</code></h1>
<p>The following code, once uncommented, stores into the GIT database the trees for <code>src</code>
and for the root directory of the GIT project.</p>
<textarea id="in10">
//hash_src_tree = store_tree("src", ["main.scm"], []);
//hash_root_tree = store_tree("", ["README"], [{name:"src", hash:hash_src_tree}]);
</textarea>
<p>The <code>store_tree()</code> function needs to be called for the contents of subdirectories
first, and that result can be used to store the trees of upper directories. In the next section,
we will write a function which takes a list of paths, constructs an internal representation of
the hierarchy, and stores the corresponding trees bottom-up.</p>
</section>
<section id="store-tree-from-paths">
<h1>Storing a tree from a list of paths</h1>
<p>Making trees out of the subfolders one by one is cumbersome.
The following utility function takes a list of paths, and builds
a tree from those.</p>
<textarea id="in11">
function paths_to_tree(paths) {
// This temporary mutable object will store a hierarchy of
// subfolders and files, e.g.
// {
// subfolders: { src: { subfolders: [], files: ['main.scm'] } }
// files: ['README']
// }
var hierarchy = { subfolders: {}, files: [] };
// This splits the input paths on occurrences of "/",
// and inserts them into the "hierarchy" object.
for (var i = 0; i < paths.length; i++) {
var path_components = paths[i].split('/');
var h = hierarchy;
for (var j = 0; j < path_components.length - 1; j++) {
if (! h.subfolders.hasOwnProperty(path_components[j])) {
h.subfolders[path_components[j]] = {
subfolders: {},
files: []
};
}
h = h.subfolders[path_components[j]];
}
h.files[h.files.length] = path_components[path_components.length - 1];
}
// This function takes the path to a directory, e.g. "src",
// and a hierarchy object e.g. { subfolders: [], files: ['main.scm'] }.
// It recursively stores the tree object for that directory into
// GIT's database.
var to_tree = function(base_directory, hierarchy) {
var subtrees = [];
for (var i in hierarchy.subfolders) {
if (hierarchy.subfolders.hasOwnProperty(i)) {
subtrees[subtrees.length] = {
name: i,
hash: to_tree(join_paths(base_directory, i), hierarchy.subfolders[i])
};
}
}
return store_tree(base_directory, hierarchy.files, subtrees);
}
// Store the trees for the whole hierarchy, starting from the
// root directory of the GIT repository (which is represented
// as an empty path "")
return to_tree("", hierarchy);
}
// git add README src/main.scm
paths_to_tree(["README", "src/main.scm"]);
</textarea>
</section>
</section>
<section id="store-commit">
<h1>Storing a commit in the GIT database</h1>
<p>Now that the GIT database contains the entire tree for the current version,
a commit can be created. A commit contains</p>
<ul>
<li>the hash of the tree object,</li>
<li>the hash of the previous commit, which is dubbed the <code>parent</code> (merge commits have two or more parents, and the initial commit has no parent commit),</li>
<li>information about the author (the person who initially wrote the code),</li>
<li>information about the committer (the person who adds the code to the GIT
database, often the same person as the author, but it can be a different person
e.g. when someone else rewrites the history with a rebase or applies a patch recieved
by e-mail),</li>
<li>and a description.</li>
</ul>
<div id="example-commit-object"></div>
<script class="example">
___example('example-commit-object', function() {
var main = ___h2f(hash_object(true, 'blob', false, 'src/main.scm'));
var readme = ___h2f(hash_object(true, 'blob', false, 'README'));
var src = ___h2f(store_tree("src", ["main.scm"], []));
var proj = ___h2f(paths_to_tree(["README", "src/main.scm"]));
var initial_commit = ___h2f(store_commit(
paths_to_tree(["README", "src/main.scm"]),
[],
{name:'Ada Lovelace', email:'ada@analyti.cal', date:new Date(1617120803000), timezoneMinutes: +60},
{name:'Ada Lovelace', email:'ada@analyti.cal', date:new Date(1617120803000), timezoneMinutes: +60},
'Initial commit'));
var previous_names = [ main, readme, src, proj ];
var names = [ main, readme, src, proj, initial_commit ];
return { filesystem: filesystem, names: names, previous_names: previous_names };
});
</script>
<p>The author and committer information contain</p>
<ul>
<li>the person's name,</li>
<li>the person's email,</li>
<li>the *nix timestamp at which the version was authored or committed,</li>
<li>and the <a href="https://www.youtube.com/watch?v=q2nNzNo_Xps">timezone for that timestamp</a>.</li>
</ul>
<textarea id="in12">
function store_commit(tree, parents, author, committer, message) {
var commit_contents = '';
commit_contents += 'tree ' + tree + '\n';
for (var i = 0; i < parents.length; i++) {
commit_contents += 'parent ' + parents[i] + '\n';
}
commit_contents += 'author ' + author.name
+ ' <' + author.email + '> '
+ format_date(author.date) + ' '
+ format_timezone(author.timezoneMinutes) + '\n';
commit_contents += 'committer ' + committer.name
+ ' <' + committer.email + '> '
+ format_date(committer.date) + ' '
+ format_timezone(committer.timezoneMinutes) + '\n';
commit_contents += '\n';
commit_contents += '' + message + (message[message.length-1] == '\n' ? '' : '\n');
// cat commit_contents | git hash-object -w -t commit --stdin
return hash_object(true, 'commit', true, commit_contents);
}
function format_date(d) {
return Math.floor((+d) / 1000);
}
function left_pad(s, char, len) {
while ((''+s).length < len) { s = '' + char + s; }
return s;
}
function format_timezone(tm) {
var h = Math.floor(Math.abs(+tm)/60);
var m = Math.abs(+tm)%60;
return (tm >= 0 ? '+' : '-') + left_pad(h, '0', 2) + left_pad(m, '0', 2);
}
</textarea>
<section id="store-commit-example">
<h1>Storing an example commit</h1>
<p>It is now possible to store a commit in the database. This saves
a copy of the tree along with some metadata about this version.
The first commit has no parent, which is represented by passing
the empty list.</p>
<textarea id="in13">
initial_commit = store_commit(
paths_to_tree(["README", "src/main.scm"]),
[],
{name:'Ada Lovelace', email:'ada@analyti.cal', date:new Date(1617120803000), timezoneMinutes: +60},
{name:'Ada Lovelace', email:'ada@analyti.cal', date:new Date(1617120803000), timezoneMinutes: +60},
'Initial commit');
</textarea>
</section>
</section>
<section id="resolving-references">
<h1>resolving references</h1>
<p>The next few sections will introduce <em>symbolic references</em>
like branch names, the special name <code>HEAD</code> or tag names.</p>
<p>Symbolic references are nothing more than regular files containing a hexadecimal
hash or a string of the form <code>ref: path/to/other/symbolic/reference</code>.
The <code>HEAD</code> reference is stored in <code>.git/HEAD</code>, and can point
directly to a commit hash like
<span id="example-reference-head-hash">0123456789abcdef0123456789abcdef01234567</span>,
or can point to another symbolic reference, in which case the <code>.git/HEAD</code> file
will contain e.g. <code>refs/heads/main</code>.</p>
<p>Branches are simple files stored in <code>.git/refs/heads/name-of-the-branch</code></p>
<div id="example-reference"></div>
<script class="example">
___example('example-reference', function() {
var h2f = function(hash) { return 'proj/.git/objects/'+hash.substr(0,2)+'/'+hash.substr(2); }
var main = h2f(hash_object(true, 'blob', false, 'src/main.scm'));
var readme = h2f(hash_object(true, 'blob', false, 'README'));
var src = h2f(store_tree("src", ["main.scm"], []));
var proj = h2f(paths_to_tree(["README", "src/main.scm"]));
var initial_commit_hash = store_commit(
paths_to_tree(["README", "src/main.scm"]),
[],
{name:'Ada Lovelace', email:'ada@analyti.cal', date:new Date(1617120803000), timezoneMinutes: +60},
{name:'Ada Lovelace', email:'ada@analyti.cal', date:new Date(1617120803000), timezoneMinutes: +60},
'Initial commit');
var initial_commit = h2f(initial_commit_hash);
git_branch('main', initial_commit_hash, true);
var main_branch = 'proj/.git/refs/heads/main';
git_tag('v1.0', initial_commit_hash, true);
var v1_0_tag = 'proj/.git/refs/tags/v1.0';
git_init_head();
var head = 'proj/.git/HEAD';
document.getElementById('example-reference-head-hash').innerText = initial_commit_hash;
var previous_names = [ main, readme, src, proj, initial_commit ];
var names = [ main, readme, src, proj, initial_commit, main_branch, v1_0_tag, head ];
return { filesystem: filesystem, names: names, previous_names: previous_names }
});
</script>
<p>We'll start with a small utility to remove the newline at the end of a string.
GIT references are usually files containing a hexadecimal hash, and following
*NIX tradition these files finish with a newline byte. When reading these
references, we need to get rid of the newline first.</p>
<textarea>
// Removes the newline at the end of a string, if present.
function trim_newline(s) {
if (s.endsWith('\n')) { return s.substr(0, s.length-1); } else { return s; }
}
</textarea>
<section id="git-symbolic-ref">
<h1><code>git symbolic-ref</code></h1>
<textarea>
function git_symbolic_ref(ref) {
var ref_file = join_paths(current_directory, '.git/' + ref);
if (exists(ref_file) && read(ref_file).startsWith('ref: ')) {
return trim_newline(read(ref_file)).substr('ref: '.length);
} else {
return false;
}
}
</textarea>
</section>
<section id="git-rev-parse">
<h1><code>git rev-parse</code></h1>
<textarea>
function git_rev_parse(ref) {
var symbolic_ref_target = git_symbolic_ref(ref);
if (symbolic_ref_target) {
// symbolic ref like "ref: refs/heads/main"
return git_rev_parse(symbolic_ref_target);
} else if (/[0-9a-f]{40}/.test(ref)) {
// hash like "0123456789abcdef0123456789abcdef01234567"
return ref;
} else if (ref == 'HEAD') {
// user-friendly reference like "HEAD"
return git_rev_parse(trim_newline(read(join_paths(current_directory, '.git/' + ref))));
} else if (ref.startsWith('refs/') && exists(join_paths(current_directory, '.git/' + ref))) {
// user-friendly reference like "refs/heads/main"
return git_rev_parse(trim_newline(read(join_paths(current_directory, '.git/' + ref))));
} else if (exists(join_paths(current_directory, '.git/refs/heads/' + ref))) {
// user-friendly reference like "main" (a branch)
return git_rev_parse(trim_newline(read(join_paths(current_directory, '.git/refs/heads/' + ref))));
} else if (exists(join_paths(current_directory, '.git/refs/tags/' + ref))) {
// user-friendly reference like "v1.0" (a branch)
return git_rev_parse(trim_newline(read(join_paths(current_directory, '.git/refs/tags/' + ref))));
} else {
// unknown ref
return false;
}
}
</textarea>
</section>
</section>
<section id="git-branch">
<h1><code>git branch</code></h1>
<p>A branch is a pointer to a commit, stored in a file in <code>.git/refs/heads/name_of_the_branch</code>.
The branch can be overwritten with <code>git branch -f</code>. Also, as will be explained later,
<code>git commit</code> can update the pointer of a branch.</p>
<textarea id="in14">
function git_branch(branch_name, commit_ref, force) {
var commit_hash = git_rev_parse(commit_ref);
mkdir(join_paths(current_directory, '.git/refs'));
mkdir(join_paths(current_directory, '.git/refs/heads'));
if (!force && exists(join_paths(current_directory, '.git/refs/heads/' + branch_name))) {
alert("branch already exists");
return false;
} else {
write(join_paths(current_directory, '.git/refs/heads/' + branch_name), commit_hash + '\n');
return true;
}
}
// git branch main 0123456789012345678901234567890123456789
git_branch('main', initial_commit, false);
// git branch -f main 0123456789012345678901234567890123456789
git_branch('main', initial_commit, true);
</textarea>
</section>
<section id="HEAD">
<h1><code>HEAD</code></h1>
<p>
The HEAD indicates the "current" commit. It is set at first as part of the <code>git init</code> routine.
</p>
<textarea id="in15">
function git_init_head() {
write(join_paths(current_directory, '.git/HEAD'), 'ref: refs/heads/main\n');
}
git_init_head();
</textarea>
</section>
<section id="git-commit">
<h1><code>git commit</code></h1>
<p>If the <code>HEAD</code> points to a commit hash, then <code>git commit</code> updates the <code>HEAD</code> to point to the new commit.
Otherwise, when the <code>HEAD</code> points to a branch, then the target branch (represented by a file named <code>.git/refs/heads/the_branch_name</code>) is updated.</p>
<textarea id="in16">
var gitconfig = {
user: {
name: 'Ada Lovelace',
email: 'ada@analyti.cal',
}
};
var $EDITOR = function() { return window.prompt('Commit message:'); }
</textarea>
<textarea>
function git_commit(file_paths, message) {
var now = new Date();
var timestamp = (+now)/1000;
var timezoneMinutes = -(now.getTimezoneOffset());
var parent = git_rev_parse('HEAD');
var parents = parent ? [parent] : []
var new_commit_hash = store_commit(
paths_to_tree(file_paths),
parents,
{name:gitconfig.user.name, email:gitconfig.user.email, date:now, timezoneMinutes:timezoneMinutes },
{name:gitconfig.user.name, email:gitconfig.user.email, date:now, timezoneMinutes:timezoneMinutes },
message || $EDITOR());
advance_head(new_commit_hash);
return new_commit_hash;
}
</textarea>
<textarea>
function advance_head(new_commit_hash) {
var referenced_branch = git_symbolic_ref('HEAD');
if (referenced_branch) {
// Update the target of the ref:
write(join_paths(current_directory, '.git/' + referenced_branch), new_commit_hash + '\n');
} else {
// Detached HEAD, update .git/HEAD directly.
write(join_paths(current_directory, '.git/HEAD'), new_commit_hash + '\n');
}
}
</textarea>
<textarea>
write('proj/README', 'This is my Scheme project -- with updates!');
var second_commit = git_commit(['README', 'src/main.scm'], 'Some updates');
</textarea>
</section>
<section id="git-tag">
<h1><code>git tag</code></h1>
<p>Tags are like branches, but are stored in <code>.git/refs/tags/the_tag_name</code>
and a tag is not normally modified. Once created, it's supposed to always point
to the same version.</p>
<p>GIT does offer a <code>git tag -f existing-tag new-hash</code> command,
but using it should be a rare occurrence.</p>
<textarea id="in17">
function git_tag(tag_name, commit_hash, force) {
mkdir(join_paths(current_directory, '.git/refs'));
mkdir(join_paths(current_directory, '.git/refs/tags'));
if (!force && exists(join_paths(current_directory, '.git/refs/tags/' + tag_name))) {
alert("tag already exists");
return false;
} else {
write(join_paths(current_directory, '.git/refs/tags/' + tag_name), commit_hash + '\n');
return true;
}
}
// git tag v1.0 0123456789012345678901234567890123456789
git_tag('v1.0', second_commit);
</textarea>
</section>
<section id="git-checkout">
<h1><code>git checkout</code></h1>
<section id="checkout-branch-vs-other">
<h1>Checkout, branches and other references</h1>
<p>More importantly, the HEAD does not normally point to a tag. Although nothing actually
prevents writing <code>ref: refs/tags/v1.0</code> into <code>.git/HEAD</code>, the GIT
commands will not automatically do this. For example, <code>git checkout tag-or-branch-or-hash</code>
will put a symbolic <code>ref: </code> in <code>.git/HEAD</code> only if the argument is a branch.</p>
<textarea id="in18">
function git_checkout(tag_or_branch_or_hash) {
if (exists(join_paths(current_directory, '.git/refs/heads/' + tag_or_branch_or_hash))) {
// Normal (attached) HEAD, points to 'ref: refs/heads/the_branch_name'
write(join_paths(current_directory, '.git/HEAD'), 'ref: refs/heads/' + tag_or_branch_or_hash + '\n');
} else {
// Detached HEAD, points directly to commit hash
write(join_paths(current_directory, '.git/HEAD'), git_rev_parse(tag_or_branch_or_hash) + '\n');
}
checkout_files(git_rev_parse('HEAD'));
}
</textarea>
</section>
<section id="checkout-files">
<h1>Checking out files</h1>
<textarea>
function checkout_files(hash) {
var commit = parse_commit(hash);
checkout_tree(current_directory, commit.tree);
}
function checkout_tree(path_prefix, hash) {
var entries = parse_tree(hash);
for (var i = 0; i < entries.length; i++) {
var o = parse_object(entries[i].hash);
var entry_path = join_paths(path_prefix, entries[i].name);
if (o.type == 'blob') {
write(entry_path, o.contents);
} else {
checkout_tree(entry_path, entries[i].hash)
}
}
}
</textarea>
</section>
<section id="parse-assert">
<h1>Assert</h1>
The parsers will check that their input looks reasonably well-formed, using <code>assert()</code>.
<textarea>
function assert(boolean, text) {
if (! boolean) { alert("assertion failed: " + text); throw new Error(text); }
}
</textarea>
</section>
<section id="parsed-compressed">
<h1>Reading compressed objects</h1>
<textarea>
function parse_object(hash) {
var compressed = read(join_paths(current_directory, '.git/objects/' + hash.substr(0,2) + '/' + hash.substr(2)));
var inflated = inflate(compressed);
var split = inflated.match(/^([\s\S]*?) ([\s\S]*?)\0([\s\S]*)$/);
assert(split, "ill-formed object");
var type = split[1];
var length = split[2];
var contents = split[3];
assert(contents.length == length, "object has incorrect length");
return { type: type, length: length, contents: contents };
}
</textarea>
</section>
<section id="parse-tree">
<h1>Parsing tree objects</h1>
<textarea>
function parse_tree(hash) {
var tree = parse_object(hash);
var split = tree.contents.split(/(?<=\0[\s\S]{20})/);
assert(split, 'invalid contents of tree object');
var entries = [];
for (var i = 0; i < split.length; i++) {
entries.push(parse_tree_entry(split[i]));
}
return entries;
}
</textarea>
<textarea>
function parse_tree_entry(entry) {
var split = entry.match(/^([0-9]+) ([\s\S]*)\0([\s\S]{20})$/);
assert(split, 'invalid entry in tree object');
var mode = split[1];
var name = split[2];
var hash = to_hex(split[3]);
return { mode: mode, name: name, hash: hash };
}
</textarea>
<p>The <code>parse_tree</code> function above needs a small utility to convert hashes represented using raw bytes to a hexadecimal representation.</p>
<textarea id="in19">
function to_hex(bin) {
var bin = String(bin);
var hex = "";
for (var i = 0; i < bin.length; i++) {
hex += left_pad(bin.charCodeAt(i).toString(16), '0', 2);
}
return hex;
}
</textarea>
</section>
<section id="parse-commit">
<h1>Parsing commit objects</h1>
<textarea>
function parse_commit(hash) {
var commit = parse_object(hash);
var lines = commit.contents.split('\n');
var tree = null;
var parents = [];
var author = null;
var committer = null;
var i;
// A blank line separates the headers from the message.
for (i = 0; i < lines.length && lines[i] != ''; i++) {
var split = lines[i].match(/^(.*?) (.*)$/);
assert(split, "ill-formed commit header: " + lines[i]);
var header = split[1];
var value = split[2];
switch (header) {
case 'tree':
assert(!tree, 'duplicate tree header in commit');
assert(/^[0-9a-f]{40}$/.test(value), "invalid tree header in commit");
tree = value;
break;
case 'parent':
assert(/^[0-9a-f]{40}$/.test(value), "invalid parent header in commit");
parents.push(value);
break;
case 'author':
assert(!author, 'duplicate author header in commit');
author = parse_author(value, 'author');
break;
case 'committer':
assert(!committer, 'duplicate committer header in commit');
committer = parse_author(value, 'committer');
break;
default: /* unknown field, skipping */ break;
}
}
// The message is everything after the blank line.
message = lines.splice(i+1).join('\n');
assert(tree, 'commit lacks tree header');
assert(author, 'commit lacks author header');
assert(committer, 'commit lacks committer header');
return {
tree: tree,
parents: parents,
author: author,
committer: committer,
message: message
};
}
</textarea>
</section>
<section id="parse-author-committer">
<h1>Parsing author and committer metadata</h1>
<textarea>
function parse_author(value, field) {
var split = value.match(/^(.*?) <(.*?)> ([0-9]+) ([+-])([0-9][0-9])([0-9][0-9])$/);
assert(split, 'ill-formed ' + field)
var name = split[1];
var email = split[2];
var date = new Date(parseInt(split[3], 10) * 1000);
var timezone_sign = (split[4] == '+' ? 1 : -1);
var timezone_hours = parseInt(split[5], 10);
var timezone_minutes = parseInt(split[6], 10);
var timezone = timezone_sign * (timezone_hours * 60 + timezone_minutes);
return { name: name, email: email, date: date, timzeone: timezone };
}
</textarea>
</section>
<section id="checkout-example">
<h1>Example checkout</h1>
<p></p>
<textarea id="in20">
git_checkout(initial_commit);
</textarea>
</section>
</section>
<section id="git-init">
<h1><code>git init</code></h1>
<p>The <code>git init</code> command creates the <code>.git</code> directory and points <code>.git/HEAD</code>
to the default branch (a file which does not exist yet, as this branch does not contain any commit at this point).</p>
<textarea id="in21">
function git_init() {
git_init_mkdir();
git_init_head();
}
</textarea>
</section>
<section id="index">
<h1>The index</h1>
<p>When adding files with <code>git add</code>, GIT does not immediately create a commit object.
Instead, it adds the files to the index, which uses a binary format with lots of metadata.
The mock filesystem used here lacks most of these pieces of information, so thr value <code>0</code>
will be used for most fields. See <a href="https://mincong.io/2018/04/28/git-index/">this blog post</a>
for a more in-depth study of the index.</p>
<textarea id="index-raw-bytes-utils">
function raw_bytes(val, bytes) {
return hex_to_raw_bytes(left_pad(val.toString(16), '0', bytes*2));
}
function raw_bytes16(val) { return raw_bytes(val, 2); }
function raw_bytes32(val) { return raw_bytes(val, 4); }
function raw_bytes64(val) { return raw_bytes(val, 8); }
</textarea>
<textarea id="make-index">
function store_index(paths) {
var magic = 'DIRC' // DIRectory Cache
var version = raw_bytes32(2);
var entries = raw_bytes32(paths.length);
var header = magic + version + entries;
index = header;
for (var i = 0; i < paths.length; i++) {
var ctime = raw_bytes64(0);
var mtime = raw_bytes64(0);
var device = raw_bytes32(0);
var inode = raw_bytes32(0);
// default permissions for files, in octal.
var mode = raw_bytes32(0100644);
var uid = raw_bytes32(0);
var gid = raw_bytes32(0);
var size = raw_bytes32(read(join_paths(current_directory, paths[i])).length);
var hash = hex_to_raw_bytes(hash_object(true, 'blob', false, paths[i]));
// for this simple index, the flags (the 4 higher bits) are 0.
assert(paths[i].length < 0xfff)
var flags_and_file_path_length = raw_bytes16(paths[i].length)
var file_path = paths[i] + '\0';
entry = ctime + mtime + device + inode + mode + uid + gid + size
+ hash + flags_and_file_path_length + file_path;
while (entry.length % 8 != 0) {
// pad with null bytes to a multiple of 8 bytes (64-bits).
entry += '\0';
}
index += entry;
}
index += hex_to_raw_bytes(sha1(index));
write(join_paths(current_directory, '.git/index'), index)
}
</textarea>
</section>
<section id="playground">
<h1>Playground</h1>
<p>The implementation is now sufficiently complete to create a small repository.</p>
<textarea id="playground-reset">
// Reset the filesystem to its initial state
filesystem = {};
current_directory = '';
</textarea>
<textarea id="playground-play">
mkdir('proj');
cd('proj');
write('proj/README', 'This is my implementation of GIT.\n');
mkdir('proj/src');
write('proj/src/main.scm', "(define filesystem '())\n...\n");
git_init();
git_commit(['README', 'src/main.scm'], 'A well-understood initial commit.');
git_branch('dev', 'HEAD');
git_checkout('dev');
write('proj/src/main.scm', "(define filesystem '())\n(define current_directory \"\")\n");
git_commit(['README', 'src/main.scm'], 'What an update!');
git_checkout('main');
// update the cache of the working directory. Without this,
// GIT finds an empty cache, and thinks all files are scheduled
// for deletion, until "git add ." allows it to realize that
// the working directory matches the contents of HEAD.
store_index(['README', 'src/main.scm']);
</textarea>
<p>By clicking on "Copy commands to recreate in *nix terminal.", it is possible to copy a series of <code>mkdir …</code> and <code>printf … > …</code> commands that, when executed, will recreate the virtual filesystem on a real system. The resulting
folder is bit-compatible with the official <code>git log</code>, <code>git status</code>, <code>git checkout</code> etc.
commands.</p>
</section>
<section id="conclusion">
<h1>Conclusion</h1>
<p>This article shows that a large part of the core of GIT can be re-implemented in <span class="loc-count">a few</span> source lines of code* (<a href="javascript:___copy_all_code(); void(0);">copy all the code</a>).
<span style="font-size: small">*&nbsp;empty lines and single closing braces excluded, <span class="loc-count-total">a few more</span> in total.</span></p>
<div id="copy-all-code" style="display: none;"></div>
<ul>
</ul>
<li>Some of the features which may appear mysterious at first sight (e.g. detached HEAD) should be clearer with the knowledge of how GIT works behind the scenes.</li>
<li>Furthermore, branches are often associated with an intuition (containers into which commits are added) which does not match the implementation (mutable pointers to commits).</li>
<li>Finally, it is tempting to think of commits as patches. While <code>darcs</code> tries to expose an interface which matches this intuition, it is clear that the implementation of GIT considers commits as copies of the entire repository, and are linked to the previous version solely by the <code>parent</code> metadata in the commit headers.</li>
</ul>
<p>A few core commands like <code>git diff</code> and <code>git apply</code> are not described in this tutorial.
They are little more than improved versions of the classical *nix commands <code>diff</code> and <code>patch</code>.</p>
<p>Most other commands provided by GIT are merely convenience wrappers around these commands. For example, <code>git cherry-pick</code> is simply a combination of <code>git diff</code> between the tree of a commit and the tree of its parent, followed by <code>git apply</code> to apply the patch and <code>git commit</code> to create a new commit whose diff is equivalent to the diff of the original commit. As an other example, the command <code>git rebase</code> performs as succession of <code>cherry-pick</code> operations.</p>
<p>By keeping in mind the internal model of GIT, it becomes easier to understand the usual commands and their quirks. By undersanding the design philosophy behind the implementation, the day-to-day usage can become, hopefully, less surprising.</p>
</section>
<div id="toc"></div>
</article>
<script>
(function() {
var script = '';
var ta = document.getElementsByTagName('textarea');
for (var j = 0; j < ta.length; j++) {
if (ta[j] == document.getElementById('playground-reset')) {
break;
}
script += ta[j].value + "\n\n";
}
var js = document.getElementsByTagName('script');
for (var j = 0; j < js.length; j++) {
if (js[j].className.indexOf('example') != -1) {
script += js[j].innerText;
}
}
script += '\nfor (var i = 0; i < examples.length; i++) { examples[i](); }';
eval(script);
})();
___git_tutorial_onload()
</script>
</body>
</html>