Dissecting .git/

Table of Contents

One of my favorite pieces of software is Magit, the self-described "Git porcelain inside Emacs." A commenter on Hacker News claimed to finally grok git after switching to Magit, so I decided to try it out. I wasn't sold immediately; I had a hard time discovering the available actions and keybindings, so I frequently got stuck and fell back to the git command-line interface. But I slowly grew to like Magit, particularly the ability to commit parts of a file, which I had no idea was possible!

Somewhere along the way I read something in the magit docs about "hunks", and made the connection that "git add-ing the changed parts of a file" was the same thing as "staging a hunk." I had a working understanding of the git index: I knew that git add would put things in it, and that git commit would commit its contents.

So in my mental model, git tracked "hunks", which were just a set of lines in a file that changed between two commits. In other words, a "hunk" was something like "the delta between two revisions of a file." Hunks were stored in the index until they were committed. A commit was just some metadata with a set of hunks and a pointer to the previous commit.

This mental model makes it easy to imagine how git diff would work: git diff A B would just walk the tree of commits from root to HEAD, incrementally adding hunks until it reached commit B, at which point it would print out a pretty representation of the accumulated deltas between commits A and B.

In retrospect, it's a credit to the interface of both git and Magit that I was able to use these tools for years even though I totally misundestood what was going on under the hood.

I realized that my understanding was off when I stumbled across this StackOverflow answer 1 to the question "Is there a “theirs” version of “git merge -s ours”?" The answer demonstrates how to make branch A look exactly like branch B with a clever sequence of git commands.

The summary is:

  1. git merge -s ours B to create a "dummy" merge-commit
  2. git branch TEMP to mark this commit with a temp branch
  3. git reset --hard B to make the contents of the working tree and index look exactly like B
  4. git reset --soft TEMP to put HEAD back to the dummy merge-commit
  5. git commit --amend to overwrite the dummy merge-commit

This totally blew up my mental model. If it were true that commits only store incremental changes between files, then this sequence of steps shouldn't work: discarding all the hunks created by git merge and overwriting them with hunks from another branch should have resulted in a jumbled Frankenstein commit full of hunks that didn't fit together at all. And yet, it worked beautifully.

I spent some time digging around in the .git/ subdirectory and I now have a better understanding. Here's a quick summary of what I learned.

What follows is a guided tour of my journey into .git/. When I got stuck, I consulted Chapter 10 of the Git Book 2.

1 A first look at .git/

Everything that git stores is in the .git/ directory. Let's start by creating a dummy repo and inspecting its contents.

git init
tree .git
.git
├── HEAD
├── branches
├── config
├── description
├── hooks
│   ├── applypatch-msg.sample
│   ├── commit-msg.sample
│   ├── fsmonitor-watchman.sample
│   ├── post-update.sample
│   ├── pre-applypatch.sample
│   ├── pre-commit.sample
│   ├── pre-merge-commit.sample
│   ├── pre-push.sample
│   ├── pre-rebase.sample
│   ├── pre-receive.sample
│   ├── prepare-commit-msg.sample
│   └── update.sample
├── info
│   └── exclude
├── objects
│   ├── info
│   └── pack
└── refs
    ├── heads
    └── tags

9 directories, 16 files

We don't really care about git hooks or the repo metadata for now, so let's filter those out:

alias gittree="tree -I 'hooks|description|info|config|branches' .git"
gittree

.git
├── HEAD
├── objects
│   └── pack
└── refs
    ├── heads
    └── tags

5 directories, 1 file

So we've got HEAD, which presumably points to the tip of our current branch.

file .git/HEAD
cat .git/HEAD
.git/HEAD: ASCII text
ref: refs/heads/master

Hey, it's just a plain ol' ASCII file. That's easy. Ok, we need some content for this repo. Let's create a file with a list of numbers.

for i in {1..15}; do echo "$i" >> numbers.txt; done
head -n5 numbers.txt

1
2
3
4
5

git status
On branch master

No commits yet

Untracked files:
..." to include in what will be committed)
	numbers.txt

nothing added to commit but untracked files present (use "git add" to track)

Now let's add our file to the index. We should expect something to change here…

git add numbers.txt
gittree

.git
├── HEAD
├── index
├── objects
│   ├── 97
│   │   └── b3d1a5707f8a11fa5fa8bc6c3bd7b3965601fd
│   └── pack
└── refs
    ├── heads
    └── tags

6 directories, 3 files

Aha! We've got a new file: index! Let's crack it open…

2 .git/index

file .git/index
.git/index: Git index, version 2, 1 entries

Ok, this seems to be some sort of git-internal binary format. A bit of googling turns up the command git ls-files, which according to the manpage is used to "show information about files in the index and working tree."

git ls-files --stage
100644 97b3d1a5707f8a11fa5fa8bc6c3bd7b3965601fd 0	numbers.txt

The manpage says the format is:

[<tag> ]<mode> <object> <stage> <file>

<mode> (roughly) corresponds to the Unix file mode; <object> looks like some kind of hash; <stage> = 0 isn't exactly self-explanatory, but we can always revisit that; and finally, <file>, which is self-explanatory.

Pssst. That object hash 97b3d1a5707f8a11fa5fa8bc6c3bd7b3965601fd - does it look familiar?

tree --prune .git/objects
.git/objects
└── 97
    └── b3d1a5707f8a11fa5fa8bc6c3bd7b3965601fd

1 directory, 1 file

Concatenate the directory name with the name of the file inside it, and you get the <object> corresponding to numbers.txt in the index. Now let's see what the object is…

3 .git/objects

file .git/objects/97/b3d1a5707f8a11fa5fa8bc6c3bd7b3965601fd
.git/objects/97/b3d1a5707f8a11fa5fa8bc6c3bd7b3965601fd: VAX COFF executable not stripped

Ok, it's zlib-compressed. No problem, Python has a built-in zlib module. Let's whip up a quick script to decompress and print zlib compressed bytes.

"""
Deflate zlib-encoded bytes and print as UTF-8
"""

import zlib
import sys
import _io
import textwrap


def as_hex(buf: bytes):
    return ''.join('{:02x}'.format(b) for b in buf)

def deflate_and_print(buf: _io.BufferedReader):
    """
    Decompress with zlib and dump the uncompressed bytes to stdout.

    If we can't decode as UTF-8 then print two representations:

        1. Python's string representation (nice for when a payload contains
           both Unicode and non-Unicode)
        2. Hex
    """
    _bytes = zlib.decompress(buf.read())
    try:
        s = _bytes.decode('utf8')
    except UnicodeDecodeError:
        s = repr(_bytes)
        s += '\n[hex repr]\n' + textwrap.fill(as_hex(_bytes), 16)

    print(s)

file_arg = sys.argv[1] if len(sys.argv) > 1 else '-'
if file_arg == '-':
    deflate_and_print(sys.stdin.buffer)
else:
    with open(file_arg, 'rb') as f:
        deflate_and_print(f)

Now we're ready to decompress and print it:

zprint .git/objects/97/b3d1a5707f8a11fa5fa8bc6c3bd7b3965601fd
env: can't execute 'python3': No such file or directory

Hey, that's just the full content of numbers.txt, prefixed with some sort of header: blob 36\0 (the null byte isn't rendered in the HTML export, but it's there). Could "36" be the size of the content that follows?

wc -c numbers.txt
36 numbers.txt

36 bytes - probably not a coincidence.

So to recap what happened when we staged our new file:

  • git add numbers.txt created a new entry in the index
  • That entry included a filename (numbers.txt) and some sort of ID
  • That ID corresponds to the name of a file in .git/objects
  • The file in .git/objects is just a metadata header + the compressed content of numbers.txt

So far, so good. Now let's create a commit.

4 git commit

git commit -m "First commit: Add numbers.txt"
gittree
[master (root-commit) 0b7291f] First commit: Add numbers.txt
 1 file changed, 15 insertions(+)
 create mode 100644 numbers.txt
.git
├── COMMIT_EDITMSG
├── HEAD
├── index
├── logs
│   ├── HEAD
│   └── refs
│       └── heads
│           └── master
├── objects
│   ├── 0b
│   │   ├── 4252fee2e097732e264bea210e35be1cb63345
│   │   └── 7291fe750a527d1df16caa5c6828027859512c
│   ├── 97
│   │   └── b3d1a5707f8a11fa5fa8bc6c3bd7b3965601fd
│   └── pack
└── refs
    ├── heads
    │   └── master
    └── tags

10 directories, 9 files

Ok, we've got a lot of new stuff to look at:

  • COMMIT_EDITMSG
  • .git/refs/heads/master
  • .git/logs/
  • 2 new objects in .git/objects

Let's take them in order…

cat .git/COMMIT_EDITMSG
First commit: Add numbers.txt

Ok, that's simple enough. Next:

cat .git/refs/heads/master
0b7291fe750a527d1df16caa5c6828027859512c

So the ref master is just a pointer to one of our .git/objects. How about .git/logs?

cat .git/logs/HEAD
   
1583704643 +0000 commit (initial): First commit: Add numbers.txt
cat .git/logs/refs/heads/master
   
1583704643 +0000 commit (initial): First commit: Add numbers.txt

HEAD and master are identical - no surprise there - and it appears to just have a timestamp, a reference to the commit, and the commit message.

That brings us to the interesting part: our 2 new objects. We already know that HEAD=/=master is pointing at a file in .git/objects. What's in the file?

sha=$(cat .git/refs/heads/master)
object="${sha:0:2}/${sha:2}"
file .git/objects/${object}
.git/objects/0b/7291fe750a527d1df16caa5c6828027859512c: VAX COFF executable

Ok, more zlib-compression. We can handle that:

zprint .git/objects/${object}
env: can't execute 'python3': No such file or directory

Let's take a closer look at the header. Recall the header from the last object we looked at (the compressed content of numbers.txt):

zprint .git/objects/97/b3d1a5707f8a11fa5fa8bc6c3bd7b3965601fd | head -n1
env: can't execute 'python3': No such file or directory

This time, instead of "blob" we have "commit":

zprint .git/objects/${object} | head -n1
env: can't execute 'python3': No such file or directory

We can now deduce that the header is something like <object type> <content size in bytes>\0<content>, where <object type> is one of "blob" or "commit". Blobs contain compressed file content, and commits contain:

  1. A reference to what looks like our mysterious 3rd object
  2. Two timestamps (identical, in this case)
  3. The commit message

Finally, let's crack open that final file in .git/objects which is referenced by the commit.

zprint .git/objects/0b/4252fee2e097732e264bea210e35be1cb63345
env: can't execute 'python3': No such file or directory

As usual, the payload has a header of the form <object type> <content size in bytes>\0. This time we've got a new object type: tree. The tree object appears to contain:

  1. A file mode (100644)
  2. A filename (numbers.txt)
  3. A reference to object 97b3d1a5707f8a11fa5fa8bc6c3bd7b3965601fd, which is the blob object corresponding to numbers.txt

So to recap what we've found:

  • There are three types of objects: blobs, commits, and trees
  • Blobs are just compressed file content
  • Commits point to a tree and store some metadata
  • Trees are comprised of filenames, file modes, and an associated blob

Let's add a new file in a new commit and validate our understanding:

echo "red\nblue\ngreen" > colors.txt
git add colors.txt

Now we check the index:

git ls-files --stage
100644 ae981935c385a7575d2e992c626cc72fbf552c90 0	colors.txt
100644 97b3d1a5707f8a11fa5fa8bc6c3bd7b3965601fd 0	numbers.txt

We should have 1 new object: ae981935c385a7575d2e992c626cc72fbf552c90. Do we?

tree --prune .git/objects
.git/objects
├── 0b
│   ├── 4252fee2e097732e264bea210e35be1cb63345
│   └── 7291fe750a527d1df16caa5c6828027859512c
├── 97
│   └── b3d1a5707f8a11fa5fa8bc6c3bd7b3965601fd
└── ae
    └── 981935c385a7575d2e992c626cc72fbf552c90

3 directories, 4 files

Indeed we do. Now let's commit it.

git commit -m "Add colors" -m "Here's a commit message with an actual body, not just a subject line"
[master 66e8fff] Add colors
 1 file changed, 1 insertion(+)
 create mode 100644 colors.txt

git commit should have produced (at least) 2 new objects: a tree and a commit.

tree --prune .git/objects
.git/objects
├── 0b
│   ├── 4252fee2e097732e264bea210e35be1cb63345
│   └── 7291fe750a527d1df16caa5c6828027859512c
├── 66
│   └── e8fff6f795b91081f5d05eade083f18e600fc4
├── 97
│   └── b3d1a5707f8a11fa5fa8bc6c3bd7b3965601fd
├── ae
│   └── 981935c385a7575d2e992c626cc72fbf552c90
└── fd
    └── 8cf227b67f57d753468f6f0319a5558d37bd0d

5 directories, 6 files

Here's the commit object:

sha=$(cat .git/refs/heads/master)
object="${sha:0:2}/${sha:2}"
zprint .git/objects/${object}
env: can't execute 'python3': No such file or directory

Notice that the commit has a field we haven't seen before: parent. This looks like a pointer to our previous commit.

And now for our tree:

zprint .git/objects/fd/8cf227b67f57d753468f6f0319a5558d37bd0d
env: can't execute 'python3': No such file or directory

As expected, this tree object has two entries - one for each file.

5 What about hunks?

This is all great, but I still don't know how to incorporate the concept of "hunks" into this mental model. Let's stage a hunk and see what our index object looks like.

mv numbers.txt{,.bak}
awk '{ if (NR == 1) print "Stage me!"; else if (NR == 14) print "But not me!"; else print ; }' numbers.txt.bak > numbers.txt
rm numbers.txt.bak
cat numbers.txt
Stage me!
2
3
4
5
6
7
8
9
10
11
12
13
But not me!
15

Normally I would use Magit or git add -i to interactively stage hunks, but for the sake of reproducibility, I'm going to script it. First, we'll add our file to the index, then we create a patch:

git diff | tee /tmp/stageme.patch
diff --git a/numbers.txt b/numbers.txt
index 97b3d1a..8a10488 100644
--- a/numbers.txt
+++ b/numbers.txt
@@ -1,4 +1,4 @@
-1
+Stage me!
 2
 3
 4
@@ -11,5 +11,5 @@
 11
 12
 13
-14
+But not me!
 15

This patch has two hunks. Let's remove the second one:

perl -lne 'if (m/^@@/) { $hunk_count++ }; print if $hunk_count < 2' /tmp/stageme.patch | tee /tmp/onehunk.patch
diff --git a/numbers.txt b/numbers.txt
index 97b3d1a..8a10488 100644
--- a/numbers.txt
+++ b/numbers.txt
@@ -1,4 +1,4 @@
-1
+Stage me!
 2
 3
 4

Now let's stage only the first hunk. Note that we use --cached so that the change is applied to the index but not to our working tree, i.e., the version of numbers.txt that we have on disk.

git apply --cached /tmp/onehunk.patch

Let's check the contents of the index and confirm that only the first hunk is staged:

git --no-pager diff --cached
diff --git a/numbers.txt b/numbers.txt
index 97b3d1a..576880b 100644
--- a/numbers.txt
+++ b/numbers.txt
@@ -1,4 +1,4 @@
-1
+Stage me!
 2
 3
 4

Looks good. Let's also double-check that our working tree includes both hunks.

cat numbers.txt
Stage me!
2
3
4
5
6
7
8
9
10
11
12
13
But not me!
15

Ok, good - the version of numbers.txt in the git index and the version in our working tree are completely different. Let's look at the git object corresponding to numbers.txt in the index:

git ls-files --stage | grep numbers.txt
100644 576880b02860bc2fb8ccfd0437884cfbff621cc5 0	numbers.txt

zprint .git/objects/57/6880b02860bc2fb8ccfd0437884cfbff621cc5
env: can't execute 'python3': No such file or directory

Interesting… so the blob object that git apply --cached creates is a standalone object. That is, there's no structural sharing or "delta" that's layered on top of some other object; git has created a new, independent object with the file's contents. Furthermore, the content in this object can't be found anywhere in the working tree - it lives only in .git/objects.

6 Conclusion

Git's data model is both simple and elegant. Consider how much is accomplished with so little; all of git's functionality rests on a foundation of just three data structures: blobs, trees, and commits. Nothing that can be computed needs to be materialized or stored, so it isn't.

This exercise got me thinking about Steve Yegge's "Kingdom of Nouns" 3, in which he describes the elevation of Nouns over Verbs in Object Oriented Programming. I can't imagine a citizen of Javaland designing anything that remotely resembles git's data model.

In OOP, you model a problem domain by identifying the important Nouns and turning them into classes. For a version control system, you'd have Commits, sure, and maybe you get Trees, too. But you'd also have Diffs, because diffs are a Very Important Thing, and it's practically inconceivable that an important thing would not be modeled as an object.

And what do you do with important objects? You build them with Factories, you persist them in databases via an ORM, or maybe you serialize them as XML and store them on disk. You could have the insight that Diffs are cheaply computed given Blobs and Trees, so they need not be first-class citizens in your data model; it's just so improbable that this would occur to you if you're thinking in terms of objects and methods.

This quote from Linus Torvalds now makes more sense to me:

In fact, I'm a huge proponent of designing your code around the data, rather
than the other way around, and I think it's one of the reasons git has
been fairly successful (*).

...

(*) I will, in fact, claim that the difference between a bad programmer
and a good one is whether he considers his code or his data structures
more important. Bad programmers worry about the code. Good programmers
worry about data structures and their relationships.

– Linus Torvalds 4

I generally avoid engaging in hero-worship, and I'm especially wary of making a hero of anyone who casually divides the world into "bad programmers" and "good programmers." But I do have a renewed sense of respect for both the design and the designer of git.

Footnotes: