Dissecting .git/

1. A first look at .git/
2. .git/index
3. .git/objects
4. git commit
5. What about hunks?
6. Conclusion

One of my favorite pieces of software is Magit, the self-described "Git porcelain inside Emacs." A commenter on Hacker News claimed to finally grok git after switching to Magit, so I decided to try it out. I wasn't sold immediately; I had a hard time discovering the available actions and keybindings, so I frequently got stuck and fell back to the git command-line interface. But I slowly grew to like Magit, particularly the ability to commit parts of a file, which I had no idea was possible!

Somewhere along the way I read something in the magit docs about "hunks", and made the connection that "git add-ing the changed parts of a file" was the same thing as "staging a hunk." I had a working understanding of the git index: I knew that git add would put things in it, and that git commit would commit its contents.

So in my mental model, git tracked "hunks", which were just a set of lines in a file that changed between two commits. In other words, a "hunk" was something like "the delta between two revisions of a file." Hunks were stored in the index until they were committed. A commit was just some metadata with a set of hunks and a pointer to the previous commit.

This mental model makes it easy to imagine how git diff would work: git diff A B would just walk the tree of commits from root to HEAD, incrementally adding hunks until it reached commit B, at which point it would print out a pretty representation of the accumulated deltas between commits A and B.

In retrospect, it's a credit to the interface of both git and Magit that I was able to use these tools for years even though I totally misundestood what was going on under the hood.

I realized that my understanding was off when I stumbled across this StackOverflow answer ¹ to the question "Is there a “theirs” version of “git merge -s ours”?" The answer demonstrates how to make branch A look exactly like branch B with a clever sequence of git commands.

The summary is:

git merge -s ours B to create a "dummy" merge-commit
git branch TEMP to mark this commit with a temp branch
git reset --hard B to make the contents of the working tree and index look exactly like B
git reset --soft TEMP to put HEAD back to the dummy merge-commit
git commit --amend to overwrite the dummy merge-commit

This totally blew up my mental model. If it were true that commits only store incremental changes between files, then this sequence of steps shouldn't work: discarding all the hunks created by git merge and overwriting them with hunks from another branch should have resulted in a jumbled Frankenstein commit full of hunks that didn't fit together at all. And yet, it worked beautifully.

I spent some time digging around in the .git/ subdirectory and I now have a better understanding. Here's a quick summary of what I learned.

.git/ is a lot like a key-value store; it's also a lot like a filesystem, where git blobs = Unix inodes and git trees = Unix directory entries.
git does not store hunks, deltas, or diffs; it stores the full content of files (compressed)
Almost everything git stores is one of three types of objects:
- blob: the compressed content of a file
- tree: each tree node consists of entries; each entry consist of a mode, a type, and a filename, and corresponds to either a blob (i.e., a file) or another tree node (i.e., a directory)
- commit: a pointer to a tree, some metadata associated with provenance (e.g., author, commit message), and a pointer to the previous commit object (the parent)

What follows is a guided tour of my journey into .git/. When I got stuck, I consulted Chapter 10 of the Git Book ².

1 A first look at .git/

Everything that git stores is in the .git/ directory. Let's start by creating a dummy repo and inspecting its contents.

git init

tree .git

.git
├── HEAD
├── branches
├── config
├── description
├── hooks
│   ├── applypatch-msg.sample
│   ├── commit-msg.sample
│   ├── fsmonitor-watchman.sample
│   ├── post-update.sample
│   ├── pre-applypatch.sample
│   ├── pre-commit.sample
│   ├── pre-merge-commit.sample
│   ├── pre-push.sample
│   ├── pre-rebase.sample
│   ├── pre-receive.sample
│   ├── prepare-commit-msg.sample
│   └── update.sample
├── info
│   └── exclude
├── objects
│   ├── info
│   └── pack
└── refs
    ├── heads
    └── tags

9 directories, 16 files

We don't really care about git hooks or the repo metadata for now, so let's filter those out:

alias gittree="tree -I 'hooks|description|info|config|branches' .git"
gittree


.git
├── HEAD
├── objects
│   └── pack
└── refs
    ├── heads
    └── tags

5 directories, 1 file

So we've got HEAD, which presumably points to the tip of our current branch.

file .git/HEAD
cat .git/HEAD

.git/HEAD: ASCII text
ref: refs/heads/master

Hey, it's just a plain ol' ASCII file. That's easy. Ok, we need some content for this repo. Let's create a file with a list of numbers.

for i in {1..15}; do echo "$i" >> numbers.txt; done
head -n5 numbers.txt

git status

On branch master

No commits yet

Untracked files:
..." to include in what will be committed)
	numbers.txt

nothing added to commit but untracked files present (use "git add" to track)

Now let's add our file to the index. We should expect something to change here…

git add numbers.txt
gittree


.git
├── HEAD
├── index
├── objects
│   ├── 97
│   │   └── b3d1a5707f8a11fa5fa8bc6c3bd7b3965601fd
│   └── pack
└── refs
    ├── heads
    └── tags

6 directories, 3 files

Aha! We've got a new file: index! Let's crack it open…

2 .git/index

file .git/index

.git/index: Git index, version 2, 1 entries

Ok, this seems to be some sort of git-internal binary format. A bit of googling turns up the command git ls-files, which according to the manpage is used to "show information about files in the index and working tree."

git ls-files --stage

100644 97b3d1a5707f8a11fa5fa8bc6c3bd7b3965601fd 0	numbers.txt

The manpage says the format is:

[<tag> ]<mode> <object> <stage> <file>

<mode> (roughly) corresponds to the Unix file mode; <object> looks like some kind of hash; <stage> = 0 isn't exactly self-explanatory, but we can always revisit that; and finally, <file>, which is self-explanatory.

Pssst. That object hash 97b3d1a5707f8a11fa5fa8bc6c3bd7b3965601fd - does it look familiar?

tree --prune .git/objects

.git/objects
└── 97
    └── b3d1a5707f8a11fa5fa8bc6c3bd7b3965601fd

1 directory, 1 file

Concatenate the directory name with the name of the file inside it, and you get the <object> corresponding to numbers.txt in the index. Now let's see what the object is…

3 .git/objects

file .git/objects/97/b3d1a5707f8a11fa5fa8bc6c3bd7b3965601fd

.git/objects/97/b3d1a5707f8a11fa5fa8bc6c3bd7b3965601fd: VAX COFF executable not stripped

Ok, it's zlib-compressed. No problem, Python has a built-in zlib module. Let's whip up a quick script to decompress and print zlib compressed bytes.

"""
Deflate zlib-encoded bytes and print as UTF-8
"""

import zlib
import sys
import _io
import textwrap


def as_hex(buf: bytes):
    return ''.join('{:02x}'.format(b) for b in buf)

def deflate_and_print(buf: _io.BufferedReader):
    """
    Decompress with zlib and dump the uncompressed bytes to stdout.

    If we can't decode as UTF-8 then print two representations:

        1. Python's string representation (nice for when a payload contains
           both Unicode and non-Unicode)
        2. Hex
    """
    _bytes = zlib.decompress(buf.read())
    try:
        s = _bytes.decode('utf8')
    except UnicodeDecodeError:
        s = repr(_bytes)
        s += '\n[hex repr]\n' + textwrap.fill(as_hex(_bytes), 16)

    print(s)

file_arg = sys.argv[1] if len(sys.argv) > 1 else '-'
if file_arg == '-':
    deflate_and_print(sys.stdin.buffer)
else:
    with open(file_arg, 'rb') as f:
        deflate_and_print(f)

Now we're ready to decompress and print it:

zprint .git/objects/97/b3d1a5707f8a11fa5fa8bc6c3bd7b3965601fd

env: can't execute 'python3': No such file or directory

Hey, that's just the full content of numbers.txt, prefixed with some sort of header: blob 36\0 (the null byte isn't rendered in the HTML export, but it's there). Could "36" be the size of the content that follows?

wc -c numbers.txt

36 numbers.txt

36 bytes - probably not a coincidence.

So to recap what happened when we staged our new file:

git add numbers.txt created a new entry in the index
That entry included a filename (numbers.txt) and some sort of ID
That ID corresponds to the name of a file in .git/objects
The file in .git/objects is just a metadata header + the compressed content of numbers.txt

So far, so good. Now let's create a commit.

4 git commit

git commit -m "First commit: Add numbers.txt"
gittree

[master (root-commit) 0b7291f] First commit: Add numbers.txt
 1 file changed, 15 insertions(+)
 create mode 100644 numbers.txt
.git
├── COMMIT_EDITMSG
├── HEAD
├── index
├── logs
│   ├── HEAD
│   └── refs
│       └── heads
│           └── master
├── objects
│   ├── 0b
│   │   ├── 4252fee2e097732e264bea210e35be1cb63345
│   │   └── 7291fe750a527d1df16caa5c6828027859512c
│   ├── 97
│   │   └── b3d1a5707f8a11fa5fa8bc6c3bd7b3965601fd
│   └── pack
└── refs
    ├── heads
    │   └── master
    └── tags

10 directories, 9 files

Ok, we've got a lot of new stuff to look at:

COMMIT_EDITMSG
.git/refs/heads/master
.git/logs/
2 new objects in .git/objects

Let's take them in order…

cat .git/COMMIT_EDITMSG

First commit: Add numbers.txt

Ok, that's simple enough. Next:

cat .git/refs/heads/master

0b7291fe750a527d1df16caa5c6828027859512c

So the ref master is just a pointer to one of our .git/objects. How about .git/logs?

cat .git/logs/HEAD


1583704643 +0000	commit (initial): First commit: Add numbers.txt

cat .git/logs/refs/heads/master


1583704643 +0000	commit (initial): First commit: Add numbers.txt

HEAD and master are identical - no surprise there - and it appears to just have a timestamp, a reference to the commit, and the commit message.

That brings us to the interesting part: our 2 new objects. We already know that HEAD=/=master is pointing at a file in .git/objects. What's in the file?

sha=$(cat .git/refs/heads/master)
object="${sha:0:2}/${sha:2}"

file .git/objects/${object}

.git/objects/0b/7291fe750a527d1df16caa5c6828027859512c: VAX COFF executable

Ok, more zlib-compression. We can handle that:

zprint .git/objects/${object}

env: can't execute 'python3': No such file or directory

Let's take a closer look at the header. Recall the header from the last object we looked at (the compressed content of numbers.txt):

zprint .git/objects/97/b3d1a5707f8a11fa5fa8bc6c3bd7b3965601fd | head -n1

env: can't execute 'python3': No such file or directory

This time, instead of "blob" we have "commit":

zprint .git/objects/${object} | head -n1

env: can't execute 'python3': No such file or directory

We can now deduce that the header is something like <object type> <content size in bytes>\0<content>, where <object type> is one of "blob" or "commit". Blobs contain compressed file content, and commits contain:

A reference to what looks like our mysterious 3rd object
Two timestamps (identical, in this case)
The commit message

Finally, let's crack open that final file in .git/objects which is referenced by the commit.

zprint .git/objects/0b/4252fee2e097732e264bea210e35be1cb63345

env: can't execute 'python3': No such file or directory

As usual, the payload has a header of the form <object type> <content size in bytes>\0. This time we've got a new object type: tree. The tree object appears to contain:

A file mode (100644)
A filename (numbers.txt)
A reference to object 97b3d1a5707f8a11fa5fa8bc6c3bd7b3965601fd, which is the blob object corresponding to numbers.txt

So to recap what we've found:

There are three types of objects: blobs, commits, and trees
Blobs are just compressed file content
Commits point to a tree and store some metadata
Trees are comprised of filenames, file modes, and an associated blob

Let's add a new file in a new commit and validate our understanding:

echo "red\nblue\ngreen" > colors.txt
git add colors.txt

Now we check the index:

git ls-files --stage

100644 ae981935c385a7575d2e992c626cc72fbf552c90 0	colors.txt
100644 97b3d1a5707f8a11fa5fa8bc6c3bd7b3965601fd 0	numbers.txt

We should have 1 new object: ae981935c385a7575d2e992c626cc72fbf552c90. Do we?

tree --prune .git/objects

.git/objects
├── 0b
│   ├── 4252fee2e097732e264bea210e35be1cb63345
│   └── 7291fe750a527d1df16caa5c6828027859512c
├── 97
│   └── b3d1a5707f8a11fa5fa8bc6c3bd7b3965601fd
└── ae
    └── 981935c385a7575d2e992c626cc72fbf552c90

3 directories, 4 files

Indeed we do. Now let's commit it.

git commit -m "Add colors" -m "Here's a commit message with an actual body, not just a subject line"

[master 66e8fff] Add colors
 1 file changed, 1 insertion(+)
 create mode 100644 colors.txt

git commit should have produced (at least) 2 new objects: a tree and a commit.

tree --prune .git/objects

.git/objects
├── 0b
│   ├── 4252fee2e097732e264bea210e35be1cb63345
│   └── 7291fe750a527d1df16caa5c6828027859512c
├── 66
│   └── e8fff6f795b91081f5d05eade083f18e600fc4
├── 97
│   └── b3d1a5707f8a11fa5fa8bc6c3bd7b3965601fd
├── ae
│   └── 981935c385a7575d2e992c626cc72fbf552c90
└── fd
    └── 8cf227b67f57d753468f6f0319a5558d37bd0d

5 directories, 6 files

Here's the commit object:

sha=$(cat .git/refs/heads/master)
object="${sha:0:2}/${sha:2}"

zprint .git/objects/${object}

env: can't execute 'python3': No such file or directory

Notice that the commit has a field we haven't seen before: parent. This looks like a pointer to our previous commit.

And now for our tree:

zprint .git/objects/fd/8cf227b67f57d753468f6f0319a5558d37bd0d

env: can't execute 'python3': No such file or directory

As expected, this tree object has two entries - one for each file.

5 What about hunks?

This is all great, but I still don't know how to incorporate the concept of "hunks" into this mental model. Let's stage a hunk and see what our index object looks like.

mv numbers.txt{,.bak}
awk '{ if (NR == 1) print "Stage me!"; else if (NR == 14) print "But not me!"; else print ; }' numbers.txt.bak > numbers.txt
rm numbers.txt.bak

cat numbers.txt

Stage me!
2
3
4
5
6
7
8
9
10
11
12
13
But not me!
15

Normally I would use Magit or git add -i to interactively stage hunks, but for the sake of reproducibility, I'm going to script it. First, we'll add our file to the index, then we create a patch:

git diff | tee /tmp/stageme.patch

diff --git a/numbers.txt b/numbers.txt
index 97b3d1a..8a10488 100644
--- a/numbers.txt
+++ b/numbers.txt
@@ -1,4 +1,4 @@
-1
+Stage me!
 2
 3
 4
@@ -11,5 +11,5 @@
 11
 12
 13
-14
+But not me!
 15

This patch has two hunks. Let's remove the second one:

perl -lne 'if (m/^@@/) { $hunk_count++ }; print if $hunk_count < 2' /tmp/stageme.patch | tee /tmp/onehunk.patch

diff --git a/numbers.txt b/numbers.txt
index 97b3d1a..8a10488 100644
--- a/numbers.txt
+++ b/numbers.txt
@@ -1,4 +1,4 @@
-1
+Stage me!
 2
 3
 4

Now let's stage only the first hunk. Note that we use --cached so that the change is applied to the index but not to our working tree, i.e., the version of numbers.txt that we have on disk.

git apply --cached /tmp/onehunk.patch

Let's check the contents of the index and confirm that only the first hunk is staged:

git --no-pager diff --cached

diff --git a/numbers.txt b/numbers.txt
index 97b3d1a..576880b 100644
--- a/numbers.txt
+++ b/numbers.txt
@@ -1,4 +1,4 @@
-1
+Stage me!
 2
 3
 4

Looks good. Let's also double-check that our working tree includes both hunks.

cat numbers.txt

Stage me!
2
3
4
5
6
7
8
9
10
11
12
13
But not me!
15

Ok, good - the version of numbers.txt in the git index and the version in our working tree are completely different. Let's look at the git object corresponding to numbers.txt in the index:

git ls-files --stage | grep numbers.txt

100644 576880b02860bc2fb8ccfd0437884cfbff621cc5 0	numbers.txt

zprint .git/objects/57/6880b02860bc2fb8ccfd0437884cfbff621cc5

env: can't execute 'python3': No such file or directory

Interesting… so the blob object that git apply --cached creates is a standalone object. That is, there's no structural sharing or "delta" that's layered on top of some other object; git has created a new, independent object with the file's contents. Furthermore, the content in this object can't be found anywhere in the working tree - it lives only in .git/objects.

6 Conclusion

Git's data model is both simple and elegant. Consider how much is accomplished with so little; all of git's functionality rests on a foundation of just three data structures: blobs, trees, and commits. Nothing that can be computed needs to be materialized or stored, so it isn't.

This exercise got me thinking about Steve Yegge's "Kingdom of Nouns" ³, in which he describes the elevation of Nouns over Verbs in Object Oriented Programming. I can't imagine a citizen of Javaland designing anything that remotely resembles git's data model.

In OOP, you model a problem domain by identifying the important Nouns and turning them into classes. For a version control system, you'd have Commits, sure, and maybe you get Trees, too. But you'd also have Diffs, because diffs are a Very Important Thing, and it's practically inconceivable that an important thing would not be modeled as an object.

And what do you do with important objects? You build them with Factories, you persist them in databases via an ORM, or maybe you serialize them as XML and store them on disk. You could have the insight that Diffs are cheaply computed given Blobs and Trees, so they need not be first-class citizens in your data model; it's just so improbable that this would occur to you if you're thinking in terms of objects and methods.

This quote from Linus Torvalds now makes more sense to me:

In fact, I'm a huge proponent of designing your code around the data, rather
than the other way around, and I think it's one of the reasons git has
been fairly successful (*).

...

(*) I will, in fact, claim that the difference between a bad programmer
and a good one is whether he considers his code or his data structures
more important. Bad programmers worry about the code. Good programmers
worry about data structures and their relationships.

– Linus Torvalds ⁴

I generally avoid engaging in hero-worship, and I'm especially wary of making a hero of anyone who casually divides the world into "bad programmers" and "good programmers." But I do have a renewed sense of respect for both the design and the designer of git.

Footnotes:

https://stackoverflow.com/a/4969679/895769

https://git-scm.com/book/en/v2/Git-Internals-Plumbing-and-Porcelain

http://steve-yegge.blogspot.com/2006/03/execution-in-kingdom-of-nouns.html

⁴

https://lwn.net/Articles/193245/