Dissecting .git/
Table of Contents
One of my favorite pieces of software is Magit, the self-described "Git porcelain inside Emacs." A commenter on Hacker News claimed to finally grok git after switching to Magit, so I decided to try it out. I wasn't sold immediately; I had a hard time discovering the available actions and keybindings, so I frequently got stuck and fell back to the git command-line interface. But I slowly grew to like Magit, particularly the ability to commit parts of a file, which I had no idea was possible!
Somewhere along the way I read something in the magit docs about
"hunks", and made the connection that "git add-ing the changed parts
of a file" was the same thing as "staging a hunk." I had a working
understanding of the git index: I knew that git add
would put things
in it, and that git commit
would commit its contents.
So in my mental model, git tracked "hunks", which were just a set of lines in a file that changed between two commits. In other words, a "hunk" was something like "the delta between two revisions of a file." Hunks were stored in the index until they were committed. A commit was just some metadata with a set of hunks and a pointer to the previous commit.
This mental model makes it easy to imagine how git diff
would work:
git diff A B
would just walk the tree of commits from root to HEAD,
incrementally adding hunks until it reached commit B
, at which point
it would print out a pretty representation of the accumulated deltas
between commits A
and B
.
In retrospect, it's a credit to the interface of both git
and
Magit
that I was able to use these tools for years even though I
totally misundestood what was going on under the hood.
I realized that my understanding was off when I stumbled across this StackOverflow answer 1 to the question "Is there a “theirs” version of “git merge -s ours”?" The answer demonstrates how to make branch A look exactly like branch B with a clever sequence of git commands.
The summary is:
git merge -s ours B
to create a "dummy" merge-commitgit branch TEMP
to mark this commit with a temp branchgit reset --hard B
to make the contents of the working tree and index look exactly like Bgit reset --soft TEMP
to put HEAD back to the dummy merge-commitgit commit --amend
to overwrite the dummy merge-commit
This totally blew up my mental model. If it were true that commits
only store incremental changes between files, then this sequence of
steps shouldn't work: discarding all the hunks created by git merge
and overwriting them with hunks from another branch should have
resulted in a jumbled Frankenstein commit full of hunks that didn't
fit together at all. And yet, it worked beautifully.
I spent some time digging around in the .git/
subdirectory and I now
have a better understanding. Here's a quick summary of what I learned.
.git/
is a lot like a key-value store; it's also a lot like a filesystem, where git blobs = Unix inodes and git trees = Unix directory entries.- git does not store hunks, deltas, or diffs; it stores the full content of files (compressed)
- Almost everything git stores is one of three types of objects:
- blob: the compressed content of a file
- tree: each tree node consists of entries; each entry consist of a mode, a type, and a filename, and corresponds to either a blob (i.e., a file) or another tree node (i.e., a directory)
- commit: a pointer to a tree, some metadata associated with provenance (e.g., author, commit message), and a pointer to the previous commit object (the parent)
What follows is a guided tour of my journey into .git/
. When I got
stuck, I consulted Chapter 10 of the Git Book 2.
1 A first look at .git/
Everything that git stores is in the .git/
directory. Let's start by
creating a dummy repo and inspecting its contents.
git init
tree .git
.git ├── HEAD ├── branches ├── config ├── description ├── hooks │ ├── applypatch-msg.sample │ ├── commit-msg.sample │ ├── fsmonitor-watchman.sample │ ├── post-update.sample │ ├── pre-applypatch.sample │ ├── pre-commit.sample │ ├── pre-merge-commit.sample │ ├── pre-push.sample │ ├── pre-rebase.sample │ ├── pre-receive.sample │ ├── prepare-commit-msg.sample │ └── update.sample ├── info │ └── exclude ├── objects │ ├── info │ └── pack └── refs ├── heads └── tags 9 directories, 16 files
We don't really care about git hooks or the repo metadata for now, so let's filter those out:
alias gittree="tree -I 'hooks|description|info|config|branches' .git" gittree
.git ├── HEAD ├── objects │ └── pack └── refs ├── heads └── tags 5 directories, 1 file
So we've got HEAD, which presumably points to the tip of our current branch.
file .git/HEAD cat .git/HEAD
.git/HEAD: ASCII text ref: refs/heads/master
Hey, it's just a plain ol' ASCII file. That's easy. Ok, we need some content for this repo. Let's create a file with a list of numbers.
for i in {1..15}; do echo "$i" >> numbers.txt; done head -n5 numbers.txt
1 2 3 4 5
git status
On branch master No commits yet Untracked files: ..." to include in what will be committed) numbers.txt nothing added to commit but untracked files present (use "git add" to track)
Now let's add our file to the index. We should expect something to change here…
git add numbers.txt gittree
.git ├── HEAD ├── index ├── objects │ ├── 97 │ │ └── b3d1a5707f8a11fa5fa8bc6c3bd7b3965601fd │ └── pack └── refs ├── heads └── tags 6 directories, 3 files
Aha! We've got a new file: index
! Let's crack it open…
2 .git/index
file .git/index
.git/index: Git index, version 2, 1 entries
Ok, this seems to be some sort of git-internal binary format. A bit of
googling turns up the command git ls-files
, which according to the
manpage is used to "show information about files in the index and
working tree."
git ls-files --stage
100644 97b3d1a5707f8a11fa5fa8bc6c3bd7b3965601fd 0 numbers.txt
The manpage says the format is:
[<tag> ]<mode> <object> <stage> <file>
<mode>
(roughly) corresponds to the Unix file mode; <object>
looks
like some kind of hash; <stage> = 0
isn't exactly self-explanatory,
but we can always revisit that; and finally, <file>
, which is
self-explanatory.
Pssst. That object hash 97b3d1a5707f8a11fa5fa8bc6c3bd7b3965601fd
-
does it look familiar?
tree --prune .git/objects
.git/objects └── 97 └── b3d1a5707f8a11fa5fa8bc6c3bd7b3965601fd 1 directory, 1 file
Concatenate the directory name with the name of the file inside it,
and you get the <object>
corresponding to numbers.txt
in the
index. Now let's see what the object is…
3 .git/objects
file .git/objects/97/b3d1a5707f8a11fa5fa8bc6c3bd7b3965601fd
.git/objects/97/b3d1a5707f8a11fa5fa8bc6c3bd7b3965601fd: VAX COFF executable not stripped
Ok, it's zlib-compressed. No problem, Python has a built-in zlib
module. Let's whip up a quick script to decompress and print zlib
compressed bytes.
""" Deflate zlib-encoded bytes and print as UTF-8 """ import zlib import sys import _io import textwrap def as_hex(buf: bytes): return ''.join('{:02x}'.format(b) for b in buf) def deflate_and_print(buf: _io.BufferedReader): """ Decompress with zlib and dump the uncompressed bytes to stdout. If we can't decode as UTF-8 then print two representations: 1. Python's string representation (nice for when a payload contains both Unicode and non-Unicode) 2. Hex """ _bytes = zlib.decompress(buf.read()) try: s = _bytes.decode('utf8') except UnicodeDecodeError: s = repr(_bytes) s += '\n[hex repr]\n' + textwrap.fill(as_hex(_bytes), 16) print(s) file_arg = sys.argv[1] if len(sys.argv) > 1 else '-' if file_arg == '-': deflate_and_print(sys.stdin.buffer) else: with open(file_arg, 'rb') as f: deflate_and_print(f)
Now we're ready to decompress and print it:
zprint .git/objects/97/b3d1a5707f8a11fa5fa8bc6c3bd7b3965601fd
env: can't execute 'python3': No such file or directory
Hey, that's just the full content of numbers.txt
, prefixed with some
sort of header: blob 36\0
(the null byte isn't rendered in the HTML
export, but it's there). Could "36" be the size of the content that
follows?
wc -c numbers.txt
36 numbers.txt
36 bytes - probably not a coincidence.
So to recap what happened when we staged our new file:
git add numbers.txt
created a new entry in the index- That entry included a filename (
numbers.txt
) and some sort of ID - That ID corresponds to the name of a file in
.git/objects
- The file in
.git/objects
is just a metadata header + the compressed content ofnumbers.txt
So far, so good. Now let's create a commit.
4 git commit
git commit -m "First commit: Add numbers.txt"
gittree
[master (root-commit) 0b7291f] First commit: Add numbers.txt 1 file changed, 15 insertions(+) create mode 100644 numbers.txt .git ├── COMMIT_EDITMSG ├── HEAD ├── index ├── logs │ ├── HEAD │ └── refs │ └── heads │ └── master ├── objects │ ├── 0b │ │ ├── 4252fee2e097732e264bea210e35be1cb63345 │ │ └── 7291fe750a527d1df16caa5c6828027859512c │ ├── 97 │ │ └── b3d1a5707f8a11fa5fa8bc6c3bd7b3965601fd │ └── pack └── refs ├── heads │ └── master └── tags 10 directories, 9 files
Ok, we've got a lot of new stuff to look at:
COMMIT_EDITMSG
.git/refs/heads/master
.git/logs/
- 2 new objects in
.git/objects
Let's take them in order…
cat .git/COMMIT_EDITMSG
First commit: Add numbers.txt
Ok, that's simple enough. Next:
cat .git/refs/heads/master
0b7291fe750a527d1df16caa5c6828027859512c
So the ref master
is just a pointer to one of our
.git/objects
. How about .git/logs
?
cat .git/logs/HEAD
1583704643 +0000 | commit (initial): First commit: Add numbers.txt |
cat .git/logs/refs/heads/master
1583704643 +0000 | commit (initial): First commit: Add numbers.txt |
HEAD
and master
are identical - no surprise there - and it appears
to just have a timestamp, a reference to the commit, and the commit
message.
That brings us to the interesting part: our 2 new objects. We already
know that HEAD=/=master
is pointing at a file in
.git/objects
. What's in the file?
sha=$(cat .git/refs/heads/master) object="${sha:0:2}/${sha:2}"
file .git/objects/${object}
.git/objects/0b/7291fe750a527d1df16caa5c6828027859512c: VAX COFF executable
Ok, more zlib-compression. We can handle that:
zprint .git/objects/${object}
env: can't execute 'python3': No such file or directory
Let's take a closer look at the header. Recall the header from the
last object we looked at (the compressed content of numbers.txt
):
zprint .git/objects/97/b3d1a5707f8a11fa5fa8bc6c3bd7b3965601fd | head -n1
env: can't execute 'python3': No such file or directory
This time, instead of "blob" we have "commit":
zprint .git/objects/${object} | head -n1
env: can't execute 'python3': No such file or directory
We can now deduce that the header is something like <object type>
<content size in bytes>\0<content>
, where <object type>
is one of
"blob" or "commit". Blobs contain compressed file content, and
commits contain:
- A reference to what looks like our mysterious 3rd object
- Two timestamps (identical, in this case)
- The commit message
Finally, let's crack open that final file in .git/objects
which is
referenced by the commit.
zprint .git/objects/0b/4252fee2e097732e264bea210e35be1cb63345
env: can't execute 'python3': No such file or directory
As usual, the payload has a header of the form <object type> <content
size in bytes>\0
. This time we've got a new object type: tree
. The
tree object appears to contain:
- A file mode (
100644
) - A filename (
numbers.txt
) - A reference to object
97b3d1a5707f8a11fa5fa8bc6c3bd7b3965601fd
, which is the blob object corresponding tonumbers.txt
So to recap what we've found:
- There are three types of objects: blobs, commits, and trees
- Blobs are just compressed file content
- Commits point to a tree and store some metadata
- Trees are comprised of filenames, file modes, and an associated blob
Let's add a new file in a new commit and validate our understanding:
echo "red\nblue\ngreen" > colors.txt git add colors.txt
Now we check the index:
git ls-files --stage
100644 ae981935c385a7575d2e992c626cc72fbf552c90 0 colors.txt 100644 97b3d1a5707f8a11fa5fa8bc6c3bd7b3965601fd 0 numbers.txt
We should have 1 new object:
ae981935c385a7575d2e992c626cc72fbf552c90
. Do we?
tree --prune .git/objects
.git/objects ├── 0b │ ├── 4252fee2e097732e264bea210e35be1cb63345 │ └── 7291fe750a527d1df16caa5c6828027859512c ├── 97 │ └── b3d1a5707f8a11fa5fa8bc6c3bd7b3965601fd └── ae └── 981935c385a7575d2e992c626cc72fbf552c90 3 directories, 4 files
Indeed we do. Now let's commit it.
git commit -m "Add colors" -m "Here's a commit message with an actual body, not just a subject line"
[master 66e8fff] Add colors 1 file changed, 1 insertion(+) create mode 100644 colors.txt
git commit
should have produced (at least) 2 new objects: a tree and
a commit.
tree --prune .git/objects
.git/objects ├── 0b │ ├── 4252fee2e097732e264bea210e35be1cb63345 │ └── 7291fe750a527d1df16caa5c6828027859512c ├── 66 │ └── e8fff6f795b91081f5d05eade083f18e600fc4 ├── 97 │ └── b3d1a5707f8a11fa5fa8bc6c3bd7b3965601fd ├── ae │ └── 981935c385a7575d2e992c626cc72fbf552c90 └── fd └── 8cf227b67f57d753468f6f0319a5558d37bd0d 5 directories, 6 files
Here's the commit object:
sha=$(cat .git/refs/heads/master) object="${sha:0:2}/${sha:2}"
zprint .git/objects/${object}
env: can't execute 'python3': No such file or directory
Notice that the commit has a field we haven't seen before:
parent
. This looks like a pointer to our previous commit.
And now for our tree:
zprint .git/objects/fd/8cf227b67f57d753468f6f0319a5558d37bd0d
env: can't execute 'python3': No such file or directory
As expected, this tree object has two entries - one for each file.
5 What about hunks?
This is all great, but I still don't know how to incorporate the concept of "hunks" into this mental model. Let's stage a hunk and see what our index object looks like.
mv numbers.txt{,.bak}
awk '{ if (NR == 1) print "Stage me!"; else if (NR == 14) print "But not me!"; else print ; }' numbers.txt.bak > numbers.txt
rm numbers.txt.bak
cat numbers.txt
Stage me! 2 3 4 5 6 7 8 9 10 11 12 13 But not me! 15
Normally I would use Magit or git add -i
to interactively stage
hunks, but for the sake of reproducibility, I'm going to script
it. First, we'll add our file to the index, then we create a patch:
git diff | tee /tmp/stageme.patch
diff --git a/numbers.txt b/numbers.txt index 97b3d1a..8a10488 100644 --- a/numbers.txt +++ b/numbers.txt @@ -1,4 +1,4 @@ -1 +Stage me! 2 3 4 @@ -11,5 +11,5 @@ 11 12 13 -14 +But not me! 15
This patch has two hunks. Let's remove the second one:
perl -lne 'if (m/^@@/) { $hunk_count++ }; print if $hunk_count < 2' /tmp/stageme.patch | tee /tmp/onehunk.patch
diff --git a/numbers.txt b/numbers.txt index 97b3d1a..8a10488 100644 --- a/numbers.txt +++ b/numbers.txt @@ -1,4 +1,4 @@ -1 +Stage me! 2 3 4
Now let's stage only the first hunk. Note that we use --cached
so
that the change is applied to the index but not to our working tree,
i.e., the version of numbers.txt
that we have on disk.
git apply --cached /tmp/onehunk.patch
Let's check the contents of the index and confirm that only the first hunk is staged:
git --no-pager diff --cached
diff --git a/numbers.txt b/numbers.txt index 97b3d1a..576880b 100644 --- a/numbers.txt +++ b/numbers.txt @@ -1,4 +1,4 @@ -1 +Stage me! 2 3 4
Looks good. Let's also double-check that our working tree includes both hunks.
cat numbers.txt
Stage me! 2 3 4 5 6 7 8 9 10 11 12 13 But not me! 15
Ok, good - the version of numbers.txt
in the git index and the
version in our working tree are completely different. Let's look at
the git object corresponding to numbers.txt
in the index:
git ls-files --stage | grep numbers.txt
100644 576880b02860bc2fb8ccfd0437884cfbff621cc5 0 numbers.txt
zprint .git/objects/57/6880b02860bc2fb8ccfd0437884cfbff621cc5
env: can't execute 'python3': No such file or directory
Interesting… so the blob object that git apply --cached
creates is
a standalone object. That is, there's no structural sharing or "delta"
that's layered on top of some other object; git has created a new,
independent object with the file's contents. Furthermore, the content
in this object can't be found anywhere in the working tree - it lives
only in .git/objects
.
6 Conclusion
Git's data model is both simple and elegant. Consider how much is accomplished with so little; all of git's functionality rests on a foundation of just three data structures: blobs, trees, and commits. Nothing that can be computed needs to be materialized or stored, so it isn't.
This exercise got me thinking about Steve Yegge's "Kingdom of Nouns" 3, in which he describes the elevation of Nouns over Verbs in Object Oriented Programming. I can't imagine a citizen of Javaland designing anything that remotely resembles git's data model.
In OOP, you model a problem domain by identifying the important Nouns and turning them into classes. For a version control system, you'd have Commits, sure, and maybe you get Trees, too. But you'd also have Diffs, because diffs are a Very Important Thing, and it's practically inconceivable that an important thing would not be modeled as an object.
And what do you do with important objects? You build them with Factories, you persist them in databases via an ORM, or maybe you serialize them as XML and store them on disk. You could have the insight that Diffs are cheaply computed given Blobs and Trees, so they need not be first-class citizens in your data model; it's just so improbable that this would occur to you if you're thinking in terms of objects and methods.
This quote from Linus Torvalds now makes more sense to me:
In fact, I'm a huge proponent of designing your code around the data, rather than the other way around, and I think it's one of the reasons git has been fairly successful (*). ... (*) I will, in fact, claim that the difference between a bad programmer and a good one is whether he considers his code or his data structures more important. Bad programmers worry about the code. Good programmers worry about data structures and their relationships.
– Linus Torvalds 4
I generally avoid engaging in hero-worship, and I'm especially wary of making a hero of anyone who casually divides the world into "bad programmers" and "good programmers." But I do have a renewed sense of respect for both the design and the designer of git.