Understanding Git's Index and Staging Area
This document explains what the Git index is, how its binary format is laid out on disk, and how the three-way snapshot model (HEAD, index, worktree) drives git status and git commit.
The Big Picture: Three Snapshots
At any moment, Git is tracking three versions of your project simultaneously:
HEAD Index (staging area) Working tree
(last commit) (.git/index) (your files)
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ tree │ │ entry per │ │ actual │
│ from last │ ───► │ staged file │ ───────► │ files on │
│ commit │ │ + SHA + stat│ │ disk │
└─────────────┘ └─────────────┘ └─────────────┘
↑ ↑
│ │
└──── git diff HEAD ──┘──── git diff ────────┘
The index is the snapshot you are building toward the next commit. git add copies a file's current content into the object database as a blob and records the blob's SHA in the index. git commit turns the index into a tree object, wraps it in a commit, and advances the branch pointer. The working tree is never committed directly.
This intermediate layer is Git's most powerful feature: you can build up a commit piece by piece, staging individual hunks, without touching anything else.
Seeing the Index in Action
# Stage a file
$ echo "hello" > greeting.txt
$ git add greeting.txt
# Inspect the index
$ git ls-files --stage
100644 ce013625030ba8dba906f756967f9e9ca394464a 0 greeting.txt
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^
# blob SHA of "hello\n" stage path
The blob SHA ce013625030ba8dba906f756967f9e9ca394464a is the same in every Git repository on Earth that has ever staged the string "hello\n". You can verify:
$ echo "hello" | git hash-object --stdin
ce013625030ba8dba906f756967f9e9ca394464a
An empty index produces the well-known empty tree:
$ git write-tree # from an empty index
4b825dc642cb6eb9a060e54bf8d69288fbee4904
The DIRC Binary Format
The index is stored at .git/index as a binary file. Understanding the layout helps when debugging corruption or building compatible tools.
High-level structure
┌──────────────────────────────────────────────────────────┐
│ 4 bytes "DIRC" (magic signature) │
│ 4 bytes version (big-endian uint32, value = 2) │
│ 4 bytes entry count (big-endian uint32) │
├──────────────────────────────────────────────────────────┤
│ Entry 0 (variable length, multiple of 8 bytes) │
│ Entry 1 │
│ ... │
├──────────────────────────────────────────────────────────┤
│ 20 bytes SHA-1 checksum of everything above │
└──────────────────────────────────────────────────────────┘
The 4-byte magic DIRC stands for "directory cache"—an early Git name for the index.
Entry layout
Each entry stores the cached stat(2) metadata alongside the blob SHA and file path:
Offset Size Field
0 4 ctime seconds ─┐
4 4 ctime nanoseconds │ file metadata
8 4 mtime seconds │ for fast
12 4 mtime nanoseconds │ change
16 4 device id │ detection
20 4 inode number │
24 4 file mode │
28 4 uid │
32 4 gid │
36 4 file size ─┘
40 20 SHA-1 (binary, 20 bytes)
60 2 flags (uint16)
62 N path (UTF-8, NUL-terminated)
62+N P NUL padding (so total entry size is multiple of 8)
All multi-byte integers are big-endian.
The flags field
The 16-bit flags word packs three pieces of information:
Bit 15 assume-valid flag (skip this entry in status checks)
Bits 14 extended flag (set = version 3 extended flags follow)
Bits 12-13 merge stage (0 = normal, 1-3 = conflict stages)
Bits 0-11 name length (min(path_length, 0xFFF))
The stage bits are what make three-way merge conflict tracking possible: a single conflicted file appears in the index three times, each with a different stage number.
8-byte padding
Each entry is padded with NUL bytes so that the entry occupies a whole multiple of 8 bytes. The NUL path terminator counts toward the total. If the path is already aligned the NUL terminator alone is sufficient; no extra padding bytes are added.
Total bytes = 62 (fixed fields) + len(path_utf8) + 1 (NUL)
Pad to next multiple of 8 if not already aligned.
SHA-1 trailer
The final 20 bytes of the file are the SHA-1 digest of all preceding bytes. Git verifies this checksum on every read. A mismatch aborts with a "corrupted index" error.
Entries Keyed by (path, stage)
For normal files the stage is always 0. During a merge conflict the index can contain up to three entries for the same path:
| Stage | Meaning |
|---|---|
| 0 | Normal (no conflict) |
| 1 | Common ancestor (merge base) |
| 2 | "Ours" (current branch version) |
| 3 | "Theirs" (incoming branch version) |
$ git ls-files --stage conflicted.txt
100644 aaa... 1 conflicted.txt ← base
100644 bbb... 2 conflicted.txt ← ours
100644 ccc... 3 conflicted.txt ← theirs
$ git status
both modified: conflicted.txt
Stage-0 entries are absent while conflicts exist; the file is "unresolved". Writing a stage-0 entry (via git add after editing) resolves the conflict.
Stat Caching: How Status Stays Fast
Git avoids re-hashing every file on every git status call by caching stat(2) results in the index. The check sequence is:
For each index entry:
1. stat() the working-tree file
2. Compare size, mtime, inode, ctime, and mode
3. If all match → "probably unchanged" (skip SHA computation)
4. If any differ → hash the file and compare SHAs
This is why git status is nearly instant on large repos: the expensive SHA computation is only triggered when the filesystem says something changed.
Operations: read_tree, write_tree, get_status
Three functions in gitpy/index/operations.py are the workhorse bridge between the index and the object database.
write_tree: index → commit tree
write_tree converts the current index into a tree object and returns the root SHA. If the index is empty, it writes the empty tree:
$ git write-tree
4b825dc642cb6eb9a060e54bf8d69288fbee4904 ← always the same empty-tree SHA
The function recurses through the path hierarchy, grouping entries by directory prefix to build nested tree objects bottom-up.
read_tree: tree → index
read_tree is the inverse: it populates the index from an existing tree object. This is what git checkout uses internally before writing files to disk.
$ git read-tree HEAD^{tree}
# Populates index from the HEAD commit's tree
Entries produced by read_tree have all stat fields set to zero because there is no working-tree file to stat yet.
get_status: three-way comparison
get_status computes git status output by comparing three sources:
- HEAD tree: what the last commit recorded
- Index: what is staged
- Working tree: what is on disk
For each path it returns a StatusEntry with two FileStatus values:
index_status |
Meaning |
|---|---|
ADDED |
staged but not in HEAD |
MODIFIED |
staged with different SHA than HEAD |
DELETED |
removed from index but in HEAD |
UNMODIFIED |
same as HEAD |
worktree_status |
Meaning |
|---|---|
MODIFIED |
working-tree file differs from index |
DELETED |
file missing from working tree |
UNTRACKED |
file in working tree, not in index |
UNMODIFIED |
working tree matches index |
$ git status
Changes to be committed: ← index_status != UNMODIFIED
modified: README.md
Changes not staged for commit: ← worktree_status != UNMODIFIED
modified: src/main.py
Untracked files: ← UNTRACKED
scratch.py
Well-Known SHA Values
These hash values are identical in every Git installation:
| Content | SHA-1 |
|---|---|
| Empty blob (empty file) | e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 |
| Empty tree (empty index → write-tree) | 4b825dc642cb6eb9a060e54bf8d69288fbee4904 |
Blob for "hello\n" |
ce013625030ba8dba906f756967f9e9ca394464a |
You can verify locally:
$ git hash-object -t blob /dev/null
e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
$ git hash-object -t tree /dev/null
4b825dc642cb6eb9a060e54bf8d69288fbee4904
$ printf 'hello\n' | git hash-object --stdin
ce013625030ba8dba906f756967f9e9ca394464a
Implementation in gitpy
The staging area lives in gitpy/index/:
gitpy/index/
├── __init__.py
├── entry.py # IndexEntry dataclass
├── index.py # Index, IndexFile
└── operations.py # read_tree, write_tree, get_status, conflict helpers
IndexEntry
IndexEntry is a @dataclass(slots=True) holding every field from the binary format, plus computed properties:
from gitpy.index.entry import IndexEntry
entry = IndexEntry.from_path(
path="src/main.py",
sha="ce013625030ba8dba906f756967f9e9ca394464a",
worktree=Path("/my/repo"),
stage=0,
)
entry.stage # 0
entry.is_executable # False (mode 0o100644)
entry.is_symlink # False
entry.matches_stat(st) # fast-path change detection
from_path() uses nanosecond timestamps (st_ctime_ns, st_mtime_ns) directly to avoid floating-point precision loss from dividing by 1e9.
Index
Index stores entries in a dict[tuple[str, int], IndexEntry] keyed by (path, stage). Iteration yields entries sorted by (path, stage), which is the canonical Git wire order.
from gitpy.index.index import Index
idx = Index()
idx.add(entry)
entry = idx.get("src/main.py", stage=0)
idx.remove("src/main.py") # removes all stages
idx.remove("src/main.py", stage=2) # removes only stage 2
len(idx) # number of entries
Serialisation produces a byte string with the DIRC header, all entries (padded to multiples of 8), and the 20-byte SHA-1 trailer. Parsing verifies the checksum before reading any entries.
IndexFile
IndexFile manages the .git/index file with atomic write semantics using an exclusive-create lock file (.git/index.lock):
from gitpy.index.index import IndexFile
idx_file = IndexFile(Path(".git"))
idx = idx_file.read() # returns empty Index if file absent
# ... mutate idx ...
idx_file.write(idx) # atomic rename over .git/index
Operations
from gitpy.index.operations import (
read_tree,
write_tree,
get_status,
FileStatus,
StatusEntry,
has_conflicts,
get_conflicts,
add_conflict,
resolve_conflict,
)
# Populate index from a tree SHA
read_tree(index, tree_sha, db)
# Convert index to a tree and return root SHA
root_sha = write_tree(index, db)
# Three-way status
entries = get_status(index, head_tree_sha, worktree, db)
for e in entries:
print(e.path, e.index_status, e.worktree_status)
# Conflict helpers
if has_conflicts(index):
for path, stages in get_conflicts(index).items():
print(path, [s.stage for s in stages])
Key Takeaways
- The index is a snapshot cache: it holds the staged version of every file, plus enough
stat(2)metadata to detect changes quickly. - Binary DIRC format: 4-byte magic, big-endian header, entries padded to 8-byte multiples, SHA-1 trailer.
- Stage bits enable merge conflicts: the same path appears three times (stages 1-3) during a conflict; stage 0 is normal.
- write_tree of an empty index is always
4b825dc642cb6eb9a060e54bf8d69288fbee4904. - get_status does a three-way comparison: HEAD tree vs index (staged changes), index vs working tree (unstaged changes).
What's Next?
- References: How HEAD, branches, and tags point into the commit graph.
- Diff: How gitpy computes line-level differences between blobs using the Myers algorithm.
- Object Model: The blob and tree objects that the index refers to.
- Object Storage: How those objects are compressed and stored on disk.