Understanding Git's Pack Files

This document explains how Git efficiently stores objects using pack files. If you've ever wondered why Git repositories are so compact or what happens during git gc, this guide will reveal Git's impressive compression techniques.

The Big Picture: From Loose to Packed

In our previous exploration of object storage, we learned that Git stores objects as individual compressed files called "loose objects." While this works well for small repositories, it becomes inefficient at scale:

Filesystem overhead: Thousands of small files strain the filesystem
No delta compression: Similar files are stored separately
Network inefficiency: Transferring many files is slow

Pack files solve all of these by combining many objects into a single file with sophisticated delta compression.

Before: .git/objects/
├── 00/a1b2c3...  # loose object
├── 01/d4e5f6...  # loose object
├── 02/g7h8i9...  # loose object
├── ... (thousands more)

After: .git/objects/pack/
├── pack-abc123.idx   # ~200KB index
└── pack-abc123.pack  # ~5MB packed data (was 50MB loose!)

Why Pack Files Matter

Space Savings

Consider a source file that changes slightly between commits. With loose objects:

commit 1: utils.py (10KB) → 10KB stored
commit 2: utils.py (10.1KB) → 10.1KB stored
commit 3: utils.py (10.2KB) → 10.2KB stored
...
100 commits: ~1GB stored

With pack files and delta compression:

commit 1: utils.py (10KB) → 10KB base
commit 2: utils.py delta → ~100 bytes
commit 3: utils.py delta → ~150 bytes
...
100 commits: ~20KB stored!

Delta compression stores only the differences between similar objects, not complete copies.

Network Transfer

When you git clone or git fetch:

Without packs:
- Client: "I need objects a, b, c, d, e..."
- Server: Sends 1000 individual HTTP requests
- Time: Minutes to hours

With packs:
- Client: "I need these refs"
- Server: Creates single packfile with all needed objects
- Time: Seconds to minutes

Git was designed for the Linux kernel—millions of files, decades of history. Pack files make this manageable.

When Pack Files Are Created

Automatic Triggers

git gc (garbage collection): ```bash # Explicit gc $ git gc

# Auto gc (triggered by git periodically) $ git gc --auto ```

git push / git fetch:
Objects sent over the network are always packed
The server creates a pack on-the-fly
Automatic thresholds: bash # Git auto-packs when: # - Loose objects exceed gc.auto (default: 6700) # - Packs exceed gc.autoPackLimit (default: 50)

Manual Packing

# Create new pack from all loose objects
$ git repack -a -d

# Aggressive repacking (better compression, slower)
$ git repack -a -d -f --depth=50 --window=250

Pack File Anatomy

Overall Structure

A pack file (.pack) has three sections:

┌──────────────────────────────────────────────────────────────┐
│                      HEADER (12 bytes)                        │
│   Signature: "PACK"  │  Version: 2  │  Object Count: N       │
├──────────────────────────────────────────────────────────────┤
│                                                               │
│                      OBJECT ENTRIES                           │
│                                                               │
│   Entry 1: [header] [data]                                   │
│   Entry 2: [header] [delta-base-ref] [delta]                 │
│   Entry 3: [header] [data]                                   │
│   ...                                                         │
│   Entry N: [header] [data]                                   │
│                                                               │
├──────────────────────────────────────────────────────────────┤
│                      TRAILER (20 bytes)                       │
│                   SHA-1 of all above content                  │
└──────────────────────────────────────────────────────────────┘

The Header

Bytes 0-3:  "PACK" (magic signature)
Bytes 4-7:  Version number (2 = current standard)
Bytes 8-11: Number of objects in pack (big-endian)

# Reading the header
PACK_SIGNATURE = b"PACK"

def read_pack_header(data: bytes) -> tuple[int, int]:
    """Parse pack header."""
    if data[:4] != PACK_SIGNATURE:
        raise ValueError("Invalid pack signature")

    version = int.from_bytes(data[4:8], "big")
    object_count = int.from_bytes(data[8:12], "big")

    return version, object_count

Object Types in Packs

Packs store six types of entries:

Type	Value	Description
`OBJ_COMMIT`	1	Commit object
`OBJ_TREE`	2	Tree object
`OBJ_BLOB`	3	Blob object
`OBJ_TAG`	4	Annotated tag
`OBJ_OFS_DELTA`	6	Delta with offset reference
`OBJ_REF_DELTA`	7	Delta with SHA reference

Types 1-4 are "undeltified"—stored in full (but still compressed). Types 6-7 are deltas that reference a base object.

Note: Type 5 is reserved and unused.

Object Entry Format

Each entry begins with a variable-length header:

┌───────────────────────────────────────────────────────────┐
│                    First byte                             │
│  ┌───┬───────┬───────────────────────────────────────┐   │
│  │MSB│ type  │         size (bits 0-3)               │   │
│  │(1)│  (3)  │              (4)                      │   │
│  └───┴───────┴───────────────────────────────────────┘   │
├───────────────────────────────────────────────────────────┤
│                 Continuation bytes                        │
│  ┌───┬───────────────────────────────────────────────┐   │
│  │MSB│              size (bits 4-10, 11-17, ...)     │   │
│  │(1)│                      (7)                      │   │
│  └───┴───────────────────────────────────────────────┘   │
│              (repeat while MSB=1)                         │
└───────────────────────────────────────────────────────────┘

The variable-length encoding allows arbitrary sizes without wasting bytes on small objects.

def read_pack_object_header(data: bytes, offset: int) -> tuple[int, int, int]:
    """
    Read pack object header.

    Returns: (object_type, uncompressed_size, bytes_consumed)
    """
    byte = data[offset]
    obj_type = (byte >> 4) & 0x07  # Bits 4-6
    size = byte & 0x0F             # Bits 0-3

    consumed = 1
    shift = 4

    # Continue while MSB is set
    while byte & 0x80:
        byte = data[offset + consumed]
        size |= (byte & 0x7F) << shift
        shift += 7
        consumed += 1

    return obj_type, size, consumed

Delta Base References

For delta objects (types 6 and 7), an additional reference follows the header:

OFS_DELTA (type 6): Negative offset to base object in same pack

[header] [variable-length offset] [compressed delta]

REF_DELTA (type 7): SHA-1 of base object (may be in another pack)

[header] [20-byte SHA] [compressed delta]

OFS_DELTA is more common in stored packs (faster to resolve). REF_DELTA is used in network transfers where the base might already exist on the receiver.

Variable-Length Integer Encoding

Git uses clever encodings to minimize bytes:

Size Encoding (in headers)

Size 0-15:     1 byte  (4 bits in first byte)
Size 16-2047:  2 bytes (4 + 7 bits)
Size up to ~2M: 3 bytes (4 + 7 + 7 bits)
...

OFS_DELTA Offset Encoding

This encoding is particularly clever—it handles arbitrary offsets with minimal bytes:

def read_ofs_delta_offset(data: bytes, offset: int) -> tuple[int, int]:
    """
    Read OFS_DELTA negative offset.

    The encoding adds 1 to each continuation byte to eliminate
    ambiguity in the representation.
    """
    byte = data[offset]
    result = byte & 0x7F
    consumed = 1

    while byte & 0x80:
        byte = data[offset + consumed]
        # The +1 is crucial: it makes the encoding unambiguous
        result = ((result + 1) << 7) | (byte & 0x7F)
        consumed += 1

    return result, consumed

Why the +1? Without it, there would be multiple ways to encode the same number. The +1 ensures each value has exactly one encoding.

Exploring Pack Files

Command-Line Tools

# List all pack files
$ ls .git/objects/pack/
pack-abc123def456.idx
pack-abc123def456.pack

# Verify pack integrity
$ git verify-pack -v .git/objects/pack/*.pack

# Show pack statistics
$ git verify-pack -s .git/objects/pack/*.pack
statistics:
  objects: 1234
  total: 5MB
  delta: 890 (72%)

# List objects in a pack
$ git verify-pack -v .git/objects/pack/*.pack | head -20
abc123... commit 234  456  0
def456... tree   890  123  1
789abc... blob   45   67   2 \  # Delta chain start
012def... blob   12   34   3 789abc...  # Deltified against 789abc

Understanding verify-pack Output

SHA          type  size  compressed-size  offset  [base-SHA depth]
abc123def... blob  1000  350             12
456789abc... blob  50    30              400     abc123def... 1

size: Uncompressed size of the resolved object
compressed-size: Bytes in pack file
offset: Position in pack file
base-SHA: For deltas, the base object
depth: Delta chain depth (1 = direct delta, 2 = delta of delta, etc.)

Reading Pack Files with Python

import zlib
from pathlib import Path

def explore_pack(pack_path: Path):
    """Quick pack file exploration."""
    data = pack_path.read_bytes()

    # Check header
    assert data[:4] == b"PACK", "Not a pack file"
    version = int.from_bytes(data[4:8], "big")
    count = int.from_bytes(data[8:12], "big")

    print(f"Pack version: {version}")
    print(f"Object count: {count}")

    # Parse first few objects
    offset = 12
    for i in range(min(5, count)):
        obj_type, size, header_len = read_pack_object_header(data, offset)
        print(f"Object {i}: type={obj_type}, size={size}")
        offset += header_len
        # Skip compressed data (would need to decompress to find end)
        break  # Simplified - real parsing is more complex

Delta Compression Deep Dive

Delta compression is the key to Git's storage efficiency. Let's understand how it works.

The Basic Idea

Instead of storing complete copies of similar content:

Version 1: "Hello, World!\nThis is line 2.\nThis is line 3.\n"
Version 2: "Hello, Git World!\nThis is line 2.\nThis is line 3.\n"

Store the first version, then instructions to transform it:

Base: "Hello, World!\nThis is line 2.\nThis is line 3.\n"
Delta: "Copy bytes 0-6, Insert 'Git ', Copy bytes 7-end"

Delta Instruction Format

Deltas consist of two instruction types:

INSERT: Add literal bytes

┌─────────────────────────────────────────┐
│ 0xxxxxxx │ literal data (x bytes)       │
└─────────────────────────────────────────┘

First bit is 0
Remaining 7 bits = length (1-127)
Followed by that many literal bytes

COPY: Copy from base object

┌─────────────────────────────────────────┐
│ 1oooosss │ offset bytes │ size bytes    │
└─────────────────────────────────────────┘

First bit is 1
Next 4 bits indicate which offset bytes are present
Next 3 bits indicate which size bytes are present
Offset and size are little-endian

Delta Structure

┌──────────────────────────────────────────┐
│     Source (base) size - varint          │
├──────────────────────────────────────────┤
│     Target (result) size - varint        │
├──────────────────────────────────────────┤
│     Delta instructions                    │
│     [COPY] [INSERT] [COPY] ...           │
└──────────────────────────────────────────┘

Example: Applying a Delta

def apply_delta(base: bytes, delta_data: bytes) -> bytes:
    """Reconstruct target from base + delta."""
    offset = 0

    # Read expected sizes (for verification)
    source_size, consumed = read_varint(delta_data, offset)
    offset += consumed
    target_size, consumed = read_varint(delta_data, offset)
    offset += consumed

    assert len(base) == source_size, "Base size mismatch"

    result = bytearray()

    while offset < len(delta_data):
        cmd = delta_data[offset]
        offset += 1

        if cmd & 0x80:
            # COPY instruction
            copy_offset, copy_size = parse_copy_instruction(cmd, delta_data, offset)
            result.extend(base[copy_offset : copy_offset + copy_size])
            offset += bytes_consumed_by_copy(cmd)

        elif cmd > 0:
            # INSERT instruction
            result.extend(delta_data[offset : offset + cmd])
            offset += cmd

        else:
            raise ValueError("Invalid delta instruction: 0x00")

    assert len(result) == target_size, "Target size mismatch"
    return bytes(result)

Delta Chains

Deltas can be chained: Object A is a delta of B, which is a delta of C, which is stored in full.

Object A (newest) → delta of B
                     ↓
Object B          → delta of C
                     ↓
Object C (oldest) → stored in full

To read Object A: 1. Read C (decompress) 2. Apply delta to get B 3. Apply delta to get A

Git limits chain depth (default: 50) to balance compression vs. read speed.

The Pack Index (.idx)

Pack files alone would require scanning the entire file to find an object. The index provides O(1) lookup.

Index Structure (Version 2)

┌──────────────────────────────────────────────────────────────┐
│                    HEADER (8 bytes)                          │
│           Magic: 0xff744f63  │  Version: 2                   │
├──────────────────────────────────────────────────────────────┤
│                   FANOUT TABLE (1024 bytes)                  │
│                 256 entries × 4 bytes each                   │
│                                                              │
│  fanout[0x00] = count of objects with SHA starting ≤ 0x00   │
│  fanout[0x01] = count of objects with SHA starting ≤ 0x01   │
│  ...                                                         │
│  fanout[0xff] = total object count                          │
├──────────────────────────────────────────────────────────────┤
│                    SHA TABLE                                 │
│              N × 20-byte SHA-1 hashes                        │
│                   (sorted order)                             │
├──────────────────────────────────────────────────────────────┤
│                    CRC32 TABLE                               │
│              N × 4-byte CRC32 values                         │
│           (of compressed data in pack)                       │
├──────────────────────────────────────────────────────────────┤
│                   OFFSET TABLE                               │
│              N × 4-byte pack offsets                         │
│         (high bit set → use large offset table)              │
├──────────────────────────────────────────────────────────────┤
│               LARGE OFFSET TABLE (optional)                  │
│              8-byte offsets for packs > 2GB                  │
├──────────────────────────────────────────────────────────────┤
│                  PACK SHA (20 bytes)                         │
│                  INDEX SHA (20 bytes)                        │
└──────────────────────────────────────────────────────────────┘

The Fanout Table Trick

The fanout table enables fast binary search:

def find_object(index: PackIndex, sha: str) -> int | None:
    """Find object offset in pack using index."""
    sha_bytes = bytes.fromhex(sha)
    first_byte = sha_bytes[0]

    # Use fanout to narrow search range
    if first_byte == 0:
        start = 0
    else:
        start = index.fanout[first_byte - 1]
    end = index.fanout[first_byte]

    # Binary search within the range
    while start < end:
        mid = (start + end) // 2
        mid_sha = index.shas[mid]

        if mid_sha == sha_bytes:
            return index.offsets[mid]
        elif mid_sha < sha_bytes:
            start = mid + 1
        else:
            end = mid

    return None  # Not found

For an object starting with 0xab: 1. Look up fanout[0xaa] and fanout[0xab] 2. Binary search only between those indices 3. Typically reduces search space by 256x

Large Pack Support

For packs larger than 2GB, 4-byte offsets aren't enough. The index handles this elegantly:

If offset[i] has high bit set:
    actual_offset = large_offsets[offset[i] & 0x7fffffff]

This allows seamless support for packs up to 16 exabytes while keeping the common case (< 2GB) efficient.

Creating Pack Files

The Packing Algorithm

Collect objects: Gather all objects to pack
Sort: Group similar objects together (by type, path hints, size)
Find delta bases: For each object, find good candidates for deltification
Compute deltas: Create deltas that provide good compression
Write pack: Output header, objects (deltified or not), trailer
Write index: Create .idx file for fast lookup

Object Selection for Delta Bases

Git uses heuristics to find good delta bases:

def find_delta_base(obj, candidates: list) -> GitObject | None:
    """Find best delta base for object."""
    best_delta = None
    best_size = len(obj.data)

    for candidate in candidates:
        # Must be same type
        if candidate.type_name != obj.type_name:
            continue

        # Must be similar size (within factor of 10)
        if len(candidate.data) * 10 < len(obj.data):
            continue
        if len(candidate.data) > len(obj.data) * 10:
            continue

        # Compute delta
        delta = create_delta(candidate.data, obj.data)

        if len(delta) < best_size:
            best_delta = candidate
            best_size = len(delta)

    # Only use delta if it saves significant space
    if best_delta and best_size < len(obj.data) * 0.9:
        return best_delta
    return None

The Sliding Window

Git uses a "sliding window" approach:

def pack_objects(objects: list[GitObject], window_size: int = 10):
    """Pack objects with delta compression."""
    # Sort by type, then by filename hint, then by size
    sorted_objects = sort_for_packing(objects)

    window: list[GitObject] = []

    for obj in sorted_objects:
        # Try to deltify against objects in window
        base = find_delta_base(obj, window)

        if base:
            write_delta_object(obj, base)
        else:
            write_full_object(obj)

        # Update window
        window.append(obj)
        if len(window) > window_size:
            window.pop(0)

Objects are sorted so that similar objects (same file across versions) are adjacent, maximizing delta opportunities.

Pack File Verification

Git provides robust verification:

Integrity Checks

# Verify pack file integrity
$ git verify-pack .git/objects/pack/*.pack

# Verbose output shows any issues
$ git verify-pack -v .git/objects/pack/*.pack 2>&1 | grep -i error

What's Verified

Pack checksum: SHA-1 of pack file content matches trailer
Index checksum: SHA-1 of index content matches trailer
Object checksums: Each object's SHA-1 matches after reconstruction
Delta chain validity: All delta bases exist and resolve correctly
CRC32 values: Compressed data hasn't been corrupted

Self-Healing

If verification fails, Git can often recover:

# Remove corrupted pack and re-fetch
$ rm .git/objects/pack/pack-corrupted.*
$ git fetch --all

# Or re-clone
$ git clone --mirror <url>

Network Transfer: Thin Packs

During git fetch/git push, Git uses "thin packs"—packs that reference objects the receiver already has.

Normal Pack vs Thin Pack

Normal pack: All delta bases are included in the pack

Object A (delta of B) ← B is in this pack
Object B (full)

Thin pack: Delta bases may be external

Object A (delta of B) ← B is NOT in this pack
                        (receiver already has B)

Pack Protocol

Client: "I have commits X, Y, Z. Send me everything else."
Server: Creates thin pack with just the new objects
        Uses client's existing objects as delta bases
Client: Receives pack, "thickens" it by resolving deltas

This minimizes network transfer—why send objects the client already has?

Performance Characteristics

Lookup Performance

Operation	Complexity	Notes
Find by SHA (with index)	O(log n)	Binary search in fanout-limited range
Read undeltified object	O(1)	Direct seek + decompress
Read deltified object	O(d)	d = delta chain depth
Scan all objects	O(n)	Linear traversal

Space Efficiency

Scenario	Loose Objects	Packed
Linux kernel (1M objects)	~2GB	~200MB
Typical project (10K objects)	~100MB	~10MB
Similar files (100 versions)	100 × size	~1 × size

Tuning Parameters

# Delta chain depth (default: 50)
$ git config pack.depth 50

# Delta window size (default: 10)
$ git config pack.window 10

# Pack size threshold (default: 2GB)
$ git config pack.packSizeLimit 2g

# Aggressive packing
$ git repack -a -d --depth=250 --window=250

Implementation in gitpy

Module Structure

gitpy/storage/
├── __init__.py       # Module exports
├── compression.py    # Zlib utilities
├── loose.py          # Loose object store
├── database.py       # Object database (updated for packs)
├── delta.py          # Delta encoding/decoding
├── pack.py           # Pack file reader
├── pack_index.py     # Pack index
└── pack_writer.py    # Pack file writer

Key Classes

PackFile: Reads objects from pack files

class PackFile:
    def __init__(self, pack_path: Path): ...
    def read_object(self, sha: str) -> PackObject | None: ...
    def __contains__(self, sha: str) -> bool: ...
    def __iter__(self) -> Iterator[PackObject]: ...

PackIndex: Fast SHA → offset lookup

class PackIndex:
    def get_offset(self, sha: str) -> int | None: ...
    def __contains__(self, sha: str) -> bool: ...

    @classmethod
    def from_file(cls, path: Path) -> PackIndex: ...
    def serialize(self) -> bytes: ...

Delta functions: Encode and decode deltas

def parse_delta(data: bytes) -> tuple[int, int, list[DeltaOp]]: ...
def apply_delta(base: bytes, delta_ops: list[DeltaOp]) -> bytes: ...
def create_delta(source: bytes, target: bytes) -> bytes: ...

Integration with ObjectDatabase

class ObjectDatabase:
    def __init__(self, git_dir: Path):
        self.loose = LooseObjectStore(git_dir)
        self._pack_files: list[PackFile] = []
        self._load_packs()

    def read_raw(self, sha: str) -> bytes:
        # Try loose first (faster for recent objects)
        if self.loose.exists(sha):
            return self.loose.read(sha)

        # Search pack files
        for pack in self._pack_files:
            obj = pack.read_object(sha)
            if obj:
                # Reconstruct with header
                header = f"{obj.type_name} {len(obj.data)}\0".encode()
                return header + obj.data

        raise FileNotFoundError(f"Object not found: {sha}")

Git Compatibility

Reference Hashes

These hashes must match real Git:

Object	SHA-1
Empty blob	`e69de29bb2d1d6434b8b29ae775ad8c2e48c5391`
Empty tree	`4b825dc642cb6eb9a060e54bf8d69288fbee4904`
`hello\n` blob	`ce013625030ba8dba906f756967f9e9ca394464a`

Interoperability Tests

# Create pack with Git, read with gitpy
$ git gc
$ python3 -c "
from gitpy import Repository
repo = Repository(Path('.'))
blob = repo.objects.read_blob('ce013625')  # From pack
print(blob.data)  # b'hello\n'
"

# Create pack with gitpy, verify with Git
$ python3 -c "
from gitpy import Repository
repo = Repository(Path('.'))
repo.objects.repack()
"
$ git verify-pack -v .git/objects/pack/*.pack  # Should succeed

Key Takeaways

Pack files combine many objects into single files for efficiency
Delta compression stores differences instead of full copies
The pack index enables O(1) lookup by SHA
Variable-length encoding minimizes overhead
Thin packs optimize network transfer
Delta chains are limited to balance compression vs. read speed
All formats are Git-compatible for interoperability

Pack files transform Git from a simple content store into an incredibly efficient version control system. Understanding them reveals how Git can handle massive repositories like the Linux kernel while keeping operations fast.

What's Next?

With pack files understood, explore related topics:

Delta Compression: Deep dive into delta algorithms
Pack Index: Index format details
References (gitpy/refs/): How branches and tags point to commits
Garbage Collection: How Git cleans up unreachable objects