Understanding Git's Pack Index

This document explains how Git enables fast object lookup in pack files through index files. If you've ever wondered how Git finds objects instantly among millions, this guide reveals the elegant data structure that makes it possible.

The Problem: Finding Objects in Packs

Pack files combine thousands or millions of objects into a single file. Without an index, finding an object would require scanning the entire pack:

Without index:
  Find object abc123...
  → Scan from start: 1MB... 10MB... 100MB... found at 150MB!
  → Time: O(pack size) = seconds to minutes

With index:
  Find object abc123...
  → Binary search in index: abc123 → offset 157,286,400
  → Seek directly to offset
  → Time: O(log n) = milliseconds

The pack index (.idx file) provides O(1) amortized lookup time regardless of pack size.

Index File Location

For every .pack file, there's a corresponding .idx file:

.git/objects/pack/
├── pack-abc123def456.pack   # Object data (50MB)
├── pack-abc123def456.idx    # Index (~2MB)
├── pack-789abc012def.pack
└── pack-789abc012def.idx

The filenames match—same SHA identifies both files.

Index Format Evolution

Git has used two index formats:

Version	Introduced	Features
1	Original	Simple but limited to 2GB packs
2	Git 1.6+	Current standard, supports >2GB packs

Version 2 is the standard. Version 1 is obsolete but still parseable.

Version 2 Index Structure

The index has a carefully designed layout:

┌────────────────────────────────────────────────────────────────┐
│                        HEADER (8 bytes)                        │
│     Magic: 0xff 0x74 0x4f 0x63  │  Version: 0x00000002        │
├────────────────────────────────────────────────────────────────┤
│                     FANOUT TABLE (1024 bytes)                  │
│                    256 × 4-byte entries                        │
│                                                                 │
│  fanout[0x00] = count of objects with first SHA byte ≤ 0x00   │
│  fanout[0x01] = count of objects with first SHA byte ≤ 0x01   │
│  ...                                                           │
│  fanout[0xff] = total object count in pack                    │
├────────────────────────────────────────────────────────────────┤
│                      SHA TABLE (N × 20 bytes)                  │
│                     Objects sorted by SHA-1                    │
├────────────────────────────────────────────────────────────────┤
│                     CRC32 TABLE (N × 4 bytes)                  │
│               CRC32 of each object's packed data               │
├────────────────────────────────────────────────────────────────┤
│                    OFFSET TABLE (N × 4 bytes)                  │
│                  Pack file offset for each object              │
│            (high bit set → use large offset table)             │
├────────────────────────────────────────────────────────────────┤
│               LARGE OFFSET TABLE (variable)                    │
│            8-byte offsets for objects beyond 2GB               │
│                  (only if pack > 2GB)                          │
├────────────────────────────────────────────────────────────────┤
│                    PACK SHA (20 bytes)                         │
│               SHA-1 of the pack file content                   │
├────────────────────────────────────────────────────────────────┤
│                   INDEX SHA (20 bytes)                         │
│                  SHA-1 of all above content                    │
└────────────────────────────────────────────────────────────────┘

The Fanout Table: Fast Search Narrowing

The fanout table is the key innovation. It's a 256-entry array where:

fanout[i] = number of objects with first SHA byte ≤ i

This creates a cumulative histogram:

Example with 1000 objects:
fanout[0x00] = 3     # 3 objects start with 00
fanout[0x01] = 7     # 7 objects start with ≤ 01
fanout[0x02] = 12    # 12 objects start with ≤ 02
...
fanout[0x80] = 512   # ~half start with ≤ 80 (expected)
...
fanout[0xfe] = 996   # 996 objects start with ≤ fe
fanout[0xff] = 1000  # 1000 total objects

Using the Fanout Table

To find objects starting with byte 0xab:

def get_search_range(fanout: list[int], first_byte: int) -> tuple[int, int]:
    """Get range of indices for objects with given first byte."""
    if first_byte == 0:
        start = 0
    else:
        start = fanout[first_byte - 1]
    end = fanout[first_byte]
    return start, end

# Example: Find objects starting with 0xab
start, end = get_search_range(fanout, 0xab)
# start = fanout[0xaa] = 670
# end = fanout[0xab] = 674
# Only 4 objects to binary search!

This reduces search space by ~256x on average.

Object Lookup Algorithm

Here's the complete lookup process:

def find_object_offset(index: bytes, sha: str) -> int | None:
    """
    Find object's pack file offset using the index.

    Returns offset or None if not found.
    """
    sha_bytes = bytes.fromhex(sha)
    first_byte = sha_bytes[0]

    # Step 1: Read fanout to narrow search range
    fanout_base = 8  # After header
    if first_byte == 0:
        start = 0
    else:
        start = int.from_bytes(
            index[fanout_base + (first_byte - 1) * 4 :
                  fanout_base + first_byte * 4],
            "big"
        )
    end = int.from_bytes(
        index[fanout_base + first_byte * 4 :
              fanout_base + (first_byte + 1) * 4],
        "big"
    )

    # Get total object count
    total_objects = int.from_bytes(
        index[fanout_base + 255 * 4 : fanout_base + 256 * 4],
        "big"
    )

    # Calculate section offsets
    sha_table_base = fanout_base + 256 * 4
    crc_table_base = sha_table_base + total_objects * 20
    offset_table_base = crc_table_base + total_objects * 4

    # Step 2: Binary search in SHA table
    while start < end:
        mid = (start + end) // 2
        mid_sha = index[sha_table_base + mid * 20 :
                       sha_table_base + (mid + 1) * 20]

        if mid_sha == sha_bytes:
            # Found! Read offset
            offset_bytes = index[offset_table_base + mid * 4 :
                                offset_table_base + (mid + 1) * 4]
            offset = int.from_bytes(offset_bytes, "big")

            # Check for large offset
            if offset & 0x80000000:
                # High bit set - look up in large offset table
                large_idx = offset & 0x7fffffff
                large_base = offset_table_base + total_objects * 4
                offset = int.from_bytes(
                    index[large_base + large_idx * 8 :
                          large_base + (large_idx + 1) * 8],
                    "big"
                )

            return offset

        elif mid_sha < sha_bytes:
            start = mid + 1
        else:
            end = mid

    return None  # Not found

CRC32 Verification

Each object has a CRC32 checksum of its packed data:

Object at offset 1000, ends at offset 1500
CRC32 = crc32(pack_data[1000:1500])

This enables: - Fast corruption detection: No need to decompress - Partial pack verification: Check specific objects - Network transfer verification: Verify without unpacking

import zlib

def verify_object_crc(pack_data: bytes, index: PackIndex, sha: str) -> bool:
    """Verify object hasn't been corrupted."""
    entry = index.find(sha)
    if entry is None:
        return False

    # Get object boundaries
    obj_start = entry.offset
    obj_end = find_object_end(pack_data, obj_start)

    # Compute CRC32
    actual_crc = zlib.crc32(pack_data[obj_start:obj_end]) & 0xffffffff

    return actual_crc == entry.crc32

Large Pack Support (>2GB)

Standard 4-byte offsets can only address up to ~2GB. For larger packs:

Offset table entry interpretation:
  - If high bit (0x80000000) is clear: value is the offset
  - If high bit is set: remaining bits are index into large offset table

Large Offset Table

┌───────────────────────────────────────────────────────────┐
│  Entry 0: 8-byte offset for first large object           │
│  Entry 1: 8-byte offset for second large object          │
│  ...                                                      │
└───────────────────────────────────────────────────────────┘

This allows packs up to 16 exabytes while keeping the common case (< 2GB) efficient.

Example: 5GB Pack

Objects:
  sha_001: offset 100         → offset table: 0x00000064
  sha_002: offset 2,500,000   → offset table: 0x002625A0
  sha_003: offset 3,000,000,000 (3GB) → offset table: 0x80000000
                               → large_offset[0] = 0x00000000B2D05E00

First two objects use direct 4-byte offsets.
Third object uses large offset table.

Building an Index

When creating a new pack file, the index is built simultaneously:

from dataclasses import dataclass
from typing import NamedTuple

@dataclass(slots=True)
class PackIndexEntry:
    """Single entry in pack index."""
    sha: str      # 40-char hex
    offset: int   # Position in pack
    crc32: int    # CRC32 of packed data


class PackIndex:
    """Pack file index for fast object lookup."""

    def __init__(self, pack_sha: str, entries: list[PackIndexEntry]):
        # Sort entries by SHA (required for binary search)
        self.entries = sorted(entries, key=lambda e: e.sha)
        self.pack_sha = pack_sha
        self._build_fanout()
        self._sha_to_idx = {e.sha: i for i, e in enumerate(self.entries)}

    def _build_fanout(self) -> None:
        """Build fanout table from sorted entries."""
        self._fanout = [0] * 256

        for entry in self.entries:
            first_byte = int(entry.sha[:2], 16)
            # Increment counts for this byte and all higher bytes
            for i in range(first_byte, 256):
                self._fanout[i] += 1

    def get_offset(self, sha: str) -> int | None:
        """Look up offset by SHA."""
        idx = self._sha_to_idx.get(sha)
        if idx is not None:
            return self.entries[idx].offset
        return None

    def __contains__(self, sha: str) -> bool:
        return sha in self._sha_to_idx

    @property
    def object_count(self) -> int:
        return len(self.entries)

Serializing the Index

Writing an index file:

import hashlib

IDX_MAGIC = b"\xff\x74\x4f\x63"  # 0xff744f63
IDX_VERSION = 2


def serialize_index(index: PackIndex) -> bytes:
    """Serialize index to version 2 format."""
    result = bytearray()

    # Header
    result.extend(IDX_MAGIC)
    result.extend(IDX_VERSION.to_bytes(4, "big"))

    # Fanout table
    for count in index._fanout:
        result.extend(count.to_bytes(4, "big"))

    # SHA table (sorted)
    for entry in index.entries:
        result.extend(bytes.fromhex(entry.sha))

    # CRC32 table
    for entry in index.entries:
        result.extend(entry.crc32.to_bytes(4, "big"))

    # Offset table (with large offset handling)
    large_offsets: list[int] = []
    for entry in index.entries:
        if entry.offset >= 0x80000000:  # Needs large offset
            large_idx = len(large_offsets)
            large_offsets.append(entry.offset)
            result.extend((0x80000000 | large_idx).to_bytes(4, "big"))
        else:
            result.extend(entry.offset.to_bytes(4, "big"))

    # Large offset table
    for offset in large_offsets:
        result.extend(offset.to_bytes(8, "big"))

    # Pack SHA
    result.extend(bytes.fromhex(index.pack_sha))

    # Index SHA (hash of everything so far)
    index_sha = hashlib.sha1(result).digest()
    result.extend(index_sha)

    return bytes(result)

Parsing an Index

Reading an existing index file:

def parse_index(data: bytes) -> PackIndex:
    """Parse version 2 index file."""
    # Verify header
    if data[:4] != IDX_MAGIC:
        raise ValueError("Invalid index magic")

    version = int.from_bytes(data[4:8], "big")
    if version != 2:
        raise ValueError(f"Unsupported index version: {version}")

    # Read fanout table
    fanout_start = 8
    fanout = []
    for i in range(256):
        count = int.from_bytes(
            data[fanout_start + i * 4 : fanout_start + (i + 1) * 4],
            "big"
        )
        fanout.append(count)

    object_count = fanout[255]

    # Calculate section offsets
    sha_start = fanout_start + 256 * 4
    crc_start = sha_start + object_count * 20
    offset_start = crc_start + object_count * 4
    large_offset_start = offset_start + object_count * 4

    # Read tables
    entries = []
    for i in range(object_count):
        # SHA
        sha = data[sha_start + i * 20 : sha_start + (i + 1) * 20].hex()

        # CRC32
        crc32 = int.from_bytes(
            data[crc_start + i * 4 : crc_start + (i + 1) * 4],
            "big"
        )

        # Offset
        offset = int.from_bytes(
            data[offset_start + i * 4 : offset_start + (i + 1) * 4],
            "big"
        )
        if offset & 0x80000000:
            # Large offset
            large_idx = offset & 0x7fffffff
            offset = int.from_bytes(
                data[large_offset_start + large_idx * 8 :
                     large_offset_start + (large_idx + 1) * 8],
                "big"
            )

        entries.append(PackIndexEntry(sha=sha, offset=offset, crc32=crc32))

    # Extract pack SHA
    pack_sha_offset = len(data) - 40
    pack_sha = data[pack_sha_offset : pack_sha_offset + 20].hex()

    return PackIndex(pack_sha=pack_sha, entries=entries)

Index Verification

The index has multiple integrity checks:

def verify_index(index_data: bytes) -> bool:
    """Verify index file integrity."""
    # 1. Check header
    if index_data[:4] != IDX_MAGIC:
        return False

    version = int.from_bytes(index_data[4:8], "big")
    if version != 2:
        return False

    # 2. Verify index SHA (last 20 bytes)
    expected_sha = index_data[-20:]
    computed_sha = hashlib.sha1(index_data[:-20]).digest()
    if expected_sha != computed_sha:
        return False

    # 3. Verify fanout is monotonically increasing
    fanout_start = 8
    prev = 0
    for i in range(256):
        count = int.from_bytes(
            index_data[fanout_start + i * 4 : fanout_start + (i + 1) * 4],
            "big"
        )
        if count < prev:
            return False
        prev = count

    # 4. Verify SHAs are sorted
    object_count = prev
    sha_start = fanout_start + 256 * 4
    prev_sha = b"\x00" * 20
    for i in range(object_count):
        sha = index_data[sha_start + i * 20 : sha_start + (i + 1) * 20]
        if sha <= prev_sha:
            return False
        prev_sha = sha

    return True

Performance Analysis

Lookup Complexity

Operation	Time	Space
Find by SHA	O(log n)	O(1)
Fanout lookup	O(1)	-
Binary search	O(log(n/256))	-
Overall	O(log n) ≈ O(1)	O(1)

For practical purposes, lookup is effectively O(1) because: - Fanout reduces search space by 256x - Log₂(n/256) is small even for millions of objects - Memory access patterns are cache-friendly

Space Overhead

Component	Size per Object
SHA	20 bytes
CRC32	4 bytes
Offset	4 bytes (+ 8 for large)
Total	~28 bytes

For a pack with 100,000 objects: ~2.8MB index

Multi-Pack Index (MIDX)

Git 2.28+ supports a "multi-pack index" for repositories with many packs:

.git/objects/pack/
├── pack-aaa.pack
├── pack-aaa.idx
├── pack-bbb.pack
├── pack-bbb.idx
├── pack-ccc.pack
├── pack-ccc.idx
└── multi-pack-index   # Single index spanning all packs

Benefits: - Single lookup across all packs - Faster object resolution - Better cache efficiency

This is an advanced optimization—the basic single-pack index is sufficient for most repositories.

Command-Line Tools

Inspect Index

# Show index summary
$ git show-index < .git/objects/pack/pack-*.idx | head
100 abc123def456... (blob)
250 789012345678... (commit)
...

# Verify index integrity
$ git verify-pack -v .git/objects/pack/pack-*.idx

Index Statistics

# Count objects
$ git show-index < .git/objects/pack/*.idx | wc -l
50000

# Distribution by first byte (should be ~uniform)
$ git show-index < .git/objects/pack/*.idx | \
    cut -c1-2 | sort | uniq -c | head
  195 00
  198 01
  202 02
  ...

Implementation in gitpy

Module Structure

# gitpy/storage/pack_index.py

from dataclasses import dataclass
from pathlib import Path
import hashlib

IDX_SIGNATURE = b"\xff\x74\x4f\x63"
IDX_VERSION = 2


@dataclass(slots=True)
class PackIndexEntry:
    """Single entry in pack index."""
    sha: str       # 40-char hex SHA
    offset: int    # Offset in pack file
    crc32: int     # CRC32 of packed object data


class PackIndex:
    """
    Pack index for fast object lookup.

    Provides O(log n) lookup of pack file offset by SHA.
    """

    def __init__(self, pack_sha: str, entries: list[PackIndexEntry]):
        self.pack_sha = pack_sha
        self.entries = sorted(entries, key=lambda e: e.sha)
        self._build_fanout()
        self._sha_to_idx = {e.sha: i for i, e in enumerate(self.entries)}

    def _build_fanout(self) -> None:
        """Build fanout table from entries."""
        self._fanout = [0] * 256
        counts = [0] * 256

        for entry in self.entries:
            first_byte = int(entry.sha[:2], 16)
            counts[first_byte] += 1

        # Convert to cumulative
        cumulative = 0
        for i in range(256):
            cumulative += counts[i]
            self._fanout[i] = cumulative

    def find(self, sha: str) -> PackIndexEntry | None:
        """Look up entry by SHA."""
        idx = self._sha_to_idx.get(sha)
        if idx is not None:
            return self.entries[idx]
        return None

    def get_offset(self, sha: str) -> int | None:
        """Get pack file offset for SHA."""
        entry = self.find(sha)
        return entry.offset if entry else None

    def __contains__(self, sha: str) -> bool:
        return sha in self._sha_to_idx

    @property
    def object_count(self) -> int:
        return len(self.entries)

    @classmethod
    def from_file(cls, path: Path) -> "PackIndex":
        """Load index from .idx file."""
        return parse_index(path.read_bytes())

    def serialize(self) -> bytes:
        """Serialize to index file format."""
        return serialize_index(self)

Testing Index Code

def test_index_roundtrip():
    """Serialize then parse should preserve data."""
    entries = [
        PackIndexEntry(sha="a" * 40, offset=100, crc32=0x12345678),
        PackIndexEntry(sha="b" * 40, offset=500, crc32=0xdeadbeef),
        PackIndexEntry(sha="f" * 40, offset=1000, crc32=0xcafebabe),
    ]

    index = PackIndex(pack_sha="c" * 40, entries=entries)
    data = index.serialize()
    restored = PackIndex.parse(data)

    assert restored.object_count == 3
    assert restored.get_offset("a" * 40) == 100
    assert restored.get_offset("b" * 40) == 500
    assert restored.get_offset("f" * 40) == 1000


def test_fanout_distribution():
    """Fanout should enable efficient lookup."""
    # Create entries with known distribution
    entries = [
        PackIndexEntry(sha="00" + "a" * 38, offset=100, crc32=0),
        PackIndexEntry(sha="00" + "b" * 38, offset=200, crc32=0),
        PackIndexEntry(sha="ff" + "c" * 38, offset=300, crc32=0),
    ]

    index = PackIndex(pack_sha="d" * 40, entries=entries)

    # Check fanout values
    assert index._fanout[0x00] == 2   # Two objects ≤ 0x00
    assert index._fanout[0xfe] == 2   # Still two (none in 01-fe)
    assert index._fanout[0xff] == 3   # Three total


def test_large_offset():
    """Handle offsets > 2GB."""
    large_offset = 0x100000000  # 4GB

    entries = [
        PackIndexEntry(sha="a" * 40, offset=large_offset, crc32=0),
    ]

    index = PackIndex(pack_sha="b" * 40, entries=entries)
    data = index.serialize()
    restored = PackIndex.parse(data)

    assert restored.get_offset("a" * 40) == large_offset

Key Takeaways

Fanout table reduces search space by ~256x
Binary search on sorted SHAs gives O(log n) lookup
CRC32 values enable fast corruption detection
Large offset table supports packs up to 16 exabytes
Trailing checksums verify both pack and index integrity
Compact format keeps index ~28 bytes per object

The pack index is a masterpiece of data structure design—simple, efficient, and elegant. It transforms linear search into near-constant-time lookup, making Git fast regardless of repository size.

What's Next?

Pack Files: How objects are stored in packs
Delta Compression: How similar objects share storage
Implementation: See gitpy/storage/pack_index.py for code