Understanding Git's Object Storage
This document explains how Git stores objects on disk. If you've ever wondered what happens inside .git/objects/ or how Git achieves its legendary performance, this guide will demystify Git's elegant storage system.
The Big Picture: Content-Addressable Storage
Git's object database is a content-addressable filesystem. Instead of organizing files by name or location, Git organizes them by content. The SHA-1 hash of an object's content becomes its "address" in the database.
This has profound implications:
- Automatic deduplication: Identical content is stored once
- Integrity verification: Corruption is immediately detectable
- Immutability: Objects never change once created
- Efficient sharing: Same hash = same content across all repositories
The .git/objects Directory
When you run git init, Git creates this structure:
.git/
└── objects/
├── info/ # Auxiliary information
├── pack/ # Packfiles (compressed collections)
└── <XX>/ # Loose objects (2-char prefix directories)
└── <YYY...> # Object files (38-char suffix)
Let's explore each part.
Loose Objects: The Simple Case
How Objects Are Stored
When Git creates a new object, it stores it as a "loose object"—an individual file named by its SHA-1 hash.
Path calculation:
SHA: ce013625030ba8dba906f756967f9e9ca394464a
^^
First 2 chars → directory name
Path: .git/objects/ce/013625030ba8dba906f756967f9e9ca394464a
^^/^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
dir remaining 38 chars
Why split the hash?
Filesystems become slow with too many files in one directory. By using 256 possible two-character prefixes (00-ff), Git distributes objects across subdirectories.
Compression
All loose objects are zlib-compressed. The compression format:
zlib_compress("<type> <size>\0<content>")
For example, a blob containing hello\n:
# What gets compressed:
raw_data = b"blob 6\0hello\n"
# Stored on disk:
compressed = zlib.compress(raw_data)
Exploring Loose Objects
# Find an object
$ ls .git/objects/ce/
013625030ba8dba906f756967f9e9ca394464a
# Decompress and view (Python)
$ python3 -c "
import zlib
data = open('.git/objects/ce/013625030ba8dba906f756967f9e9ca394464a', 'rb').read()
print(zlib.decompress(data))
"
b'blob 6\x00hello\n'
# Or use Git's plumbing
$ git cat-file -p ce013625
hello
Object File Format
Inside the compressed file:
┌──────────────────────────────────────┐
│ zlib-compressed data │
│ ┌──────────────────────────────────┐ │
│ │ type SP size NUL content │ │
│ │ ^^^^ ^^^^ ^^^^^^^ │ │
│ │ blob 6 hello\n │ │
│ └──────────────────────────────────┘ │
└──────────────────────────────────────┘
- type:
blob,tree,commit, ortag - SP: Space character (0x20)
- size: Decimal byte count of content
- NUL: Null byte (0x00) separator
- content: Raw object data
Atomic Writes: Preventing Corruption
Git uses a critical pattern for writing objects: write to temp, then rename.
def write_object(sha, data):
path = f".git/objects/{sha[:2]}/{sha[2:]}"
temp_path = path + ".tmp"
# 1. Write to temporary file
with open(temp_path, "wb") as f:
f.write(zlib.compress(data))
# 2. Atomic rename
os.rename(temp_path, path)
# 3. Make read-only (immutable)
os.chmod(path, 0o444)
Why this matters:
- If Git crashes during write, only the
.tmpfile is corrupted - The
rename()system call is atomic on most filesystems - Either the complete object exists, or it doesn't—no partial files
The Idempotency Property
Because objects are content-addressable:
# Writing the same content twice is safe
sha1 = write_object(data) # Creates object
sha2 = write_object(data) # Skips (already exists)
assert sha1 == sha2
Git checks if the object exists before writing. This is safe because: - Same content → same hash → same object - Object files are immutable once created
SHA-1 Verification
When reading an object, Git verifies the hash:
def read_object(sha):
path = f".git/objects/{sha[:2]}/{sha[2:]}"
compressed = open(path, "rb").read()
data = zlib.decompress(compressed)
# Verify integrity
computed = hashlib.sha1(data).hexdigest()
if computed != sha:
raise CorruptObjectError(f"Expected {sha}, got {computed}")
return data
This catches: - Disk corruption (bit rot) - Incomplete writes - Malicious tampering
If any bit changes, the hash won't match.
Short SHA Resolution
Git allows abbreviated SHA references:
$ git show ce01362 # 7 characters
$ git show ce0136 # 6 characters
$ git show ce01 # 4 characters (minimum)
How it works:
def resolve_short_sha(prefix):
if len(prefix) < 4:
raise ValueError("Minimum 4 characters required")
dir_prefix = prefix[:2]
file_prefix = prefix[2:]
matches = []
for file in os.listdir(f".git/objects/{dir_prefix}"):
if file.startswith(file_prefix):
matches.append(dir_prefix + file)
if len(matches) == 0:
raise NotFoundError(f"No object matching {prefix}")
if len(matches) > 1:
raise AmbiguousError(f"{prefix} matches {len(matches)} objects")
return matches[0]
Ambiguity handling:
With enough objects, short prefixes become ambiguous. Git requires increasingly longer prefixes:
$ git show abc1
error: short SHA1 abc1 is ambiguous
hint: The candidates are:
hint: abc1234... blob
hint: abc1567... commit
Packfiles: Efficient Storage at Scale
Loose objects work well for small repositories, but become inefficient at scale:
- Many small files = filesystem overhead
- Similar objects stored separately (no delta compression)
- Slow for network transfer
Packfiles solve this by:
- Collecting many objects into one file
- Delta-compressing similar objects
- Using an index for O(1) lookups
Pack Structure
.git/objects/pack/
├── pack-abc123.idx # Index (hash → offset)
└── pack-abc123.pack # Packed objects
When Packing Happens
git gc(garbage collection)git push/git fetch(network transfer)- Automatically when loose objects exceed threshold
Pack Format Overview
┌─────────────────────────────────────────────┐
│ PACK header (signature, version, count) │
├─────────────────────────────────────────────┤
│ Object 1 (type, size, compressed data) │
├─────────────────────────────────────────────┤
│ Object 2 (OFS_DELTA: offset, delta) │ ← Delta-compressed!
├─────────────────────────────────────────────┤
│ Object 3 (REF_DELTA: base SHA, delta) │ ← Delta-compressed!
├─────────────────────────────────────────────┤
│ ... more objects ... │
├─────────────────────────────────────────────┤
│ SHA-1 checksum of entire pack │
└─────────────────────────────────────────────┘
Delta compression:
Instead of storing similar objects fully, Git stores: - One base object (full content) - Deltas (instructions to reconstruct others from base)
Example: If file.txt changes slightly between commits: - Version 1: Full blob (1000 bytes) - Version 2: Delta from v1 (50 bytes of changes)
This is why Git repositories are so compact!
The Object Database Interface
Git's storage has a clean abstraction:
┌─────────────────────────────────────────────┐
│ ObjectDatabase │
│ - read(sha) → GitObject │
│ - write(obj) → sha │
│ - exists(sha) → bool │
└─────────────────────────────────────────────┘
│
┌───────────┴───────────┐
▼ ▼
┌───────────────┐ ┌───────────────┐
│ LooseObjects │ │ Packfiles │
│ .git/objects/ │ │ .git/objects/ │
│ XX/YYYY... │ │ pack/*.pack │
└───────────────┘ └───────────────┘
When reading: 1. Check loose objects first 2. Then search packfiles 3. Error if not found
When writing:
1. Always write loose objects
2. git gc later converts to packfiles
Repository Initialization
When you run git init, Git creates:
.git/
├── HEAD # Points to current branch
├── config # Repository configuration
├── description # For GitWeb (rarely used)
├── objects/
│ ├── info/ # Pack information
│ └── pack/ # Packfiles go here
├── refs/
│ ├── heads/ # Branch references
│ └── tags/ # Tag references
└── info/
└── exclude # Local ignore patterns
The essential parts:
HEAD: Tells Git which branch you're onobjects/: The object databaserefs/: Human-readable names for commits
Bare vs Normal Repositories
Normal repository:
myproject/
├── .git/ # Git metadata
└── src/ # Working directory
└── main.py
Bare repository:
myproject.git/ # Just Git metadata, no working directory
├── HEAD
├── config
├── objects/
└── refs/
Bare repositories are used for:
- Remote servers (GitHub, etc.)
- Sharing without working directories
- git clone --bare
Real-World Storage Patterns
Small Repository
.git/objects/
├── ce/
│ └── 013625... # A blob
├── 4b/
│ └── 825dc6... # Empty tree
├── e6/
│ └── 9de29b... # Empty blob
└── info/
└── pack/ # Empty (no packs yet)
Large Repository After git gc
.git/objects/
├── pack/
│ ├── pack-abc123.idx # ~2MB index
│ └── pack-abc123.pack # ~50MB packed objects
├── info/
│ └── packs # Lists available packs
└── <few loose objects> # Recently created, not yet packed
Performance Characteristics
Loose Objects
| Operation | Complexity | Notes |
|---|---|---|
| Read by SHA | O(1) | Direct path lookup |
| Write | O(1) | Create file |
| Exists check | O(1) | File exists check |
| List all | O(n) | Scan all directories |
Packfiles
| Operation | Complexity | Notes |
|---|---|---|
| Read by SHA | O(1) | Index lookup + seek |
| Delta resolution | O(chain) | May need to read base objects |
| Write | N/A | git gc handles this |
Space Efficiency
| Scenario | Loose | Packed |
|---|---|---|
| 1000 small files | ~1000 inodes | 2 files |
| Similar versions | Full copies | Delta compressed |
| Network transfer | Many requests | Single stream |
Implementation in gitpy
Code Organization
gitpy/
├── storage/
│ ├── __init__.py # Module exports
│ ├── compression.py # Zlib utilities
│ ├── loose.py # LooseObjectStore
│ └── database.py # ObjectDatabase
└── repository.py # Repository class
LooseObjectStore
Handles individual object files:
class LooseObjectStore:
def __init__(self, git_dir: Path):
self.objects_dir = git_dir / "objects"
def _object_path(self, sha: str) -> Path:
return self.objects_dir / sha[:2] / sha[2:]
def exists(self, sha: str) -> bool:
return self._object_path(sha).exists()
def read(self, sha: str) -> bytes:
path = self._object_path(sha)
compressed = path.read_bytes()
data = zlib.decompress(compressed)
# Verify integrity
if hashlib.sha1(data).hexdigest() != sha:
raise ValueError("SHA mismatch - corrupted object")
return data
def write(self, sha: str, data: bytes) -> Path:
path = self._object_path(sha)
if path.exists():
return path # Already exists
path.parent.mkdir(parents=True, exist_ok=True)
# Atomic write
temp = path.with_suffix(".tmp")
temp.write_bytes(zlib.compress(data))
temp.rename(path)
path.chmod(0o444)
return path
ObjectDatabase
High-level interface with type safety:
class ObjectDatabase:
def __init__(self, git_dir: Path):
self.loose = LooseObjectStore(git_dir)
def read(self, sha: str) -> GitObject:
data = self.loose.read(sha)
_, obj = parse_object(data)
return obj
def read_blob(self, sha: str) -> Blob:
obj = self.read(sha)
if not isinstance(obj, Blob):
raise TypeError(f"Expected blob, got {obj.type_name}")
return obj
def write(self, obj: GitObject) -> str:
data = create_object_data(obj)
sha = hashlib.sha1(data).hexdigest()
self.loose.write(sha, data)
return sha
Repository
Ties everything together:
class Repository:
def __init__(self, path: Path):
self.worktree = path
self.git_dir = path / ".git"
self.objects = ObjectDatabase(self.git_dir)
@classmethod
def init(cls, path: Path, bare: bool = False) -> "Repository":
git_dir = path if bare else path / ".git"
# Create directory structure
(git_dir / "objects" / "info").mkdir(parents=True)
(git_dir / "objects" / "pack").mkdir()
(git_dir / "refs" / "heads").mkdir(parents=True)
(git_dir / "refs" / "tags").mkdir()
# Create HEAD
(git_dir / "HEAD").write_text("ref: refs/heads/main\n")
return cls(path)
@classmethod
def find(cls, start: Path = None) -> "Repository":
"""Walk up directory tree to find repository."""
path = (start or Path.cwd()).resolve()
while True:
if (path / ".git").is_dir():
return cls(path)
if path.parent == path:
raise ValueError("Not a git repository")
path = path.parent
Git Compatibility Verification
Known Hashes
These must match real Git:
| Object | SHA-1 |
|---|---|
| Empty blob | e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 |
| Empty tree | 4b825dc642cb6eb9a060e54bf8d69288fbee4904 |
hello\n blob |
ce013625030ba8dba906f756967f9e9ca394464a |
Interoperability Test
# Create object with Git, read with gitpy
$ git init test-repo && cd test-repo
$ echo "hello" | git hash-object -w --stdin
ce013625030ba8dba906f756967f9e9ca394464a
$ python3 -c "
from gitpy.repository import Repository
repo = Repository(Path('.'))
blob = repo.objects.read_blob('ce013625')
print(blob.data) # b'hello\n'
"
# Create object with gitpy, read with Git
$ python3 -c "
from gitpy.repository import Repository
from gitpy.objects import Blob
repo = Repository(Path('.'))
sha = repo.objects.write(Blob(data=b'gitpy wrote this\n'))
print(sha)
"
$ git cat-file -p <sha>
gitpy wrote this
Key Takeaways
- Content-addressable storage enables deduplication and integrity
- Loose objects are simple zlib-compressed files
- Atomic writes prevent corruption
- SHA verification catches errors
- Packfiles provide efficient storage at scale
- Clean abstractions separate concerns
The storage layer is deliberately simple—complexity comes from what you build on top. With objects stored reliably, Git can focus on commits, branches, and history without worrying about data integrity.
What's Next?
With object storage in place, the next layers build on top:
- Pack Files: Efficient storage combining many objects with delta compression
- Delta Compression: How Git stores only differences between similar objects
- Pack Index: Fast O(1) object lookup in pack files
- References (
gitpy/refs/): Human-readable names (branches, tags, HEAD) - Index (
gitpy/index/): The staging area between working directory and repository - Commands (
gitpy/commands/): Plumbing and porcelain operations
All of these layers ultimately read and write objects through the storage layer we've built.