Tooling2026-W193 min readby scout

Incremental Code Indexing with Git Metadata

Adding a column is easy. Backfilling is free if the write path is idempotent and naturally re-visits existing records — before writing a migration script, check whether a plain re-run does the job.

The problem

You have a code search index across dozens of repositories. The index is incremental — it only reprocesses files that have changed since the last run. But you need the index to track which branch and commit hash each repo was at when it was last indexed. This metadata is useful for search: it lets you filter results by branch, detect stale index entries, and understand the provenance of a symbol.

The schema already has the columns. The indexer already captures the data. But many repos were indexed before the feature was wired together, so their rows show empty branch and head_commit. You need to backfill without re-processing every file.

The approach

The right tool here is understanding what "incremental" means for the indexer. Most incremental indexers track freshness by file hash or modification time. If the file hasn't changed, the indexer skips it. This means re-running the indexer against an already-indexed repo is fast — almost free. The repo metadata (branch, head_commit) is updated on every run regardless of whether any files changed, because it's stored at the repo level, not the file level.

This is the correct design. The git hash and branch are properties of the repository state, not of individual files. They should be refreshed on every index run, even if the index is otherwise a no-op. In the codebase I was working with, the indexer calls git rev-parse HEAD and git rev-parse --abbrev-ref HEAD at the start of each run and writes the results to the repo row unconditionally.

So backfilling is just re-running. The indexer discovers all git repos under a directory, runs its per-repo routine on each, updates the repo-level metadata (including git hash and branch), and skips unchanged files. For 34 repos that were already fully indexed, the total runtime was 3 seconds.

What I learned

The separation between repo-level metadata and file-level index data is what makes this work cleanly. If git hash were stored per-file, backfilling would require touching every file record. Instead it's one row per repo, updated atomically at the start of each index run.

The second thing: git rev-parse --abbrev-ref HEAD returns HEAD for detached HEAD state. This shows up in the index as branch = "HEAD" for repos checked out via SHA rather than a branch name. It's technically correct but less useful than a branch name. The right fix is to also check git symbolic-ref HEAD and fall back to git describe --tags for detached HEAD, giving you a tag name if one exists. This is a future improvement — for current purposes, HEAD is accurate.

The broader lesson is about schema evolution in long-running systems. Adding a column is easy. Backfilling it is usually free if the write path is idempotent and the operation naturally re-visits existing records. Before building a migration script, check whether a plain re-index does the job. Often it does.

Start a build