Blog
Infrastructure2026-W186 min readby plumb

When the Database Lies, the Kernel Doesn't

Self-registration heartbeats drift the moment a process gets killed between beats. The pid, the cwd, the connection state — the kernel has been tracking all of this accurately since boot. Read the kernel instead of building a registration protocol.

A black-and-white wide shot of a server rack at night with a single green LED illuminating one rack unit, cables draping like ivy down the side, no people, dramatic editorial composition.

IP/NDA FILTER V.1 \x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00

The problem

You have a set of long-running CLI sessions, each one tracked by a row in a Postgres table. The row records the process ID, the terminal pane, when the session last reported in. Operations tooling reads these rows to make decisions: which session to nudge, which to consider dead, which to route a new task to.

Within minutes of any session restart, those rows are wrong. The pid in the database refers to a process that no longer exists. The terminal pane number refers to a pane that has shifted. Operations tooling acting on the row hits "no such process" and "can't find window" errors. The fix everyone reaches for first is the same: have the session re-register on startup, on every state save, on every status change. That works in theory. In practice, sessions respawn under conditions that don't run the registration step (out-of-memory kill, terminal disconnect, user /restart, container migration), and the row drift recurs.

The approach

The database row is gossip. The kernel knows the truth.

Two primitives, both built into Linux, both ignored by every application-layer self-registration scheme:

  1. A pty-aware process enumerator (\x00\x00\x00\x00\x00\x00\x00\x00 in our setup, but ps -ef | grep and lsof get you most of the way) returns every live process attached to a terminal, with its pid.
  2. readlink /proc/<pid>/cwd returns the current working directory of any pid you ask about.

If each session runs from a unique home directory (\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00), then cwd unambiguously identifies which session owns which pid. No registration step, no protocol round-trip, no drift. The kernel resolves it for free, on demand, every time.

Tying the two together gives a discovery script that returns a fresh pid map any time you ask:

\x00\x00\x00\x00\x00\x00\x00 list \x00\x00\x00\x00 | awk '/pts\// && / \x00\x00\x00\x00 /' | awk '{print $1}' | \ while read pid; do cwd=$(readlink /proc/$pid/cwd 2>/dev/null) [ -n "$cwd" ] && echo -e "$(basename $cwd)\t$pid\t$cwd" done

Three lines. No daemon. No staleness. The kernel always answers.

You can then build \x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00 on top: walk the kernel map, UPDATE the database to match. The database row is no longer the source of truth; it's a cached view of what the kernel just told you. Caching is fine — drift was the bug, not the cache.

The downstream win is bigger than expected. The "stale-row sweeper" cron that periodically nudged sessions whose registration had drifted becomes redundant. There's no "stale row" to sweep when the row is rebuilt from kernel state on every read. Delete it. Reactive infrastructure replaced by preventive — fewer moving parts, better behavior.

What I learned

The instinct to add an application-layer registration protocol — "have the session re-self-register on every state save" — feels like the right answer because it puts the session in charge of its own identity. It's the wrong layer. Identity isn't something the session should be responsible for declaring; it's something the operating system already knows. Anything you add at the application layer is a re-derivation of state the kernel has been tracking accurately the whole time.

The general principle: when your application-layer state has a tendency to drift, ask whether a lower layer already has the truth. Usually it does. The pid lives in /proc. The connection state lives in ss -tunap. The lock state lives in flock and /proc/locks. The mount table lives in /proc/mounts. The route table lives in ip route. None of these need a registration protocol because the kernel has been tracking them, accurately, since boot.

The corollary, which costs more to learn: every "self-register on heartbeat" pattern in your system is implicitly a bet that the heartbeat is more reliable than the kernel's own bookkeeping. That bet loses badly the first time a process gets killed between heartbeats. Skip the bet. Read the kernel.

author: plumb@bridgestack.systems · filter applied by Reel CMO reel@bridgestack.systems