background agents are a larp

a sober review of agent autonomy, and working with semi-toddlerish models

if you've been trying to manage 20 claude code sessions in tmux, or using cursor's background agents — i have bad news for you: you're probably doing yourself more harm than good.

there are a few names that people are giving to this idea right now:

background agents
ambient agents
async agents

i like to think of them as landing somewhere on Karpathy's autonomy slider:

Karpathy autonomy slider

partial autonomy: human observes partial outputs → intervenes early → constrains trajectory
fuller autonomy: agent commits to a trajectory → human evaluates only after substantial work exists (work continues without waiting for feedback or results)

thinking about the degree of autonomy is clearer than "background agents", because that tends to mean it runs in the cloud. for the purposes of this discussion, only the degree of autonomy matters.

the core distinction isn't where the agent runs but rather the feedback topology. in other words, how tight is the feedback loop on decision making?

moving right means batching more decisions before a human corrects course. that increases both the delay and the amount of output you have to verify and unwind if the early assumptions are wrong.

why delayed feedback compounds errors

today, agents make non-trivial mistakes with meaningful probability. early assumptions are often wrong or underspecified, and downstream code depends on them.

in this world, background work is actively harmful. why?

when feedback is delayed:

the agent continues generating code conditioned on its own incorrect assumptions.
wrong decisions accrete dependent structure: APIs, data models, etc.
by the time a human looks, the result is merely delayed confusion.

agents are highly sensitive to initial specifications, so small early ambiguities are amplified over long sequences of actions, making outcomes effectively unpredictable and late correction disproportionately expensive.

but you might think: “If the agent goes down a wrong path, I’ll just reject the PR and rerun it.”

that only works if rejection is stateless and obvious. but in order to confidently reject, the human figure out:

Is this wrong, or just unfamiliar?
Is it wrong at the root, or salvageable?
Is it misaligned with the spec, or did the spec change?

and in order to do that, the human must:

reload context into memory, recall constraints + initial intent / request (remember we are now at a later point in time)
reconstruct the agent’s intent
diagnose why the approach is wrong (vs merely unfamiliar)
decide it’s irrecoverable rather than salvageable

it's a mistake to think that you can manage 10-20 agents in realtime and multitask across them effectively. a quick survey of the literature on this point proves this: comparative studies consistently find that multitasking increases error rates and memory lapses for cognitive work — people get slower, make more mistakes, and lose context compared to doing the same work sequentially. so while you might feel like a god-tier programming wizard managing 15 tmux sessions, Gandalf is unlikely to be impressed.

a simple cost model

call this diagnosis cost. formally, if:

$p$ $p$ is the probability of a bad trajectory.
- if you want to connect this to per-step error: for a task requiring $k$ meaningful decisions and per-step error rate $e$ , a simple model gives:

p = 1 — (1-e)^k \approx k e \quad \text{(for small } e\text{)}

so longer autonomous runs raise the chance of going off-trajectory even when $e$ is low.

$d$ is the delay before human feedback,
$n$ is the amount of accumulated output / number of decisions the human has to verify,
$R(d, n)$ is the human cost of diagnosing and rejecting after delay $d$ with review load $n$ ,

then the expected human cost from wrong paths is:

\mathbb{E}[\text{human cost}] = p \cdot R(d, n)

with $R(d, n)$ non-decreasing in both arguments. in particular, as $n$ grows (batch size / review load), decision fatigue makes review slower + more error-prone.

background agents increase $d$ , while tight feedback loops minimize it because they prune bad branches early. in a tight loop, the human remembers what they just asked for because the mental state is warm, so rejection is fast + local — the cognitive overhead of context switching is limited.

in other words, background agents only become useful when they can complete an end-to-end trajectory with a low enough probability of “trajectory-level” failure that the expected cost of delayed correction is small.

until models cross the reliability threshold, the winning workflow is boring:

keep the loop tight for anything with unclear requirements or high leverage decisions.
steer aggressively, constantly re-referencing the specs + agreed-upon plans.
invest in automation that makes mistakes cheap to detect: tests, linters, type checks, evals.

the practical rule is simple: until agents are end-to-end reliable, keep autonomy proportional to how cheap it is to detect + undo mistakes.

stay tuned

of course, this will all change very soon. models are being unhobbled at a shocking rate, and we'll have new problems to consider when working in this new mode of autonomy.

Andrew Jones

why delayed feedback compounds errors

a simple cost model

stay tuned