← Back to blog

What goes wrong with autonomous coding agents in production

Six failure patterns we encoded as recovery logic — and why they are hard to see in a demo.

Autonomous coding agents demo beautifully. A clean repository, a well-scoped issue, a model in a good mood, and a 90-second video where the agent produces a clean diff and a passing test. The pitch lands every time.

Production is where the picture changes. IDE copilots, task-to-PR bots, and pipeline-style agents all run into the same set of structural problems once they’re operating against real codebases over real time horizons. The gap between “an agent wrote this in a sandbox” and “this shipped to main with cost, audit, and merge discipline” is where teams get burned.

We catalogued the failure modes as we hit them. Six patterns show up over and over, regardless of which model or framework is underneath. Each one is an engineering problem with an engineering solution. This post walks through them, and how Colony encodes the recovery in the pipeline rather than relying on the model to get it right.

Ephemeral runs

Task runners spin up, produce a PR, and disappear. The next run starts cold. No accumulated context about the codebase, no awareness of prior failures, no continuity across attempts.

A concrete example: an issue fails its first development pass because the agent misreads the acceptance criteria. The task is re-triggered. The new run has no memory of the first attempt, so it makes the same misread, produces a near-identical broken PR, and the team is paying twice for the same wrong answer. Run this a few times across a few issues and the cost line on the dashboard climbs without anything reaching main.

Colony’s pipeline is resident, not ephemeral. State lives in Postgres, not in the agent process. Every prior attempt — surveyor plans, builder transcripts, inspector verdicts, retrospectives — is attached to the issue and available to subsequent runs. When an issue re-enters the pipeline, the Mayor and the executors read the full history. Recovery paths know what was already tried.

Unattributed spend

Agent spend is becoming a CFO problem. Most platforms report a monthly total. You can’t answer “what did this specific issue cost” or “which agent burned the budget” or “which repository is responsible for last week’s overrun.”

A concrete example: a single epic gets decomposed by the planner into sub-issues. One of those sub-issues hits a transient failure pattern and the developer retries on it nine times before a circuit breaker fires. At the end of the day, the dashboard says “you spent $X this week” — but the team can’t tell that 60% of it went to one stuck sub-issue, because the cost is rolled up at the org level.

Colony emits a cost_event row to the Pipeline Store for every executor invocation — analyzer, developer, reviewer, merger, planner — with turn count, token usage, timestamp, and agent name. Cost is attributable per issue, per agent, per repo, per tenant. Budget enforcement happens at task dequeue time in the worker, not during execution, so a per-issue ceiling stops runaway spend before the next turn begins.

No merge discipline

Someone still has to read every PR, run every check, and decide when it’s safe to merge. The “autonomy” stops at the review gate — the most expensive bottleneck in the pipeline. The agent produced the diff, but a person still owns the merge button.

A concrete example: a team adopts a task-to-PR bot. It cheerfully produces PRs. Three weeks in, the engineering lead realizes she’s now reviewing more PRs than before, because the agent files at a rate humans don’t match. The agent is a productivity tool only if her review throughput goes up. It hasn’t. She rolls the experiment back.

Colony’s Inspector is a pipeline agent, not a comment. It runs deterministic checks (build, tests, lint, types), then an LLM review against the issue’s acceptance criteria, and emits a structured verdict — APPROVE, CHANGES_REQUESTED, or PARTIAL with action items. Approved PRs advance to the Marshal, which handles rebase, conflict resolution with confidence scoring, and the merge itself. Humans approve PRs that the pipeline flags as needs-human — a fraction of the total, not all of them.

Opaque pipelines

When a PR goes sideways, there is no trail connecting it back to the issue, the decisions, the costs, and the states it passed through. Debugging agent behavior becomes archaeology — read the diff, read the comments, guess at the prompt, give up.

A concrete example: an agent merges a change that breaks an unrelated test the next day. The team wants to understand why the agent thought this change was safe. The answer requires reconstructing the original analyzer plan, the developer’s intermediate decisions, the reviewer’s verdict, and the merger’s rebase resolution. Most of that wasn’t recorded. The team is left with the diff and a guess.

Colony records every state transition in the state_transitions table. Every agent run is in agent_runs. Every cost event is in cost_events. Every dependency edge is in issue_dependencies. The dashboard renders these as a timeline per issue: which agent did what, when, at what cost, with what output, leading to what state. Postgres is the authority — GitHub labels are a projection. When something goes wrong, the question “what happened” has an answer in SQL.

Brittle failure modes

Agents that worked in demos break on real repositories: rebase conflicts, flaky CI, fabricated evidence, stale worktrees. Every one is an unplanned rescue. The platform doesn’t help you understand the failure; it hands you a stuck issue.

Concrete examples from our own production learnings:

  • A developer agent claimed review items were addressed using invented file contents — it generated plausible-looking code that didn’t exist on disk. Fixed by requiring file-verified evidence: every claim of “addressed” carries a SHA-checked manifest of the actual file state.
  • A Claude subprocess hung indefinitely with no stdout or stderr, freezing the worker. Fixed with dual timeouts: a wall-clock cap plus an inactivity timeout that kills the process if no data events arrive for N seconds.
  • Workers sharing a host-mounted .git/ directory hit binary corruption in .git/config from concurrent worktree writes. Fixed by giving each worker its own clone.
  • The Inspector’s LLM occasionally returned unparseable output — truncated JSON, empty verdicts, natural-language fallbacks. Retrying produced the same garbage. Fixed by escalating to human-review-ready instead of retrying into the same parse failure.

These aren’t speculative. They are the recovery logic that the pipeline runs on every issue today.

Siloed integrations

IDE copilot in one window. Task-to-PR bot in another tab. Pipeline orchestration in a third system. No shared state, no unified cost ledger, no coherent audit trail. The team is operating four different tools that each think they own the issue lifecycle, and the truth lives nowhere.

A concrete example: an issue gets picked up by an automation that opens a PR. The IDE copilot is then used by an engineer to fix something the automation got wrong. The PR is merged manually. The orchestration system never recorded the merge because it was waiting on the automation, not on the human. Three weeks later, when the team is auditing what shipped this quarter and at what cost, the issue is invisible to one tool and rolled up incorrectly in another.

Colony is a single source of truth for the issue lifecycle, intake to merge to monitoring. Slash commands (/colony:retry, /colony:state, /colony:reimplement) let operators control the pipeline from inside GitHub without bypassing the state machine. Webhooks fan in from GitHub events into the same Postgres store. Cost, state, dependencies, and audit trail share one schema. IDE copilots are complementary — Colony doesn’t replace interactive coding — but everything autonomous lives in one place.

These are engineering problems

The pattern in all six is the same. The model is not the problem. The model is fine. The problem is the absence of the layer above the model — the orchestration, the state, the budget, the recovery — that makes any production system actually work.

This is not “AI bad.” We use these models every day. They’re useful. They are also non-deterministic processes operating against eventually-consistent external systems with their own non-deterministic behavior. The engineering response is the same as it would be for any other system with those properties: explicit state, atomic transitions, circuit breakers, retries with classification, observability, cost accounting, and graceful escalation. Colony is what those six properties look like when they’re applied to autonomous coding agents.

If your team is operating coding agents in production and any of the failure modes above are familiar, we’d like to talk. The pilot conversation is open. The pipeline is at github.com/RunColony/colony.


If you’d like to see the pipeline running on your work, we should talk.