Production learnings

The failure catalogue.

Colony has been running on its own development work long enough to collect the part demos skip: what broke, how it failed, and what recovery logic shipped because of it.

8 Public categories
19 Active failure classes
5 Resolved or mitigated

This is a curated public version of Colony’s internal production learnings document. The internal log stays denser; this page keeps the lessons readable and safe to publish.

Operating principle

Recovery belongs in the pipeline, not in the prompt.

The recurring pattern is not that models are bad. The pattern is that autonomous software work touches non-deterministic models, eventually consistent APIs, mutable git state, flaky subprocesses, budget ceilings, and human-defined merge rules. Colony encodes those failure modes as state, checks, circuit breakers, and escalation paths.

Catalogue

What broke. What shipped.

Each group below is a public summary of production incidents and design changes from Colony operating on real repositories.

Trust boundary

LLM trust boundaries

Every place an agent makes a factual claim needs a verifier. The model can propose, summarize, and judge, but the pipeline checks the filesystem, branch state, and structured output before accepting the claim.

Active

Agents fabricate evidence when cornered

What brokeBuilder and Inspector agents produced plausible claims about files, exports, and review fixes that did not match the actual branch.

Recovery shippedColony requires file-verified evidence, reconstructs manifests on pickup, and cross-checks Inspector claims against actual branch state.

Active

LLM parse failures must escalate, not loop

What brokeEmpty verdicts, truncated JSON, and natural-language fallbacks blocked merge or sent the Builder into no-op retries.

Recovery shippedThe Inspector retries once with explicit format instructions, then escalates to human review when deterministic checks have passed.

Active

Prompt paths need explicit case coverage

What brokeInstructions that covered only obvious verdict paths caused the Inspector or Builder to skip newly discovered defects.

Recovery shippedPrompts now name the case combinations directly: new findings, prior findings, passing priors, and acceptance-criteria mismatches.

Progress detection

Termination and circuit breakers

Autonomous work needs a way to stop that is tied to progress, not hope. Colony treats every retry path as a place where spend, comments, and state can run away.

Resolved

Hard turn limits caused false failures

What brokeFixed iteration caps stopped useful work early and misclassified tasks that were still making progress.

Recovery shippedProgress-based termination became the primary signal, with turn limits retained as a backstop and post-run validation.

Active

Looping failure paths need circuit breakers

What brokeConflict, build, and blocked-retry loops repeated without converging, sometimes posting the same diagnosis over and over.

Recovery shippedColony records error signatures, counts repeated failures, blocks or escalates after thresholds, and preserves manual-unblock paths.

Active

Retry strategy depends on failure class

What brokeTransient API overloads, permanent build errors, and dependency conflicts were too easy to treat as the same kind of failure.

Recovery shippedExecutor results carry failure classes so transient work can back off and retry while permanent failures block with diagnostics.

Workers

Subprocess and workspace isolation

The agent process is only one part of the system. Long-running subprocesses, git state, and monorepo caches have their own failure modes, and they need infrastructure-level containment.

Active

Long-running subprocesses need dual timeouts

What brokeClaude Code subprocesses could hang indefinitely with no stdout or stderr, freezing the worker without producing a useful result.

Recovery shippedWorkers enforce both wall-clock and inactivity timeouts, preserving productive long runs while killing silent hangs.

Active

Shared repo mounts cause cascading failures

What brokeConcurrent workers sharing one host-mounted .git directory caused corrupted config, stale worktrees, and push collisions.

Recovery shippedColony moved toward per-worker clone isolation so one worker cannot corrupt another worker’s git state.

Active

Incremental TypeScript caches lie in worktrees

What brokeStale .tsbuildinfo files let cross-package type errors pass locally and fail later in clean Docker builds.

Recovery shippedPre-push checks delete TypeScript build info before type-checking changes that can affect multiple packages.

Parallel work

Dependency coordination

Parallel autonomous development fails fastest when issues share infrastructure. Colony makes dependency edges explicit and reconciles them when events are missed.

Active

Cross-issue dependencies are the hard part

What brokeOne issue shipped a shared type while another was written against old main, producing rebase and reimplementation churn.

Recovery shippedPlanner dependency audits, conflict resolution, and explicit depends_on ordering serialize work where independence is false.

Resolved

Structured dependency graph replaced body parsing

What brokeHTML comments in issue bodies were fragile, race-prone, and expensive to parse on every poll cycle.

Recovery shippedPostgres now stores dependency edges with typed relationships, cycle checks, and resolution queries.

Active

Event-driven resolution still needs reconciliation

What brokeIf a worker crashed after resolving a dependency in the database but before removing the blocked label, the issue stayed stuck.

Recovery shippedPeriodic scans verify dependency-blocked issues and repair missed unblock events.

State machine

Pipeline state and GitHub consistency

GitHub is the collaboration surface, not the state authority. The pipeline needs transactional state, durable event history, and projections that can be repaired.

Resolved

GitHub labels diverged from pipeline state

What brokeContainers read different cached label snapshots, so blocked issues could be re-enqueued and processed twice.

Recovery shippedPostgres became authoritative for state, blocked flags, and pause flags; GitHub labels are synced as a projection.

Active

Metrics need event-sourced timestamps

What brokeIssue updatedAt changed for comments, edits, and labels, making throughput and cycle-time metrics unreliable.

Recovery shippedCycle metrics are derived from label and state-transition events rather than mutable entity timestamps.

Active

Paused resume needs visible failure handling

What brokeIf the original paused state was missing, removing the pause label could silently leave an issue stuck.

Recovery shippedThe planned recovery posts a diagnostic comment and routes the issue to needs-human rather than failing silently.

Ledger

Cost, budget, and observability

Agent autonomy is not operational unless cost is attributed to the issue, agent, repo, and tenant. Colony treats spend as pipeline data, not billing exhaust.

Active

Cost tracking must be centralized

What brokeCost events split across comments, memory, and partial tables made per-issue spend incomplete and hard to audit.

Recovery shippedExecutors emit cost events to the Pipeline Store with agent, turn count, token usage, issue, repo, and timestamp.

Active

Budget enforcement happens before the next run

What brokeA stuck issue can compound cost if each retry starts before the system checks the budget ceiling.

Recovery shippedWorkers enforce budgets at task dequeue time, before giving another executor a chance to spend.

Active

Dashboards need persistent data

What brokeActivity feeds sourced from in-memory health snapshots disappeared on agent restart.

Recovery shippedHistorical dashboards query persistent transition, run, and cost tables; health snapshots stay limited to liveness.

Merge discipline

Review and merge reliability

The merge gate is where autonomy has to respect the team’s existing controls. Colony separates authoring, review, approval, and merge identity so the pipeline cannot approve itself by accident.

Resolved

Force-push after rebase can close a PR

What brokeGitHub sometimes closed pull requests after conflict resolution pushed a rewritten branch.

Recovery shippedThe Marshal rechecks PR state after force-push, attempts reopen, and creates a replacement PR if needed.

Active

Conflict resolution needs confidence gates

What brokeLLM conflict resolution can make judgment calls that are not safe to merge automatically.

Recovery shippedPer-file confidence scores gate the next step: high confidence may proceed, low confidence escalates.

Active

Bot review state can block human approval

What brokeA stale bot CHANGES_REQUESTED review can keep auto-merge blocked even after a human approves the needs-human PR.

Recovery shippedThe fix path dismisses or neutralizes stale bot reviews when deterministic checks have passed and human approval is present.

Operations

Deployment and infrastructure

Production agent systems inherit the ordinary problems of distributed services: container lifecycle, rate limits, environment propagation, and platform-specific behavior.

Resolved

Orphaned containers caused ENFILE cascades

What brokeOld worker containers survived deploy cycles, leaked file descriptors, and eventually made healthy processes fail too.

Recovery shippedDeployment scripts detect orphan containers during safestop and shutdown paths.

Active

GitHub API rate limits need layered defense

What brokeAt multi-repo scale, enrichment and polling can exhaust installation-level GitHub API budgets.

Recovery shippedColony uses conditional requests, caching, batched operations, queued enrichment, and graceful degradation.

Active

Production config needs extra scrutiny

What brokeA self-improvement change added a startup gate that was directionally correct for OSS but broke the production multi-repo deployment.

Recovery shippedConfig-affecting self-improvement work requires review against production deployment shape, not only the test fixture.

The catalogue keeps growing because the pipeline keeps running.

Bring a real repo. We’ll show you what the same evidence trail looks like on your work.