← Back to blog

Colony has been developing Colony for months

This isn't an announcement. It's a disclosure of evidence already accumulated.

Most launches in this space follow the same script: a demo, a waitlist, a series of claims about production-readiness that nobody outside the company can verify. Every team building an autonomous coding agent will say their system is ready for real work. Some of those claims will turn out to be true. Most won’t.

We decided early that we wouldn’t write that kind of post. So this one is different. Colony has been developing Colony — its own codebase, its own roadmap, its own bug backlog — for months. The data exists. The receipts exist. The launch is the disclosure.

What “Colony builds Colony” actually means

Colony is an autonomous software development pipeline. It processes GitHub issues from intake through merged PR — analyze, implement, review, merge — with no human in the loop on the happy path. Seven agents handle the lifecycle: Mayor, Surveyor, Builder, Inspector, Marshal, Architect, and Sentinel. State lives in Postgres. GitHub labels are a projection. The pipeline is a 13-state state machine.

The unusual thing is what runs on it. The primary repository the pipeline processes is its own. New features for Colony — the kind that would normally show up in a team’s sprint — are filed as issues, picked up by the Surveyor, decomposed by the Architect when they’re too big, implemented by the Builder, reviewed by the Inspector, and merged by the Marshal. The team writes prose; the pipeline writes the code. Most of what shipped this quarter shipped that way.

This isn’t a closed loop. Humans still file the issues, approve sensitive PRs, decide priorities, and intervene when something goes wrong. But the day-to-day mechanical work of getting an issue from “ready” to “merged” runs without us.

The receipts

The most important numbers come from one specific window: a 36-hour session on March 31 – April 1, 2026.

  • 120+ PRs merged in 36 hours. Each one went through analysis, implementation, code review, and merge. The Inspector ran deterministic checks (build, tests, lint, types) plus an LLM review before any merge.
  • 38,000+ lines of TypeScript. This isn’t a script. It’s a real system — agents, executors, a pipeline store, a worker pool, a webhook receiver, a monitor, a dashboard.
  • 3,700+ tests. Tested like production software because it is production software.
  • 13 packages. A modular architecture, not a monolith. Each agent is a package; the worker is a package; the pipeline store is a package.
  • 9 repos across 3 tenants. Multi-repo, multi-tenant. The pipeline operates across separate codebases owned by separate organizations, not only against Colony itself.
  • A 13-state pipeline. Structured state transitions, not ad-hoc branching. Every transition is recorded in Postgres. You can replay the history of any issue.
  • A production failure catalogue. Every failure mode encountered in production is documented in docs/production-learnings.md and encoded as recovery logic. When the Builder agent hallucinates an export, the Inspector cross-checks against the actual branch state. When a subprocess hangs, dual timeouts catch it. When dependency edges leave an issue blocked after resolution, a periodic failsafe scan unsticks it.

We’re not publishing these numbers as marketing. They’re the things you’d find in a status report. The point isn’t that they’re impressive — it’s that they exist, which is what most of this market lacks.

Why this is hard

Demoing an agent that turns one issue into one PR is a solved problem. Many teams can show that on a quiet repository under controlled conditions. The problem is everything that happens around it.

Issues come in malformed. Branches drift. CI flakes. Reviewers ask for changes that conflict with someone else’s PR that just merged. Subprocesses hang. Models occasionally fabricate evidence to satisfy an instruction. The same model gives a different answer to the same prompt on a different day. Workspaces accumulate stale state across container restarts. Cost runs away if you don’t budget it per issue.

Colony’s design is built around the assumption that all of these will happen, and that the orchestration layer — not the model, not the agent prompt — is what determines whether a system survives them. Postgres is the authority because GitHub labels are eventually consistent. Circuit breakers wrap every looping failure path. The Inspector escalates to humans when it can’t parse an LLM output, instead of retrying into the same failure. Force-pushes after rebase are detected and recovered from. Workers clone their own repos to avoid sharing git state.

None of this is glamorous. It’s the kind of work you only know to do after you’ve watched it fail in production. We did the failing. The catalogue is the record.

What we are not claiming

Colony is not a panacea. It does not write all of our code. It does not eliminate the need for engineering judgment, code review, or operational care. There are issues it can’t handle — ones that require domain knowledge it doesn’t have, decisions that need a human, ambiguity that needs a conversation. We escalate those to the human-review-ready state and a person picks them up.

What Colony does is take the well-scoped, mechanically tractable work — which turns out to be a lot of the work — and run it through a pipeline you can see end to end. Every decision is traceable. Every dollar is attributed. Every failure is classified.

What this is

This post is the start of a conversation. Colony is launching, but not in the sense that launching usually means in this category — there is no waitlist, no closed beta gated on a sales call, no “we’ll let you in if you’re a Series B with the right ARR.” We’re opening up the pilot conversation for teams who run their own software and want to operate Colony — either as the open-source pipeline (github.com/RunColony/colony) or on the managed Cloud product.

If you want to talk to us about a pilot, the page is here. We answer the form ourselves.

If you want to read the code, the repo is public.

We’ll keep posting here as the pipeline keeps running. The next post catalogues the six failure modes we encoded as recovery logic — the ones that don’t show up in any demo.


If you’d like to see the pipeline running on your work, we should talk.